LFM2.5-8B-A1B Hardware Requirements: VRAM and GPU Guide
VRAM figures are computed from the published 8.3B total parameter count using the sitewide methodology. Tokens/sec estimates scale with the 1.5B active parameter set and the listed memory bandwidth.
Updated May 2026 · 8.3B total / 1.5B active MoE · 128K context · Reasoning model · Ollama / MLX / llama.cpp setup
Liquid AI shipped LFM2.5-8B-A1B on 28 May 2026 — a reasoning Mixture-of-Experts with 8.3B total parameters but only 1.5B active per token. That mix gives it the on-device speed of a 1.5B model with the depth of an 8B, plus a 128K context window. At Q4_K_M the full weight set fits in about 6 GB of VRAM, so it runs on an RTX 4060 8GB and any 12 GB+ card runs it comfortably. Apple Silicon gets day-one MLX support.
What is LFM2.5-8B-A1B?
LFM2.5-8B-A1B is Liquid AI's on-device reasoning model. It builds on the LFM2-8B-A1B release from late 2025 with a doubled vocabulary, scaled-up pretraining (12T to 38T tokens), an explicit chain-of-thought before every answer, and a 128K context window.
- Architecture Mixture-of-Experts. 8.3B total parameters across experts, 1.5B active per token. Speed of a 1.5B model, knowledge depth of an 8B.
- Context 128K tokens. Long enough for full codebases, long documents, or multi-turn agent traces.
- Reasoning Always emits an explicit chain-of-thought before its final answer. Treat the <think> block as internal scratchpad and present only the post-think output to end users.
- Hugging Face LiquidAI/LFM2.5-8B-A1B (base + instruct + GGUF + MLX). License: lfm1.0.
LFM2.5-8B-A1B VRAM by Quantization
Buy on AmazonMoE models still load every expert into VRAM at inference time. The figures below use the sitewide formula: VRAM ≈ total_params × bytes + 2 GB overhead. Q4_K_M uses ~0.5 bytes/param, Q8 uses ~1.0 bytes/param, FP16 uses ~2.0 bytes/param. Add KV-cache headroom for the 128K context.
| Quantization | VRAM (8.3B total) | Fits on | Notes |
|---|---|---|---|
| Q4_K_M | ~6 GB | RTX 4060 8GB and up | Default Ollama / llama.cpp quant. Best speed/quality balance. |
| Q5_K_M | ~7 GB | RTX 4060 8GB (tight) and up | A small VRAM step up for noticeably better quality. |
| Q8_0 | ~10 GB | 12 GB GPU and up | Near-FP16 quality, fits comfortably on the RTX 4070 12GB. |
| FP16 | ~19 GB | 24 GB GPU and up | Reference precision. Use a 24 GB card or Mac mini M4 Pro 48GB. |
Want a context-aware estimate? The VRAM Calculator projects KV-cache growth for the 128K window.
Why The Active vs Total Distinction Matters
LFM2.5-8B-A1B is one of a wave of on-device MoE releases, and its memory and speed behaviour is unlike a dense 8B model.
VRAM is set by TOTAL params
All 8.3B parameters must be resident in VRAM because routing happens per-token and any expert can be selected. Plan VRAM as if it were a dense 8B.
Speed is set by ACTIVE params
Only 1.5B parameters fire per forward pass. Tokens/sec on the same hardware are 3-5x what you would see from a dense 8B model.
Great for low-power devices
Liquid AI quotes ~30 tok/s on a phone and 253 tok/s on M5 Max. That speed comes from the small active set, not the total size.
Long context is cheap
KV-cache scales with active params for the routed FFN block, so the 128K window is more affordable in memory than on a comparable dense 8B.
What LFM2.5-8B-A1B Can You Run on Your GPU?
Find your GPU below. Each card shows which quantization tier fits and what to watch out for.
RTX 4060 8GB
Runs:
- +LFM2.5-8B-A1B at Q4_K_M (~6 GB) with ~2 GB free for KV cache
- +Short-to-medium context windows
Does not fit:
- -Q8_0 (~10 GB)
- -FP16 (~19 GB)
- -Full 128K context at high batch sizes
Surprisingly capable here. Only 1.5B active params per token means generation speed comparable to a dense 1-2B model — much faster than running a dense 8B on the same card.
Intel Arc B580 12GB
Runs:
- +Q4_K_M and Q5_K_M comfortably
- +Q8_0 with ~2 GB headroom
Does not fit:
- -FP16 (~19 GB)
Strong budget pick. 456 GB/s bandwidth and 12 GB VRAM mean Q8 fits with room for the long-context KV cache. Confirm llama.cpp Vulkan/SYCL build before purchase.
RTX 4070 12GB
Runs:
- +Q4_K_M, Q5_K_M, and Q8_0 all fit
- +Long contexts at Q4 with room to spare
Does not fit:
- -FP16 (~19 GB)
Sweet spot for this model. CUDA tensor cores and 504 GB/s bandwidth keep the active 1.5B params moving quickly — expect very fast token rates.
RTX 4060 Ti 16GB
Runs:
- +Every quant up to Q8_0 with generous headroom
- +Full 128K context at Q4_K_M without OOM
Does not fit:
- -FP16 (~19 GB)
Best mid-range VRAM-per-dollar for this model. 16 GB lets you keep the full expert set at Q8 plus a big KV cache for the 128K window.
RTX 4090 24GB
Runs:
- +Every quant including FP16 (~19 GB)
- +128K context at Q8 with batch size > 1
Does not fit:
- -Nothing in the LFM2.5-8B-A1B family is out of reach
1,008 GB/s bandwidth and 24 GB VRAM is overkill for a 1.5B-active MoE — but if you already own one, expect very high tokens/sec at full FP16 precision.
AMD RX 7900 XTX 24GB
Runs:
- +All quants up to FP16
- +128K context with KV-cache headroom
Does not fit:
- -Nothing in the LFM2.5-8B-A1B family is out of reach
24 GB VRAM at 960 GB/s — strong AMD pick. Confirm your llama.cpp build uses ROCm or Vulkan for hardware acceleration.
Mac mini M4 16GB
Runs:
- +Q4_K_M and Q5_K_M
- +Q8_0 fits but is tight at 128K context
Does not fit:
- -FP16 (~19 GB)
Excellent on-device pick. MLX support landed on day one. Liquid AI quotes ~253 tok/s on M5 Max — on M4 16GB expect a meaningful fraction of that for short prompts.
Mac mini M4 24GB
Runs:
- +All quants up to Q8_0 comfortably
- +128K context at Q8 without memory pressure
Does not fit:
- -FP16 (~19 GB) — fits but leaves ~5 GB for the OS, very tight
The sweet spot on Mac. 273 GB/s on the M4 Pro 24GB is even better — pick that if you can stretch the budget.
Inference Speed by Hardware
Generation speed is set by memory bandwidth and the active 1.5B parameter set. Real-world results vary by quantization, context length, and runtime build.
| Hardware | Bandwidth | Q4 tok/s | Q8 tok/s | Notes |
|---|---|---|---|---|
| RTX 4090 24GB | 1,008 GB/s | ~180 t/s | ~120 t/s | Bandwidth ceiling, not compute |
| RTX 4070 12GB | 504 GB/s | ~95 t/s | ~62 t/s | Strong sweet-spot pick |
| RTX 4060 Ti 16GB | 288 GB/s | ~55 t/s | ~36 t/s | Best 16 GB value |
| Intel Arc B580 12GB | 456 GB/s | ~85 t/s | ~55 t/s | Budget winner |
| RTX 4060 8GB | 272 GB/s | ~50 t/s | — | Q4 only |
| Mac mini M4 Pro 24GB | 273 GB/s | ~50 t/s | ~33 t/s | MLX backend |
| Mac mini M4 16GB | 120 GB/s | ~22 t/s | ~14 t/s | Q4 recommended |
Estimate: tokens/sec ≈ bandwidth (GB/s) / (active_params × bytes). The 1.5B active set keeps numerator low — speeds run well above what dense 8B models hit on the same cards.
How to Run LFM2.5-8B-A1B Locally
llama.cpp / Ollama
ollama run hf.co/LiquidAI/LFM2.5-8B-A1B-GGUF Liquid AI publishes a first-party GGUF repo. Ollama auto-detects GPU on NVIDIA, AMD (ROCm), and Apple Silicon. Default quant is Q4_K_M (~6 GB).
MLX (Apple Silicon)
mlx_lm.generate --model LiquidAI/LFM2.5-8B-A1B Day-one MLX support. Best path on Mac for both performance and the 128K context window. Use the MLX 4-bit quant for the M-series mini class machines.
vLLM / SGLang
vllm serve LiquidAI/LFM2.5-8B-A1B For server deployments or high-concurrency on-prem use. Liquid reports 18,500 tokens/sec at high concurrency on a single H100.
For installation walkthroughs, see how to run LLMs locally and the Ollama cheat sheet.
Which Hardware Should You Buy for LFM2.5-8B-A1B?
RTX 4060 8GB
Q4_K_M fits with ~2 GB free for short-context KV cache. Generation speed is driven by the 1.5B active params, not the 8.3B total — surprisingly snappy.
Intel Arc B580 12GB
Cheapest card that fits Q8_0 with headroom. 456 GB/s bandwidth keeps the active set moving. Confirm llama.cpp Vulkan/SYCL support.
RTX 4070 12GB
The all-rounder: every quant up to Q8 fits, CUDA support is rock-solid, and 504 GB/s bandwidth pushes high tokens/sec on this MoE.
RTX 4060 Ti 16GB
Comfortable Q8 with room for the full 128K context KV cache. The best mid-range NVIDIA pick for long-context use.
Mac mini M4 24GB
MLX support landed day one. 24 GB unified memory holds Q8 plus the 128K KV cache. Silent and fast — Liquid AI quotes ~253 tok/s on M5 Max.
AMD RX 7900 XTX 24GB
24 GB at 960 GB/s for FP16-quality inference. Cheaper than an RTX 4090 with the same VRAM. ROCm and Vulkan both work with current llama.cpp builds.
For the full cross-budget GPU comparison, see the best GPU for LLMs guide.
Frequently Asked Questions
How much VRAM does LFM2.5-8B-A1B need?
LFM2.5-8B-A1B has 8.3B total parameters across its expert layers. At Q4_K_M the full weight set fits in about 6 GB of VRAM, at Q8 about 10 GB, and at FP16 about 19 GB. Inference speed scales with the 1.5B active parameters, not the 8.3B total — so even an 8 GB GPU runs it noticeably faster than a dense 8B model.
Can an 8GB GPU run LFM2.5-8B-A1B?
Yes. At Q4_K_M the full 8.3B expert set occupies about 6 GB, leaving roughly 2 GB on an RTX 4060 8GB for the OS, KV cache, and short-to-medium context windows. Because the model is a Mixture-of-Experts that activates only 1.5B parameters per token, it generates faster than dense 8B models on the same hardware.
What is the difference between total and active parameters?
Total parameters (8.3B) is the full size of the model and determines VRAM. Active parameters (1.5B) is the number used per forward pass and determines inference speed. LFM2.5-8B-A1B routes each token through a small subset of experts, so it has the speed of a 1.5B model with the knowledge depth of an 8B model — the core trade-off of MoE architectures.
Does LFM2.5-8B-A1B run on Apple Silicon?
Yes. Liquid AI shipped day-one MLX support, and Liquid reports about 253 tokens/sec on a Mac with M5 Max. A Mac mini M4 16GB fits the model at Q4_K_M with plenty of headroom for the 128K context window. For Q8 quality, the Mac mini M4 24GB is the comfortable pick. Apple unified memory works well for MoE models because the entire expert set must stay resident regardless of which GPU you use.
Is LFM2.5-8B-A1B open source?
The model weights are openly downloadable from Hugging Face under the lfm1.0 license. It is best described as open-weight rather than fully open-source — review the Liquid AI license terms before commercial use. Day-one runtime support covers llama.cpp, MLX, vLLM, SGLang, ONNX, and Liquid LEAP.
How does LFM2.5-8B-A1B compare to dense 8B models?
On the same hardware, LFM2.5-8B-A1B generates tokens 3-5x faster than a dense 8B like Llama 3.1 8B because only 1.5B parameters fire per token. It also ships with a 128K context window and an explicit reasoning trace before its final answer. Dense 8B models still tend to win on raw quality at single-turn tasks, but LFM2.5-8B-A1B is the better pick for low-latency on-device use, long-context tasks, and multilingual workloads.
Check VRAM for LFM2.5-8B-A1B, or compare hardware options.
Related Guides
Best GPU for Running LLMs Locally
Top picks across every budget tier in 2026.
How Much VRAM Do I Need?
Match your GPU VRAM to the largest model you can run.
LLM Quantization Guide
Q4 vs Q8 vs FP16 — what each tier costs in quality.
Best LLMs to Run Locally
Curated picks across model families and sizes.
LM Studio vs Ollama
Which local-LLM runtime fits your workflow.
Sources & methodology
VRAM and tokens-per-second figures are computed from the published spec using the sitewide formula on the methodology page. Primary sources for this guide:
- Hugging Face model card. Parameter count, license, runtime support.
- Liquid AI release blog. 8.3B total / 1.5B active, 128K context, 18.5K tok/s on H100, 253 tok/s on M5 Max.
- Ollama. GGUF runtime used for the install steps.
Spot a number that does not match the linked source? Email [email protected] and the guide will be updated.