What LLMs Can I Run? Complete VRAM Guide
"I have X GB of VRAM — what can I actually run?" is one of the most common questions in local LLM communities. This guide gives you a direct answer organized by VRAM tier, with specific model recommendations, quantization choices, and speed estimates for each. Jump to your tier below, or start with the quick reference table.
Quick Reference: VRAM to Models
| VRAM | Max model at Q4_K_M | Top recommendation |
|---|---|---|
| 8 GB | 7B at Q4_K_M | Qwen3 8B Q4_K_M |
| 12 GB | 13B at Q4_K_M | Qwen3 14B Q4_K_M |
| 16 GB | 13B at Q8 | Qwen3 14B Q8 |
| 24 GB | 34B at Q4_K_M | Qwen3 32B Q4_K_M |
| 32 GB | 34B at Q8 | Qwen3 32B Q8 |
| 48 GB | 70B at Q4_K_M | Llama 3.3 70B Q4_K_M |
| 64 GB | 70B at Q5_K_M | Llama 3.3 70B Q5_K_M |
| 128 GB | 70B at FP16 / 100B+ at Q4 | Llama 4 Scout 109B Q4 |
Use the VRAM Calculator to get exact figures for any model and quantization combination.
8 GB VRAM
RTX 4060, RTX 3070 (8 GB), older 8 GB cardsCan run
- ✓ 7B models at Q4_K_M (~4.5 GB) — roughly 30–40 tok/s on RTX 4060
- ✓ 3B models at Q8 (~3.5 GB) — faster, lower quality ceiling
- ✓ 1B models at FP16 — very fast, lightweight tasks
Cannot run
- ✗ 13B models at any quantization without CPU offloading
- ✗ 34B or 70B models at useful speed
Best models for this tier
| Model | Quantization | Why |
|---|---|---|
| Qwen3 8B | Q4_K_M | Top reasoning and instruction-following at this size |
| DeepSeek-R1-Distill-8B | Q4_K_M | Best thinking model for 8 GB — chain-of-thought reasoning |
| Gemma 3 4B | Q4_K_M | Google's 4B model — excellent quality-to-size ratio, ~3 GB |
| Llama 3.1 8B Instruct | Q4_K_M | Rock-solid general purpose chat and coding |
| Phi-4-mini | Q4_K_M | Punches above its weight on math and code |
12 GB VRAM
RTX 5070 12GB, RTX 4070 12GB, Intel Arc B580, RTX 3080 10GBCan run
- ✓ 13B at Q4_K_M (~8–9 GB) with good context headroom
- ✓ 7B at Q8 (~8.5 GB) — near-lossless quality for 7B
- ✓ 7B at FP16 — maximum quality, tight fit at ~14 GB (use Q8 instead)
Cannot run
- ✗ 34B models comfortably — they need 20+ GB at Q4_K_M
- ✗ 13B at Q8 without headroom issues (~14 GB)
Best models for this tier
| Model | Quantization | Why |
|---|---|---|
| Qwen3 14B | Q4_K_M | Best 14B in 2026 — strong reasoning and coding |
| DeepSeek-R1-Distill-14B | Q4_K_M | Reasoning/thinking model at 14B class |
| Gemma 3 12B | Q4_K_M | Google's 12B — ~8 GB at Q4, solid general performance |
| Llama 3.1 8B Instruct | Q8 | Near-lossless 8B — excellent quality |
| Phi-4 14B | Q4_K_M | Strong on math and code at 13B class |
16–20 GB VRAM
RTX 4060 Ti 16GB, RTX 5070 Ti 16GB, RTX 4070 Ti Super 16GB, RTX 4080 16GB, RTX 5080 16GB, M4 Mac mini 16 GB unifiedCan run
- ✓ 13B at Q8 (~14 GB) — excellent quality, fits comfortably
- ✓ 7B at FP16 (~14 GB) — full precision inference
- ✓ 34B at aggressive Q3 quantization — reduced quality, not recommended
Cannot run
- ✗ 34B at Q4_K_M comfortably (~20 GB — very tight or impossible)
- ✗ 70B at any quantization without heavy CPU offloading
Best models for this tier
| Model | Quantization | Why |
|---|---|---|
| Qwen3 14B | Q8 | Near-lossless 14B quality — the sweet spot for 16 GB |
| DeepSeek-R1-Distill-14B | Q8 | High-quality reasoning at full Q8 fidelity |
| Llama 3.1 8B | FP16 | Full precision 8B — best 8B quality possible |
24 GB VRAM
RTX 4090, RTX 3090, RX 7900 XTX, Mac mini M4 24 GB unifiedCan run
- ✓ 34B at Q4_K_M (~20 GB) — flagship consumer GPU sweet spot
- ✓ 13B at Q8 (~14 GB) — excellent quality with headroom
- ✓ 7B at FP16 (~14 GB) — full precision 7B
- ✓ 22B models like Codestral comfortably at Q4_K_M or Q5_K_M
Cannot run
- ✗ 70B without CPU offloading (40 GB at Q4_K_M)
- ✗ 34B at FP16 (~70 GB)
Best models for this tier
| Model | Quantization | Why |
|---|---|---|
| Qwen3 32B | Q4_K_M | Flagship 32B — strong reasoning, fits comfortably |
| DeepSeek-R1-Distill-32B | Q4_K_M | Best thinking/reasoning model at this tier |
| Qwen3 14B | Q8 | Near-lossless 14B — maximum quality for 14B class |
| Codestral 22B | Q4_K_M | Best dedicated coding model at this tier |
32 GB VRAM
RTX 5090 (32 GB), some pro/workstation GPUsCan run
- ✓ 34B at Q8 (~34 GB — very tight, minimal context headroom)
- ✓ 34B at Q6_K (~27 GB) — comfortable with good headroom
- ✓ 70B at Q4_K_M with ~10 GB CPU offload — usable but slower
Cannot run
- ✗ 70B fully in VRAM (needs ~40 GB at Q4_K_M)
- ✗ 34B at FP16 (~70 GB)
Best models for this tier
| Model | Quantization | Why |
|---|---|---|
| Qwen3 32B | Q8 | Near-lossless 32B quality — flagship tier for RTX 5090 |
| DeepSeek-R1-Distill-32B | Q8 | Full-quality thinking model, fits at ~34 GB |
| Llama 3.3 70B | Q4_K_M | Partial offload — reduced speed but 70B quality |
48 GB VRAM
Mac mini M4 Pro 48 GB · dual RTX 3090 NVLink (48 GB) · Quadro A6000Can run
- ✓ 70B at Q4_K_M (~40 GB) — fits comfortably, the key milestone of this tier
- ✓ 34B at FP16 (~70 GB — tight on dual 3090; fine on M4 Pro 48 GB)
- ✓ 34B at Q8 (~34 GB) with significant headroom
Cannot run
- ✗ 70B at Q8 (~74 GB) — exceeds 48 GB
- ✗ 100B+ models at any practical quantization
Best models for this tier
| Model | Quantization | Why |
|---|---|---|
| Llama 3.3 70B Instruct | Q4_K_M | This tier unlocks 70B — the flagship open model for 2026 |
| Qwen3 32B | FP16 | Full precision 32B — maximum quality at this tier |
| DeepSeek-R1-Distill-70B | Q4_K_M | Best reasoning/thinking model that fits here |
64 GB VRAM
Mac Studio M4 Max 64 GB unifiedCan run
- ✓ 70B at Q5_K_M (~47 GB) — comfortable fit, strong quality
- ✓ 70B at Q6_K (~55 GB) — high quality with headroom
- ✓ 70B at Q8 (~74 GB) — tight to impossible; use Q6_K instead
Cannot run
- ✗ 70B at Q8 comfortably — 74 GB exceeds 64 GB
- ✗ 100B+ models at Q4_K_M comfortably
Best models for this tier
| Model | Quantization | Why |
|---|---|---|
| Llama 3.3 70B Instruct | Q5_K_M | Excellent quality, comfortable headroom at ~47 GB |
| DeepSeek-R1-Distill-70B | Q5_K_M | High-quality reasoning at full 70B scale |
| Llama 3.3 70B Instruct | Q6_K | Near-lossless quality, still fits at ~55 GB |
128 GB VRAM
Mac Studio M4 Max 128 GB unifiedCan run
- ✓ 70B at Q8 (~74 GB) — full near-lossless quality with headroom
- ✓ 70B at FP16 (~140 GB) — very tight, use Q8 instead
- ✓ 100B+ models at Q4_K_M — this tier unlocks early 100B+ models
Cannot run
- ✗ 70B at FP16 comfortably — 140 GB exceeds 128 GB; use Q8
- ✗ Very large 200B+ models
Best models for this tier
| Model | Quantization | Why |
|---|---|---|
| Llama 3.3 70B Instruct | Q8 | Full near-lossless quality at 70B — the best 70B inference |
| Llama 4 Scout 109B | Q4_K_M | First MoE 100B+ model accessible at this tier (~60 GB) |
| Qwen3 32B | Q8 | Maximum quality 32B — extremely fast on Mac Studio |
Rules of Thumb
- 1. Estimate VRAM with: parameters × bytes-per-param. A Q4_K_M model uses ~0.5 bytes per parameter. A 7B model at Q4_K_M: 7,000,000,000 × 0.5 = 3.5 GB of weights, plus ~1 GB overhead = roughly 4.5 GB total.
- 2. Leave ~2 GB headroom for context and KV cache. A 7B model that weighs 4.5 GB can still exhaust an 8 GB GPU if you feed it very long prompts. Keep headroom, especially with larger context windows.
- 3. CPU offloading cuts speed dramatically. Offloading 20–30% of a model's layers to CPU RAM typically reduces token generation to under 5 tokens per second — barely usable for interactive chat. Prefer running a smaller model fully in VRAM over a larger model with offloading.
- 4. Apple unified memory counts the same as VRAM. On Mac, the full unified memory pool is available for model loading. A 48 GB M4 Pro Mac mini can load a 70B Q4_K_M model just like a 48 GB discrete GPU setup.
- 5. Q4_K_M is the right default for almost everyone. It gives excellent quality at 4× the VRAM savings over FP16. Only go lower (Q3) if you have no other option. Go higher (Q6_K, Q8) when you have VRAM headroom and quality matters. See the quantization guide for full details.
- 6. A smaller model fully in VRAM beats a larger model split across GPU and CPU. A 7B model at Q8 running entirely on an 8 GB GPU will be faster and more responsive than a 13B model offloaded 40% to RAM.
Frequently Asked Questions
Can I run a 7B model on 8 GB VRAM?
Yes. A 7B model at Q4_K_M quantization uses roughly 4.5 GB VRAM, which fits comfortably on an 8 GB GPU like the RTX 4060 with room left for context. You can expect around 30–40 tokens per second. You cannot run 13B or larger models at this tier without CPU offloading, which significantly reduces speed.
Can I run a 70B model on a 24 GB GPU?
Not fully in VRAM. A 70B model at Q4_K_M requires roughly 40 GB VRAM. On a 24 GB GPU like the RTX 4090 you would need to offload a significant portion of layers to CPU RAM, which typically cuts speed to under 5 tokens per second. For 70B models without offloading, you need 48 GB or more of VRAM.
Does Apple unified memory count the same as VRAM?
Yes, for LLM purposes Apple Silicon unified memory works like GPU VRAM — the model loads entirely into it. A 24 GB M4 Mac mini behaves similarly to a 24 GB discrete GPU for inference. Unified memory bandwidth is typically lower than high-end discrete GPU VRAM, so token speeds may differ, but large models fit and run without CPU offloading.
What is the best model for an 8 GB GPU?
The best models for 8 GB VRAM are Qwen3 8B at Q4_K_M for top reasoning quality, DeepSeek-R1-Distill-8B at Q4_K_M for thinking/reasoning, Phi-4-mini for coding and math, and Gemma 3 4B at Q4_K_M for a fast, capable 4B option. All fit within 8 GB with headroom for context.
How much VRAM does a 13B model need?
A 13B model at Q4_K_M quantization requires approximately 8–9 GB VRAM, fitting on a 12 GB GPU with some headroom. At Q8 it needs around 14 GB, requiring a 16 GB or 24 GB GPU. At FP16 it needs roughly 27 GB, placing it in 32 GB or 48 GB territory.
Related Guides
Popular hardware for local LLMs
Know your VRAM tier? Find the right hardware or calculate exact requirements.
Sources & methodology
VRAM and tokens-per-second figures on this page are synthesised from open community benchmarks. The sitewide formula and the full source list are on the methodology page. For this guide specifically I leaned on:
- XiongjieDai GPU-Benchmarks-on-LLM-Inference — primary source for the per-tier tokens-per-second ranges (RTX 4060 through RTX 5090).
- Modal: How much VRAM do I need for inference — sanity-checked the bytes-per-parameter math and the overhead allowance.
- Home GPU LLM Leaderboard — used to confirm that the "best model at each tier" picks match what the community actually runs.
- Hugging Face Hub — model cards I checked for each recommended model's parameter count, quantization options, and file sizes.
Spot a number that does not match the linked source? Email [email protected] and I will update the guide.