How Much VRAM Do I Need for Local LLMs?
Updated May 2026 — covers all model sizes, quant levels, and context overhead
VRAM by Model Size and Quantization
| Model Size | Q4_K_M | Q8_0 | FP16 | Minimum GPU |
|---|---|---|---|---|
| 1B–3B | 1–2 GB | 1–3 GB | 2–6 GB | Any GPU (even GTX 1060 6GB) |
| 7B | 4.5–5.5 GB | 7–8 GB | 14–16 GB | RTX 4060 8GB (Q4/Q8) |
| 13B | 7–9 GB | 13–14 GB | 26–28 GB | RTX 4060 Ti 16GB (Q4) |
| 20B–30B | 12–17 GB | 20–30 GB | 40–60 GB | RTX 3090 / RTX 4090 (Q4) |
| 70B | 37–40 GB | 70–75 GB | 140 GB | 2× RTX 3090 or Mac Studio |
| 405B | ~220 GB | ~405 GB | ~810 GB | Multi-GPU workstation |
These numbers include ~2 GB system overhead but not the KV cache for your conversation context. See the KV cache section below.
The Formula
bytes_per_param:
Q4_K_M = 0.5 bytes
Q8_0 = 1.0 bytes
FP16 = 2.0 bytes
FP32 = 4.0 bytes
overhead = ~2 GB (activations, KV cache baseline, CUDA runtime)
Example: 13B model at Q4_K_M = (13 × 109 × 0.5) / 109 + 2 = 8.5 GB
Don't Forget the KV Cache
Every token in your conversation occupies VRAM in the KV (key-value) cache. For long conversations or large context windows, this can add several GB on top of the model weights.
| Context (tokens) | 7B model | 13B model | 70B model |
|---|---|---|---|
| 4K (typical chat) | ~0.5 GB | ~0.8 GB | ~2 GB |
| 16K | ~1.5 GB | ~3 GB | ~8 GB |
| 32K | ~3 GB | ~6 GB | ~16 GB |
| 128K | ~12 GB | ~24 GB | ~64 GB |
For most chat use cases with 4K–8K context, add 1–2 GB on top of model weights. Running at 32K+ context? Budget significantly more.
Q4 vs Q8: Is Quality Affected?
Q4_K_M is the sweet spot for most users. It cuts VRAM roughly in half compared to Q8 while retaining 95-98% of the model's quality. The difference is imperceptible in normal use. Q8 is only worth it if you need maximum accuracy for tasks like math or coding benchmarks.
| Format | VRAM vs FP16 | Quality loss | Use when |
|---|---|---|---|
| Q2_K | ~88% less | Noticeable | Desperate for VRAM |
| Q4_K_M | ~75% less | Minimal | Most users (recommended) |
| Q5_K_M | ~69% less | Negligible | You have headroom |
| Q8_0 | ~50% less | None | Accuracy-critical tasks |
| FP16 | Baseline | None | Fine-tuning / training |
What Can I Run on My GPU?
| GPU / Memory | Best model at Q4 | Notes |
|---|---|---|
| 4 GB VRAM | 3B Q8 or 7B Q2 | Very limited; consider upgrading |
| 6 GB VRAM | 7B Q4 | Tight fit; little context headroom |
| 8 GB VRAM | 7B Q8 or 13B Q4 (partial) | RTX 4060; sweet spot for 7B |
| 12 GB VRAM | 13B Q4 | RTX 4070; comfortable for 13B |
| 16 GB VRAM | 13B Q8 or 30B Q4 | RTX 4060 Ti 16GB; excellent value |
| 24 GB VRAM | 30B Q4 / 34B Q4 | RTX 4090; best consumer GPU |
| 48 GB VRAM | 70B Q4 | 2× RTX 3090 or RTX A6000 |
| 96–192 GB unified | 70B FP16 / 405B Q4 | Mac Studio M4 Max/Ultra |
Recommended GPUs by VRAM
Frequently Asked Questions
How much VRAM do I need for a 7B LLM?
At Q4_K_M: about 4.5–5.5 GB. An 8 GB GPU like the RTX 4060 runs 7B models comfortably with room for context. At Q8, plan for 7–8 GB.
How much VRAM do I need for a 13B model?
About 8–9 GB at Q4_K_M. A 12 GB GPU works but is tight. A 16 GB GPU (RTX 4060 Ti 16GB) is the comfortable minimum for 13B at Q4 with reasonable context.
How much VRAM do I need for a 70B model?
About 37–40 GB at Q4_K_M. No single consumer GPU can hold this — you need two RTX 3090s (48 GB combined), an RTX 6000 Ada (48 GB), or a Mac Studio with 96 GB+ unified memory.
Does VRAM include context window memory?
No — the table above is for model weights only. The KV cache for your conversation adds 0.5–2 GB for typical chats (4K–8K context). For 32K+ context, budget several GB extra.
What happens if my model does not fit in VRAM?
The model offloads layers to RAM. This is 10–50x slower. Fitting the entire model in VRAM is the single biggest factor for inference speed. Even 1 layer offloaded can halve your tokens/second.
Related Guides
Sources & methodology
VRAM and tokens-per-second figures on this page are synthesised from open community benchmarks. The sitewide formula and the full source list are on the methodology page. For this guide specifically I leaned on:
- Modal: How much VRAM do I need for inference — primary source for the bytes-per-parameter math and the ~2 GB overhead allowance.
- XiongjieDai GPU-Benchmarks-on-LLM-Inference — empirical "what fits where" check against real llama-bench runs on a wide range of GPUs.
- Home GPU LLM Leaderboard — community VRAM/throughput leaderboard used to confirm the per-tier model picks.
- llama.cpp llama-bench discussion — upstream benchmarking thread documenting the quantization formats (Q4_K_M, Q5_K_M, Q8_0) and their per-weight byte costs.
Spot a number that does not match the linked source? Email [email protected] and I will update the guide.