How Much VRAM Do I Need for Local LLMs?

Updated May 2026 — covers all model sizes, quant levels, and context overhead

Quick rule: VRAM needed = model parameters × bytes per parameter + ~2 GB overhead. At Q4 quantization, that's roughly 0.5 bytes per parameter. A 7B model at Q4 needs ~5.5 GB; a 70B model needs ~37 GB.

VRAM by Model Size and Quantization

Model Size Q4_K_M Q8_0 FP16 Minimum GPU
1B–3B 1–2 GB 1–3 GB 2–6 GB Any GPU (even GTX 1060 6GB)
7B 4.5–5.5 GB 7–8 GB 14–16 GB RTX 4060 8GB (Q4/Q8)
13B 7–9 GB 13–14 GB 26–28 GB RTX 4060 Ti 16GB (Q4)
20B–30B 12–17 GB 20–30 GB 40–60 GB RTX 3090 / RTX 4090 (Q4)
70B 37–40 GB 70–75 GB 140 GB 2× RTX 3090 or Mac Studio
405B ~220 GB ~405 GB ~810 GB Multi-GPU workstation

These numbers include ~2 GB system overhead but not the KV cache for your conversation context. See the KV cache section below.

The Formula

VRAM (GB) = (parameters × bytes_per_param) + overhead

bytes_per_param:
  Q4_K_M = 0.5 bytes
  Q8_0   = 1.0 bytes
  FP16   = 2.0 bytes
  FP32   = 4.0 bytes

overhead = ~2 GB (activations, KV cache baseline, CUDA runtime)

Example: 13B model at Q4_K_M = (13 × 109 × 0.5) / 109 + 2 = 8.5 GB

Don't Forget the KV Cache

Every token in your conversation occupies VRAM in the KV (key-value) cache. For long conversations or large context windows, this can add several GB on top of the model weights.

Context (tokens) 7B model 13B model 70B model
4K (typical chat)~0.5 GB~0.8 GB~2 GB
16K~1.5 GB~3 GB~8 GB
32K~3 GB~6 GB~16 GB
128K~12 GB~24 GB~64 GB

For most chat use cases with 4K–8K context, add 1–2 GB on top of model weights. Running at 32K+ context? Budget significantly more.

Q4 vs Q8: Is Quality Affected?

Q4_K_M is the sweet spot for most users. It cuts VRAM roughly in half compared to Q8 while retaining 95-98% of the model's quality. The difference is imperceptible in normal use. Q8 is only worth it if you need maximum accuracy for tasks like math or coding benchmarks.

Format VRAM vs FP16 Quality loss Use when
Q2_K~88% lessNoticeableDesperate for VRAM
Q4_K_M~75% lessMinimalMost users (recommended)
Q5_K_M~69% lessNegligibleYou have headroom
Q8_0~50% lessNoneAccuracy-critical tasks
FP16BaselineNoneFine-tuning / training

What Can I Run on My GPU?

GPU / Memory Best model at Q4 Notes
4 GB VRAM3B Q8 or 7B Q2Very limited; consider upgrading
6 GB VRAM7B Q4Tight fit; little context headroom
8 GB VRAM7B Q8 or 13B Q4 (partial)RTX 4060; sweet spot for 7B
12 GB VRAM13B Q4RTX 4070; comfortable for 13B
16 GB VRAM13B Q8 or 30B Q4RTX 4060 Ti 16GB; excellent value
24 GB VRAM30B Q4 / 34B Q4RTX 4090; best consumer GPU
48 GB VRAM70B Q42× RTX 3090 or RTX A6000
96–192 GB unified70B FP16 / 405B Q4Mac Studio M4 Max/Ultra

Recommended GPUs by VRAM

💡
RTX 4060 8GB
Best 8 GB GPU for 7B models
View on Amazon
RTX 4060 Ti 16GB
Best value for 13B–30B models
View on Amazon
🏆
RTX 4090 24GB
Best consumer GPU for LLMs
View on Amazon

Frequently Asked Questions

How much VRAM do I need for a 7B LLM?

At Q4_K_M: about 4.5–5.5 GB. An 8 GB GPU like the RTX 4060 runs 7B models comfortably with room for context. At Q8, plan for 7–8 GB.

How much VRAM do I need for a 13B model?

About 8–9 GB at Q4_K_M. A 12 GB GPU works but is tight. A 16 GB GPU (RTX 4060 Ti 16GB) is the comfortable minimum for 13B at Q4 with reasonable context.

How much VRAM do I need for a 70B model?

About 37–40 GB at Q4_K_M. No single consumer GPU can hold this — you need two RTX 3090s (48 GB combined), an RTX 6000 Ada (48 GB), or a Mac Studio with 96 GB+ unified memory.

Does VRAM include context window memory?

No — the table above is for model weights only. The KV cache for your conversation adds 0.5–2 GB for typical chats (4K–8K context). For 32K+ context, budget several GB extra.

What happens if my model does not fit in VRAM?

The model offloads layers to RAM. This is 10–50x slower. Fitting the entire model in VRAM is the single biggest factor for inference speed. Even 1 layer offloaded can halve your tokens/second.

Related Guides

Sources & methodology

VRAM and tokens-per-second figures on this page are synthesised from open community benchmarks. The sitewide formula and the full source list are on the methodology page. For this guide specifically I leaned on:

Spot a number that does not match the linked source? Email [email protected] and I will update the guide.