How Much VRAM Do I Need for Local LLMs?

Q: How much VRAM do I need for a 7B LLM?

A 7B model needs about 4-5 GB VRAM at Q4_K_M quantization, 7 GB at Q8, and 14 GB at FP16 (full precision). An 8 GB GPU like the RTX 4060 runs 7B models at Q4/Q8 comfortably. A 6 GB GPU can run Q4 but leaves little headroom.

Q: How much VRAM do I need for a 13B model?

A 13B model needs about 8 GB VRAM at Q4_K_M, 13 GB at Q8, and 26 GB at FP16. A 12 GB GPU can handle Q4 with context headroom, but 16 GB (RTX 4060 Ti 16GB or RTX 4080) gives more comfortable room for longer conversations.

Q: How much VRAM do I need for a 70B model?

A 70B model needs about 38-40 GB VRAM at Q4_K_M. No single consumer GPU can hold this. Options: two RTX 3090s (48 GB combined), an RTX 6000 Ada (48 GB), or Mac Studio with 96-192 GB unified memory. You can also use CPU offloading but speed drops significantly.

Q: Does VRAM include context window memory?

Yes. The VRAM numbers above are for the model weights alone. The KV cache for your conversation context also uses VRAM. A 128K context window can add 4-16 GB on top of model weights depending on the model architecture. Budget 2-4 GB extra for typical chat sessions.

Q: What happens if my model does not fit in VRAM?

The model falls back to CPU offloading, which is 10-50x slower. For example, an RTX 4060 8GB running a 13B Q4 model will offload ~3 GB to RAM and drop from ~30 tokens/s to ~2-5 tokens/s. Fitting the entire model in VRAM is the single biggest factor for inference speed.

Updated May 2026 — covers all model sizes, quant levels, and context overhead

Quick rule: VRAM needed = model parameters × bytes per parameter + ~2 GB overhead. At Q4 quantization, that's roughly 0.5 bytes per parameter. A 7B model at Q4 needs ~5.5 GB; a 70B model needs ~37 GB.

VRAM by Model Size and Quantization

Model Size	Q4_K_M	Q8_0	FP16	Minimum GPU
1B–3B	1–2 GB	1–3 GB	2–6 GB	Any GPU (even GTX 1060 6GB)
7B	4.5–5.5 GB	7–8 GB	14–16 GB	RTX 4060 8GB (Q4/Q8)
13B	7–9 GB	13–14 GB	26–28 GB	RTX 4060 Ti 16GB (Q4)
20B–30B	12–17 GB	20–30 GB	40–60 GB	RTX 3090 / RTX 4090 (Q4)
70B	37–40 GB	70–75 GB	140 GB	2× RTX 3090 or Mac Studio
405B	~220 GB	~405 GB	~810 GB	Multi-GPU workstation

These numbers include ~2 GB system overhead but not the KV cache for your conversation context. See the KV cache section below.

The Formula

VRAM (GB) = (parameters × bytes_per_param) + overhead

bytes_per_param:

  Q4_K_M = 0.5 bytes

  Q8_0   = 1.0 bytes

  FP16   = 2.0 bytes

  FP32   = 4.0 bytes

overhead = ~2 GB (activations, KV cache baseline, CUDA runtime)

Example: 13B model at Q4_K_M = (13 × 10⁹ × 0.5) / 10⁹ + 2 = 8.5 GB

Don't Forget the KV Cache

Every token in your conversation occupies VRAM in the KV (key-value) cache. For long conversations or large context windows, this can add several GB on top of the model weights.

Context (tokens)	7B model	13B model	70B model
4K (typical chat)	~0.5 GB	~0.8 GB	~2 GB
16K	~1.5 GB	~3 GB	~8 GB
32K	~3 GB	~6 GB	~16 GB
128K	~12 GB	~24 GB	~64 GB

For most chat use cases with 4K–8K context, add 1–2 GB on top of model weights. Running at 32K+ context? Budget significantly more.

Q4 vs Q8: Is Quality Affected?

Q4_K_M is the sweet spot for most users. It cuts VRAM roughly in half compared to Q8 while retaining 95-98% of the model's quality. The difference is imperceptible in normal use. Q8 is only worth it if you need maximum accuracy for tasks like math or coding benchmarks.

Format	VRAM vs FP16	Quality loss	Use when
Q2_K	~88% less	Noticeable	Desperate for VRAM
Q4_K_M	~75% less	Minimal	Most users (recommended)
Q5_K_M	~69% less	Negligible	You have headroom
Q8_0	~50% less	None	Accuracy-critical tasks
FP16	Baseline	None	Fine-tuning / training

What Can I Run on My GPU?

GPU / Memory	Best model at Q4	Notes
4 GB VRAM	3B Q8 or 7B Q2	Very limited; consider upgrading
6 GB VRAM	7B Q4	Tight fit; little context headroom
8 GB VRAM	7B Q8 or 13B Q4 (partial)	RTX 4060; sweet spot for 7B
12 GB VRAM	13B Q4	RTX 4070; comfortable for 13B
16 GB VRAM	13B Q8 or 30B Q4	RTX 4060 Ti 16GB; excellent value
24 GB VRAM	30B Q4 / 34B Q4	RTX 4090; best consumer GPU
48 GB VRAM	70B Q4	2× RTX 3090 or RTX A6000
96–192 GB unified	70B FP16 / 405B Q4	Mac Studio M4 Max/Ultra

Recommended GPUs by VRAM

💡

RTX 4060 8GB

Best 8 GB GPU for 7B models

View on Amazon

⭐

RTX 4060 Ti 16GB

Best value for 13B–30B models

View on Amazon

🏆

RTX 4090 24GB

Best consumer GPU for LLMs

View on Amazon

Frequently Asked Questions

How much VRAM do I need for a 7B LLM?

At Q4_K_M: about 4.5–5.5 GB. An 8 GB GPU like the RTX 4060 runs 7B models comfortably with room for context. At Q8, plan for 7–8 GB.

How much VRAM do I need for a 13B model?

About 8–9 GB at Q4_K_M. A 12 GB GPU works but is tight. A 16 GB GPU (RTX 4060 Ti 16GB) is the comfortable minimum for 13B at Q4 with reasonable context.

How much VRAM do I need for a 70B model?

About 37–40 GB at Q4_K_M. No single consumer GPU can hold this — you need two RTX 3090s (48 GB combined), an RTX 6000 Ada (48 GB), or a Mac Studio with 96 GB+ unified memory.

Does VRAM include context window memory?

No — the table above is for model weights only. The KV cache for your conversation adds 0.5–2 GB for typical chats (4K–8K context). For 32K+ context, budget several GB extra.

What happens if my model does not fit in VRAM?

The model offloads layers to RAM. This is 10–50x slower. Fitting the entire model in VRAM is the single biggest factor for inference speed. Even 1 layer offloaded can halve your tokens/second.

Related Guides

Sources & methodology

VRAM and tokens-per-second figures on this page are synthesised from open community benchmarks. The sitewide formula and the full source list are on the methodology page. For this guide specifically I leaned on:

Modal: How much VRAM do I need for inference — primary source for the bytes-per-parameter math and the ~2 GB overhead allowance.
XiongjieDai GPU-Benchmarks-on-LLM-Inference — empirical "what fits where" check against real llama-bench runs on a wide range of GPUs.
Home GPU LLM Leaderboard — community VRAM/throughput leaderboard used to confirm the per-tier model picks.
llama.cpp llama-bench discussion — upstream benchmarking thread documenting the quantization formats (Q4_K_M, Q5_K_M, Q8_0) and their per-weight byte costs.

Spot a number that does not match the linked source? Email [email protected] and I will update the guide.