LLM VRAM Calculator
Calculate how much GPU memory you need to run any language model locally.
Quick presets
e.g. 7 for a 7B model, 70 for 70B
Estimated VRAM needed
— GB
—
Compatible Hardware
Apple Mac mini (M4, 16GB)
16 GB VRAM
—
Apple Mac mini (M4, 24GB)
24 GB VRAM
—
Apple Mac mini (M4 Pro, 24GB)
24 GB VRAM
—
Apple Mac mini (M4 Pro, 48GB)
48 GB VRAM
—
Apple Mac Studio (M4 Max, 64GB)
64 GB VRAM
—
Apple Mac Studio (M4 Max, 128GB)
128 GB VRAM
—
NVIDIA GeForce RTX 4060 (8GB)
8 GB VRAM
—
NVIDIA GeForce RTX 4070 (12GB)
12 GB VRAM
—
NVIDIA GeForce RTX 4090 (24GB)
24 GB VRAM
—
NVIDIA GeForce RTX 3090 (24GB)
24 GB VRAM
—
NVIDIA GeForce RTX 5090 (32GB)
32 GB VRAM
—
Intel Arc B580 (12GB)
12 GB VRAM
—
AMD Radeon RX 7900 XTX (24GB)
24 GB VRAM
—
NVIDIA RTX 4080 (16GB)
16 GB VRAM
—
NVIDIA RTX 4060 Ti (16GB)
16 GB VRAM
—
NVIDIA RTX 5080 (16GB)
16 GB VRAM
—
NVIDIA RTX 5070 (12GB)
12 GB VRAM
—
NVIDIA RTX 5070 Ti (16GB)
16 GB VRAM
—
NVIDIA RTX 4070 Ti Super (16GB)
16 GB VRAM
—
NVIDIA DGX Spark
128 GB VRAM
—
Beelink SEi12 + eGPU RTX 4090 (24GB)
24 GB VRAM
—
How Much VRAM Do You Need for LLMs?
GPU VRAM (or Apple Silicon unified memory) is the primary bottleneck when running large language models locally. The model weights must fit entirely in VRAM for efficient inference — otherwise the system falls back to slow RAM offloading.
The VRAM requirement scales linearly with parameter count and quantization bit-width:
- Q4_K_M — most popular for consumer GPUs. Roughly half the VRAM of FP16 with minimal quality loss (~1–3%).
- Q8 — near-lossless quality, about half of FP16. Good for developers who need reliable outputs.
- FP16 — maximum quality, maximum VRAM. Needed for fine-tuning and research.
VRAM Requirements by Model Size
| Model Size | Q4_K_M | Q8 | FP16 | Example Models |
|---|---|---|---|---|
| 1B | 2 GB | 3 GB | 4 GB | Qwen3-0.6B, SmolLM2 |
| 3B | 3 GB | 5 GB | 8 GB | Qwen3 4B, Phi-3.5 Mini, Llama 3.2 3B |
| 8B | 5 GB | 9 GB | 17 GB | Qwen3 8B, Llama 3.1 8B, DeepSeek-R1-Distill-8B |
| 14B | 8 GB | 15 GB | 29 GB | Qwen3 14B, Phi-4 14B, DeepSeek-R1-Distill-14B |
| 32B | 18 GB | 34 GB | 66 GB | Qwen3 32B, DeepSeek-R1-Distill-32B |
| 70B | 37 GB | 72 GB | 142 GB | Llama 3.3 70B, Qwen3 72B, DeepSeek-R1-Distill-70B |
The Formula
The +1.5 GB base overhead covers the KV cache at 4K context, activation buffers, and framework runtime. Longer context windows increase this significantly — a 128K context window adds roughly 20–25 GB depending on model architecture.
For Mixture-of-Experts models (Llama 4 Scout, Mixtral), use the total parameter count in the calculator — all expert weights must be loaded into VRAM even though only a fraction activate per token. For example, Llama 4 Scout is named "17B active" but has ~109B total parameters, requiring ~60 GB at Q4_K_M. See the Llama 4 guide for details.
Frequently Asked Questions
How much VRAM does Llama 3.3 70B need?
Llama 3.3 70B needs approximately 26 GB at Q2_K, 43 GB at Q4_K_M, and 75 GB at Q8. Q4_K_M is the recommended quality level, which requires a Mac Studio M4 Max 64 GB or dual RTX 4090 GPUs on NVIDIA hardware.
How much VRAM does Gemma 3 need?
Gemma 3 1B needs ~1 GB (runs on CPU). Gemma 3 4B needs ~3 GB (any GPU). Gemma 3 12B needs ~7.5 GB (8 GB GPU at Q4). Gemma 3 27B needs ~16 GB at Q4_K_M — it fits in a 16 GB GPU thanks to grouped query attention.
How much VRAM does Phi-4 need?
Phi-4 (14B) needs approximately 9 GB at Q4_K_M and 15 GB at Q8. Any 12 GB GPU like the Intel Arc B580 or RTX 4070 can run it at Q4_K_M. Phi-4-mini (3.8B) needs only 2.5 GB — runs on any hardware.
How much VRAM does Qwen3 need?
Qwen3 8B needs ~5.5 GB (fits any 8 GB GPU). Qwen3 14B needs ~9 GB (requires 12 GB+). Qwen3 32B needs ~18 GB (fits 24 GB GPUs at Q4_K_M). Qwen3 30B-A3B MoE needs ~20 GB total weight.
Can I run a 13B model on 8 GB VRAM?
No. A 13B model needs about 8.5 GB at Q4_K_M, which exceeds 8 GB. You need at least 12 GB VRAM for 13B models. With 8 GB you are limited to up to 8B models at Q4_K_M.
Does context length affect VRAM?
Yes. The KV cache grows with context: 4K adds ~1.5 GB, 32K adds ~8-10 GB, 128K adds ~20-25 GB. Use the context overhead slider above to account for your target window.
How much VRAM for a 7B model?
A 7B model needs approximately 5 GB at Q4_K_M, 8 GB at Q8, and 15 GB at FP16. Any 8 GB GPU like the RTX 4060 can run it at Q4_K_M with good performance.
How much VRAM for a 70B model?
A 70B model needs approximately 37 GB at Q4_K_M, 72 GB at Q8, and 142 GB at FP16. On consumer hardware you need a Mac Studio M4 Max 64 GB or dual RTX 4090 (48 GB combined).
Embed This Calculator
Add the VRAM calculator to your blog or documentation site. Copy the snippet below — no signup required.
<iframe src="https://llmhardware.io/embed/vram-calculator" width="100%" height="420" style="border:none;border-radius:10px;" loading="lazy" title="LLM VRAM Calculator"></iframe>
Related Guides
What Can I Run on My GPU?
VRAM tier lookup: 8GB, 12GB, 24GB, 48GB, 64GB — which models fit each tier.
Read guide →LLM Model Sizes Explained
Understand 7B, 13B, 70B — what the parameter count means for your hardware.
Read guide →Quantization Explained
Q4_K_M vs Q8 vs FP16 — how quantization affects VRAM and quality.
Read guide →Best LLMs to Run Locally
Top model picks by VRAM tier — Qwen3, DeepSeek R1, Gemma 3, Phi-4, Llama 3.3.
Read guide →Best GPU for LLMs
Every budget tier — ranked by VRAM value.
Read guide →