What LLMs Can I Run? Complete VRAM Guide

"I have X GB of VRAM — what can I actually run?" is one of the most common questions in local LLM communities. This guide gives you a direct answer organized by VRAM tier, with specific model recommendations, quantization choices, and speed estimates for each. Jump to your tier below, or start with the quick reference table.

Quick Reference: VRAM to Models

VRAMMax model at Q4_K_MTop recommendation
8 GB 7B at Q4_K_M Qwen3 8B Q4_K_M
12 GB 13B at Q4_K_M Qwen3 14B Q4_K_M
16 GB 13B at Q8 Qwen3 14B Q8
24 GB 34B at Q4_K_M Qwen3 32B Q4_K_M
32 GB 34B at Q8 Qwen3 32B Q8
48 GB 70B at Q4_K_M Llama 3.3 70B Q4_K_M
64 GB 70B at Q5_K_M Llama 3.3 70B Q5_K_M
128 GB 70B at FP16 / 100B+ at Q4 Llama 4 Scout 109B Q4

Use the VRAM Calculator to get exact figures for any model and quantization combination.

8 GB VRAM

RTX 4060, RTX 3070 (8 GB), older 8 GB cards

Can run

  • 7B models at Q4_K_M (~4.5 GB) — roughly 30–40 tok/s on RTX 4060
  • 3B models at Q8 (~3.5 GB) — faster, lower quality ceiling
  • 1B models at FP16 — very fast, lightweight tasks

Cannot run

  • 13B models at any quantization without CPU offloading
  • 34B or 70B models at useful speed

Best models for this tier

ModelQuantizationWhy
Qwen3 8B Q4_K_M Top reasoning and instruction-following at this size
DeepSeek-R1-Distill-8B Q4_K_M Best thinking model for 8 GB — chain-of-thought reasoning
Gemma 3 4B Q4_K_M Google's 4B model — excellent quality-to-size ratio, ~3 GB
Llama 3.1 8B Instruct Q4_K_M Rock-solid general purpose chat and coding
Phi-4-mini Q4_K_M Punches above its weight on math and code

12 GB VRAM

RTX 5070 12GB, RTX 4070 12GB, Intel Arc B580, RTX 3080 10GB

Can run

  • 13B at Q4_K_M (~8–9 GB) with good context headroom
  • 7B at Q8 (~8.5 GB) — near-lossless quality for 7B
  • 7B at FP16 — maximum quality, tight fit at ~14 GB (use Q8 instead)

Cannot run

  • 34B models comfortably — they need 20+ GB at Q4_K_M
  • 13B at Q8 without headroom issues (~14 GB)

Best models for this tier

ModelQuantizationWhy
Qwen3 14B Q4_K_M Best 14B in 2026 — strong reasoning and coding
DeepSeek-R1-Distill-14B Q4_K_M Reasoning/thinking model at 14B class
Gemma 3 12B Q4_K_M Google's 12B — ~8 GB at Q4, solid general performance
Llama 3.1 8B Instruct Q8 Near-lossless 8B — excellent quality
Phi-4 14B Q4_K_M Strong on math and code at 13B class

16–20 GB VRAM

RTX 4060 Ti 16GB, RTX 5070 Ti 16GB, RTX 4070 Ti Super 16GB, RTX 4080 16GB, RTX 5080 16GB, M4 Mac mini 16 GB unified

Can run

  • 13B at Q8 (~14 GB) — excellent quality, fits comfortably
  • 7B at FP16 (~14 GB) — full precision inference
  • 34B at aggressive Q3 quantization — reduced quality, not recommended

Cannot run

  • 34B at Q4_K_M comfortably (~20 GB — very tight or impossible)
  • 70B at any quantization without heavy CPU offloading

Best models for this tier

ModelQuantizationWhy
Qwen3 14B Q8 Near-lossless 14B quality — the sweet spot for 16 GB
DeepSeek-R1-Distill-14B Q8 High-quality reasoning at full Q8 fidelity
Llama 3.1 8B FP16 Full precision 8B — best 8B quality possible

24 GB VRAM

RTX 4090, RTX 3090, RX 7900 XTX, Mac mini M4 24 GB unified

Can run

  • 34B at Q4_K_M (~20 GB) — flagship consumer GPU sweet spot
  • 13B at Q8 (~14 GB) — excellent quality with headroom
  • 7B at FP16 (~14 GB) — full precision 7B
  • 22B models like Codestral comfortably at Q4_K_M or Q5_K_M

Cannot run

  • 70B without CPU offloading (40 GB at Q4_K_M)
  • 34B at FP16 (~70 GB)

Best models for this tier

ModelQuantizationWhy
Qwen3 32B Q4_K_M Flagship 32B — strong reasoning, fits comfortably
DeepSeek-R1-Distill-32B Q4_K_M Best thinking/reasoning model at this tier
Qwen3 14B Q8 Near-lossless 14B — maximum quality for 14B class
Codestral 22B Q4_K_M Best dedicated coding model at this tier

32 GB VRAM

RTX 5090 (32 GB), some pro/workstation GPUs

Can run

  • 34B at Q8 (~34 GB — very tight, minimal context headroom)
  • 34B at Q6_K (~27 GB) — comfortable with good headroom
  • 70B at Q4_K_M with ~10 GB CPU offload — usable but slower

Cannot run

  • 70B fully in VRAM (needs ~40 GB at Q4_K_M)
  • 34B at FP16 (~70 GB)

Best models for this tier

ModelQuantizationWhy
Qwen3 32B Q8 Near-lossless 32B quality — flagship tier for RTX 5090
DeepSeek-R1-Distill-32B Q8 Full-quality thinking model, fits at ~34 GB
Llama 3.3 70B Q4_K_M Partial offload — reduced speed but 70B quality

48 GB VRAM

Mac mini M4 Pro 48 GB · dual RTX 3090 NVLink (48 GB) · Quadro A6000

Can run

  • 70B at Q4_K_M (~40 GB) — fits comfortably, the key milestone of this tier
  • 34B at FP16 (~70 GB — tight on dual 3090; fine on M4 Pro 48 GB)
  • 34B at Q8 (~34 GB) with significant headroom

Cannot run

  • 70B at Q8 (~74 GB) — exceeds 48 GB
  • 100B+ models at any practical quantization

Best models for this tier

ModelQuantizationWhy
Llama 3.3 70B Instruct Q4_K_M This tier unlocks 70B — the flagship open model for 2026
Qwen3 32B FP16 Full precision 32B — maximum quality at this tier
DeepSeek-R1-Distill-70B Q4_K_M Best reasoning/thinking model that fits here

64 GB VRAM

Mac Studio M4 Max 64 GB unified

Can run

  • 70B at Q5_K_M (~47 GB) — comfortable fit, strong quality
  • 70B at Q6_K (~55 GB) — high quality with headroom
  • 70B at Q8 (~74 GB) — tight to impossible; use Q6_K instead

Cannot run

  • 70B at Q8 comfortably — 74 GB exceeds 64 GB
  • 100B+ models at Q4_K_M comfortably

Best models for this tier

ModelQuantizationWhy
Llama 3.3 70B Instruct Q5_K_M Excellent quality, comfortable headroom at ~47 GB
DeepSeek-R1-Distill-70B Q5_K_M High-quality reasoning at full 70B scale
Llama 3.3 70B Instruct Q6_K Near-lossless quality, still fits at ~55 GB

128 GB VRAM

Mac Studio M4 Max 128 GB unified

Can run

  • 70B at Q8 (~74 GB) — full near-lossless quality with headroom
  • 70B at FP16 (~140 GB) — very tight, use Q8 instead
  • 100B+ models at Q4_K_M — this tier unlocks early 100B+ models

Cannot run

  • 70B at FP16 comfortably — 140 GB exceeds 128 GB; use Q8
  • Very large 200B+ models

Best models for this tier

ModelQuantizationWhy
Llama 3.3 70B Instruct Q8 Full near-lossless quality at 70B — the best 70B inference
Llama 4 Scout 109B Q4_K_M First MoE 100B+ model accessible at this tier (~60 GB)
Qwen3 32B Q8 Maximum quality 32B — extremely fast on Mac Studio

Rules of Thumb

  1. 1. Estimate VRAM with: parameters × bytes-per-param. A Q4_K_M model uses ~0.5 bytes per parameter. A 7B model at Q4_K_M: 7,000,000,000 × 0.5 = 3.5 GB of weights, plus ~1 GB overhead = roughly 4.5 GB total.
  2. 2. Leave ~2 GB headroom for context and KV cache. A 7B model that weighs 4.5 GB can still exhaust an 8 GB GPU if you feed it very long prompts. Keep headroom, especially with larger context windows.
  3. 3. CPU offloading cuts speed dramatically. Offloading 20–30% of a model's layers to CPU RAM typically reduces token generation to under 5 tokens per second — barely usable for interactive chat. Prefer running a smaller model fully in VRAM over a larger model with offloading.
  4. 4. Apple unified memory counts the same as VRAM. On Mac, the full unified memory pool is available for model loading. A 48 GB M4 Pro Mac mini can load a 70B Q4_K_M model just like a 48 GB discrete GPU setup.
  5. 5. Q4_K_M is the right default for almost everyone. It gives excellent quality at 4× the VRAM savings over FP16. Only go lower (Q3) if you have no other option. Go higher (Q6_K, Q8) when you have VRAM headroom and quality matters. See the quantization guide for full details.
  6. 6. A smaller model fully in VRAM beats a larger model split across GPU and CPU. A 7B model at Q8 running entirely on an 8 GB GPU will be faster and more responsive than a 13B model offloaded 40% to RAM.

Frequently Asked Questions

Can I run a 7B model on 8 GB VRAM?

Yes. A 7B model at Q4_K_M quantization uses roughly 4.5 GB VRAM, which fits comfortably on an 8 GB GPU like the RTX 4060 with room left for context. You can expect around 30–40 tokens per second. You cannot run 13B or larger models at this tier without CPU offloading, which significantly reduces speed.

Can I run a 70B model on a 24 GB GPU?

Not fully in VRAM. A 70B model at Q4_K_M requires roughly 40 GB VRAM. On a 24 GB GPU like the RTX 4090 you would need to offload a significant portion of layers to CPU RAM, which typically cuts speed to under 5 tokens per second. For 70B models without offloading, you need 48 GB or more of VRAM.

Does Apple unified memory count the same as VRAM?

Yes, for LLM purposes Apple Silicon unified memory works like GPU VRAM — the model loads entirely into it. A 24 GB M4 Mac mini behaves similarly to a 24 GB discrete GPU for inference. Unified memory bandwidth is typically lower than high-end discrete GPU VRAM, so token speeds may differ, but large models fit and run without CPU offloading.

What is the best model for an 8 GB GPU?

The best models for 8 GB VRAM are Qwen3 8B at Q4_K_M for top reasoning quality, DeepSeek-R1-Distill-8B at Q4_K_M for thinking/reasoning, Phi-4-mini for coding and math, and Gemma 3 4B at Q4_K_M for a fast, capable 4B option. All fit within 8 GB with headroom for context.

How much VRAM does a 13B model need?

A 13B model at Q4_K_M quantization requires approximately 8–9 GB VRAM, fitting on a 12 GB GPU with some headroom. At Q8 it needs around 14 GB, requiring a 16 GB or 24 GB GPU. At FP16 it needs roughly 27 GB, placing it in 32 GB or 48 GB territory.

Related Guides

Popular hardware for local LLMs

RTX 4060 (8 GB)
Budget pick. Runs 7B-8B models at 25-35 tok/s.
Buy on Amazon
RTX 4060 Ti 16 GB
Sweet spot. Runs 13B-14B at full speed. Best value.
Buy on Amazon
RTX 4090 (24 GB)
Top consumer GPU. Runs 70B models with offloading.
Buy on Amazon

Know your VRAM tier? Find the right hardware or calculate exact requirements.

Sources & methodology

VRAM and tokens-per-second figures on this page are synthesised from open community benchmarks. The sitewide formula and the full source list are on the methodology page. For this guide specifically I leaned on:

Spot a number that does not match the linked source? Email [email protected] and I will update the guide.