What LLMs Can I Run? Complete VRAM Guide

Q: What is the best model for an 8 GB GPU?

The best models for 8 GB VRAM in 2026 are Qwen3 8B at Q4_K_M for top reasoning quality, DeepSeek-R1-Distill-8B at Q4_K_M for thinking/reasoning tasks, Llama 3.1 8B at Q4_K_M for general chat, and Phi-4-mini for math and coding. All fit within 8 GB with headroom for context.

"I have X GB of VRAM — what can I actually run?" is one of the most common questions in local LLM communities. This guide gives you a direct answer organized by VRAM tier, with specific model recommendations, quantization choices, and speed estimates for each. Jump to your tier below, or start with the quick reference table.

Quick Reference: VRAM to Models

VRAM	Max model at Q4_K_M	Top recommendation
8 GB	7B at Q4_K_M	Qwen3 8B Q4_K_M
12 GB	13B at Q4_K_M	Qwen3 14B Q4_K_M
16 GB	13B at Q8	Qwen3 14B Q8
24 GB	34B at Q4_K_M	Qwen3 32B Q4_K_M
32 GB	34B at Q8	Qwen3 32B Q8
48 GB	70B at Q4_K_M	Llama 3.3 70B Q4_K_M
64 GB	70B at Q5_K_M	Llama 3.3 70B Q5_K_M
128 GB	70B at FP16 / 100B+ at Q4	Llama 4 Scout 109B Q4

Use the VRAM Calculator to get exact figures for any model and quantization combination.

8 GB VRAM

RTX 4060, RTX 3070 (8 GB), older 8 GB cards

Can run

✓ 7B models at Q4_K_M (~4.5 GB) — roughly 30–40 tok/s on RTX 4060
✓ 3B models at Q8 (~3.5 GB) — faster, lower quality ceiling
✓ 1B models at FP16 — very fast, lightweight tasks

Cannot run

✗ 13B models at any quantization without CPU offloading
✗ 34B or 70B models at useful speed

Best models for this tier

Model	Quantization	Why
Qwen3 8B	Q4_K_M	Top reasoning and instruction-following at this size
DeepSeek-R1-Distill-8B	Q4_K_M	Best thinking model for 8 GB — chain-of-thought reasoning
Gemma 3 4B	Q4_K_M	Google's 4B model — excellent quality-to-size ratio, ~3 GB
Llama 3.1 8B Instruct	Q4_K_M	Rock-solid general purpose chat and coding
Phi-4-mini	Q4_K_M	Punches above its weight on math and code

12 GB VRAM

RTX 5070 12GB, RTX 4070 12GB, Intel Arc B580, RTX 3080 10GB

Can run

✓ 13B at Q4_K_M (~8–9 GB) with good context headroom
✓ 7B at Q8 (~8.5 GB) — near-lossless quality for 7B
✓ 7B at FP16 — maximum quality, tight fit at ~14 GB (use Q8 instead)

Cannot run

✗ 34B models comfortably — they need 20+ GB at Q4_K_M
✗ 13B at Q8 without headroom issues (~14 GB)

Best models for this tier

Model	Quantization	Why
Qwen3 14B	Q4_K_M	Best 14B in 2026 — strong reasoning and coding
DeepSeek-R1-Distill-14B	Q4_K_M	Reasoning/thinking model at 14B class
Gemma 3 12B	Q4_K_M	Google's 12B — ~8 GB at Q4, solid general performance
Llama 3.1 8B Instruct	Q8	Near-lossless 8B — excellent quality
Phi-4 14B	Q4_K_M	Strong on math and code at 13B class

16–20 GB VRAM

RTX 4060 Ti 16GB, RTX 5070 Ti 16GB, RTX 4070 Ti Super 16GB, RTX 4080 16GB, RTX 5080 16GB, M4 Mac mini 16 GB unified

Can run

✓ 13B at Q8 (~14 GB) — excellent quality, fits comfortably
✓ 7B at FP16 (~14 GB) — full precision inference
✓ 34B at aggressive Q3 quantization — reduced quality, not recommended

Cannot run

✗ 34B at Q4_K_M comfortably (~20 GB — very tight or impossible)
✗ 70B at any quantization without heavy CPU offloading

Best models for this tier

Model	Quantization	Why
Qwen3 14B	Q8	Near-lossless 14B quality — the sweet spot for 16 GB
DeepSeek-R1-Distill-14B	Q8	High-quality reasoning at full Q8 fidelity
Llama 3.1 8B	FP16	Full precision 8B — best 8B quality possible

24 GB VRAM

RTX 4090, RTX 3090, RX 7900 XTX, Mac mini M4 24 GB unified

Can run

✓ 34B at Q4_K_M (~20 GB) — flagship consumer GPU sweet spot
✓ 13B at Q8 (~14 GB) — excellent quality with headroom
✓ 7B at FP16 (~14 GB) — full precision 7B
✓ 22B models like Codestral comfortably at Q4_K_M or Q5_K_M

Cannot run

✗ 70B without CPU offloading (40 GB at Q4_K_M)
✗ 34B at FP16 (~70 GB)

Best models for this tier

Model	Quantization	Why
Qwen3 32B	Q4_K_M	Flagship 32B — strong reasoning, fits comfortably
DeepSeek-R1-Distill-32B	Q4_K_M	Best thinking/reasoning model at this tier
Qwen3 14B	Q8	Near-lossless 14B — maximum quality for 14B class
Codestral 22B	Q4_K_M	Best dedicated coding model at this tier

32 GB VRAM

RTX 5090 (32 GB), some pro/workstation GPUs

Can run

✓ 34B at Q8 (~34 GB — very tight, minimal context headroom)
✓ 34B at Q6_K (~27 GB) — comfortable with good headroom
✓ 70B at Q4_K_M with ~10 GB CPU offload — usable but slower

Cannot run

✗ 70B fully in VRAM (needs ~40 GB at Q4_K_M)
✗ 34B at FP16 (~70 GB)

Best models for this tier

Model	Quantization	Why
Qwen3 32B	Q8	Near-lossless 32B quality — flagship tier for RTX 5090
DeepSeek-R1-Distill-32B	Q8	Full-quality thinking model, fits at ~34 GB
Llama 3.3 70B	Q4_K_M	Partial offload — reduced speed but 70B quality

48 GB VRAM

Mac mini M4 Pro 48 GB · dual RTX 3090 NVLink (48 GB) · Quadro A6000

Can run

✓ 70B at Q4_K_M (~40 GB) — fits comfortably, the key milestone of this tier
✓ 34B at FP16 (~70 GB — tight on dual 3090; fine on M4 Pro 48 GB)
✓ 34B at Q8 (~34 GB) with significant headroom

Cannot run

✗ 70B at Q8 (~74 GB) — exceeds 48 GB
✗ 100B+ models at any practical quantization

Best models for this tier

Model	Quantization	Why
Llama 3.3 70B Instruct	Q4_K_M	This tier unlocks 70B — the flagship open model for 2026
Qwen3 32B	FP16	Full precision 32B — maximum quality at this tier
DeepSeek-R1-Distill-70B	Q4_K_M	Best reasoning/thinking model that fits here

64 GB VRAM

Mac Studio M4 Max 64 GB unified

Can run

✓ 70B at Q5_K_M (~47 GB) — comfortable fit, strong quality
✓ 70B at Q6_K (~55 GB) — high quality with headroom
✓ 70B at Q8 (~74 GB) — tight to impossible; use Q6_K instead

Cannot run

✗ 70B at Q8 comfortably — 74 GB exceeds 64 GB
✗ 100B+ models at Q4_K_M comfortably

Best models for this tier

Model	Quantization	Why
Llama 3.3 70B Instruct	Q5_K_M	Excellent quality, comfortable headroom at ~47 GB
DeepSeek-R1-Distill-70B	Q5_K_M	High-quality reasoning at full 70B scale
Llama 3.3 70B Instruct	Q6_K	Near-lossless quality, still fits at ~55 GB

128 GB VRAM

Mac Studio M4 Max 128 GB unified

Can run

✓ 70B at Q8 (~74 GB) — full near-lossless quality with headroom
✓ 70B at FP16 (~140 GB) — very tight, use Q8 instead
✓ 100B+ models at Q4_K_M — this tier unlocks early 100B+ models

Cannot run

✗ 70B at FP16 comfortably — 140 GB exceeds 128 GB; use Q8
✗ Very large 200B+ models

Best models for this tier

Model	Quantization	Why
Llama 3.3 70B Instruct	Q8	Full near-lossless quality at 70B — the best 70B inference
Llama 4 Scout 109B	Q4_K_M	First MoE 100B+ model accessible at this tier (~60 GB)
Qwen3 32B	Q8	Maximum quality 32B — extremely fast on Mac Studio

Rules of Thumb

1. Estimate VRAM with: parameters × bytes-per-param. A Q4_K_M model uses ~0.5 bytes per parameter. A 7B model at Q4_K_M: 7,000,000,000 × 0.5 = 3.5 GB of weights, plus ~1 GB overhead = roughly 4.5 GB total.
2. Leave ~2 GB headroom for context and KV cache. A 7B model that weighs 4.5 GB can still exhaust an 8 GB GPU if you feed it very long prompts. Keep headroom, especially with larger context windows.
3. CPU offloading cuts speed dramatically. Offloading 20–30% of a model's layers to CPU RAM typically reduces token generation to under 5 tokens per second — barely usable for interactive chat. Prefer running a smaller model fully in VRAM over a larger model with offloading.
4. Apple unified memory counts the same as VRAM. On Mac, the full unified memory pool is available for model loading. A 48 GB M4 Pro Mac mini can load a 70B Q4_K_M model just like a 48 GB discrete GPU setup.
5. Q4_K_M is the right default for almost everyone. It gives excellent quality at 4× the VRAM savings over FP16. Only go lower (Q3) if you have no other option. Go higher (Q6_K, Q8) when you have VRAM headroom and quality matters. See the quantization guide for full details.
6. A smaller model fully in VRAM beats a larger model split across GPU and CPU. A 7B model at Q8 running entirely on an 8 GB GPU will be faster and more responsive than a 13B model offloaded 40% to RAM.

Frequently Asked Questions

Can I run a 7B model on 8 GB VRAM?

Yes. A 7B model at Q4_K_M quantization uses roughly 4.5 GB VRAM, which fits comfortably on an 8 GB GPU like the RTX 4060 with room left for context. You can expect around 30–40 tokens per second. You cannot run 13B or larger models at this tier without CPU offloading, which significantly reduces speed.

Can I run a 70B model on a 24 GB GPU?

Not fully in VRAM. A 70B model at Q4_K_M requires roughly 40 GB VRAM. On a 24 GB GPU like the RTX 4090 you would need to offload a significant portion of layers to CPU RAM, which typically cuts speed to under 5 tokens per second. For 70B models without offloading, you need 48 GB or more of VRAM.

Does Apple unified memory count the same as VRAM?

Yes, for LLM purposes Apple Silicon unified memory works like GPU VRAM — the model loads entirely into it. A 24 GB M4 Mac mini behaves similarly to a 24 GB discrete GPU for inference. Unified memory bandwidth is typically lower than high-end discrete GPU VRAM, so token speeds may differ, but large models fit and run without CPU offloading.

What is the best model for an 8 GB GPU?

The best models for 8 GB VRAM are Qwen3 8B at Q4_K_M for top reasoning quality, DeepSeek-R1-Distill-8B at Q4_K_M for thinking/reasoning, Phi-4-mini for coding and math, and Gemma 3 4B at Q4_K_M for a fast, capable 4B option. All fit within 8 GB with headroom for context.

How much VRAM does a 13B model need?

A 13B model at Q4_K_M quantization requires approximately 8–9 GB VRAM, fitting on a 12 GB GPU with some headroom. At Q8 it needs around 14 GB, requiring a 16 GB or 24 GB GPU. At FP16 it needs roughly 27 GB, placing it in 32 GB or 48 GB territory.

Related Guides

GPU Buying Guide

Which GPU to buy for LLMs at every budget

Running LLMs Without a GPU

CPU-only inference: speed, RAM needs, best models

DeepSeek R1 Hardware Guide

What hardware you need for each DeepSeek R1 variant

Qwen3 Hardware Guide

Running Qwen3 4B through 32B locally

Best LLMs to Run Locally (2026)

Top model picks by VRAM tier — Qwen3, Gemma 3, Phi-4

Gemma 3 Hardware Guide

Google's 1B–27B models — VRAM and GPU picks

Phi-4 Hardware Guide

Microsoft's 14B that beats most 70B models on benchmarks

Llama 3.3 70B Guide

What hardware you need for the best open-weights 70B model

Mac mini M4 Pro 48GB Guide

One of the most affordable consumer devices for 70B models

How to Run LLMs Locally

Found your tier? Set up Ollama or LM Studio in 5 minutes

How to Run Qwen3 Locally

Best model at every VRAM tier — step-by-step Ollama setup

How to Run Phi-4 Locally

Microsoft Phi-4 14B — needs 10 GB VRAM, beats Llama 3.1 8B on reasoning

How to Run Llama 4 Scout Locally

Llama 4 Scout needs 58 GB — why, and what to do with 8-24 GB

Best LLMs for 8 GB VRAM

RTX 3060/4060: Qwen3 7B Q8 at 35 tok/s — full model fit table

Best LLMs for 16 GB VRAM

RTX 4060 Ti/4080: Qwen3 14B Q8 (14.8 GB), Gemma 3 27B Q4

Best LLMs for 24 GB VRAM

RTX 4090/3090: Qwen3 32B Q4 (19 GB) at 38 t/s — the 32B sweet spot

Popular hardware for local LLMs

RTX 4060 (8 GB)

Budget pick. Runs 7B-8B models at 25-35 tok/s.

Buy on Amazon

RTX 4060 Ti 16 GB

Sweet spot. Runs 13B-14B at full speed. Best value.

Buy on Amazon

RTX 4090 (24 GB)

Top consumer GPU. Runs 70B models with offloading.

Buy on Amazon

Know your VRAM tier? Find the right hardware or calculate exact requirements.

VRAM Calculator GPU Buying Guide Browse All Models

Sources & methodology

VRAM and tokens-per-second figures on this page are synthesised from open community benchmarks. The sitewide formula and the full source list are on the methodology page. For this guide specifically I leaned on:

XiongjieDai GPU-Benchmarks-on-LLM-Inference — primary source for the per-tier tokens-per-second ranges (RTX 4060 through RTX 5090).
Modal: How much VRAM do I need for inference — sanity-checked the bytes-per-parameter math and the overhead allowance.
Home GPU LLM Leaderboard — used to confirm that the "best model at each tier" picks match what the community actually runs.
Hugging Face Hub — model cards I checked for each recommended model's parameter count, quantization options, and file sizes.

Spot a number that does not match the linked source? Email [email protected] and I will update the guide.