RTX 4070 vs RTX 4080 vs RTX 4090 for Local LLMs: Which Should You Buy? (2026)
Editorial: AI structured the three-way comparison. The break-point logic ("when each tier wins") was hand-edited from the linked community benchmarks, not from spec-sheet extrapolation.
Updated May 2026 · RTX 4070 12 GB vs RTX 4080 Super 16 GB vs RTX 4090 24 GB · VRAM comparison · Speed benchmarks · 32B model compatibility
Three of NVIDIA's most popular consumer GPUs for local AI, each at a different price point with a meaningfully different VRAM tier. The RTX 4070 has 12 GB, the RTX 4080 Super has 16 GB, and the RTX 4090 has 24 GB. That VRAM difference determines which models you can run, and the bandwidth difference determines how fast they generate tokens. This guide breaks down exactly what you get at each price point so you can pick the right card for your workload.
Quick Verdict
RTX 4070 12GB — Budget
Best budget pick
- Runs 7B-14B models at Q4 comfortably
- ~80 t/s on 7B, ~48 t/s on 14B
- Low 200W power draw
- Cannot run 32B or 70B models
- Best for casual and budget users
RTX 4080 Super 16GB — Sweet Spot
Best mid-range value
- Runs 14B at Q8 (near-lossless quality)
- ~110 t/s on 7B, ~65 t/s on 14B
- Handles 20-22B models at Q4
- Cannot run 32B in VRAM
- Best for enthusiast users
RTX 4090 24GB — Max Performance
Best for power users
- Runs Qwen3 32B and DeepSeek 32B at Q4
- ~145 t/s on 7B, ~90 t/s on 14B
- Fastest token generation of any single GPU
- Still cannot run 70B models
- Best for power users and researchers
None of these three cards run 70B models
Llama 3.3 70B at Q4_K_M needs approximately 37 GB of VRAM — above even the RTX 4090's 24 GB. For 70B models you need a Mac mini M4 Pro (48 GB unified memory) or Mac Studio M4 Max (64 GB). If 70B is your goal, the GPU comparison here is irrelevant — consider Apple Silicon instead.
Full Comparison: RTX 4070 vs RTX 4080 Super vs RTX 4090
| Metric | RTX 4070 12GB | RTX 4080 Super 16GB | RTX 4090 24GB |
|---|---|---|---|
| VRAM | 12 GB GDDR6X | 16 GB GDDR6X | 24 GB GDDR6X |
| Memory bandwidth | 504 GB/s | 736 GB/s | 1008 GB/s |
| 7B model Q4 speed | ~80 t/s | ~110 t/s | ~145 t/s |
| 14B model Q4 speed | ~48 t/s | ~65 t/s | ~90 t/s |
| 32B model Q4 speed | doesn't fit | doesn't fit | ~38 t/s |
| Llama 3.3 70B Q4 | No (needs 24 GB+) | No (needs 24 GB+) | No (37 GB — just above 24 GB) |
| DeepSeek-R1-14B Q4 | Yes (~48 t/s) | Yes (~65 t/s) | Yes (~90 t/s) |
| Can run Qwen3 32B Q4 | No (~19 GB needed) | No | Yes (~18.5 GB fits) |
| Power draw (LLM) | ~200W | ~320W | ~450W |
| Best for | 7B-14B models, budget efficiency | 14B-32B range, best mid-range | Everything up to 32B at high speed |
Speed figures are approximate for Ollama with llama.cpp backend at default context length. Actual results vary by model, quantization, context length, and system RAM. Prices are approximate retail in 2026 — check current listings before purchasing.
VRAM Comparison: The 12 GB to 16 GB to 24 GB Cliff
VRAM capacity is the single most important specification for local LLMs. It determines which models you can run in VRAM at all — and models that don't fit in VRAM fall back to CPU offloading, typically dropping from 80+ tokens per second to under 10 tokens per second. Each jump in VRAM tier unlocks a meaningful new capability tier.
12 GB (RTX 4070): 7B-14B at Q4, nothing larger
At 12 GB you comfortably run any 7B model at Q8 (~8.5 GB) and any 14B model at Q4 (~9 GB). Qwen3 14B at Q4_K_M fits with headroom. What you cannot run: 14B at Q8 (~14 GB), any 20B+ model, 32B at any quantization, or 70B at any quantization. The 12 GB tier is the entry point for local LLMs — useful and capable, but the ceiling is real.
16 GB (RTX 4080 Super): 14B at Q8, 20-22B at Q4
At 16 GB you unlock 14B models at Q8 (~14 GB) — near-lossless quality for Qwen3 14B, Phi-4 14B, and similar models. You can also run Mistral 22B at Q4 (~14 GB) and some 20B models. What you still cannot run: Gemma 3 27B at Q4 (~17 GB — just above 16 GB), Qwen3 32B at Q4 (~18.5 GB), or any 70B model. The 16 GB tier is the sweet spot for most serious users.
24 GB (RTX 4090): 32B at Q4, everything up to the 70B wall
At 24 GB you unlock the entire 32B tier: Qwen3 32B at Q4 (~18.5 GB), DeepSeek-R1-Distill-32B at Q4, and Gemma 3 27B at Q4 (~17 GB) with room for context. You also run all 14B models at Q8 with significant headroom. The 24 GB ceiling hits at 70B — Llama 3.3 70B at Q4_K_M needs ~37 GB, which is above the 4090's capacity. For a single consumer GPU, 24 GB is the maximum available today.
Speed Comparison: Bandwidth Drives Token Speed
LLM token generation is memory-bandwidth-limited, not compute-limited. The RTX 4090's 1008 GB/s bandwidth is 37% higher than the RTX 4080 Super's 736 GB/s, which is 46% higher than the RTX 4070's 504 GB/s. That bandwidth ratio maps almost directly to tokens per second on the same model.
Llama 3.1 8B Q4_K_M
- RTX 4070: ~80 t/s
- RTX 4080 Super: ~110 t/s
- RTX 4090: ~145 t/s
Qwen3 14B Q4_K_M
- RTX 4070: ~48 t/s
- RTX 4080 Super: ~65 t/s
- RTX 4090: ~90 t/s
DeepSeek-R1-32B Q4
- RTX 4070: cannot run
- RTX 4080 Super: cannot run
- RTX 4090: ~38 t/s
All three cards feel fast in conversational chat at 7B. The difference becomes meaningful at 14B where 48 t/s (4070) vs 90 t/s (4090) is noticeable for long responses — roughly half the wait time. At 32B only the 4090 runs the model in VRAM, so the comparison is moot for the other two cards.
Power and Running Cost: 450W vs 200W
The RTX 4090 draws approximately 450W under LLM load. The RTX 4070 draws approximately 200W. That 250W difference adds up over sustained use — and it affects PSU requirements and electricity bills.
Monthly electricity cost at 8 hours/day ($0.15/kWh)
RTX 4070 (200W)
~$7/month
~$88/year
RTX 4080 Super (320W)
~$12/month
~$140/year
RTX 4090 (450W)
~$16/month
~$197/year
The RTX 4090 costs roughly $9/month more in electricity than the RTX 4070 at 8 hours/day — about $108/year extra. Over a three-year period that adds roughly $325 to the effective cost of the 4090. Rates are higher in Europe and parts of Australia, widening this gap.
PSU requirements also differ: the RTX 4090 needs at minimum a 850W PSU with a modern CPU. The RTX 4080 Super runs on 750W. The RTX 4070 runs comfortably on 650W. If you are upgrading and your PSU is borderline, factor in upgrade cost for the higher-draw cards.
The 70B Question: None of These Cards Can Do It
If your primary goal is running 70B models — Llama 3.3 70B, Qwen3 72B, DeepSeek-R1 70B — none of the three cards in this comparison will satisfy you. Here is why:
Llama 3.3 70B Q4_K_M needs ~37 GB VRAM
The most popular 70B quantization requires approximately 37 GB to load in VRAM. The RTX 4090 tops out at 24 GB. You can run it with GPU+CPU hybrid offloading, but speed drops to approximately 2-5 tokens per second — frustrating for interactive use.
Qwen3 72B needs ~40 GB VRAM
Qwen3 72B at Q4_K_M is larger at approximately 40 GB. No single consumer GPU can run this in VRAM today. Even with hybrid offloading on a 4090, speed is negligible.
For 70B, you need a Mac or dual-GPU setup
The Mac mini M4 Pro with 48 GB unified memory runs Llama 3.3 70B at approximately 12-15 t/s — slow but usable for non-interactive tasks. The Mac Studio M4 Max with 64 GB runs it more comfortably. A dual-RTX-3090 PCIe x8 setup can reach 48 GB combined VRAM but requires a workstation motherboard and careful configuration. These are the realistic paths to 70B, not any single consumer NVIDIA GPU.
What the RTX 4090 does unlock at 24 GB
The jump from 16 GB to 24 GB specifically unlocks 32B models: Qwen3 32B at Q4 (~18.5 GB), DeepSeek-R1-Distill-32B at Q4, and Gemma 3 27B at Q4 (~17 GB). These are excellent, capable models that deliver near-70B quality on many benchmarks. If 32B is your target, the 4090 is the card. If 70B is your target, look elsewhere.
Price-to-Performance Analysis
Raw price-per-token-per-second is one lens, but the model compatibility cliff matters more. A card that runs a model 30% slower but actually runs the model is infinitely better than one that cannot run it at all.
| GPU | Price | 14B Q4 speed | Unlocks 32B? |
|---|---|---|---|
| RTX 4070 12GB | Check price on Amazon | ~48 t/s | No |
| RTX 4080 Super 16GB | Check price on Amazon | ~65 t/s | No |
| RTX 4090 24GB | Check price on Amazon | ~90 t/s | Yes |
The RTX 4070 delivers the best raw price efficiency per token at 14B. The RTX 4080 Super costs 67% more for 35% more speed — a modest efficiency step down. The RTX 4090 costs 167% more than the 4070 for 88% more speed — reasonable if you need 32B models, poor value if you only run 14B. The decision framework: pick the RTX 4070 if 14B is your ceiling, the RTX 4080 Super if you want 14B at Q8 quality, and the RTX 4090 if you want 32B models or absolute fastest generation.
Who Should Buy Which GPU
Buy the RTX 4070 12GB if...
- ✓ You are on a tight budget
- ✓ You primarily run 7B-14B models (Llama 3.1 8B, Qwen3 14B, Phi-4 14B)
- ✓ Q4 quality is acceptable — you do not need Q8 near-lossless on 14B
- ✓ You want low power draw (~200W) and a quiet, efficient build
- ✓ You are new to local LLMs and want to try without heavy investment
Avoid if: you want Qwen3 32B, DeepSeek-R1-Distill-32B, or Gemma 3 27B — none fit in 12 GB.
Buy the RTX 4080 Super 16GB if...
- ✓ You want 14B models at Q8 (near-lossless quality)
- ✓ You run Mistral 22B or similar 20B-class models at Q4
- ✓ You want noticeably faster generation than the 4070 (~35% faster)
- ✓ You are an enthusiast who wants the best mid-range option
- ✓ You do not specifically need 32B models
Avoid if: you need Qwen3 32B or DeepSeek-R1-32B at Q4 — they need 18.5-19 GB, just above 16 GB.
Buy the RTX 4090 24GB if...
- ✓ You want to run Qwen3 32B, DeepSeek-R1-Distill-32B, or Gemma 3 27B
- ✓ You want the fastest possible token generation on any single GPU
- ✓ You run multiple users or applications against the same local GPU
- ✓ You do heavy long-context inference where speed compounds over generations
- ✓ Budget is secondary to capability
Avoid if: your only goal is 70B models — the 4090 still cannot run them. Budget goes further on Apple Silicon for 70B.
Running LLMs on Any of These Cards: Ollama Setup
All three cards are fully supported by Ollama and LM Studio. NVIDIA GPU detection is automatic — no special configuration needed.
Install Ollama
Windows and macOS
Download from ollama.com. Installs as a background service, auto-detects NVIDIA GPUs.
Linux
curl -fsSL https://ollama.com/install.sh | sh
RTX 4070 12GB — recommended models
Qwen3 14B Q4_K_M (~9 GB — fits with 3 GB headroom)
ollama run qwen3:14b
Llama 3.1 8B Q8_0 (~8.5 GB — high quality 8B)
ollama run llama3.1:8b-instruct-q8_0
Phi-4 14B Q4_K_M (~9.2 GB)
ollama run phi4:14b
RTX 4080 Super 16GB — recommended models
Qwen3 14B Q8_0 — near-lossless quality (~14 GB)
ollama run qwen3:14b-q8_0
Mistral 22B Q4_K_M (~14 GB)
ollama run mistral:22b-instruct-q4_K_M
Phi-4 14B Q8_0 (~14.5 GB)
ollama run phi4:q8_0
RTX 4090 24GB — recommended models
Qwen3 32B Q4_K_M (~18.5 GB) — unlocked by 24 GB
ollama run qwen3:32b
DeepSeek-R1-Distill-32B Q4_K_M
ollama run deepseek-r1:32b
Qwen3 14B Q8_0 with headroom (~14 GB)
ollama run qwen3:14b-q8_0
Verify GPU load after starting a model
ollama ps
GPU% should show 100% for full VRAM inference. If it shows partial %, some layers are on CPU and speed will be significantly slower. Reduce context length or quantization if the model is spilling to CPU.
Frequently Asked Questions
Which NVIDIA GPU is best for running LLMs locally?
The RTX 4090 delivers maximum capability with 24 GB VRAM and 1008 GB/s bandwidth — it runs 32B models at Q4 and generates tokens fastest across all sizes. The RTX 4080 Super 16GB is the best value for most users, covering 14B at Q8 and 20B-class models. The RTX 4070 12GB is the budget pick for 7B-14B use. The right choice depends on which models you want to run — if you need Qwen3 32B or DeepSeek-R1-32B, only the 4090 works.
Can the RTX 4070 run Llama 3 70B?
No. Llama 3.3 70B at Q4_K_M needs approximately 37 GB of VRAM. The RTX 4070 has 12 GB. Even the RTX 4090 at 24 GB cannot run 70B in VRAM — none of these three cards can. For 70B models you need a Mac mini M4 Pro with 48 GB unified memory or a Mac Studio M4 Max with 64 GB.
Is the RTX 4090 worth it over the RTX 4080 Super?
Yes if you want 32B models — the 4090's 24 GB unlocks Qwen3 32B and DeepSeek-R1-Distill-32B at Q4. The 4090 also generates tokens 37% faster than the 4080 Super across all model sizes. If you only run 14B models, the RTX 4080 Super costs less and delivers similar quality at roughly 70% the speed — still very fast at ~65 t/s. The 4090 is worth it for power users and researchers who need the 32B tier or absolute fastest generation.
What is the cheapest GPU that runs Qwen3 14B well?
The RTX 4070 12GB runs Qwen3 14B at Q4_K_M (~9 GB) comfortably with headroom. For near-lossless Q8 quality (~14 GB), you need 16 GB of VRAM — the RTX 4060 Ti 16GB is the cheapest option for Q8, though slower than the RTX 4080 Super 16GB.
Is the RTX 4080 Super faster than the RTX 4090 for 7B models?
No. The RTX 4090's 1008 GB/s memory bandwidth is approximately 37% higher than the RTX 4080 Super's 736 GB/s. Since LLM token generation is memory-bandwidth-limited, the 4090 generates tokens faster across all model sizes — roughly 145 t/s vs 110 t/s on a 7B Q4 model. The 4080 Super is never faster than the 4090 at inference.
Related Guides
RTX 4070 12GB Guide
Full RTX 4070 setup and model compatibility
RTX 4080 16GB Guide
Full guide for the RTX 4080 Super
RTX 4090 24GB Guide
Full guide for the RTX 4090
Best GPU for LLMs
Full GPU buying guide for local AI
What Can I Run?
VRAM compatibility tool for your hardware
RTX 3090 vs 4070 Ti Super
Used 24 GB vs new 16 GB comparison
Popular hardware for local LLMs
Find the right GPU for your LLM goals and check which models fit your hardware.
Sources & methodology
VRAM and tokens-per-second figures on this page are synthesised from open community benchmarks. The sitewide formula and the full source list are on the methodology page. For this guide I leaned on:
- Hardware Corner GPU ranking. Tokens per second for the 4070, 4080 and 4090 at 2k and 8k context.
- XiongjieDai GPU-Benchmarks-on-LLM-Inference. Cross-validated llama-bench runs for all three Ada cards on the same models.
- Home GPU LLM Leaderboard. VRAM tier mapping (12/16/24 GB) that drives the 'which one for which model' calls.
Spot a number that does not match the linked source? Email [email protected] and I will update the guide.