RTX 4070 vs RTX 4080 vs RTX 4090 for Local LLMs: Which Should You Buy? (2026)

Q: Can the RTX 4070 run Llama 3 70B?

No. Llama 3.3 70B at Q4_K_M requires approximately 37 GB of VRAM. The RTX 4070 has only 12 GB, the RTX 4080 Super has 16 GB, and even the RTX 4090 has only 24 GB — none of these three cards can run 70B models in VRAM. For 70B models you need a Mac mini M4 Pro with 48 GB unified memory, a Mac Studio M4 Max with 64 GB, or a dual-GPU server setup. Running 70B with GPU+CPU split on any of these cards drops speed to 2-5 tokens per second, which is essentially unusable for interactive chat.

Q: Is the RTX 4090 worth it over the RTX 4080 Super?

Yes, if you want to run 32B models — the RTX 4090's 24 GB VRAM unlocks Qwen3 32B at Q4 (~18.5 GB) and DeepSeek-R1-Distill-32B at Q4, which the RTX 4080 Super's 16 GB cannot hold. The 4090 also generates tokens 37% faster than the 4080 Super thanks to 1008 GB/s vs 736 GB/s bandwidth. If you only run 14B models, the RTX 4080 Super costs less and delivers similar quality at roughly 70% the speed — still very fast at ~65 t/s. The 4090 is the right buy for power users who want the absolute fastest generation and access to the full 32B model tier.

Editorial: AI structured the three-way comparison. The break-point logic ("when each tier wins") was hand-edited from the linked community benchmarks, not from spec-sheet extrapolation.

Updated May 2026 · RTX 4070 12 GB vs RTX 4080 Super 16 GB vs RTX 4090 24 GB · VRAM comparison · Speed benchmarks · 32B model compatibility

Three of NVIDIA's most popular consumer GPUs for local AI, each at a different price point with a meaningfully different VRAM tier. The RTX 4070 has 12 GB, the RTX 4080 Super has 16 GB, and the RTX 4090 has 24 GB. That VRAM difference determines which models you can run, and the bandwidth difference determines how fast they generate tokens. This guide breaks down exactly what you get at each price point so you can pick the right card for your workload.

Quick Verdict

RTX 4070 12GB — Budget

Best budget pick

Runs 7B-14B models at Q4 comfortably
~80 t/s on 7B, ~48 t/s on 14B
Low 200W power draw
Cannot run 32B or 70B models
Best for casual and budget users

RTX 4080 Super 16GB — Sweet Spot

Best mid-range value

Runs 14B at Q8 (near-lossless quality)
~110 t/s on 7B, ~65 t/s on 14B
Handles 20-22B models at Q4
Cannot run 32B in VRAM
Best for enthusiast users

RTX 4090 24GB — Max Performance

Best for power users

Runs Qwen3 32B and DeepSeek 32B at Q4
~145 t/s on 7B, ~90 t/s on 14B
Fastest token generation of any single GPU
Still cannot run 70B models
Best for power users and researchers

None of these three cards run 70B models

Llama 3.3 70B at Q4_K_M needs approximately 37 GB of VRAM — above even the RTX 4090's 24 GB. For 70B models you need a Mac mini M4 Pro (48 GB unified memory) or Mac Studio M4 Max (64 GB). If 70B is your goal, the GPU comparison here is irrelevant — consider Apple Silicon instead.

Full Comparison: RTX 4070 vs RTX 4080 Super vs RTX 4090

Metric	RTX 4070 12GB	RTX 4080 Super 16GB	RTX 4090 24GB
VRAM	12 GB GDDR6X	16 GB GDDR6X	24 GB GDDR6X
Memory bandwidth	504 GB/s	736 GB/s	1008 GB/s
7B model Q4 speed	~80 t/s	~110 t/s	~145 t/s
14B model Q4 speed	~48 t/s	~65 t/s	~90 t/s
32B model Q4 speed	doesn't fit	doesn't fit	~38 t/s
Llama 3.3 70B Q4	No (needs 24 GB+)	No (needs 24 GB+)	No (37 GB — just above 24 GB)
DeepSeek-R1-14B Q4	Yes (~48 t/s)	Yes (~65 t/s)	Yes (~90 t/s)
Can run Qwen3 32B Q4	No (~19 GB needed)	No	Yes (~18.5 GB fits)
Power draw (LLM)	~200W	~320W	~450W
Best for	7B-14B models, budget efficiency	14B-32B range, best mid-range	Everything up to 32B at high speed

Speed figures are approximate for Ollama with llama.cpp backend at default context length. Actual results vary by model, quantization, context length, and system RAM. Prices are approximate retail in 2026 — check current listings before purchasing.

VRAM Comparison: The 12 GB to 16 GB to 24 GB Cliff

VRAM capacity is the single most important specification for local LLMs. It determines which models you can run in VRAM at all — and models that don't fit in VRAM fall back to CPU offloading, typically dropping from 80+ tokens per second to under 10 tokens per second. Each jump in VRAM tier unlocks a meaningful new capability tier.

12 GB (RTX 4070): 7B-14B at Q4, nothing larger

At 12 GB you comfortably run any 7B model at Q8 (~8.5 GB) and any 14B model at Q4 (~9 GB). Qwen3 14B at Q4_K_M fits with headroom. What you cannot run: 14B at Q8 (~14 GB), any 20B+ model, 32B at any quantization, or 70B at any quantization. The 12 GB tier is the entry point for local LLMs — useful and capable, but the ceiling is real.

16 GB (RTX 4080 Super): 14B at Q8, 20-22B at Q4

At 16 GB you unlock 14B models at Q8 (~14 GB) — near-lossless quality for Qwen3 14B, Phi-4 14B, and similar models. You can also run Mistral 22B at Q4 (~14 GB) and some 20B models. What you still cannot run: Gemma 3 27B at Q4 (~17 GB — just above 16 GB), Qwen3 32B at Q4 (~18.5 GB), or any 70B model. The 16 GB tier is the sweet spot for most serious users.

24 GB (RTX 4090): 32B at Q4, everything up to the 70B wall

At 24 GB you unlock the entire 32B tier: Qwen3 32B at Q4 (~18.5 GB), DeepSeek-R1-Distill-32B at Q4, and Gemma 3 27B at Q4 (~17 GB) with room for context. You also run all 14B models at Q8 with significant headroom. The 24 GB ceiling hits at 70B — Llama 3.3 70B at Q4_K_M needs ~37 GB, which is above the 4090's capacity. For a single consumer GPU, 24 GB is the maximum available today.

Speed Comparison: Bandwidth Drives Token Speed

LLM token generation is memory-bandwidth-limited, not compute-limited. The RTX 4090's 1008 GB/s bandwidth is 37% higher than the RTX 4080 Super's 736 GB/s, which is 46% higher than the RTX 4070's 504 GB/s. That bandwidth ratio maps almost directly to tokens per second on the same model.

Llama 3.1 8B Q4_K_M

RTX 4070: ~80 t/s
RTX 4080 Super: ~110 t/s
RTX 4090: ~145 t/s

Qwen3 14B Q4_K_M

RTX 4070: ~48 t/s
RTX 4080 Super: ~65 t/s
RTX 4090: ~90 t/s

DeepSeek-R1-32B Q4

RTX 4070: cannot run
RTX 4080 Super: cannot run
RTX 4090: ~38 t/s

All three cards feel fast in conversational chat at 7B. The difference becomes meaningful at 14B where 48 t/s (4070) vs 90 t/s (4090) is noticeable for long responses — roughly half the wait time. At 32B only the 4090 runs the model in VRAM, so the comparison is moot for the other two cards.

Power and Running Cost: 450W vs 200W

The RTX 4090 draws approximately 450W under LLM load. The RTX 4070 draws approximately 200W. That 250W difference adds up over sustained use — and it affects PSU requirements and electricity bills.

Monthly electricity cost at 8 hours/day ($0.15/kWh)

RTX 4070 (200W)

~$7/month

~$88/year

RTX 4080 Super (320W)

~$12/month

~$140/year

RTX 4090 (450W)

~$16/month

~$197/year

The RTX 4090 costs roughly $9/month more in electricity than the RTX 4070 at 8 hours/day — about $108/year extra. Over a three-year period that adds roughly $325 to the effective cost of the 4090. Rates are higher in Europe and parts of Australia, widening this gap.

PSU requirements also differ: the RTX 4090 needs at minimum a 850W PSU with a modern CPU. The RTX 4080 Super runs on 750W. The RTX 4070 runs comfortably on 650W. If you are upgrading and your PSU is borderline, factor in upgrade cost for the higher-draw cards.

The 70B Question: None of These Cards Can Do It

If your primary goal is running 70B models — Llama 3.3 70B, Qwen3 72B, DeepSeek-R1 70B — none of the three cards in this comparison will satisfy you. Here is why:

Llama 3.3 70B Q4_K_M needs ~37 GB VRAM

The most popular 70B quantization requires approximately 37 GB to load in VRAM. The RTX 4090 tops out at 24 GB. You can run it with GPU+CPU hybrid offloading, but speed drops to approximately 2-5 tokens per second — frustrating for interactive use.

Qwen3 72B needs ~40 GB VRAM

Qwen3 72B at Q4_K_M is larger at approximately 40 GB. No single consumer GPU can run this in VRAM today. Even with hybrid offloading on a 4090, speed is negligible.

For 70B, you need a Mac or dual-GPU setup

The Mac mini M4 Pro with 48 GB unified memory runs Llama 3.3 70B at approximately 12-15 t/s — slow but usable for non-interactive tasks. The Mac Studio M4 Max with 64 GB runs it more comfortably. A dual-RTX-3090 PCIe x8 setup can reach 48 GB combined VRAM but requires a workstation motherboard and careful configuration. These are the realistic paths to 70B, not any single consumer NVIDIA GPU.

What the RTX 4090 does unlock at 24 GB

The jump from 16 GB to 24 GB specifically unlocks 32B models: Qwen3 32B at Q4 (~18.5 GB), DeepSeek-R1-Distill-32B at Q4, and Gemma 3 27B at Q4 (~17 GB). These are excellent, capable models that deliver near-70B quality on many benchmarks. If 32B is your target, the 4090 is the card. If 70B is your target, look elsewhere.

Price-to-Performance Analysis

Raw price-per-token-per-second is one lens, but the model compatibility cliff matters more. A card that runs a model 30% slower but actually runs the model is infinitely better than one that cannot run it at all.

GPU	Price	14B Q4 speed	Unlocks 32B?
RTX 4070 12GB	Check price on Amazon	~48 t/s	No
RTX 4080 Super 16GB	Check price on Amazon	~65 t/s	No
RTX 4090 24GB	Check price on Amazon	~90 t/s	Yes

The RTX 4070 delivers the best raw price efficiency per token at 14B. The RTX 4080 Super costs 67% more for 35% more speed — a modest efficiency step down. The RTX 4090 costs 167% more than the 4070 for 88% more speed — reasonable if you need 32B models, poor value if you only run 14B. The decision framework: pick the RTX 4070 if 14B is your ceiling, the RTX 4080 Super if you want 14B at Q8 quality, and the RTX 4090 if you want 32B models or absolute fastest generation.

Who Should Buy Which GPU

Buy the RTX 4070 12GB if...

✓ You are on a tight budget
✓ You primarily run 7B-14B models (Llama 3.1 8B, Qwen3 14B, Phi-4 14B)
✓ Q4 quality is acceptable — you do not need Q8 near-lossless on 14B
✓ You want low power draw (~200W) and a quiet, efficient build
✓ You are new to local LLMs and want to try without heavy investment

Avoid if: you want Qwen3 32B, DeepSeek-R1-Distill-32B, or Gemma 3 27B — none fit in 12 GB.

Buy the RTX 4080 Super 16GB if...

✓ You want 14B models at Q8 (near-lossless quality)
✓ You run Mistral 22B or similar 20B-class models at Q4
✓ You want noticeably faster generation than the 4070 (~35% faster)
✓ You are an enthusiast who wants the best mid-range option
✓ You do not specifically need 32B models

Avoid if: you need Qwen3 32B or DeepSeek-R1-32B at Q4 — they need 18.5-19 GB, just above 16 GB.

Buy the RTX 4090 24GB if...

✓ You want to run Qwen3 32B, DeepSeek-R1-Distill-32B, or Gemma 3 27B
✓ You want the fastest possible token generation on any single GPU
✓ You run multiple users or applications against the same local GPU
✓ You do heavy long-context inference where speed compounds over generations
✓ Budget is secondary to capability

Avoid if: your only goal is 70B models — the 4090 still cannot run them. Budget goes further on Apple Silicon for 70B.

Running LLMs on Any of These Cards: Ollama Setup

All three cards are fully supported by Ollama and LM Studio. NVIDIA GPU detection is automatic — no special configuration needed.

Install Ollama

Windows and macOS

Download from ollama.com. Installs as a background service, auto-detects NVIDIA GPUs.

Linux

curl -fsSL https://ollama.com/install.sh | sh

RTX 4070 12GB — recommended models

Qwen3 14B Q4_K_M (~9 GB — fits with 3 GB headroom)

ollama run qwen3:14b

Llama 3.1 8B Q8_0 (~8.5 GB — high quality 8B)

ollama run llama3.1:8b-instruct-q8_0

Phi-4 14B Q4_K_M (~9.2 GB)

ollama run phi4:14b

RTX 4080 Super 16GB — recommended models

Qwen3 14B Q8_0 — near-lossless quality (~14 GB)

ollama run qwen3:14b-q8_0

Mistral 22B Q4_K_M (~14 GB)

ollama run mistral:22b-instruct-q4_K_M

Phi-4 14B Q8_0 (~14.5 GB)

ollama run phi4:q8_0

RTX 4090 24GB — recommended models

Qwen3 32B Q4_K_M (~18.5 GB) — unlocked by 24 GB

ollama run qwen3:32b

DeepSeek-R1-Distill-32B Q4_K_M

ollama run deepseek-r1:32b

Qwen3 14B Q8_0 with headroom (~14 GB)

ollama run qwen3:14b-q8_0

Verify GPU load after starting a model

ollama ps

GPU% should show 100% for full VRAM inference. If it shows partial %, some layers are on CPU and speed will be significantly slower. Reduce context length or quantization if the model is spilling to CPU.

Frequently Asked Questions

Which NVIDIA GPU is best for running LLMs locally?

The RTX 4090 delivers maximum capability with 24 GB VRAM and 1008 GB/s bandwidth — it runs 32B models at Q4 and generates tokens fastest across all sizes. The RTX 4080 Super 16GB is the best value for most users, covering 14B at Q8 and 20B-class models. The RTX 4070 12GB is the budget pick for 7B-14B use. The right choice depends on which models you want to run — if you need Qwen3 32B or DeepSeek-R1-32B, only the 4090 works.

Can the RTX 4070 run Llama 3 70B?

No. Llama 3.3 70B at Q4_K_M needs approximately 37 GB of VRAM. The RTX 4070 has 12 GB. Even the RTX 4090 at 24 GB cannot run 70B in VRAM — none of these three cards can. For 70B models you need a Mac mini M4 Pro with 48 GB unified memory or a Mac Studio M4 Max with 64 GB.

Is the RTX 4090 worth it over the RTX 4080 Super?

Yes if you want 32B models — the 4090's 24 GB unlocks Qwen3 32B and DeepSeek-R1-Distill-32B at Q4. The 4090 also generates tokens 37% faster than the 4080 Super across all model sizes. If you only run 14B models, the RTX 4080 Super costs less and delivers similar quality at roughly 70% the speed — still very fast at ~65 t/s. The 4090 is worth it for power users and researchers who need the 32B tier or absolute fastest generation.

What is the cheapest GPU that runs Qwen3 14B well?

The RTX 4070 12GB runs Qwen3 14B at Q4_K_M (~9 GB) comfortably with headroom. For near-lossless Q8 quality (~14 GB), you need 16 GB of VRAM — the RTX 4060 Ti 16GB is the cheapest option for Q8, though slower than the RTX 4080 Super 16GB.

Is the RTX 4080 Super faster than the RTX 4090 for 7B models?

No. The RTX 4090's 1008 GB/s memory bandwidth is approximately 37% higher than the RTX 4080 Super's 736 GB/s. Since LLM token generation is memory-bandwidth-limited, the 4090 generates tokens faster across all model sizes — roughly 145 t/s vs 110 t/s on a 7B Q4 model. The 4080 Super is never faster than the 4090 at inference.

Related Guides

RTX 4070 12GB Guide

Full RTX 4070 setup and model compatibility

RTX 4080 16GB Guide

Full guide for the RTX 4080 Super

RTX 4090 24GB Guide

Full guide for the RTX 4090

Best GPU for LLMs

Full GPU buying guide for local AI

What Can I Run?

VRAM compatibility tool for your hardware

RTX 3090 vs 4070 Ti Super

Used 24 GB vs new 16 GB comparison

Popular hardware for local LLMs

RTX 4060 (8 GB)

Budget pick. Runs 7B-8B models at 25-35 tok/s.

Buy on Amazon

RTX 4060 Ti 16 GB

Sweet spot. Runs 13B-14B at full speed. Best value.

Buy on Amazon

RTX 4090 (24 GB)

Top consumer GPU. Runs 70B models with offloading.

Buy on Amazon

Find the right GPU for your LLM goals and check which models fit your hardware.

VRAM Calculator What Can I Run? All Guides

Sources & methodology

VRAM and tokens-per-second figures on this page are synthesised from open community benchmarks. The sitewide formula and the full source list are on the methodology page. For this guide I leaned on:

Hardware Corner GPU ranking. Tokens per second for the 4070, 4080 and 4090 at 2k and 8k context.
XiongjieDai GPU-Benchmarks-on-LLM-Inference. Cross-validated llama-bench runs for all three Ada cards on the same models.
Home GPU LLM Leaderboard. VRAM tier mapping (12/16/24 GB) that drives the 'which one for which model' calls.

Spot a number that does not match the linked source? Email [email protected] and I will update the guide.