Llama 3.1 Hardware Requirements: 8B, 70B, and 405B GPU Guide
AI helped pull the Llama 3.1 size matrix; the "2026 status" note and the per-size hardware picks were written and edited by hand against the live Meta model cards.
Updated May 2026 · All Llama 3.1 model sizes · VRAM tables · Dual-GPU split guide
2026 status: Llama 3.1 70B has been largely superseded by Llama 3.3 70B for the same use cases (same VRAM footprint, better benchmarks per Meta's release notes). This page is kept for users still on Llama 3.1 8B, anyone evaluating the 405B for research, and historical context.
Meta released Llama 3.1 in July 2024, introducing three model sizes (8B, 70B, 405B) with a shared 128K context window. The 8B is the most accessible open-source model for consumer GPUs with 6 GB+ VRAM. The 70B requires 48 GB of memory, putting it in Mac mini M4 Pro or used server GPU territory. The 405B is effectively cloud-only for most users. This guide covers exact VRAM requirements, which hardware runs each size, and how to get started.
Bottom Line
- Llama 3.1 8B Q4_K_M (~4.7 GB) Runs on any GPU with 6 GB+ VRAM. RTX 3060 12GB is the sweet spot. ~30 tok/s on RTX 4070, ~40 tok/s on RTX 4090.
- Llama 3.1 70B Q4_K_M (~42 GB) Needs 48 GB unified memory. Best options: Mac mini M4 Pro 48GB or dual RTX 3090 with llama.cpp split. ~20 tok/s on Mac mini M4 Pro.
- Llama 3.1 405B (~243 GB Q4) Not practical for consumer hardware — see the 405B section below for the full reality check.
- 128K context window All three sizes share the same 128K context. Keep in mind that long contexts consume significant additional VRAM beyond the base model weight.
VRAM Requirements for All Llama 3.1 Model Sizes
Llama 3.1 spans a wide range: from 4.7 GB for the 8B Q4 file to 243 GB for 405B Q4. The 128K context window adds memory on top of these base weights proportional to context length. For practical consumer use, the 8B and 70B are the relevant choices.
| Model | VRAM | Recommended Hardware | Speed (t/s) | Notes |
|---|---|---|---|---|
| Llama 3.1 8B Q4_K_M | ~4.7 GB | 6GB+ GPU (RTX 3060) · 16GB RAM for CPU | 30 t/s @ RTX 4070 | Sweet spot for 8-12GB GPUs |
| Llama 3.1 8B Q8 | ~8.5 GB | 10-12GB GPU (RTX 3080 10GB) | 25 t/s @ RTX 4070 | Full quality on 12GB GPUs |
| Llama 3.1 70B Q4_K_M | ~42 GB | Mac mini M4 Pro 48GB · A6000 48GB · Dual RTX 3090 | 20 t/s @ Mac M4 Pro 48GB | Minimum 48GB VRAM |
| Llama 3.1 70B Q3_K_M | ~35 GB | 48GB GPU with headroom | 22 t/s @ Mac M4 Pro 48GB | Fits 48GB with 13GB spare |
| Llama 3.1 405B Q4 | ~243 GB | 8x A100 80GB · Enterprise only | N/A consumer | Not practical for consumers |
| Llama 3.1 405B Q2_K | ~120 GB | Multiple high-end GPUs required | Very slow | Significant quality loss |
Source: VRAM math derived from Modal's VRAM formula applied to Meta's official Llama 3.1 model cards; tokens-per-second ranges cross-referenced with XiongjieDai community llama-bench runs. VRAM figures are for model weights only. Context (KV cache) adds memory on top, especially at 128K. Use the VRAM Calculator for context-adjusted estimates.
Llama 3.1 8B: Hardware Guide
The 8B model is where most users start. At Q4_K_M quantization it weighs ~4.7 GB, leaving room for the 128K context window on any 8-12 GB GPU. It handles general conversation, coding assistance, summarization, and writing well. For instruction following and chat, use the instruct variant (default in Ollama).
Minimum
RTX 3060 12GB
12 GB VRAM · ~15 tok/s
Fits Q4 and Q8. Slower but fully functional. Good budget pick.
Recommended
RTX 4070 12GB
12 GB VRAM · ~30 tok/s
Runs Q4 at 30 tok/s and Q8 comfortably. Best price-to-performance for 8B.
Best Performance
RTX 4090 24GB
24 GB VRAM · ~40 tok/s
40 tok/s on Q4. 15+ GB headroom for very long context sessions.
Running 8B on CPU (no GPU)
Llama 3.1 8B Q4_K_M runs on CPU-only with 16 GB+ system RAM. Expect ~3-5 tok/s on a modern desktop CPU, which is slow but usable for non-real-time tasks. With 32 GB RAM you can run Q8 on CPU. Use llama.cpp directly for best CPU performance — Ollama also supports CPU inference automatically when no GPU is detected.
Llama 3.1 70B: 48GB Minimum and Dual-GPU Split
The 70B model is a significant step up in both capability and hardware requirements. At Q4_K_M, it needs ~42 GB of VRAM — more than any single consumer GPU. Your options are a 48 GB unified memory Mac, a 48 GB server GPU (A6000), or splitting across two GPUs with llama.cpp.
Path 1: Single 48GB device (recommended)
Mac mini M4 Pro 48GB
~20 tok/s on 70B Q4
Best value for 70B inference. Apple Metal acceleration and 300+ GB/s bandwidth make it faster than comparably-priced server GPUs for single-user use.
NVIDIA A6000 48GB
~12 tok/s on 70B Q4
Server-class GPU. Better for batch workloads and multi-user deployments. Higher cost than Mac mini for single-user desktop use.
Path 2: Dual GPU split with llama.cpp
Two RTX 3090s (2 x 24 GB = 48 GB combined) can run Llama 3.1 70B Q4_K_M using llama.cpp's tensor split feature. The model is split evenly across both GPUs, with each holding ~21 GB of the 42 GB total. This works but is slower than a single 48 GB device due to PCIe bandwidth between GPUs.
./llama-cli -m llama-3.1-70b-q4_k_m.gguf --tensor-split 1,1 -ngl 999
Expected speed: ~5 tok/s — significantly slower than Mac mini M4 Pro at ~20 tok/s. If you already own two RTX 3090s, this is a viable path. For a new purchase, Mac mini M4 Pro 48GB delivers 4x the speed at a lower cost.
See the full 70B model local guide for a step-by-step setup walkthrough, including Ollama and llama.cpp configurations for both single-GPU and multi-GPU setups.
Llama 3.1 405B: Reality Check
Not practical for consumer hardware
Llama 3.1 405B at Q4 quantization requires approximately 243 GB of VRAM — the equivalent of 8x NVIDIA A100 80GB cards. Even the most aggressive Q2_K quantization only brings it down to ~120 GB (with significant quality degradation), still requiring multiple high-end server GPUs.
Q4 requirement
~243 GB VRAM
8x A100 80GB minimum
Q2_K requirement
~120 GB VRAM
Quality heavily degraded
Enterprise minimum
8x A100 or H100
Not consumer hardware
Better alternative
Llama 3.3 70B
Beats 405B on most benchmarks at 43 GB
For nearly all use cases, Llama 3.3 70B is the better choice: it matches or exceeds 405B quality on most benchmarks, requires only ~43 GB at Q4_K_M (fits a Mac mini M4 Pro 48GB), and runs at practical inference speeds. The 405B is only worth pursuing for specific research applications that require maximum model size at any cost.
Can My Hardware Run Llama 3.1?
Any GPU with 6 GB+ VRAM can run Llama 3.1 8B. The 70B requires 48 GB total — either a single 48 GB device or dual 24 GB GPUs with tensor split.
RTX 3060 12GB
Yes (8B Q4)Runs:
- +Llama 3.1 8B Q4_K_M (~4.7 GB, 7.3 GB spare)
- +Llama 3.1 8B Q8 (~8.5 GB, 3.5 GB spare)
- +Long 128K context at Q4 with room
Does not fit:
- -Llama 3.1 70B at any quantization (needs 48GB)
- -Llama 3.1 405B (enterprise only)
The RTX 3060 12GB is the minimum comfortable GPU for Llama 3.1 8B. The 4.7 GB Q4_K_M file leaves plenty of headroom for the 128K context window. The Q8 (~8.5 GB) also fits with 3.5 GB spare — giving you full-quality 8B on a budget GPU.
RTX 4070 12GB
Yes (8B Q8)Runs:
- +Llama 3.1 8B Q4_K_M (~4.7 GB, comfortable)
- +Llama 3.1 8B Q8 (~8.5 GB, 3.5 GB spare)
- +Strong 128K context performance
Does not fit:
- -Llama 3.1 70B at any quantization
- -Llama 3.1 405B
The RTX 4070 is an excellent 8B runner. At ~30 tok/s on Q4_K_M and ~25 tok/s on Q8, it delivers fast and responsive chat. The 128K context window fills up memory quickly in long conversations — Q4_K_M gives the best balance of quality and context headroom on 12 GB.
RTX 4090 24GB
Yes (8B Q8, fast)Runs:
- +Llama 3.1 8B Q8 (~8.5 GB, 15.5 GB spare for context)
- +Llama 3.1 8B Q4_K_M at ~40 tok/s
- +Very long context sessions with 8B
Does not fit:
- -Llama 3.1 70B Q4_K_M (~42 GB — 18 GB over limit)
- -Llama 3.1 405B
The RTX 4090 is the fastest single consumer GPU for Llama 3.1 8B at ~40 tok/s on Q4_K_M. The 15+ GB headroom after loading Q8 enables very long 128K context sessions. It cannot touch the 70B model — that requires 48 GB minimum. For 70B, look at Mac mini M4 Pro 48GB instead.
Mac mini M4 Pro 48GB
Yes (70B Q4)Runs:
- +Llama 3.1 70B Q4_K_M (~42 GB, 6 GB spare)
- +Llama 3.1 70B Q3_K_M (~35 GB, 13 GB spare)
- +Llama 3.1 8B at any quantization
Does not fit:
- -Llama 3.1 405B at any useful quantization
- -Llama 3.1 70B Q8 (~79 GB — over 48 GB)
The Mac mini M4 Pro 48GB is the best value hardware for Llama 3.1 70B. At ~20 tok/s on Q4_K_M, it beats much more expensive server GPUs on price-per-token. The 6 GB headroom after loading Q4_K_M is tight — keep context length moderate. Q3_K_M at ~35 GB gives a better experience with 13 GB to spare.
NVIDIA A6000 48GB
Yes (70B Q4)Runs:
- +Llama 3.1 70B Q4_K_M (~42 GB, 6 GB spare)
- +Llama 3.1 70B Q3_K_M (~35 GB, 13 GB spare)
- +Llama 3.1 8B all quantizations
Does not fit:
- -Llama 3.1 405B
- -Llama 3.1 70B Q8 (~79 GB)
The NVIDIA A6000 48GB delivers ~12 tok/s on Llama 3.1 70B Q4_K_M — slower than Mac mini M4 Pro due to lower memory bandwidth (768 GB/s vs ~273 GB/s, but Apple optimizes Metal more effectively for unified memory). The A6000 is better for batch processing and server deployments. For a single-user desktop, the Mac mini M4 Pro wins on price and speed.
Dual RTX 3090 (2x24GB)
Yes (70B split)Runs:
- +Llama 3.1 70B Q4_K_M split across both GPUs (~21 GB each)
- +Llama 3.1 8B on single GPU
Does not fit:
- -Llama 3.1 70B on a single RTX 3090 (24 GB — too small)
- -Llama 3.1 405B
Two RTX 3090s give 48 GB combined and can run Llama 3.1 70B Q4_K_M using llama.cpp tensor splitting. The result is ~5 tok/s — significantly slower than Mac mini M4 Pro due to PCIe inter-GPU bandwidth limits. This setup makes sense if you already own two RTX 3090s; buying new, Mac mini M4 Pro 48GB is faster and cheaper.
Llama 3.1 vs 3.2 vs 3.3: Which Should You Use?
Three generations of Llama 3 models are actively used in 2026. They serve different purposes and are not direct replacements for each other.
| Llama 3.1 | Llama 3.2 | Llama 3.3 | |
|---|---|---|---|
| Release date | July 2024 | September 2024 | December 2024 |
| Available sizes | 8B, 70B, 405B | 1B, 3B, 11B, 90B | 70B only |
| Vision models | No | Yes (11B, 90B) | No |
| 8B VRAM (Q4) | ~4.7 GB | ~4.7 GB (3B: 2.2 GB) | N/A |
| 70B VRAM (Q4) | ~42 GB | N/A | ~43 GB |
| Context window | 128K | 128K | 128K |
| Best 70B choice | Skip — use 3.3 | N/A | Yes — use this |
| Best 8B choice | Yes — solid choice | Yes (or 3B for tiny VRAM) | N/A |
Use Llama 3.1 8B when:
- +You want a solid text model on 6-12 GB VRAM
- +You need 128K context without a vision requirement
- +Running on CPU with 16 GB+ RAM
Use Llama 3.2 when:
- +You need vision / image understanding (11B or 90B)
- +You have very tight VRAM (1B at 1.3 GB, 3B at 2.2 GB)
- +Running on edge devices or CPU with minimal RAM
Use Llama 3.3 70B when:
- +You have 48 GB VRAM and want the best text quality
- +You were considering Llama 3.1 405B — 3.3 70B beats it on most benchmarks
- +You want the highest-quality locally-runnable Llama model
Running Llama 3.1 with Ollama
Llama 3.1 8B (default)
ollama run llama3.1 Pulls ~4.7 GB (Q4_K_M default). Runs on any GPU with 6 GB+ VRAM or CPU with 16 GB+ RAM. The go-to command for general chat, coding help, and writing on 8-24 GB GPUs.
Llama 3.1 8B Q8
ollama run llama3.1:8b-instruct-q8_0 Pulls ~8.5 GB. Full quality weights. Requires 10-12 GB GPU (RTX 3080 10GB or RTX 3060 12GB). Noticeably sharper on nuanced tasks than Q4 but same speed on most hardware.
Llama 3.1 70B
ollama run llama3.1:70b Pulls ~42 GB. Requires 48 GB unified or discrete VRAM. Mac mini M4 Pro 48GB is the recommended hardware. Note: consider llama3.3:70b instead — same VRAM, better quality.
Llama 3.1 405B
ollama run llama3.1:405b Pulls ~243 GB. Not practical for consumer hardware — requires enterprise GPU clusters. If you want the best locally-runnable Llama, use Llama 3.3 70B instead.
Tip: Use Llama 3.3 for the 70B slot
If you have hardware capable of running Llama 3.1 70B, substitute ollama run llama3.3:70b instead. Llama 3.3 70B uses the same ~43 GB at Q4_K_M and delivers better benchmark results than Llama 3.1 70B on instruction following and reasoning tasks.
For a full local setup guide including model management and Open WebUI, see how to run Llama locally.
Related Guides
Llama 3.2 Hardware Requirements
Vision models (11B, 90B) and compact models (1B, 3B) — what hardware each needs
How to Run Llama Locally
Step-by-step Ollama and llama.cpp setup for all Llama models
How to Run a 70B Model Locally
Complete guide to 70B inference on Mac mini M4 Pro and dual-GPU setups
Best LLMs for 8GB VRAM
Every model that fits on an RTX 4060 — top picks with benchmarks
Best LLMs for 48GB VRAM
Mac mini M4 Pro and A6000 recommendations including Llama 3.1 70B
Mac mini M4 Pro vs RTX 4090
Why 48GB unified memory beats a faster GPU for 70B models
Frequently Asked Questions
How much VRAM does Llama 3.1 8B need?
Llama 3.1 8B needs ~4.7 GB VRAM at Q4_K_M quantization. A 6 GB GPU (RTX 3060) is the comfortable minimum. You can also run it on CPU with 16 GB+ system RAM at ~3-5 tok/s. At Q8 full quality, it needs ~8.5 GB, so a 10-12 GB GPU is required.
Can I run Llama 3.1 70B on a single consumer GPU?
No single consumer GPU has enough VRAM. The 70B at Q4_K_M needs ~42 GB. Your options are: a Mac mini M4 Pro 48GB, an NVIDIA A6000 48GB, or dual RTX 3090s (2x24GB = 48GB) using llama.cpp tensor split. The dual-GPU split works but runs slower (~5 tok/s) due to PCIe bandwidth.
What is the difference between Llama 3.1 and Llama 3.2?
Llama 3.1 and 3.2 both have 8B models with similar text capability and the same 128K context window. The key difference: Llama 3.2 added vision models (11B Vision, 90B Vision) that can analyze images, while Llama 3.1 is text-only. Llama 3.1 also includes 70B and 405B sizes with no Llama 3.2 equivalent.
Should I use Llama 3.1 70B or Llama 3.3 70B?
Use Llama 3.3 70B. It is the improved version and outperforms Llama 3.1 70B on most benchmarks with the same VRAM requirement (~43 GB at Q4_K_M). Llama 3.3 70B also beats Llama 3.1 405B on many tasks. There is no reason to choose 3.1 70B over 3.3 70B if your hardware can run either.
Can I run Llama 3.1 405B locally?
Only in enterprise or heavily quantized setups. At Q4, the 405B needs ~243 GB of VRAM, requiring 8x A100 80GB or similar. The most aggressive Q2 quantization reduces it to ~120 GB but quality degrades significantly. For consumer hardware, Llama 3.3 70B offers comparable performance at a fraction of the cost and VRAM.
How fast is Llama 3.1 8B on an RTX 4070?
Llama 3.1 8B Q4_K_M runs at approximately 30 tok/s on an RTX 4070. On an RTX 4090 you get ~40 tok/s. The RTX 3060 12GB delivers around 15 tok/s. For a large language model at 8B scale, 30 tok/s feels fast and real-time in chat.
Popular hardware for local LLMs
Check exact VRAM requirements for your context length, or find every model that fits your GPU.
Sources & methodology
VRAM and tokens-per-second figures on this page are synthesised from open community benchmarks. The sitewide formula and the full source list are on the methodology page. For this guide I leaned on:
- Hugging Face Hub. Meta's official Llama 3.1 model card (8B, 70B, 405B parameter and context numbers).
- Modal: How much VRAM do I need for LLM inference. VRAM formula used for the per-quant memory tables on this page.
- XiongjieDai GPU-Benchmarks-on-LLM-Inference. Independent llama-bench runs for Llama 3.1 quants across consumer GPUs.
Spot a number that does not match the linked source? Email [email protected] and I will update the guide.