Llama 3.1 Hardware Requirements: 8B, 70B, and 405B GPU Guide

AI helped pull the Llama 3.1 size matrix; the "2026 status" note and the per-size hardware picks were written and edited by hand against the live Meta model cards.

Updated May 2026 · All Llama 3.1 model sizes · VRAM tables · Dual-GPU split guide

2026 status: Llama 3.1 70B has been largely superseded by Llama 3.3 70B for the same use cases (same VRAM footprint, better benchmarks per Meta's release notes). This page is kept for users still on Llama 3.1 8B, anyone evaluating the 405B for research, and historical context.

Meta released Llama 3.1 in July 2024, introducing three model sizes (8B, 70B, 405B) with a shared 128K context window. The 8B is the most accessible open-source model for consumer GPUs with 6 GB+ VRAM. The 70B requires 48 GB of memory, putting it in Mac mini M4 Pro or used server GPU territory. The 405B is effectively cloud-only for most users. This guide covers exact VRAM requirements, which hardware runs each size, and how to get started.

Bottom Line

Llama 3.1 8B Q4_K_M (~4.7 GB) Runs on any GPU with 6 GB+ VRAM. RTX 3060 12GB is the sweet spot. ~30 tok/s on RTX 4070, ~40 tok/s on RTX 4090.
Llama 3.1 70B Q4_K_M (~42 GB) Needs 48 GB unified memory. Best options: Mac mini M4 Pro 48GB or dual RTX 3090 with llama.cpp split. ~20 tok/s on Mac mini M4 Pro.
Llama 3.1 405B (~243 GB Q4) Not practical for consumer hardware — see the 405B section below for the full reality check.
128K context window All three sizes share the same 128K context. Keep in mind that long contexts consume significant additional VRAM beyond the base model weight.

VRAM Requirements for All Llama 3.1 Model Sizes

Llama 3.1 spans a wide range: from 4.7 GB for the 8B Q4 file to 243 GB for 405B Q4. The 128K context window adds memory on top of these base weights proportional to context length. For practical consumer use, the 8B and 70B are the relevant choices.

Model	VRAM	Recommended Hardware	Speed (t/s)	Notes
Llama 3.1 8B Q4_K_M	~4.7 GB	6GB+ GPU (RTX 3060) · 16GB RAM for CPU	30 t/s @ RTX 4070	Sweet spot for 8-12GB GPUs
Llama 3.1 8B Q8	~8.5 GB	10-12GB GPU (RTX 3080 10GB)	25 t/s @ RTX 4070	Full quality on 12GB GPUs
Llama 3.1 70B Q4_K_M	~42 GB	Mac mini M4 Pro 48GB · A6000 48GB · Dual RTX 3090	20 t/s @ Mac M4 Pro 48GB	Minimum 48GB VRAM
Llama 3.1 70B Q3_K_M	~35 GB	48GB GPU with headroom	22 t/s @ Mac M4 Pro 48GB	Fits 48GB with 13GB spare
Llama 3.1 405B Q4	~243 GB	8x A100 80GB · Enterprise only	N/A consumer	Not practical for consumers
Llama 3.1 405B Q2_K	~120 GB	Multiple high-end GPUs required	Very slow	Significant quality loss

Source: VRAM math derived from Modal's VRAM formula applied to Meta's official Llama 3.1 model cards; tokens-per-second ranges cross-referenced with XiongjieDai community llama-bench runs. VRAM figures are for model weights only. Context (KV cache) adds memory on top, especially at 128K. Use the VRAM Calculator for context-adjusted estimates.

Llama 3.1 8B: Hardware Guide

The 8B model is where most users start. At Q4_K_M quantization it weighs ~4.7 GB, leaving room for the 128K context window on any 8-12 GB GPU. It handles general conversation, coding assistance, summarization, and writing well. For instruction following and chat, use the instruct variant (default in Ollama).

Minimum

RTX 3060 12GB

12 GB VRAM · ~15 tok/s

Fits Q4 and Q8. Slower but fully functional. Good budget pick.

Recommended

RTX 4070 12GB

12 GB VRAM · ~30 tok/s

Runs Q4 at 30 tok/s and Q8 comfortably. Best price-to-performance for 8B.

Best Performance

RTX 4090 24GB

24 GB VRAM · ~40 tok/s

40 tok/s on Q4. 15+ GB headroom for very long context sessions.

Running 8B on CPU (no GPU)

Llama 3.1 8B Q4_K_M runs on CPU-only with 16 GB+ system RAM. Expect ~3-5 tok/s on a modern desktop CPU, which is slow but usable for non-real-time tasks. With 32 GB RAM you can run Q8 on CPU. Use llama.cpp directly for best CPU performance — Ollama also supports CPU inference automatically when no GPU is detected.

Llama 3.1 70B: 48GB Minimum and Dual-GPU Split

The 70B model is a significant step up in both capability and hardware requirements. At Q4_K_M, it needs ~42 GB of VRAM — more than any single consumer GPU. Your options are a 48 GB unified memory Mac, a 48 GB server GPU (A6000), or splitting across two GPUs with llama.cpp.

Path 1: Single 48GB device (recommended)

Mac mini M4 Pro 48GB

~20 tok/s on 70B Q4

Best value for 70B inference. Apple Metal acceleration and 300+ GB/s bandwidth make it faster than comparably-priced server GPUs for single-user use.

NVIDIA A6000 48GB

~12 tok/s on 70B Q4

Server-class GPU. Better for batch workloads and multi-user deployments. Higher cost than Mac mini for single-user desktop use.

Path 2: Dual GPU split with llama.cpp

Two RTX 3090s (2 x 24 GB = 48 GB combined) can run Llama 3.1 70B Q4_K_M using llama.cpp's tensor split feature. The model is split evenly across both GPUs, with each holding ~21 GB of the 42 GB total. This works but is slower than a single 48 GB device due to PCIe bandwidth between GPUs.

./llama-cli -m llama-3.1-70b-q4_k_m.gguf --tensor-split 1,1 -ngl 999

Expected speed: ~5 tok/s — significantly slower than Mac mini M4 Pro at ~20 tok/s. If you already own two RTX 3090s, this is a viable path. For a new purchase, Mac mini M4 Pro 48GB delivers 4x the speed at a lower cost.

See the full 70B model local guide for a step-by-step setup walkthrough, including Ollama and llama.cpp configurations for both single-GPU and multi-GPU setups.

Llama 3.1 405B: Reality Check

Not practical for consumer hardware

Llama 3.1 405B at Q4 quantization requires approximately 243 GB of VRAM — the equivalent of 8x NVIDIA A100 80GB cards. Even the most aggressive Q2_K quantization only brings it down to ~120 GB (with significant quality degradation), still requiring multiple high-end server GPUs.

Q4 requirement

~243 GB VRAM

8x A100 80GB minimum

Q2_K requirement

~120 GB VRAM

Quality heavily degraded

Enterprise minimum

8x A100 or H100

Not consumer hardware

Better alternative

Llama 3.3 70B

Beats 405B on most benchmarks at 43 GB

For nearly all use cases, Llama 3.3 70B is the better choice: it matches or exceeds 405B quality on most benchmarks, requires only ~43 GB at Q4_K_M (fits a Mac mini M4 Pro 48GB), and runs at practical inference speeds. The 405B is only worth pursuing for specific research applications that require maximum model size at any cost.

Can My Hardware Run Llama 3.1?

Any GPU with 6 GB+ VRAM can run Llama 3.1 8B. The 70B requires 48 GB total — either a single 48 GB device or dual 24 GB GPUs with tensor split.

RTX 3060 12GB

Yes (8B Q4)

Runs:

+Llama 3.1 8B Q4_K_M (~4.7 GB, 7.3 GB spare)
+Llama 3.1 8B Q8 (~8.5 GB, 3.5 GB spare)
+Long 128K context at Q4 with room

Does not fit:

-Llama 3.1 70B at any quantization (needs 48GB)
-Llama 3.1 405B (enterprise only)

The RTX 3060 12GB is the minimum comfortable GPU for Llama 3.1 8B. The 4.7 GB Q4_K_M file leaves plenty of headroom for the 128K context window. The Q8 (~8.5 GB) also fits with 3.5 GB spare — giving you full-quality 8B on a budget GPU.

RTX 4070 12GB

Yes (8B Q8)

Runs:

+Llama 3.1 8B Q4_K_M (~4.7 GB, comfortable)
+Llama 3.1 8B Q8 (~8.5 GB, 3.5 GB spare)
+Strong 128K context performance

Does not fit:

-Llama 3.1 70B at any quantization
-Llama 3.1 405B

The RTX 4070 is an excellent 8B runner. At ~30 tok/s on Q4_K_M and ~25 tok/s on Q8, it delivers fast and responsive chat. The 128K context window fills up memory quickly in long conversations — Q4_K_M gives the best balance of quality and context headroom on 12 GB.

RTX 4090 24GB

Yes (8B Q8, fast)

Runs:

+Llama 3.1 8B Q8 (~8.5 GB, 15.5 GB spare for context)
+Llama 3.1 8B Q4_K_M at ~40 tok/s
+Very long context sessions with 8B

Does not fit:

-Llama 3.1 70B Q4_K_M (~42 GB — 18 GB over limit)
-Llama 3.1 405B

The RTX 4090 is the fastest single consumer GPU for Llama 3.1 8B at ~40 tok/s on Q4_K_M. The 15+ GB headroom after loading Q8 enables very long 128K context sessions. It cannot touch the 70B model — that requires 48 GB minimum. For 70B, look at Mac mini M4 Pro 48GB instead.

Mac mini M4 Pro 48GB

Yes (70B Q4)

Runs:

+Llama 3.1 70B Q4_K_M (~42 GB, 6 GB spare)
+Llama 3.1 70B Q3_K_M (~35 GB, 13 GB spare)
+Llama 3.1 8B at any quantization

Does not fit:

-Llama 3.1 405B at any useful quantization
-Llama 3.1 70B Q8 (~79 GB — over 48 GB)

The Mac mini M4 Pro 48GB is the best value hardware for Llama 3.1 70B. At ~20 tok/s on Q4_K_M, it beats much more expensive server GPUs on price-per-token. The 6 GB headroom after loading Q4_K_M is tight — keep context length moderate. Q3_K_M at ~35 GB gives a better experience with 13 GB to spare.

NVIDIA A6000 48GB

Yes (70B Q4)

Runs:

+Llama 3.1 70B Q4_K_M (~42 GB, 6 GB spare)
+Llama 3.1 70B Q3_K_M (~35 GB, 13 GB spare)
+Llama 3.1 8B all quantizations

Does not fit:

-Llama 3.1 405B
-Llama 3.1 70B Q8 (~79 GB)

The NVIDIA A6000 48GB delivers ~12 tok/s on Llama 3.1 70B Q4_K_M — slower than Mac mini M4 Pro due to lower memory bandwidth (768 GB/s vs ~273 GB/s, but Apple optimizes Metal more effectively for unified memory). The A6000 is better for batch processing and server deployments. For a single-user desktop, the Mac mini M4 Pro wins on price and speed.

Dual RTX 3090 (2x24GB)

Yes (70B split)

Runs:

+Llama 3.1 70B Q4_K_M split across both GPUs (~21 GB each)
+Llama 3.1 8B on single GPU

Does not fit:

-Llama 3.1 70B on a single RTX 3090 (24 GB — too small)
-Llama 3.1 405B

Two RTX 3090s give 48 GB combined and can run Llama 3.1 70B Q4_K_M using llama.cpp tensor splitting. The result is ~5 tok/s — significantly slower than Mac mini M4 Pro due to PCIe inter-GPU bandwidth limits. This setup makes sense if you already own two RTX 3090s; buying new, Mac mini M4 Pro 48GB is faster and cheaper.

Llama 3.1 vs 3.2 vs 3.3: Which Should You Use?

Three generations of Llama 3 models are actively used in 2026. They serve different purposes and are not direct replacements for each other.

	Llama 3.1	Llama 3.2	Llama 3.3
Release date	July 2024	September 2024	December 2024
Available sizes	8B, 70B, 405B	1B, 3B, 11B, 90B	70B only
Vision models	No	Yes (11B, 90B)	No
8B VRAM (Q4)	~4.7 GB	~4.7 GB (3B: 2.2 GB)	N/A
70B VRAM (Q4)	~42 GB	N/A	~43 GB
Context window	128K	128K	128K
Best 70B choice	Skip — use 3.3	N/A	Yes — use this
Best 8B choice	Yes — solid choice	Yes (or 3B for tiny VRAM)	N/A

Use Llama 3.1 8B when:

+You want a solid text model on 6-12 GB VRAM
+You need 128K context without a vision requirement
+Running on CPU with 16 GB+ RAM

Use Llama 3.2 when:

+You need vision / image understanding (11B or 90B)
+You have very tight VRAM (1B at 1.3 GB, 3B at 2.2 GB)
+Running on edge devices or CPU with minimal RAM

Use Llama 3.3 70B when:

+You have 48 GB VRAM and want the best text quality
+You were considering Llama 3.1 405B — 3.3 70B beats it on most benchmarks
+You want the highest-quality locally-runnable Llama model

Running Llama 3.1 with Ollama

Llama 3.1 8B (default)

ollama run llama3.1

Pulls ~4.7 GB (Q4_K_M default). Runs on any GPU with 6 GB+ VRAM or CPU with 16 GB+ RAM. The go-to command for general chat, coding help, and writing on 8-24 GB GPUs.

Llama 3.1 8B Q8

ollama run llama3.1:8b-instruct-q8_0

Pulls ~8.5 GB. Full quality weights. Requires 10-12 GB GPU (RTX 3080 10GB or RTX 3060 12GB). Noticeably sharper on nuanced tasks than Q4 but same speed on most hardware.

Llama 3.1 70B

ollama run llama3.1:70b

Pulls ~42 GB. Requires 48 GB unified or discrete VRAM. Mac mini M4 Pro 48GB is the recommended hardware. Note: consider llama3.3:70b instead — same VRAM, better quality.

Llama 3.1 405B

ollama run llama3.1:405b

Pulls ~243 GB. Not practical for consumer hardware — requires enterprise GPU clusters. If you want the best locally-runnable Llama, use Llama 3.3 70B instead.

Tip: Use Llama 3.3 for the 70B slot

If you have hardware capable of running Llama 3.1 70B, substitute ollama run llama3.3:70b instead. Llama 3.3 70B uses the same ~43 GB at Q4_K_M and delivers better benchmark results than Llama 3.1 70B on instruction following and reasoning tasks.

For a full local setup guide including model management and Open WebUI, see how to run Llama locally.

Related Guides

Llama 3.2 Hardware Requirements

Vision models (11B, 90B) and compact models (1B, 3B) — what hardware each needs

How to Run Llama Locally

Step-by-step Ollama and llama.cpp setup for all Llama models

How to Run a 70B Model Locally

Complete guide to 70B inference on Mac mini M4 Pro and dual-GPU setups

Best LLMs for 8GB VRAM

Every model that fits on an RTX 4060 — top picks with benchmarks

Best LLMs for 48GB VRAM

Mac mini M4 Pro and A6000 recommendations including Llama 3.1 70B

Mac mini M4 Pro vs RTX 4090

Why 48GB unified memory beats a faster GPU for 70B models

Frequently Asked Questions

How much VRAM does Llama 3.1 8B need?

Llama 3.1 8B needs ~4.7 GB VRAM at Q4_K_M quantization. A 6 GB GPU (RTX 3060) is the comfortable minimum. You can also run it on CPU with 16 GB+ system RAM at ~3-5 tok/s. At Q8 full quality, it needs ~8.5 GB, so a 10-12 GB GPU is required.

Can I run Llama 3.1 70B on a single consumer GPU?

No single consumer GPU has enough VRAM. The 70B at Q4_K_M needs ~42 GB. Your options are: a Mac mini M4 Pro 48GB, an NVIDIA A6000 48GB, or dual RTX 3090s (2x24GB = 48GB) using llama.cpp tensor split. The dual-GPU split works but runs slower (~5 tok/s) due to PCIe bandwidth.

What is the difference between Llama 3.1 and Llama 3.2?

Llama 3.1 and 3.2 both have 8B models with similar text capability and the same 128K context window. The key difference: Llama 3.2 added vision models (11B Vision, 90B Vision) that can analyze images, while Llama 3.1 is text-only. Llama 3.1 also includes 70B and 405B sizes with no Llama 3.2 equivalent.

Should I use Llama 3.1 70B or Llama 3.3 70B?

Use Llama 3.3 70B. It is the improved version and outperforms Llama 3.1 70B on most benchmarks with the same VRAM requirement (~43 GB at Q4_K_M). Llama 3.3 70B also beats Llama 3.1 405B on many tasks. There is no reason to choose 3.1 70B over 3.3 70B if your hardware can run either.

Can I run Llama 3.1 405B locally?

Only in enterprise or heavily quantized setups. At Q4, the 405B needs ~243 GB of VRAM, requiring 8x A100 80GB or similar. The most aggressive Q2 quantization reduces it to ~120 GB but quality degrades significantly. For consumer hardware, Llama 3.3 70B offers comparable performance at a fraction of the cost and VRAM.

How fast is Llama 3.1 8B on an RTX 4070?

Llama 3.1 8B Q4_K_M runs at approximately 30 tok/s on an RTX 4070. On an RTX 4090 you get ~40 tok/s. The RTX 3060 12GB delivers around 15 tok/s. For a large language model at 8B scale, 30 tok/s feels fast and real-time in chat.

Popular hardware for local LLMs

RTX 4060 (8 GB)

Budget pick. Runs 7B-8B models at 25-35 tok/s.

Buy on Amazon

RTX 4060 Ti 16 GB

Sweet spot. Runs 13B-14B at full speed. Best value.

Buy on Amazon

RTX 4090 (24 GB)

Top consumer GPU. Runs 70B models with offloading.

Buy on Amazon

Check exact VRAM requirements for your context length, or find every model that fits your GPU.

VRAM Calculator What Can I Run? All Guides

Sources & methodology

VRAM and tokens-per-second figures on this page are synthesised from open community benchmarks. The sitewide formula and the full source list are on the methodology page. For this guide I leaned on:

Hugging Face Hub. Meta's official Llama 3.1 model card (8B, 70B, 405B parameter and context numbers).
Modal: How much VRAM do I need for LLM inference. VRAM formula used for the per-quant memory tables on this page.
XiongjieDai GPU-Benchmarks-on-LLM-Inference. Independent llama-bench runs for Llama 3.1 quants across consumer GPUs.

Spot a number that does not match the linked source? Email [email protected] and I will update the guide.