System Requirements for Running LLMs Locally: CPU, RAM, PSU, Storage (2026)
Editorial: AI built the first system-requirements skeleton. The CPU, RAM, storage, and PSU notes that most guides skip were added from hand-on builds.
GPU VRAM is the bottleneck — but here is what else you need for a build that actually works.
Quick answer: the minimum and recommended specs
System RAM
Min: 16 GB
Rec: 32 GB
CPU
Min: Core i5 / Ryzen 5
Rec: Any modern 6-core+
PSU
Min: 550W (RTX 4060)
Rec: GPU tier + 150W buffer
Storage
Min: 50 GB free SSD
Rec: 500 GB+ NVMe
GPU VRAM is still the primary constraint — use the VRAM Calculator to check model fit. This guide covers everything else.
System RAM Requirements
When a model fits fully in GPU VRAM, the GPU does all the work and system RAM is barely touched. You still need enough RAM that the OS and other applications do not starve the GPU of PCIe bandwidth — 16 GB is the floor, 32 GB is comfortable for any GPU tier.
The situation changes when you want to run a model that is too large for your GPU. llama.cpp and Ollama both support CPU offloading, where model layers that do not fit in VRAM are offloaded to system RAM. The math is straightforward:
CPU offload RAM requirement
RAM needed = model size − GPU VRAM
Example: 70B Q4 model = 42 GB. RTX 4090 VRAM = 24 GB. You need at least 18 GB free system RAM for offload, plus OS overhead — so 32 GB total system RAM minimum, 64 GB comfortable.
Example: 32B Q4 model = 20 GB. RTX 4070 Ti Super VRAM = 16 GB. You need 4 GB free for offload — 16 GB system RAM is technically sufficient, but 32 GB leaves breathing room.
CPU offload is slow — expect 1 to 5 tokens per second. It is useful occasionally, but not for daily interactive use. If you find yourself offloading regularly, the right fix is more VRAM, not more RAM.
Apple Silicon: unified memory serves as both system RAM and GPU VRAM. A 24 GB M4 Mac Mini has 24 GB available to the GPU — no offloading needed for 14B models at any quantization. This is the key architectural difference between Mac and PC.
16 GB RAM
Minimum. Fine for GPU-only inference on any GPU. Tight if you want any CPU offloading.
32 GB RAM
Recommended. Comfortable for all GPU tiers. Supports modest CPU offloading for models slightly over VRAM.
64 GB RAM
For heavy users. Enables meaningful offloading of 70B models on RTX 4090. Required for CPU-only 70B inference.
CPU Requirements
GPU inference (model fits in VRAM)
The GPU does the matrix math. The CPU only handles model loading, tokenization, context management, and sampling — lightweight tasks that any modern CPU handles without breaking a sweat.
A Core i5-12400 or Ryzen 5 5600 is fully sufficient. You do not need a high-end CPU for GPU-accelerated LLM inference.
CPU-only inference (no GPU)
CPU specs matter a lot here. llama.cpp requires AVX2 support (all CPUs since 2013 have it). AVX-512 on Intel gives a 20 to 30% throughput boost. More cores directly improve speed — 8+ cores recommended.
Typical speeds: Ryzen 9 9950X (16-core) reaches 8 to 12 t/s on Qwen3 8B Q4. Core i9-14900K reaches 6 to 10 t/s on the same model.
CPU instruction set support
| Instruction Set | Required For | Speed Benefit | Who Has It |
|---|---|---|---|
| AVX2 | llama.cpp CPU inference | Baseline | All CPUs since ~2013 |
| AVX-512 | Faster CPU inference | +20–30% on Intel | Intel Ice Lake+ (some SKUs), Xeon |
| NEON / AMX | Apple Silicon | Built into M-series | All Apple M1–M4 chips |
PSU Requirements
LLM inference is a sustained full-GPU-load workload. Unlike gaming, which has natural pauses between frames, inference runs the GPU at near 100% for minutes or hours continuously. This means your PSU is under constant rated load, not peak-burst load. Size accordingly.
The formula: GPU TDP + CPU TDP + 100W overhead. Use a unit rated 80+ Gold or better — they run more efficiently at sustained load and produce less heat.
| GPU | GPU TDP | Minimum PSU | Recommended PSU |
|---|---|---|---|
| RTX 4060 / RTX 5060 Ti | 115–180W | 550W | 650W |
| RTX 4070 / RTX 5070 | 200–250W | 650W | 750W |
| RTX 4080 / RTX 5080 | 320–360W | 750W | 850W |
| RTX 4090 / RTX 5090 | 450–575W | 850W | 1000W |
Mac Mini / Mac Studio
Apple Silicon machines have a built-in power supply — no PSU choice required. The M4 Mac Mini draws around 30W at idle and up to 120W under sustained LLM inference load. Mac Studio M4 Max peaks around 180W. Power efficiency is one of the strongest Mac arguments for LLM workloads.
Storage Requirements
Model files are large. A 7B Q4 model is about 4.5 GB; a 14B Q4 is 9 GB; a 32B Q4 is 20 GB; a 70B Q4 is 42 GB. You will want several models on disk at once — plan for at least 50 GB free, ideally 500 GB or more.
More importantly, storage speed directly determines how long you wait staring at a loading screen before the model is ready. NVMe SSD is not optional at the high end.
| Storage Type | Sequential Read | 7B Model Load | 70B Model Load | Verdict |
|---|---|---|---|---|
| HDD | 100–150 MB/s | 30–120 s | 5–10 min | Avoid |
| SATA SSD | 500–550 MB/s | 5–15 s | 60–90 s | Acceptable |
| NVMe SSD (Gen 3) | 3,000–3,500 MB/s | 2–5 s | 15–25 s | Recommended |
| NVMe SSD (Gen 4/5) | 5,000–14,000 MB/s | 1–2 s | 8–12 s | Best |
Minimum storage: 50 GB free
Enough for 3 to 4 small to mid-size models. A 7B, 14B, and 32B Q4 model together take about 34 GB. Leave 15+ GB for model cache, context files, and future downloads.
Recommended: 500 GB NVMe
Comfortable for a full model library — several 7B, 14B, and 32B variants plus one or two 70B models. Gen 4 NVMe is available at commodity prices and loads even a 70B model in under 15 seconds.
Cooling
Most GPUs are tested and specified for gaming workloads, which spike to full power but average much lower. LLM inference is different: it holds the GPU at or near 100% for as long as you are generating tokens. A 10-minute conversation could mean 10 minutes of sustained maximum GPU load.
GPU temperatures should stay below 85°C for sustained inference. Above 90°C the GPU may throttle, reducing token generation speed. Above 95°C some GPUs shut down as a protection measure.
Case airflow
At least two intake fans and one exhaust. Positive pressure (more intake than exhaust) keeps dust out. Avoid cases with mesh-blocked intakes.
GPU temperature target
Below 85°C during sustained inference. Check with GPU-Z (Windows) or nvidia-smi -l 1 (Linux/terminal). Throttling starts around 90°C on most cards.
Thermal paste
New GPUs ship with adequate paste. If you have a used card or are seeing 95°C+, reapplying quality paste can drop temps by 10-15°C.
Apple Silicon machines are passively or near-passively cooled and handle sustained inference loads without thermal issues — another practical Mac advantage for 24/7 inference use.
Complete Build Specs by Tier
Buy on AmazonBudget
GPU
RTX 4060 (8 GB)
VRAM
8 GB
System RAM
16 GB
CPU
4-core (Core i5 / Ryzen 5)
PSU
650W
Storage
256 GB SSD
Can run: Qwen3 4B–7B fully in VRAM
Mid-Range
GPU
RTX 4070 / RTX 5070 (12–16 GB)
VRAM
12–16 GB
System RAM
32 GB
CPU
6-core (Core i5 / Ryzen 5)
PSU
750W
Storage
500 GB NVMe
Can run: Qwen3 14B at Q4, 7B at Q8
Pro
GPU
RTX 4090 (24 GB)
VRAM
24 GB
System RAM
32–64 GB
CPU
8-core (Core i7 / Ryzen 7)
PSU
1000W
Storage
1 TB NVMe
Can run: Qwen3 32B at Q4, 14B at Q8
Mac
GPU
M4 Mac Mini 24 GB or M4 Pro 48 GB
VRAM
24–48 GB unified
System RAM
Unified (same pool)
CPU
Built-in (no choice)
PSU
Built-in
Storage
512 GB–1 TB internal
Can run: 24 GB: up to 32B Q4 | 48 GB: 70B Q4
Frequently Asked Questions
How much system RAM do I need to run LLMs locally?
Minimum 16 GB, but 32 GB is recommended for comfortable use. When a model fits fully in GPU VRAM, system RAM is not the bottleneck — you just need enough that the OS does not starve the GPU. If you want to offload layers of a model that is too large for your GPU, you need system RAM equal to the model size minus your GPU VRAM. Apple Silicon uses unified memory for both RAM and GPU, so more memory directly expands which models you can run.
Does CPU matter for running LLMs?
For GPU inference, not much. Any modern mid-range CPU (Core i5 / Ryzen 5 or better) handles model loading, tokenization, and sampling without becoming a bottleneck. CPU matters a lot for CPU-only inference: AVX2 support is required for llama.cpp, AVX-512 gives a 20-30% speedup on Intel chips, and more cores help (8+ recommended). A Ryzen 9 9950X reaches roughly 8-12 t/s on Qwen3 8B Q4.
What PSU wattage do I need for an LLM build?
LLM inference is a sustained full-load workload, unlike gaming. Size your PSU to GPU TDP plus CPU TDP plus 100W headroom: RTX 4060/5060 Ti needs at least 550W, RTX 4070/5070 needs 650W, RTX 4080/5080 needs 750W, and RTX 4090/5090 needs at least 850W (1000W recommended). Undersized PSUs cause random shutdowns under sustained inference load.
Do I need an SSD or NVMe for running LLMs?
NVMe SSD is strongly recommended. An HDD loads a 7B model in 30-120 seconds, a SATA SSD in 5-15 seconds, and a modern NVMe SSD in 2-5 seconds. For large models (32B, 70B) this gap is even more painful. Plan for at least 50 GB free storage to keep 3-4 models on disk; 500 GB NVMe is the comfortable recommendation.
Is CPU offloading good enough for daily LLM use?
CPU offloading is slow — typically 1-5 tokens per second — and is not recommended for daily interactive use. It is useful for occasionally running a model that is slightly too large for your GPU, but the experience is noticeably worse than a model that fits fully in VRAM. For daily use, choose a model that fits in your GPU VRAM at your preferred quantization.
Related Guides
Best PC Build for AI and LLMs
Three complete build tiers with full part lists
Best GPU for LLMs — Full Guide
All budget tiers ranked by VRAM value
CPU-Only LLM Inference
Running LLMs without a GPU — speeds, model limits, and best CPUs
What Can I Run?
Model size limits for every GPU — from 8 GB to 48 GB VRAM
LLM Quantization Explained
Q4_K_M vs Q8 vs FP16 — when quality tradeoffs matter
Not sure which model fits your hardware? Use the VRAM Calculator or browse all hardware.
Sources & methodology
VRAM and tokens-per-second figures on this page are synthesised from open community benchmarks. The sitewide formula and the full source list are on the methodology page. For this guide I leaned on:
- Modal: How much VRAM do I need for LLM inference. The VRAM-needed-for-inference formula the requirement tables here lean on.
- Home GPU LLM Leaderboard. VRAM tiers (8/12/16/24/48 GB) mapped to what each tier can host.
- Hardware Corner GPU ranking. Tokens per second by GPU class, used for the 'is it fast enough?' guidance.
Spot a number that does not match the linked source? Email [email protected] and I will update the guide.