System Requirements for Running LLMs Locally: CPU, RAM, PSU, Storage (2026)

Editorial: AI built the first system-requirements skeleton. The CPU, RAM, storage, and PSU notes that most guides skip were added from hand-on builds.

GPU VRAM is the bottleneck — but here is what else you need for a build that actually works.

Quick answer: the minimum and recommended specs

System RAM

Min: 16 GB

Rec: 32 GB

CPU

Min: Core i5 / Ryzen 5

Rec: Any modern 6-core+

PSU

Min: 550W (RTX 4060)

Rec: GPU tier + 150W buffer

Storage

Min: 50 GB free SSD

Rec: 500 GB+ NVMe

GPU VRAM is still the primary constraint — use the VRAM Calculator to check model fit. This guide covers everything else.

System RAM Requirements

When a model fits fully in GPU VRAM, the GPU does all the work and system RAM is barely touched. You still need enough RAM that the OS and other applications do not starve the GPU of PCIe bandwidth — 16 GB is the floor, 32 GB is comfortable for any GPU tier.

The situation changes when you want to run a model that is too large for your GPU. llama.cpp and Ollama both support CPU offloading, where model layers that do not fit in VRAM are offloaded to system RAM. The math is straightforward:

CPU offload RAM requirement

RAM needed = model size − GPU VRAM

Example: 70B Q4 model = 42 GB. RTX 4090 VRAM = 24 GB. You need at least 18 GB free system RAM for offload, plus OS overhead — so 32 GB total system RAM minimum, 64 GB comfortable.

Example: 32B Q4 model = 20 GB. RTX 4070 Ti Super VRAM = 16 GB. You need 4 GB free for offload — 16 GB system RAM is technically sufficient, but 32 GB leaves breathing room.

CPU offload is slow — expect 1 to 5 tokens per second. It is useful occasionally, but not for daily interactive use. If you find yourself offloading regularly, the right fix is more VRAM, not more RAM.

Apple Silicon: unified memory serves as both system RAM and GPU VRAM. A 24 GB M4 Mac Mini has 24 GB available to the GPU — no offloading needed for 14B models at any quantization. This is the key architectural difference between Mac and PC.

16 GB RAM

Minimum. Fine for GPU-only inference on any GPU. Tight if you want any CPU offloading.

32 GB RAM

Recommended. Comfortable for all GPU tiers. Supports modest CPU offloading for models slightly over VRAM.

64 GB RAM

For heavy users. Enables meaningful offloading of 70B models on RTX 4090. Required for CPU-only 70B inference.

CPU Requirements

GPU inference (model fits in VRAM)

The GPU does the matrix math. The CPU only handles model loading, tokenization, context management, and sampling — lightweight tasks that any modern CPU handles without breaking a sweat.

A Core i5-12400 or Ryzen 5 5600 is fully sufficient. You do not need a high-end CPU for GPU-accelerated LLM inference.

CPU-only inference (no GPU)

CPU specs matter a lot here. llama.cpp requires AVX2 support (all CPUs since 2013 have it). AVX-512 on Intel gives a 20 to 30% throughput boost. More cores directly improve speed — 8+ cores recommended.

Typical speeds: Ryzen 9 9950X (16-core) reaches 8 to 12 t/s on Qwen3 8B Q4. Core i9-14900K reaches 6 to 10 t/s on the same model.

CPU instruction set support

Instruction SetRequired ForSpeed BenefitWho Has It
AVX2 llama.cpp CPU inference Baseline All CPUs since ~2013
AVX-512 Faster CPU inference +20–30% on Intel Intel Ice Lake+ (some SKUs), Xeon
NEON / AMX Apple Silicon Built into M-series All Apple M1–M4 chips

PSU Requirements

LLM inference is a sustained full-GPU-load workload. Unlike gaming, which has natural pauses between frames, inference runs the GPU at near 100% for minutes or hours continuously. This means your PSU is under constant rated load, not peak-burst load. Size accordingly.

The formula: GPU TDP + CPU TDP + 100W overhead. Use a unit rated 80+ Gold or better — they run more efficiently at sustained load and produce less heat.

GPUGPU TDPMinimum PSURecommended PSU
RTX 4060 / RTX 5060 Ti 115–180W 550W 650W
RTX 4070 / RTX 5070 200–250W 650W 750W
RTX 4080 / RTX 5080 320–360W 750W 850W
RTX 4090 / RTX 5090 450–575W 850W 1000W

Mac Mini / Mac Studio

Apple Silicon machines have a built-in power supply — no PSU choice required. The M4 Mac Mini draws around 30W at idle and up to 120W under sustained LLM inference load. Mac Studio M4 Max peaks around 180W. Power efficiency is one of the strongest Mac arguments for LLM workloads.

Storage Requirements

Model files are large. A 7B Q4 model is about 4.5 GB; a 14B Q4 is 9 GB; a 32B Q4 is 20 GB; a 70B Q4 is 42 GB. You will want several models on disk at once — plan for at least 50 GB free, ideally 500 GB or more.

More importantly, storage speed directly determines how long you wait staring at a loading screen before the model is ready. NVMe SSD is not optional at the high end.

Storage TypeSequential Read7B Model Load70B Model LoadVerdict
HDD 100–150 MB/s 30–120 s 5–10 min Avoid
SATA SSD 500–550 MB/s 5–15 s 60–90 s Acceptable
NVMe SSD (Gen 3) 3,000–3,500 MB/s 2–5 s 15–25 s Recommended
NVMe SSD (Gen 4/5) 5,000–14,000 MB/s 1–2 s 8–12 s Best

Minimum storage: 50 GB free

Enough for 3 to 4 small to mid-size models. A 7B, 14B, and 32B Q4 model together take about 34 GB. Leave 15+ GB for model cache, context files, and future downloads.

Recommended: 500 GB NVMe

Comfortable for a full model library — several 7B, 14B, and 32B variants plus one or two 70B models. Gen 4 NVMe is available at commodity prices and loads even a 70B model in under 15 seconds.

Cooling

Most GPUs are tested and specified for gaming workloads, which spike to full power but average much lower. LLM inference is different: it holds the GPU at or near 100% for as long as you are generating tokens. A 10-minute conversation could mean 10 minutes of sustained maximum GPU load.

GPU temperatures should stay below 85°C for sustained inference. Above 90°C the GPU may throttle, reducing token generation speed. Above 95°C some GPUs shut down as a protection measure.

Case airflow

At least two intake fans and one exhaust. Positive pressure (more intake than exhaust) keeps dust out. Avoid cases with mesh-blocked intakes.

GPU temperature target

Below 85°C during sustained inference. Check with GPU-Z (Windows) or nvidia-smi -l 1 (Linux/terminal). Throttling starts around 90°C on most cards.

Thermal paste

New GPUs ship with adequate paste. If you have a used card or are seeing 95°C+, reapplying quality paste can drop temps by 10-15°C.

Apple Silicon machines are passively or near-passively cooled and handle sustained inference loads without thermal issues — another practical Mac advantage for 24/7 inference use.

Complete Build Specs by Tier

Buy on Amazon

Budget

GPU

RTX 4060 (8 GB)

VRAM

8 GB

System RAM

16 GB

CPU

4-core (Core i5 / Ryzen 5)

PSU

650W

Storage

256 GB SSD

Can run: Qwen3 4B–7B fully in VRAM

Mid-Range

GPU

RTX 4070 / RTX 5070 (12–16 GB)

VRAM

12–16 GB

System RAM

32 GB

CPU

6-core (Core i5 / Ryzen 5)

PSU

750W

Storage

500 GB NVMe

Can run: Qwen3 14B at Q4, 7B at Q8

Pro

GPU

RTX 4090 (24 GB)

VRAM

24 GB

System RAM

32–64 GB

CPU

8-core (Core i7 / Ryzen 7)

PSU

1000W

Storage

1 TB NVMe

Can run: Qwen3 32B at Q4, 14B at Q8

Mac

GPU

M4 Mac Mini 24 GB or M4 Pro 48 GB

VRAM

24–48 GB unified

System RAM

Unified (same pool)

CPU

Built-in (no choice)

PSU

Built-in

Storage

512 GB–1 TB internal

Can run: 24 GB: up to 32B Q4 | 48 GB: 70B Q4

Frequently Asked Questions

How much system RAM do I need to run LLMs locally?

Minimum 16 GB, but 32 GB is recommended for comfortable use. When a model fits fully in GPU VRAM, system RAM is not the bottleneck — you just need enough that the OS does not starve the GPU. If you want to offload layers of a model that is too large for your GPU, you need system RAM equal to the model size minus your GPU VRAM. Apple Silicon uses unified memory for both RAM and GPU, so more memory directly expands which models you can run.

Does CPU matter for running LLMs?

For GPU inference, not much. Any modern mid-range CPU (Core i5 / Ryzen 5 or better) handles model loading, tokenization, and sampling without becoming a bottleneck. CPU matters a lot for CPU-only inference: AVX2 support is required for llama.cpp, AVX-512 gives a 20-30% speedup on Intel chips, and more cores help (8+ recommended). A Ryzen 9 9950X reaches roughly 8-12 t/s on Qwen3 8B Q4.

What PSU wattage do I need for an LLM build?

LLM inference is a sustained full-load workload, unlike gaming. Size your PSU to GPU TDP plus CPU TDP plus 100W headroom: RTX 4060/5060 Ti needs at least 550W, RTX 4070/5070 needs 650W, RTX 4080/5080 needs 750W, and RTX 4090/5090 needs at least 850W (1000W recommended). Undersized PSUs cause random shutdowns under sustained inference load.

Do I need an SSD or NVMe for running LLMs?

NVMe SSD is strongly recommended. An HDD loads a 7B model in 30-120 seconds, a SATA SSD in 5-15 seconds, and a modern NVMe SSD in 2-5 seconds. For large models (32B, 70B) this gap is even more painful. Plan for at least 50 GB free storage to keep 3-4 models on disk; 500 GB NVMe is the comfortable recommendation.

Is CPU offloading good enough for daily LLM use?

CPU offloading is slow — typically 1-5 tokens per second — and is not recommended for daily interactive use. It is useful for occasionally running a model that is slightly too large for your GPU, but the experience is noticeably worse than a model that fits fully in VRAM. For daily use, choose a model that fits in your GPU VRAM at your preferred quantization.

Related Guides

Not sure which model fits your hardware? Use the VRAM Calculator or browse all hardware.

Sources & methodology

VRAM and tokens-per-second figures on this page are synthesised from open community benchmarks. The sitewide formula and the full source list are on the methodology page. For this guide I leaned on:

Spot a number that does not match the linked source? Email [email protected] and I will update the guide.