What Hardware Do You Need for Microsoft Phi-4?
AI drafted the Phi-4 size matrix from Microsoft's model cards. Every VRAM number here was reconciled with the methodology page formula and the cited llama-bench runs.
Updated May 2026 · Phi-4 14B & Phi-4-mini 3.8B · VRAM requirements · Consumer GPU guide · Ollama setup
Phi-4, released by Microsoft in December 2024, is a 14B model that outperforms most 70B models on reasoning and coding benchmarks — while fitting in just 9 GB of VRAM at Q4_K_M. An 8 GB GPU can run it with slight offloading. For a comfortable fit, you need a 12 GB GPU: the Intel Arc B580 12GB is the sweet spot. Phi-4-mini (3.8B) runs on anything — even CPU-only.
What is Phi-4?
Phi-4 is Microsoft's flagship small language model, trained on high-quality synthetic "textbook" data rather than raw web text. This gives it exceptional reasoning and coding performance far beyond what its 14B parameter count would suggest.
- Two sizes Phi-4 (14B) and Phi-4-mini (3.8B) — both instruction-tuned and available on Ollama and Hugging Face.
- Benchmark leader Phi-4 14B matches or beats many 70B models on MMLU, HumanEval, and GSM8K — the best reasoning-per-VRAM ratio in its class.
- Trade-off Weaker multilingual and creative writing compared to Qwen3 14B. Phi-4's edge is specifically on logic, math, and code.
- HuggingFace IDs microsoft/phi-4 (14B), microsoft/Phi-4-mini-instruct (3.8B)
Phi-4 VRAM Requirements by Model Size
Buy on AmazonVRAM is estimated using ceil(params × bytes_per_param) + 1.5 GB overhead. Q4_K_M uses ~0.5 bytes/param, Q8 uses ~1.0 bytes/param, FP16 uses ~2.0 bytes/param. Add extra headroom for KV cache when using long context windows.
| Model | Params | Q4_K_M VRAM | Q8 VRAM | FP16 VRAM | Min GPU |
|---|---|---|---|---|---|
| Phi-4-mini | 3.8B | ~2.5 GB | ~4 GB | ~8 GB | Any GPU |
| Phi-4 | 14B | ~9 GB | ~14 GB | ~28 GB | 8 GB GPU (tight) |
Phi-4 14B at Q4_K_M needs ~9 GB — 1 GB over 8GB GPUs. Use the VRAM Calculator for context-length-adjusted estimates and KV cache projections.
Why Phi-4 Punches Above Its Weight
Most 14B models land firmly behind 70B models on hard benchmarks. Phi-4 is the exception. Here is why it performs so far above its size:
Synthetic textbook data
Microsoft trained Phi-4 on carefully curated synthetic data modeled on textbooks and structured problem sets — not raw web crawls. This means fewer low-quality examples and more reasoning signal per token.
Benchmark results
Phi-4 14B scores higher than Llama 3.1 70B on HumanEval (coding) and MATH benchmarks. It matches or exceeds Llama 3.3 70B on many reasoning tasks — at one-fifth of the VRAM cost.
Coding and math focus
If your use case is code generation, debugging, mathematical reasoning, or structured logic tasks, Phi-4 is one of the best models available regardless of size class.
Trade-off: multilingual
Phi-4 is primarily English-focused. Multilingual performance is noticeably weaker than Qwen3 14B. For multilingual use cases, Qwen3 is the better pick at the same VRAM tier.
What Phi-4 Can You Run on Your GPU?
Find your GPU or Mac below. Each card shows which Phi-4 variants fit, and what does not.
RTX 4060 8GB
Runs:
- +Phi-4-mini (all quants, plenty of headroom)
- +Phi-4 14B (Q4_K_M with slight offloading — ~1 GB overflow to RAM)
Does not fit:
- -Phi-4 14B Q8 (needs ~14 GB)
- -Phi-4 14B FP16 (needs ~28 GB)
Phi-4 14B at Q4_K_M overflows the 8 GB limit by ~1 GB. Ollama and llama.cpp handle this via CPU offloading — the model runs but is slower (expect 5–8 tok/s instead of 12–15). Phi-4-mini runs at full speed with room to spare.
Intel Arc B580 12GB
Runs:
- +Phi-4-mini (all quants)
- +Phi-4 14B (Q4_K_M, ~3 GB headroom — comfortable)
Does not fit:
- -Phi-4 14B Q8 (needs ~14 GB)
- -Phi-4 14B FP16 (needs ~28 GB)
Best value for Phi-4 14B. 12 GB gives ~3 GB headroom at Q4_K_M — enough for solid context lengths. Verify Ollama oneAPI/ROCm compatibility before purchasing. It typically undercuts NVIDIA cards in the same VRAM tier.
RTX 4070 12GB
Runs:
- +Phi-4-mini (all quants)
- +Phi-4 14B (Q4_K_M, ~3 GB headroom)
Does not fit:
- -Phi-4 14B Q8 (needs ~14 GB)
- -Phi-4 14B FP16 (needs ~28 GB)
Same 12 GB VRAM as the Arc B580 but ~504 GB/s bandwidth means faster generation (~30 tok/s on Phi-4 Q4). NVIDIA ecosystem advantage: ROCm issues are irrelevant here. For Phi-4 specifically the Arc B580 is better value.
RTX 4060 Ti 16GB
Runs:
- +Phi-4-mini (all quants)
- +Phi-4 14B (Q4_K_M, ~7 GB headroom)
- +Phi-4 14B (Q8, ~2 GB headroom — fits)
Does not fit:
- -Phi-4 14B FP16 (needs ~28 GB)
Excellent GPU for Phi-4. 16 GB means Phi-4 14B at Q8 (~14 GB) fits with ~2 GB of headroom — near-full-quality inference. At Q4 there is 7 GB of headroom for long context windows. The best mid-range single-GPU Phi-4 setup.
RTX 4070 Ti Super 16GB
Runs:
- +Phi-4-mini (all quants)
- +Phi-4 14B (Q4 and Q8, comfortable)
- +Phi-4 14B Q8 (~14 GB, ~2 GB headroom)
Does not fit:
- -Phi-4 14B FP16 (needs ~28 GB)
Same VRAM ceiling as the 4060 Ti 16GB but 2.3x faster bandwidth (~672 GB/s). Phi-4 14B at Q4 generates ~45 tok/s — noticeably faster than the 4060 Ti. Pay for speed, not extra VRAM capacity, at this tier.
RTX 4080 16GB
Runs:
- +Phi-4-mini (all quants)
- +Phi-4 14B (Q4 and Q8, comfortable)
Does not fit:
- -Phi-4 14B FP16 (needs ~28 GB)
720 GB/s bandwidth makes Phi-4 14B generation fast (~50 tok/s at Q4). Same 16 GB VRAM cap as the 4060 Ti — same model range. The premium is for throughput, not capacity. For Phi-4 alone, the 4060 Ti 16GB is better value.
RTX 3090 24GB (used)
Runs:
- +Phi-4-mini and Phi-4 14B at all quants (comfortable)
- +Phi-4 14B Q8 (~14 GB, ~10 GB headroom)
Does not fit:
- -Phi-4 14B FP16 (needs ~28 GB, exceeds 24 GB)
24 GB comfortably fits Phi-4 14B at Q8 with generous headroom for long contexts. Older PCIe 3.0 and ~936 GB/s bandwidth, but excellent VRAM-per-dollar on the used market for Phi-4 Q8 inference.
RTX 4090 24GB
Runs:
- +All Phi-4 variants at Q4 and Q8 (comfortable)
- +Phi-4 14B Q8 (~14 GB, ~10 GB headroom)
Does not fit:
- -Phi-4 14B FP16 (needs ~28 GB, exceeds 24 GB)
Best single consumer GPU for Phi-4. 1,008 GB/s bandwidth gives ~65 tok/s at Q4 — fast, responsive inference. 10 GB of headroom above Phi-4 Q8 allows very long context windows. Does not fit FP16.
RTX 5090 32GB
Runs:
- +All Phi-4 variants at Q4, Q8, and FP16
- +Phi-4 14B FP16 (~28 GB, ~4 GB headroom)
Does not fit:
- -Nothing in the Phi-4 family is out of reach
The only consumer GPU that fits Phi-4 14B at FP16. At 1,792 GB/s it is the fastest option for Phi-4 inference. Overkill for a 14B model — a 4060 Ti 16GB runs Phi-4 Q8 at near-identical quality for a fraction of the cost.
Mac mini M4 16GB
Runs:
- +Phi-4-mini (all quants)
- +Phi-4 14B (Q4_K_M, ~9 GB — fits with ~7 GB headroom)
Does not fit:
- -Phi-4 14B Q8 (needs ~14 GB — fits but leaves only 2 GB; tight at 16 GB total)
Solid Phi-4 Mac setup. Unified memory means all 16 GB is available. Phi-4 14B at Q4 leaves ~7 GB for OS and KV cache. Q8 at ~14 GB technically fits but is snug — keep context short if running Q8.
Mac mini M4 24GB
Runs:
- +All Phi-4 variants at Q4 and Q8 (comfortable)
- +Phi-4 14B Q8 (~14 GB, ~10 GB headroom)
Does not fit:
- -Phi-4 14B FP16 (needs ~28 GB, exceeds 24 GB)
The sweet spot for Phi-4 on Mac. 10 GB of headroom above Phi-4 Q8 supports long context windows. Silent, efficient, and handles every practical Phi-4 workload. FP16 does not fit.
Mac mini M4 Pro 48GB
Runs:
- +All Phi-4 variants at all quants including FP16
- +Phi-4 14B FP16 (~28 GB, ~20 GB headroom)
Does not fit:
- -Nothing — all Phi-4 variants fit comfortably
48 GB comfortably fits Phi-4 14B at FP16 with generous headroom. ~273 GB/s bandwidth is slower than discrete GPUs but the silence, power efficiency, and model range are compelling. Significant overkill for Phi-4 alone.
Inference Speed by Hardware
Token generation speed is bottlenecked by memory bandwidth. The table below shows estimated Q4_K_M token speeds at low batch size. Real-world results vary by driver version, context length, and system load.
| Hardware | Bandwidth | Phi-4-mini tok/s | Phi-4 14B Q4 tok/s | Phi-4 14B Q8 tok/s |
|---|---|---|---|---|
| RTX 5090 32GB | 1,792 GB/s | ~400 t/s | ~95 t/s | ~60 t/s |
| RTX 4090 24GB | 1,008 GB/s | ~225 t/s | ~55 t/s | ~34 t/s |
| RTX 4070 Ti Super 16GB | 672 GB/s | ~150 t/s | ~37 t/s | ~22 t/s |
| RTX 4080 16GB | 720 GB/s | ~160 t/s | ~39 t/s | ~24 t/s |
| RTX 4060 Ti 16GB | 288 GB/s | ~64 t/s | ~16 t/s | ~10 t/s |
| Intel Arc B580 12GB | 456 GB/s | ~102 t/s | ~25 t/s | — |
| RTX 4060 8GB | 272 GB/s | ~61 t/s | ~7 t/s* | — |
| Mac mini M4 Pro 48GB | ~273 GB/s | ~61 t/s | ~15 t/s | ~9 t/s |
| Mac mini M4 24GB | ~120 GB/s | ~27 t/s | ~7 t/s | ~4 t/s |
| Mac mini M4 16GB | ~120 GB/s | ~27 t/s | ~7 t/s | ~3 t/s |
Speed estimates: tokens/sec ≈ bandwidth (GB/s) / model size in memory (GB). * RTX 4060 Phi-4 Q4 speed is reduced due to CPU offloading. Dash (—) means the model does not fit at that VRAM tier without offloading.
How to Run Phi-4 Locally
Ollama
ollama run phi4 Easiest option. For Phi-4 14B: ollama run phi4. For Phi-4-mini: ollama run phi4-mini. GPU is auto-detected on NVIDIA, AMD (ROCm), and Apple Silicon. Ollama defaults to Q4_K_M. On an 8GB GPU it will automatically offload overflow layers to CPU RAM.
LM Studio
Search "Phi-4" in Discover GUI-based model browser and chat interface. Search for "Phi-4" in the Discover tab to find Microsoft's official GGUF variants. Lets you select Q4, Q8, or other quantizations from a dropdown. Best for non-technical users on Windows, Mac, or Linux.
Hugging Face + llama.cpp
microsoft/phi-4 Download GGUF files from microsoft/phi-4 or community repos (bartowski, unsloth) on Hugging Face. Run with llama.cpp for maximum control over quantization, GPU layer count, and context length. Use --n-gpu-layers to tune how many layers load onto VRAM.
For step-by-step installation instructions, see the how to run LLMs locally guide. For a comparison of Ollama vs LM Studio, see the Ollama vs LM Studio guide.
Running Phi-4 CPU-Only
No GPU? Phi-4-mini (3.8B) is one of the best CPU-only models available. At Q4_K_M it uses just ~3 GB of RAM and generates 4–8 tokens per second on a modern CPU — fast enough for practical use.
System RAM needed
Phi-4-mini at Q4_K_M: ~3 GB RAM. Phi-4 14B at Q4_K_M: ~9 GB RAM. For CPU inference, ensure your total system RAM exceeds the model size by at least 2–4 GB to leave room for the OS and context cache.
Speed on CPU
Phi-4-mini: 4–8 tok/s on a modern CPU (Ryzen 7, Core i7). Phi-4 14B on CPU: 1–3 tok/s — usable but slow. For 14B CPU inference, consider 32+ GB RAM and a fast CPU with many cores. Phi-4-mini is the practical CPU choice.
How to run CPU-only
With Ollama: ollama run phi4-mini — it auto-detects if no GPU is available and falls back to CPU. With llama.cpp: use --n-gpu-layers 0 to force CPU-only mode. LM Studio has a CPU mode toggle in settings.
For a detailed guide to CPU-only inference, see the CPU-only LLM inference guide.
Phi-4 14B vs Qwen3 14B vs Gemma 3 12B
All three target the same ~8–12 GB VRAM tier. Here is how they compare for local inference:
| Phi-4 14B | Qwen3 14B | Gemma 3 12B | |
|---|---|---|---|
| VRAM at Q4 | ~9 GB | ~9 GB | ~8 GB |
| Thinking mode | No | Yes (built-in) | No |
| Coding | Excellent | Very good | Good |
| Reasoning/math | Excellent | Very good | Good |
| Multilingual | Poor | Excellent | Good |
| Creative writing | Average | Good | Good |
| Best for | Coding/math | General + multilingual | Balanced use |
Choose Phi-4 if...
- +Your primary use case is coding or math
- +You want the best reasoning-per-VRAM ratio
- +English-only tasks are your focus
- +You prefer Microsoft's training approach
Choose Qwen3 14B if...
- +You need built-in chain-of-thought reasoning
- +Multilingual tasks are important
- +You want a general-purpose model
- +You want thinking mode without extra setup
Choose Gemma 3 12B if...
- +You want Google-backed architecture
- +You need balanced multilingual + reasoning
- +Your GPU has exactly 8 GB (12B fits tighter)
- +You prioritize model diversity across sizes (1B–27B)
Which Hardware Should You Buy for Phi-4?
RTX 4060 8GB
Runs Phi-4 14B at Q4_K_M with ~1 GB of CPU offloading — functional but ~50% slower than a 12 GB GPU. Phi-4-mini runs at full speed. Good entry point if you already own this card.
Intel Arc B580 12GB
The sweet spot for Phi-4 14B. 12 GB gives ~3 GB headroom at Q4 — no offloading, full GPU speed. Cheaper than any NVIDIA 12 GB card. Verify Ollama compatibility before buying.
RTX 4060 Ti 16GB
Runs Phi-4 14B at Q8 with 2 GB headroom — near-full-quality inference. The best mid-range NVIDIA option for Phi-4. Also future-proofs you for larger models up to ~14 GB.
RTX 3090 24GB
Phi-4 14B at Q8 with ~10 GB headroom for generous context lengths. Excellent VRAM-per-dollar on the used market. Older architecture but more than fast enough for Phi-4.
RTX 4090 24GB
Best single consumer GPU for Phi-4. Phi-4 14B at Q8 leaves 10 GB headroom and generates ~34 tok/s. Significant overkill for a 14B model but the overall best local inference setup.
Mac mini M4 16GB
Phi-4 14B at Q4 fits with ~7 GB headroom on unified memory. Silent, efficient, and reliable. For Q8 quality, step up to the Mac mini M4 24GB. Phi-4-mini runs on any Mac.
For a full cross-budget GPU comparison, see the best GPU for LLMs guide.
Related Resources
How to Run Phi-4 Locally
Step-by-step Ollama and LM Studio setup for Phi-4 14B
What LLMs Can I Run?
Enter your GPU — see every model that fits
Best LLMs to Run Locally
Top picks across every GPU tier in 2026
Best GPU for LLMs — Full Guide
All budget tiers, entry to high-end
How to Run LLMs Locally
Step-by-step Ollama, LM Studio, llama.cpp setup
Ollama vs LM Studio
Which tool to use for running Phi-4 locally
CPU-Only LLM Inference
Run Phi-4-mini without a GPU — setup and tips
Best LLM for Coding Locally
Phi-4 14B is a top coding pick — see the full comparison by GPU tier
Frequently Asked Questions
Is Phi-4 better than Llama 3.1 8B?
Yes, on most benchmarks. Phi-4 (14B) significantly outperforms Llama 3.1 8B on reasoning, math, and coding tasks — and matches or beats many 70B models. Phi-4's "textbook data" training gives it exceptional reasoning per parameter. The trade-off is higher VRAM (~9 GB vs ~5 GB at Q4_K_M), but the quality jump is substantial.
Can an 8GB GPU run Phi-4?
Yes, with slight CPU offloading. Phi-4 at Q4_K_M requires ~9 GB — about 1 GB over the limit of an RTX 4060 8GB. Ollama and llama.cpp handle the overflow automatically by keeping some layers in system RAM. The model runs but is ~50% slower than a 12 GB GPU. For a comfortable fit, the Intel Arc B580 12GB is the best-value option.
How does Phi-4 compare to Qwen3 14B?
Both need ~9 GB VRAM at Q4_K_M. Phi-4 leads on coding, math, and logical reasoning. Qwen3 14B has built-in thinking mode and is much better at multilingual tasks. For English coding and math work, Phi-4 is the better choice. For general-purpose or multilingual use, Qwen3 14B has the edge.
Can I run Phi-4 on a Mac?
Yes. Phi-4 runs well on Apple Silicon via Ollama or LM Studio. The Mac mini M4 16GB fits Phi-4 14B at Q4_K_M with ~7 GB of headroom. The Mac mini M4 24GB adds Phi-4 Q8 support with ~10 GB of headroom. Phi-4-mini runs on any Mac, including base M-series machines with 8 GB.
How much VRAM does Phi-4 need?
Phi-4 (14B) at Q4_K_M requires approximately 9 GB of VRAM. At Q8 it needs approximately 14 GB. At FP16 it needs approximately 28 GB. Phi-4-mini (3.8B) at Q4_K_M requires only about 2.5 GB — it fits in any GPU or runs CPU-only. For the 14B model, 12 GB is the comfortable minimum; 8 GB works with offloading.
What is Phi-4-mini and how does it differ from Phi-4?
Phi-4-mini is a 3.8B parameter model that runs in just 2.5 GB of VRAM at Q4_K_M. It is designed for edge devices and CPU-only inference. While smaller than Phi-4 14B, it still outperforms many older 7B models on coding and reasoning. Use Phi-4-mini for constrained hardware or CPU-only setups. Use Phi-4 14B when you have 12 GB+ VRAM and want maximum quality.
Check VRAM requirements for Phi-4, or compare hardware options.
Related Guides
LLM RAM Requirements
How much RAM and VRAM you need for different model sizes.
LLM Quantization Guide
Use Q4, Q8, and other formats to fit larger models in less VRAM.
Best GPUs for LLMs
Top GPU picks for running local AI models in 2026.
How to Run LLMs Locally
Step-by-step guide to getting your first local model running.
Sources & methodology
VRAM and tokens-per-second figures on this page are synthesised from open community benchmarks. The sitewide formula and the full source list are on the methodology page. For this guide I leaned on:
- Hugging Face Hub. Microsoft's Phi-4 model card (14B parameters, 16k context) used for the size math.
- Modal: How much VRAM do I need for LLM inference. VRAM-budget formula applied to each Phi-4 quant in the requirements table.
- Ollama. Phi-4 GGUF quants in the Ollama library, the ones we link from the install steps.
Spot a number that does not match the linked source? Email [email protected] and I will update the guide.