What Hardware Do You Need for Microsoft Phi-4?

AI drafted the Phi-4 size matrix from Microsoft's model cards. Every VRAM number here was reconciled with the methodology page formula and the cited llama-bench runs.

Updated May 2026 · Phi-4 14B & Phi-4-mini 3.8B · VRAM requirements · Consumer GPU guide · Ollama setup

Phi-4, released by Microsoft in December 2024, is a 14B model that outperforms most 70B models on reasoning and coding benchmarks — while fitting in just 9 GB of VRAM at Q4_K_M. An 8 GB GPU can run it with slight offloading. For a comfortable fit, you need a 12 GB GPU: the Intel Arc B580 12GB is the sweet spot. Phi-4-mini (3.8B) runs on anything — even CPU-only.

What is Phi-4?

Phi-4 is Microsoft's flagship small language model, trained on high-quality synthetic "textbook" data rather than raw web text. This gives it exceptional reasoning and coding performance far beyond what its 14B parameter count would suggest.

Phi-4 VRAM Requirements by Model Size

Buy on Amazon

VRAM is estimated using ceil(params × bytes_per_param) + 1.5 GB overhead. Q4_K_M uses ~0.5 bytes/param, Q8 uses ~1.0 bytes/param, FP16 uses ~2.0 bytes/param. Add extra headroom for KV cache when using long context windows.

ModelParamsQ4_K_M VRAMQ8 VRAMFP16 VRAMMin GPU
Phi-4-mini 3.8B ~2.5 GB ~4 GB ~8 GB Any GPU
Phi-4 14B ~9 GB ~14 GB ~28 GB 8 GB GPU (tight)

Phi-4 14B at Q4_K_M needs ~9 GB — 1 GB over 8GB GPUs. Use the VRAM Calculator for context-length-adjusted estimates and KV cache projections.

Why Phi-4 Punches Above Its Weight

Most 14B models land firmly behind 70B models on hard benchmarks. Phi-4 is the exception. Here is why it performs so far above its size:

Synthetic textbook data

Microsoft trained Phi-4 on carefully curated synthetic data modeled on textbooks and structured problem sets — not raw web crawls. This means fewer low-quality examples and more reasoning signal per token.

Benchmark results

Phi-4 14B scores higher than Llama 3.1 70B on HumanEval (coding) and MATH benchmarks. It matches or exceeds Llama 3.3 70B on many reasoning tasks — at one-fifth of the VRAM cost.

Coding and math focus

If your use case is code generation, debugging, mathematical reasoning, or structured logic tasks, Phi-4 is one of the best models available regardless of size class.

Trade-off: multilingual

Phi-4 is primarily English-focused. Multilingual performance is noticeably weaker than Qwen3 14B. For multilingual use cases, Qwen3 is the better pick at the same VRAM tier.

What Phi-4 Can You Run on Your GPU?

Find your GPU or Mac below. Each card shows which Phi-4 variants fit, and what does not.

RTX 4060 8GB

Runs:

  • +Phi-4-mini (all quants, plenty of headroom)
  • +Phi-4 14B (Q4_K_M with slight offloading — ~1 GB overflow to RAM)

Does not fit:

  • -Phi-4 14B Q8 (needs ~14 GB)
  • -Phi-4 14B FP16 (needs ~28 GB)

Phi-4 14B at Q4_K_M overflows the 8 GB limit by ~1 GB. Ollama and llama.cpp handle this via CPU offloading — the model runs but is slower (expect 5–8 tok/s instead of 12–15). Phi-4-mini runs at full speed with room to spare.

Intel Arc B580 12GB

Runs:

  • +Phi-4-mini (all quants)
  • +Phi-4 14B (Q4_K_M, ~3 GB headroom — comfortable)

Does not fit:

  • -Phi-4 14B Q8 (needs ~14 GB)
  • -Phi-4 14B FP16 (needs ~28 GB)

Best value for Phi-4 14B. 12 GB gives ~3 GB headroom at Q4_K_M — enough for solid context lengths. Verify Ollama oneAPI/ROCm compatibility before purchasing. It typically undercuts NVIDIA cards in the same VRAM tier.

RTX 4070 12GB

Runs:

  • +Phi-4-mini (all quants)
  • +Phi-4 14B (Q4_K_M, ~3 GB headroom)

Does not fit:

  • -Phi-4 14B Q8 (needs ~14 GB)
  • -Phi-4 14B FP16 (needs ~28 GB)

Same 12 GB VRAM as the Arc B580 but ~504 GB/s bandwidth means faster generation (~30 tok/s on Phi-4 Q4). NVIDIA ecosystem advantage: ROCm issues are irrelevant here. For Phi-4 specifically the Arc B580 is better value.

RTX 4060 Ti 16GB

Runs:

  • +Phi-4-mini (all quants)
  • +Phi-4 14B (Q4_K_M, ~7 GB headroom)
  • +Phi-4 14B (Q8, ~2 GB headroom — fits)

Does not fit:

  • -Phi-4 14B FP16 (needs ~28 GB)

Excellent GPU for Phi-4. 16 GB means Phi-4 14B at Q8 (~14 GB) fits with ~2 GB of headroom — near-full-quality inference. At Q4 there is 7 GB of headroom for long context windows. The best mid-range single-GPU Phi-4 setup.

RTX 4070 Ti Super 16GB

Runs:

  • +Phi-4-mini (all quants)
  • +Phi-4 14B (Q4 and Q8, comfortable)
  • +Phi-4 14B Q8 (~14 GB, ~2 GB headroom)

Does not fit:

  • -Phi-4 14B FP16 (needs ~28 GB)

Same VRAM ceiling as the 4060 Ti 16GB but 2.3x faster bandwidth (~672 GB/s). Phi-4 14B at Q4 generates ~45 tok/s — noticeably faster than the 4060 Ti. Pay for speed, not extra VRAM capacity, at this tier.

RTX 4080 16GB

Runs:

  • +Phi-4-mini (all quants)
  • +Phi-4 14B (Q4 and Q8, comfortable)

Does not fit:

  • -Phi-4 14B FP16 (needs ~28 GB)

720 GB/s bandwidth makes Phi-4 14B generation fast (~50 tok/s at Q4). Same 16 GB VRAM cap as the 4060 Ti — same model range. The premium is for throughput, not capacity. For Phi-4 alone, the 4060 Ti 16GB is better value.

RTX 3090 24GB (used)

Runs:

  • +Phi-4-mini and Phi-4 14B at all quants (comfortable)
  • +Phi-4 14B Q8 (~14 GB, ~10 GB headroom)

Does not fit:

  • -Phi-4 14B FP16 (needs ~28 GB, exceeds 24 GB)

24 GB comfortably fits Phi-4 14B at Q8 with generous headroom for long contexts. Older PCIe 3.0 and ~936 GB/s bandwidth, but excellent VRAM-per-dollar on the used market for Phi-4 Q8 inference.

RTX 4090 24GB

Runs:

  • +All Phi-4 variants at Q4 and Q8 (comfortable)
  • +Phi-4 14B Q8 (~14 GB, ~10 GB headroom)

Does not fit:

  • -Phi-4 14B FP16 (needs ~28 GB, exceeds 24 GB)

Best single consumer GPU for Phi-4. 1,008 GB/s bandwidth gives ~65 tok/s at Q4 — fast, responsive inference. 10 GB of headroom above Phi-4 Q8 allows very long context windows. Does not fit FP16.

RTX 5090 32GB

Runs:

  • +All Phi-4 variants at Q4, Q8, and FP16
  • +Phi-4 14B FP16 (~28 GB, ~4 GB headroom)

Does not fit:

  • -Nothing in the Phi-4 family is out of reach

The only consumer GPU that fits Phi-4 14B at FP16. At 1,792 GB/s it is the fastest option for Phi-4 inference. Overkill for a 14B model — a 4060 Ti 16GB runs Phi-4 Q8 at near-identical quality for a fraction of the cost.

Mac mini M4 16GB

Runs:

  • +Phi-4-mini (all quants)
  • +Phi-4 14B (Q4_K_M, ~9 GB — fits with ~7 GB headroom)

Does not fit:

  • -Phi-4 14B Q8 (needs ~14 GB — fits but leaves only 2 GB; tight at 16 GB total)

Solid Phi-4 Mac setup. Unified memory means all 16 GB is available. Phi-4 14B at Q4 leaves ~7 GB for OS and KV cache. Q8 at ~14 GB technically fits but is snug — keep context short if running Q8.

Mac mini M4 24GB

Runs:

  • +All Phi-4 variants at Q4 and Q8 (comfortable)
  • +Phi-4 14B Q8 (~14 GB, ~10 GB headroom)

Does not fit:

  • -Phi-4 14B FP16 (needs ~28 GB, exceeds 24 GB)

The sweet spot for Phi-4 on Mac. 10 GB of headroom above Phi-4 Q8 supports long context windows. Silent, efficient, and handles every practical Phi-4 workload. FP16 does not fit.

Mac mini M4 Pro 48GB

Runs:

  • +All Phi-4 variants at all quants including FP16
  • +Phi-4 14B FP16 (~28 GB, ~20 GB headroom)

Does not fit:

  • -Nothing — all Phi-4 variants fit comfortably

48 GB comfortably fits Phi-4 14B at FP16 with generous headroom. ~273 GB/s bandwidth is slower than discrete GPUs but the silence, power efficiency, and model range are compelling. Significant overkill for Phi-4 alone.

Inference Speed by Hardware

Token generation speed is bottlenecked by memory bandwidth. The table below shows estimated Q4_K_M token speeds at low batch size. Real-world results vary by driver version, context length, and system load.

HardwareBandwidthPhi-4-mini tok/sPhi-4 14B Q4 tok/sPhi-4 14B Q8 tok/s
RTX 5090 32GB 1,792 GB/s ~400 t/s ~95 t/s ~60 t/s
RTX 4090 24GB 1,008 GB/s ~225 t/s ~55 t/s ~34 t/s
RTX 4070 Ti Super 16GB 672 GB/s ~150 t/s ~37 t/s ~22 t/s
RTX 4080 16GB 720 GB/s ~160 t/s ~39 t/s ~24 t/s
RTX 4060 Ti 16GB 288 GB/s ~64 t/s ~16 t/s ~10 t/s
Intel Arc B580 12GB 456 GB/s ~102 t/s ~25 t/s
RTX 4060 8GB 272 GB/s ~61 t/s ~7 t/s*
Mac mini M4 Pro 48GB ~273 GB/s ~61 t/s ~15 t/s ~9 t/s
Mac mini M4 24GB ~120 GB/s ~27 t/s ~7 t/s ~4 t/s
Mac mini M4 16GB ~120 GB/s ~27 t/s ~7 t/s ~3 t/s

Speed estimates: tokens/sec ≈ bandwidth (GB/s) / model size in memory (GB). * RTX 4060 Phi-4 Q4 speed is reduced due to CPU offloading. Dash (—) means the model does not fit at that VRAM tier without offloading.

How to Run Phi-4 Locally

Ollama

ollama run phi4

Easiest option. For Phi-4 14B: ollama run phi4. For Phi-4-mini: ollama run phi4-mini. GPU is auto-detected on NVIDIA, AMD (ROCm), and Apple Silicon. Ollama defaults to Q4_K_M. On an 8GB GPU it will automatically offload overflow layers to CPU RAM.

LM Studio

Search "Phi-4" in Discover

GUI-based model browser and chat interface. Search for "Phi-4" in the Discover tab to find Microsoft's official GGUF variants. Lets you select Q4, Q8, or other quantizations from a dropdown. Best for non-technical users on Windows, Mac, or Linux.

Hugging Face + llama.cpp

microsoft/phi-4

Download GGUF files from microsoft/phi-4 or community repos (bartowski, unsloth) on Hugging Face. Run with llama.cpp for maximum control over quantization, GPU layer count, and context length. Use --n-gpu-layers to tune how many layers load onto VRAM.

For step-by-step installation instructions, see the how to run LLMs locally guide. For a comparison of Ollama vs LM Studio, see the Ollama vs LM Studio guide.

Running Phi-4 CPU-Only

No GPU? Phi-4-mini (3.8B) is one of the best CPU-only models available. At Q4_K_M it uses just ~3 GB of RAM and generates 4–8 tokens per second on a modern CPU — fast enough for practical use.

System RAM needed

Phi-4-mini at Q4_K_M: ~3 GB RAM. Phi-4 14B at Q4_K_M: ~9 GB RAM. For CPU inference, ensure your total system RAM exceeds the model size by at least 2–4 GB to leave room for the OS and context cache.

Speed on CPU

Phi-4-mini: 4–8 tok/s on a modern CPU (Ryzen 7, Core i7). Phi-4 14B on CPU: 1–3 tok/s — usable but slow. For 14B CPU inference, consider 32+ GB RAM and a fast CPU with many cores. Phi-4-mini is the practical CPU choice.

How to run CPU-only

With Ollama: ollama run phi4-mini — it auto-detects if no GPU is available and falls back to CPU. With llama.cpp: use --n-gpu-layers 0 to force CPU-only mode. LM Studio has a CPU mode toggle in settings.

For a detailed guide to CPU-only inference, see the CPU-only LLM inference guide.

Phi-4 14B vs Qwen3 14B vs Gemma 3 12B

All three target the same ~8–12 GB VRAM tier. Here is how they compare for local inference:

Phi-4 14BQwen3 14BGemma 3 12B
VRAM at Q4 ~9 GB ~9 GB ~8 GB
Thinking mode No Yes (built-in) No
Coding Excellent Very good Good
Reasoning/math Excellent Very good Good
Multilingual Poor Excellent Good
Creative writing Average Good Good
Best for Coding/math General + multilingual Balanced use

Choose Phi-4 if...

  • +Your primary use case is coding or math
  • +You want the best reasoning-per-VRAM ratio
  • +English-only tasks are your focus
  • +You prefer Microsoft's training approach

Choose Qwen3 14B if...

  • +You need built-in chain-of-thought reasoning
  • +Multilingual tasks are important
  • +You want a general-purpose model
  • +You want thinking mode without extra setup

Choose Gemma 3 12B if...

  • +You want Google-backed architecture
  • +You need balanced multilingual + reasoning
  • +Your GPU has exactly 8 GB (12B fits tighter)
  • +You prioritize model diversity across sizes (1B–27B)

Which Hardware Should You Buy for Phi-4?

Entry tier

RTX 4060 8GB

Runs Phi-4 14B at Q4_K_M with ~1 GB of CPU offloading — functional but ~50% slower than a 12 GB GPU. Phi-4-mini runs at full speed. Good entry point if you already own this card.

12 GB — best value

Intel Arc B580 12GB

The sweet spot for Phi-4 14B. 12 GB gives ~3 GB headroom at Q4 — no offloading, full GPU speed. Cheaper than any NVIDIA 12 GB card. Verify Ollama compatibility before buying.

Mid tier

RTX 4060 Ti 16GB

Runs Phi-4 14B at Q8 with 2 GB headroom — near-full-quality inference. The best mid-range NVIDIA option for Phi-4. Also future-proofs you for larger models up to ~14 GB.

Used 24 GB

RTX 3090 24GB

Phi-4 14B at Q8 with ~10 GB headroom for generous context lengths. Excellent VRAM-per-dollar on the used market. Older architecture but more than fast enough for Phi-4.

High end

RTX 4090 24GB

Best single consumer GPU for Phi-4. Phi-4 14B at Q8 leaves 10 GB headroom and generates ~34 tok/s. Significant overkill for a 14B model but the overall best local inference setup.

Mac ecosystem

Mac mini M4 16GB

Phi-4 14B at Q4 fits with ~7 GB headroom on unified memory. Silent, efficient, and reliable. For Q8 quality, step up to the Mac mini M4 24GB. Phi-4-mini runs on any Mac.

For a full cross-budget GPU comparison, see the best GPU for LLMs guide.

Related Resources

Frequently Asked Questions

Is Phi-4 better than Llama 3.1 8B?

Yes, on most benchmarks. Phi-4 (14B) significantly outperforms Llama 3.1 8B on reasoning, math, and coding tasks — and matches or beats many 70B models. Phi-4's "textbook data" training gives it exceptional reasoning per parameter. The trade-off is higher VRAM (~9 GB vs ~5 GB at Q4_K_M), but the quality jump is substantial.

Can an 8GB GPU run Phi-4?

Yes, with slight CPU offloading. Phi-4 at Q4_K_M requires ~9 GB — about 1 GB over the limit of an RTX 4060 8GB. Ollama and llama.cpp handle the overflow automatically by keeping some layers in system RAM. The model runs but is ~50% slower than a 12 GB GPU. For a comfortable fit, the Intel Arc B580 12GB is the best-value option.

How does Phi-4 compare to Qwen3 14B?

Both need ~9 GB VRAM at Q4_K_M. Phi-4 leads on coding, math, and logical reasoning. Qwen3 14B has built-in thinking mode and is much better at multilingual tasks. For English coding and math work, Phi-4 is the better choice. For general-purpose or multilingual use, Qwen3 14B has the edge.

Can I run Phi-4 on a Mac?

Yes. Phi-4 runs well on Apple Silicon via Ollama or LM Studio. The Mac mini M4 16GB fits Phi-4 14B at Q4_K_M with ~7 GB of headroom. The Mac mini M4 24GB adds Phi-4 Q8 support with ~10 GB of headroom. Phi-4-mini runs on any Mac, including base M-series machines with 8 GB.

How much VRAM does Phi-4 need?

Phi-4 (14B) at Q4_K_M requires approximately 9 GB of VRAM. At Q8 it needs approximately 14 GB. At FP16 it needs approximately 28 GB. Phi-4-mini (3.8B) at Q4_K_M requires only about 2.5 GB — it fits in any GPU or runs CPU-only. For the 14B model, 12 GB is the comfortable minimum; 8 GB works with offloading.

What is Phi-4-mini and how does it differ from Phi-4?

Phi-4-mini is a 3.8B parameter model that runs in just 2.5 GB of VRAM at Q4_K_M. It is designed for edge devices and CPU-only inference. While smaller than Phi-4 14B, it still outperforms many older 7B models on coding and reasoning. Use Phi-4-mini for constrained hardware or CPU-only setups. Use Phi-4 14B when you have 12 GB+ VRAM and want maximum quality.

Check VRAM requirements for Phi-4, or compare hardware options.

Related Guides

Sources & methodology

VRAM and tokens-per-second figures on this page are synthesised from open community benchmarks. The sitewide formula and the full source list are on the methodology page. For this guide I leaned on:

Spot a number that does not match the linked source? Email [email protected] and I will update the guide.