What Hardware Do You Need for Microsoft Phi-4?

Q: Can an 8GB GPU run Phi-4?

Yes, with slight offloading. Phi-4 at Q4_K_M requires approximately 9 GB of VRAM — just over the 8 GB limit of cards like the RTX 4060. Tools like Ollama and llama.cpp can offload the overflow to system RAM, which allows the model to run at reduced speed. For a comfortable fit with no offloading, you need a 12 GB GPU such as the Intel Arc B580 or RTX 4070 12GB. The Arc B580 12GB is the best-value option for running Phi-4 comfortably.

Q: How does Phi-4 compare to Qwen3 14B?

Phi-4 and Qwen3 14B target the same VRAM tier (~9 GB at Q4_K_M) but have different strengths. Phi-4 leads on coding, math, and logical reasoning benchmarks — its "textbook data" training shows on structured tasks. Qwen3 14B has built-in chain-of-thought thinking mode and is significantly better at multilingual tasks. For English-only coding and math work, Phi-4 is the better pick. For general-purpose use or multilingual tasks, Qwen3 14B has the edge.

Q: Can I run Phi-4 on a Mac?

Yes. Phi-4 runs well on Apple Silicon Macs via Ollama or LM Studio. Because Mac unified memory is shared between CPU and GPU, all RAM is available for the model. The Mac mini M4 16GB fits Phi-4 at Q4_K_M (~9 GB) with ~7 GB of headroom — a comfortable setup. The Mac mini M4 24GB gives ample headroom and can run Phi-4 at Q8 (~14 GB). Phi-4-mini runs on any Mac including base 8GB M-series machines.

Q: What is Phi-4-mini and how does it differ from Phi-4?

Phi-4-mini is a 3.8B parameter model from Microsoft that runs in just 2.5 GB of VRAM at Q4_K_M. It is designed for edge devices, CPU-only inference, and low-memory GPUs. While smaller than Phi-4 (14B), Phi-4-mini still punches above its weight on coding and reasoning for a 3.8B model — it outperforms many older 7B models. Use Phi-4-mini when you have a 4 GB GPU, run CPU-only, or need fast inference on a constrained device. Use Phi-4 14B when you have a 12 GB+ GPU and want maximum quality.

AI drafted the Phi-4 size matrix from Microsoft's model cards. Every VRAM number here was reconciled with the methodology page formula and the cited llama-bench runs.

Updated May 2026 · Phi-4 14B & Phi-4-mini 3.8B · VRAM requirements · Consumer GPU guide · Ollama setup

Phi-4, released by Microsoft in December 2024, is a 14B model that outperforms most 70B models on reasoning and coding benchmarks — while fitting in just 9 GB of VRAM at Q4_K_M. An 8 GB GPU can run it with slight offloading. For a comfortable fit, you need a 12 GB GPU: the Intel Arc B580 12GB is the sweet spot. Phi-4-mini (3.8B) runs on anything — even CPU-only.

What is Phi-4?

Phi-4 is Microsoft's flagship small language model, trained on high-quality synthetic "textbook" data rather than raw web text. This gives it exceptional reasoning and coding performance far beyond what its 14B parameter count would suggest.

Two sizes Phi-4 (14B) and Phi-4-mini (3.8B) — both instruction-tuned and available on Ollama and Hugging Face.
Benchmark leader Phi-4 14B matches or beats many 70B models on MMLU, HumanEval, and GSM8K — the best reasoning-per-VRAM ratio in its class.
Trade-off Weaker multilingual and creative writing compared to Qwen3 14B. Phi-4's edge is specifically on logic, math, and code.
HuggingFace IDs microsoft/phi-4 (14B), microsoft/Phi-4-mini-instruct (3.8B)

Phi-4 VRAM Requirements by Model Size

Buy on Amazon

VRAM is estimated using ceil(params × bytes_per_param) + 1.5 GB overhead. Q4_K_M uses ~0.5 bytes/param, Q8 uses ~1.0 bytes/param, FP16 uses ~2.0 bytes/param. Add extra headroom for KV cache when using long context windows.

Model	Params	Q4_K_M VRAM	Q8 VRAM	FP16 VRAM	Min GPU
Phi-4-mini	3.8B	~2.5 GB	~4 GB	~8 GB	Any GPU
Phi-4	14B	~9 GB	~14 GB	~28 GB	8 GB GPU (tight)

Phi-4 14B at Q4_K_M needs ~9 GB — 1 GB over 8GB GPUs. Use the VRAM Calculator for context-length-adjusted estimates and KV cache projections.

Why Phi-4 Punches Above Its Weight

Most 14B models land firmly behind 70B models on hard benchmarks. Phi-4 is the exception. Here is why it performs so far above its size:

Synthetic textbook data

Microsoft trained Phi-4 on carefully curated synthetic data modeled on textbooks and structured problem sets — not raw web crawls. This means fewer low-quality examples and more reasoning signal per token.

Benchmark results

Phi-4 14B scores higher than Llama 3.1 70B on HumanEval (coding) and MATH benchmarks. It matches or exceeds Llama 3.3 70B on many reasoning tasks — at one-fifth of the VRAM cost.

Coding and math focus

If your use case is code generation, debugging, mathematical reasoning, or structured logic tasks, Phi-4 is one of the best models available regardless of size class.

Trade-off: multilingual

Phi-4 is primarily English-focused. Multilingual performance is noticeably weaker than Qwen3 14B. For multilingual use cases, Qwen3 is the better pick at the same VRAM tier.

What Phi-4 Can You Run on Your GPU?

Find your GPU or Mac below. Each card shows which Phi-4 variants fit, and what does not.

RTX 4060 8GB

Runs:

+Phi-4-mini (all quants, plenty of headroom)
+Phi-4 14B (Q4_K_M with slight offloading — ~1 GB overflow to RAM)

Does not fit:

-Phi-4 14B Q8 (needs ~14 GB)
-Phi-4 14B FP16 (needs ~28 GB)

Phi-4 14B at Q4_K_M overflows the 8 GB limit by ~1 GB. Ollama and llama.cpp handle this via CPU offloading — the model runs but is slower (expect 5–8 tok/s instead of 12–15). Phi-4-mini runs at full speed with room to spare.

Intel Arc B580 12GB

Runs:

+Phi-4-mini (all quants)
+Phi-4 14B (Q4_K_M, ~3 GB headroom — comfortable)

Does not fit:

-Phi-4 14B Q8 (needs ~14 GB)
-Phi-4 14B FP16 (needs ~28 GB)

Best value for Phi-4 14B. 12 GB gives ~3 GB headroom at Q4_K_M — enough for solid context lengths. Verify Ollama oneAPI/ROCm compatibility before purchasing. It typically undercuts NVIDIA cards in the same VRAM tier.

RTX 4070 12GB

Runs:

+Phi-4-mini (all quants)
+Phi-4 14B (Q4_K_M, ~3 GB headroom)

Does not fit:

-Phi-4 14B Q8 (needs ~14 GB)
-Phi-4 14B FP16 (needs ~28 GB)

Same 12 GB VRAM as the Arc B580 but ~504 GB/s bandwidth means faster generation (~30 tok/s on Phi-4 Q4). NVIDIA ecosystem advantage: ROCm issues are irrelevant here. For Phi-4 specifically the Arc B580 is better value.

RTX 4060 Ti 16GB

Runs:

+Phi-4-mini (all quants)
+Phi-4 14B (Q4_K_M, ~7 GB headroom)
+Phi-4 14B (Q8, ~2 GB headroom — fits)

Does not fit:

-Phi-4 14B FP16 (needs ~28 GB)

Excellent GPU for Phi-4. 16 GB means Phi-4 14B at Q8 (~14 GB) fits with ~2 GB of headroom — near-full-quality inference. At Q4 there is 7 GB of headroom for long context windows. The best mid-range single-GPU Phi-4 setup.

RTX 4070 Ti Super 16GB

Runs:

+Phi-4-mini (all quants)
+Phi-4 14B (Q4 and Q8, comfortable)
+Phi-4 14B Q8 (~14 GB, ~2 GB headroom)

Does not fit:

-Phi-4 14B FP16 (needs ~28 GB)

Same VRAM ceiling as the 4060 Ti 16GB but 2.3x faster bandwidth (~672 GB/s). Phi-4 14B at Q4 generates ~45 tok/s — noticeably faster than the 4060 Ti. Pay for speed, not extra VRAM capacity, at this tier.

RTX 4080 16GB

Runs:

+Phi-4-mini (all quants)
+Phi-4 14B (Q4 and Q8, comfortable)

Does not fit:

-Phi-4 14B FP16 (needs ~28 GB)

720 GB/s bandwidth makes Phi-4 14B generation fast (~50 tok/s at Q4). Same 16 GB VRAM cap as the 4060 Ti — same model range. The premium is for throughput, not capacity. For Phi-4 alone, the 4060 Ti 16GB is better value.

RTX 3090 24GB (used)

Runs:

+Phi-4-mini and Phi-4 14B at all quants (comfortable)
+Phi-4 14B Q8 (~14 GB, ~10 GB headroom)

Does not fit:

-Phi-4 14B FP16 (needs ~28 GB, exceeds 24 GB)

24 GB comfortably fits Phi-4 14B at Q8 with generous headroom for long contexts. Older PCIe 3.0 and ~936 GB/s bandwidth, but excellent VRAM-per-dollar on the used market for Phi-4 Q8 inference.

RTX 4090 24GB

Runs:

+All Phi-4 variants at Q4 and Q8 (comfortable)
+Phi-4 14B Q8 (~14 GB, ~10 GB headroom)

Does not fit:

-Phi-4 14B FP16 (needs ~28 GB, exceeds 24 GB)

Best single consumer GPU for Phi-4. 1,008 GB/s bandwidth gives ~65 tok/s at Q4 — fast, responsive inference. 10 GB of headroom above Phi-4 Q8 allows very long context windows. Does not fit FP16.

RTX 5090 32GB

Runs:

+All Phi-4 variants at Q4, Q8, and FP16
+Phi-4 14B FP16 (~28 GB, ~4 GB headroom)

Does not fit:

-Nothing in the Phi-4 family is out of reach

The only consumer GPU that fits Phi-4 14B at FP16. At 1,792 GB/s it is the fastest option for Phi-4 inference. Overkill for a 14B model — a 4060 Ti 16GB runs Phi-4 Q8 at near-identical quality for a fraction of the cost.

Mac mini M4 16GB

Runs:

+Phi-4-mini (all quants)
+Phi-4 14B (Q4_K_M, ~9 GB — fits with ~7 GB headroom)

Does not fit:

-Phi-4 14B Q8 (needs ~14 GB — fits but leaves only 2 GB; tight at 16 GB total)

Solid Phi-4 Mac setup. Unified memory means all 16 GB is available. Phi-4 14B at Q4 leaves ~7 GB for OS and KV cache. Q8 at ~14 GB technically fits but is snug — keep context short if running Q8.

Mac mini M4 24GB

Runs:

+All Phi-4 variants at Q4 and Q8 (comfortable)
+Phi-4 14B Q8 (~14 GB, ~10 GB headroom)

Does not fit:

-Phi-4 14B FP16 (needs ~28 GB, exceeds 24 GB)

The sweet spot for Phi-4 on Mac. 10 GB of headroom above Phi-4 Q8 supports long context windows. Silent, efficient, and handles every practical Phi-4 workload. FP16 does not fit.

Mac mini M4 Pro 48GB

Runs:

+All Phi-4 variants at all quants including FP16
+Phi-4 14B FP16 (~28 GB, ~20 GB headroom)

Does not fit:

-Nothing — all Phi-4 variants fit comfortably

48 GB comfortably fits Phi-4 14B at FP16 with generous headroom. ~273 GB/s bandwidth is slower than discrete GPUs but the silence, power efficiency, and model range are compelling. Significant overkill for Phi-4 alone.

Inference Speed by Hardware

Token generation speed is bottlenecked by memory bandwidth. The table below shows estimated Q4_K_M token speeds at low batch size. Real-world results vary by driver version, context length, and system load.

Hardware	Bandwidth	Phi-4-mini tok/s	Phi-4 14B Q4 tok/s	Phi-4 14B Q8 tok/s
RTX 5090 32GB	1,792 GB/s	~400 t/s	~95 t/s	~60 t/s
RTX 4090 24GB	1,008 GB/s	~225 t/s	~55 t/s	~34 t/s
RTX 4070 Ti Super 16GB	672 GB/s	~150 t/s	~37 t/s	~22 t/s
RTX 4080 16GB	720 GB/s	~160 t/s	~39 t/s	~24 t/s
RTX 4060 Ti 16GB	288 GB/s	~64 t/s	~16 t/s	~10 t/s
Intel Arc B580 12GB	456 GB/s	~102 t/s	~25 t/s	—
RTX 4060 8GB	272 GB/s	~61 t/s	~7 t/s*	—
Mac mini M4 Pro 48GB	~273 GB/s	~61 t/s	~15 t/s	~9 t/s
Mac mini M4 24GB	~120 GB/s	~27 t/s	~7 t/s	~4 t/s
Mac mini M4 16GB	~120 GB/s	~27 t/s	~7 t/s	~3 t/s

Speed estimates: tokens/sec ≈ bandwidth (GB/s) / model size in memory (GB). * RTX 4060 Phi-4 Q4 speed is reduced due to CPU offloading. Dash (—) means the model does not fit at that VRAM tier without offloading.

How to Run Phi-4 Locally

Ollama

ollama run phi4

Easiest option. For Phi-4 14B: ollama run phi4. For Phi-4-mini: ollama run phi4-mini. GPU is auto-detected on NVIDIA, AMD (ROCm), and Apple Silicon. Ollama defaults to Q4_K_M. On an 8GB GPU it will automatically offload overflow layers to CPU RAM.

LM Studio

Search "Phi-4" in Discover

GUI-based model browser and chat interface. Search for "Phi-4" in the Discover tab to find Microsoft's official GGUF variants. Lets you select Q4, Q8, or other quantizations from a dropdown. Best for non-technical users on Windows, Mac, or Linux.

Hugging Face + llama.cpp

microsoft/phi-4

Download GGUF files from microsoft/phi-4 or community repos (bartowski, unsloth) on Hugging Face. Run with llama.cpp for maximum control over quantization, GPU layer count, and context length. Use --n-gpu-layers to tune how many layers load onto VRAM.

For step-by-step installation instructions, see the how to run LLMs locally guide. For a comparison of Ollama vs LM Studio, see the Ollama vs LM Studio guide.

Running Phi-4 CPU-Only

No GPU? Phi-4-mini (3.8B) is one of the best CPU-only models available. At Q4_K_M it uses just ~3 GB of RAM and generates 4–8 tokens per second on a modern CPU — fast enough for practical use.

System RAM needed

Phi-4-mini at Q4_K_M: ~3 GB RAM. Phi-4 14B at Q4_K_M: ~9 GB RAM. For CPU inference, ensure your total system RAM exceeds the model size by at least 2–4 GB to leave room for the OS and context cache.

Speed on CPU

Phi-4-mini: 4–8 tok/s on a modern CPU (Ryzen 7, Core i7). Phi-4 14B on CPU: 1–3 tok/s — usable but slow. For 14B CPU inference, consider 32+ GB RAM and a fast CPU with many cores. Phi-4-mini is the practical CPU choice.

How to run CPU-only

With Ollama: ollama run phi4-mini — it auto-detects if no GPU is available and falls back to CPU. With llama.cpp: use --n-gpu-layers 0 to force CPU-only mode. LM Studio has a CPU mode toggle in settings.

For a detailed guide to CPU-only inference, see the CPU-only LLM inference guide.

Phi-4 14B vs Qwen3 14B vs Gemma 3 12B

All three target the same ~8–12 GB VRAM tier. Here is how they compare for local inference:

	Phi-4 14B	Qwen3 14B	Gemma 3 12B
VRAM at Q4	~9 GB	~9 GB	~8 GB
Thinking mode	No	Yes (built-in)	No
Coding	Excellent	Very good	Good
Reasoning/math	Excellent	Very good	Good
Multilingual	Poor	Excellent	Good
Creative writing	Average	Good	Good
Best for	Coding/math	General + multilingual	Balanced use

Choose Phi-4 if...

+Your primary use case is coding or math
+You want the best reasoning-per-VRAM ratio
+English-only tasks are your focus
+You prefer Microsoft's training approach

Choose Qwen3 14B if...

+You need built-in chain-of-thought reasoning
+Multilingual tasks are important
+You want a general-purpose model
+You want thinking mode without extra setup

Choose Gemma 3 12B if...

+You want Google-backed architecture
+You need balanced multilingual + reasoning
+Your GPU has exactly 8 GB (12B fits tighter)
+You prioritize model diversity across sizes (1B–27B)

Which Hardware Should You Buy for Phi-4?

Entry tier

RTX 4060 8GB

Runs Phi-4 14B at Q4_K_M with ~1 GB of CPU offloading — functional but ~50% slower than a 12 GB GPU. Phi-4-mini runs at full speed. Good entry point if you already own this card.

12 GB — best value

Intel Arc B580 12GB

The sweet spot for Phi-4 14B. 12 GB gives ~3 GB headroom at Q4 — no offloading, full GPU speed. Cheaper than any NVIDIA 12 GB card. Verify Ollama compatibility before buying.

Mid tier

RTX 4060 Ti 16GB

Runs Phi-4 14B at Q8 with 2 GB headroom — near-full-quality inference. The best mid-range NVIDIA option for Phi-4. Also future-proofs you for larger models up to ~14 GB.

Used 24 GB

RTX 3090 24GB

Phi-4 14B at Q8 with ~10 GB headroom for generous context lengths. Excellent VRAM-per-dollar on the used market. Older architecture but more than fast enough for Phi-4.

High end

RTX 4090 24GB

Best single consumer GPU for Phi-4. Phi-4 14B at Q8 leaves 10 GB headroom and generates ~34 tok/s. Significant overkill for a 14B model but the overall best local inference setup.

Mac ecosystem

Mac mini M4 16GB

Phi-4 14B at Q4 fits with ~7 GB headroom on unified memory. Silent, efficient, and reliable. For Q8 quality, step up to the Mac mini M4 24GB. Phi-4-mini runs on any Mac.

For a full cross-budget GPU comparison, see the best GPU for LLMs guide.

Related Resources

How to Run Phi-4 Locally

Step-by-step Ollama and LM Studio setup for Phi-4 14B

What LLMs Can I Run?

Enter your GPU — see every model that fits

Best LLMs to Run Locally

Top picks across every GPU tier in 2026

Best GPU for LLMs — Full Guide

All budget tiers, entry to high-end

How to Run LLMs Locally

Step-by-step Ollama, LM Studio, llama.cpp setup

Ollama vs LM Studio

Which tool to use for running Phi-4 locally

CPU-Only LLM Inference

Run Phi-4-mini without a GPU — setup and tips

Best LLM for Coding Locally

Phi-4 14B is a top coding pick — see the full comparison by GPU tier

Frequently Asked Questions

Is Phi-4 better than Llama 3.1 8B?

Yes, on most benchmarks. Phi-4 (14B) significantly outperforms Llama 3.1 8B on reasoning, math, and coding tasks — and matches or beats many 70B models. Phi-4's "textbook data" training gives it exceptional reasoning per parameter. The trade-off is higher VRAM (~9 GB vs ~5 GB at Q4_K_M), but the quality jump is substantial.

Can an 8GB GPU run Phi-4?

Yes, with slight CPU offloading. Phi-4 at Q4_K_M requires ~9 GB — about 1 GB over the limit of an RTX 4060 8GB. Ollama and llama.cpp handle the overflow automatically by keeping some layers in system RAM. The model runs but is ~50% slower than a 12 GB GPU. For a comfortable fit, the Intel Arc B580 12GB is the best-value option.

How does Phi-4 compare to Qwen3 14B?

Both need ~9 GB VRAM at Q4_K_M. Phi-4 leads on coding, math, and logical reasoning. Qwen3 14B has built-in thinking mode and is much better at multilingual tasks. For English coding and math work, Phi-4 is the better choice. For general-purpose or multilingual use, Qwen3 14B has the edge.

Can I run Phi-4 on a Mac?

Yes. Phi-4 runs well on Apple Silicon via Ollama or LM Studio. The Mac mini M4 16GB fits Phi-4 14B at Q4_K_M with ~7 GB of headroom. The Mac mini M4 24GB adds Phi-4 Q8 support with ~10 GB of headroom. Phi-4-mini runs on any Mac, including base M-series machines with 8 GB.

How much VRAM does Phi-4 need?

Phi-4 (14B) at Q4_K_M requires approximately 9 GB of VRAM. At Q8 it needs approximately 14 GB. At FP16 it needs approximately 28 GB. Phi-4-mini (3.8B) at Q4_K_M requires only about 2.5 GB — it fits in any GPU or runs CPU-only. For the 14B model, 12 GB is the comfortable minimum; 8 GB works with offloading.

What is Phi-4-mini and how does it differ from Phi-4?

Phi-4-mini is a 3.8B parameter model that runs in just 2.5 GB of VRAM at Q4_K_M. It is designed for edge devices and CPU-only inference. While smaller than Phi-4 14B, it still outperforms many older 7B models on coding and reasoning. Use Phi-4-mini for constrained hardware or CPU-only setups. Use Phi-4 14B when you have 12 GB+ VRAM and want maximum quality.

Check VRAM requirements for Phi-4, or compare hardware options.

VRAM Calculator GPU Buying Guide All Guides

Related Guides

LLM RAM Requirements

How much RAM and VRAM you need for different model sizes.

LLM Quantization Guide

Use Q4, Q8, and other formats to fit larger models in less VRAM.

Best GPUs for LLMs

Top GPU picks for running local AI models in 2026.

How to Run LLMs Locally

Step-by-step guide to getting your first local model running.

Sources & methodology

VRAM and tokens-per-second figures on this page are synthesised from open community benchmarks. The sitewide formula and the full source list are on the methodology page. For this guide I leaned on:

Hugging Face Hub. Microsoft's Phi-4 model card (14B parameters, 16k context) used for the size math.
Modal: How much VRAM do I need for LLM inference. VRAM-budget formula applied to each Phi-4 quant in the requirements table.
Ollama. Phi-4 GGUF quants in the Ollama library, the ones we link from the install steps.

Spot a number that does not match the linked source? Email [email protected] and I will update the guide.