Run Local AI on Mac — M1 through M4, Intel too
Editorial: AI drafted the step-by-step; the Apple-specific gotchas (Metal toggles, MLX vs Ollama tradeoffs) were added by hand from real Mac sessions.
Updated May 2026 · Covers Ollama, Open WebUI, MLX · All Apple Silicon chips + Intel
Any Mac made since 2020 can run large language models locally — no cloud subscription, no data leaving your machine. Apple Silicon Macs use Metal GPU acceleration automatically; Intel Macs fall back to CPU. This guide gets you running in under 5 minutes and covers every Mac configuration from an 8 GB M1 to a 64 GB M4 Max.
Quick Start: 3 Commands to Run Your First Model
Requires macOS 12 Monterey or later and Homebrew (install from brew.sh if needed). The whole sequence takes about 2 minutes plus model download time.
Step 1 — Install Ollama
brew install ollama
Alternative: curl -fsSL https://ollama.com/install.sh | sh or download from ollama.com
Step 2 — Start the Ollama server
ollama serve
Leave this running in one Terminal tab. The server listens on http://localhost:11434
Step 3 — Download and run a model
ollama run qwen3:8b
Qwen3 8B Q4_K_M is ~5 GB. Works on any Mac with 8 GB+ RAM. Replace with phi4:14b for 16 GB+ Macs or llama3.3:70b for 48 GB+ Macs.
On Apple Silicon: Metal GPU is automatic
Ollama detects your M-series chip and uses Metal acceleration with no extra flags. All unified memory is available as "VRAM" — there is no separate pool.
Mac Compatibility Table — What Can Each Mac Run?
Token speeds are measured with Ollama running Q4_K_M quantization. Apple Silicon uses Metal; Intel uses CPU only.
| Mac | Unified Memory | Qwen3 7B | Phi-4 14B | Llama 3.3 70B | Notes |
|---|---|---|---|---|---|
| Any Intel Mac | System RAM | ~3-8 t/s (CPU) | ✗ (slow) | ✗ | CPU-only inference — slow but works for small models |
| M1/M2/M3 8 GB | 8 GB unified | ~25 t/s | ✗ OOM | ✗ | OS takes 3-4 GB; only 4-5 GB free for models |
| M1/M2/M3/M4 16 GB | 16 GB unified | ~30 t/s | ~18 t/s (tight) | ✗ | Good for 7B; 14B is tight |
| M4 Mac Mini 24 GB | 24 GB unified | ~38 t/s | ~25 t/s | ✗ | Recommended entry point |
| M4 Pro 24 GB | 24 GB unified | ~65 t/s | ~45 t/s | ✗ | 2.3x faster — worth the upgrade |
| M4 Pro 48 GB | 48 GB unified | ~70 t/s | ~50 t/s | ~22 t/s | Best Mac Mini for 70B |
| M4 Max 64 GB | 64 GB unified | ~90 t/s | ~65 t/s | ~30 t/s | Mac Studio class |
✗ = model will not fit in available memory at Q4_K_M. Token speeds are approximate and vary with prompt length and context size.
Tool Comparison: Ollama, LM Studio, Open WebUI, MLX
Four tools cover most Mac LLM use cases. Ollama is the foundation most others build on — even if you plan to use a GUI, install Ollama first.
| Tool | Type | Install | Best For |
|---|---|---|---|
| Ollama | CLI + API server | brew install ollama | Developers, scripting, API access |
| LM Studio | GUI app | lmstudio.ai download | Beginners, model browsing |
| Open WebUI | Browser UI (requires Ollama) | pip install open-webui | ChatGPT-like interface |
| MLX Community | Python framework | pip install mlx-lm | Developers, Apple Silicon native speed |
Adding a ChatGPT-style interface
Open WebUI runs in your browser and connects to the Ollama server. Install it with pip, then open localhost:8080.
pip install open-webui open-webui serve
MLX for maximum Apple Silicon speed
MLX is Apple's own ML framework. For some models it outperforms Ollama/llama.cpp on Apple Silicon by leveraging the Neural Engine.
pip install mlx-lm mlx_lm.generate --model mlx-community/Qwen3-8B-4bit \ --prompt "Explain unified memory"
Apple Silicon Tips
Metal GPU acceleration is automatic
Ollama detects Apple Silicon at startup and uses Metal for all GPU operations. No flags, environment variables, or configuration needed. You can confirm GPU usage in Activity Monitor: open it, go to Window > GPU History, and watch GPU utilization spike to 80-100% while a model generates.
All unified memory is available — no separate VRAM limit
Unlike a PC with a discrete GPU (where only the GPU's dedicated VRAM pool is usable for inference), Apple Silicon Macs use all system RAM as "VRAM." An M4 Max with 64 GB has 64 GB available for model weights, minus what macOS and running apps consume (~4-6 GB at idle). There is no PCIe bottleneck — memory bandwidth to the GPU is the same as total system memory bandwidth.
Memory pressure and swap
If you load a model that is too large for available RAM, macOS will swap to SSD. This tanks inference speed dramatically (10x or more slowdown). Watch memory pressure in Activity Monitor: if it turns orange or red while a model loads, you need a smaller model or higher quantization. The 8B model at Q4_K_M uses roughly 5 GB — safe on any 8 GB Mac. The 14B at Q4_K_M uses about 8.5 GB — safe on 16 GB Macs.
Check memory pressure: Activity Monitor > Memory tab > Memory Pressure (bottom of window).
Choosing quantization on Apple Silicon
Q4_K_M is the standard starting point — it balances size and quality well. On 16 GB Macs, stick to Q4_K_M for 7B models. On 24 GB, you can run 7B at Q8 (near-lossless) or 14B at Q4_K_M. The M4 Pro chip at 48 GB can run 13B at Q8 and 34B at Q4_K_M comfortably. Higher quantization (Q8, FP16) improves output quality at the cost of more memory.
Intel Mac: Slower, but Functional
Ollama runs on Intel Macs (macOS 12+) using CPU-only inference. Intel integrated graphics do not have Metal support that Ollama uses for LLM acceleration, so all computation falls on the CPU cores. Speeds are typically 3-8 tokens per second for a 7B model — readable, but noticeably slow compared to Apple Silicon.
What works on Intel Mac
- 7B models at Q4_K_M (5 GB, ~3-8 t/s)
- Ollama with full model management
- Open WebUI over Ollama
- LM Studio (has its own CPU backend)
- Offline, private inference — no API key needed
What to avoid on Intel Mac
- 13B+ models — too slow for practical use
- Long context windows — every token costs more CPU time
- MLX — Apple Silicon only
- Anything requiring real-time response
Recommendation for Intel Mac users
Install Ollama and run ollama run qwen3:8b to test. If the speed is acceptable for your use case, great. If you want faster inference, the Mac mini M4 is a 4-8x speed improvement and runs on only 30 W.
Common Errors and Fixes
Error: model requires more system memory than available
The model's Q4_K_M weights exceed your available RAM minus what macOS needs. Fix: run a smaller model or use a more aggressive quantization.
# Instead of phi4:14b on an 8 GB Mac, use: ollama run qwen3:8b # Or pull the 4-bit version explicitly: ollama pull qwen3:8b:q4_K_M
Model loads but generation is extremely slow (1-2 t/s)
macOS is swapping model weights to the SSD. The model is too large for available RAM. Open Activity Monitor, check Memory Pressure — it is likely orange or red. Fix: close other applications, restart Ollama, or switch to a smaller model.
# Check how much RAM the model needs: ollama show qwen3:8b --modelinfo | grep size # Stop all models to free memory: ollama stop
ollama: command not found after brew install
Homebrew's bin directory is not in your PATH. Run the path fix or use the direct installer instead.
# Add Homebrew to PATH (add to your ~/.zshrc too): export PATH="/opt/homebrew/bin:$PATH" # Or use the direct installer: curl -fsSL https://ollama.com/install.sh | sh
Model stuck at "pulling manifest" or download stalls
Network issue or Ollama registry timeout. Press Ctrl+C to cancel, then retry. If it keeps stalling, try pulling a different model first to confirm the registry is reachable, or check ollama.com status. Large models (70B) can take 30+ minutes on a typical home connection — the progress bar may appear frozen but is still downloading.
Frequently Asked Questions
How do I install Ollama on Mac? +
Three options: (1) Run "brew install ollama" in Terminal if you have Homebrew. (2) Run "curl -fsSL https://ollama.com/install.sh | sh" to use the official install script. (3) Download the macOS app directly from ollama.com. All three produce the same result. After installing, run "ollama serve" to start the server and "ollama run qwen3:8b" to download and run your first model. macOS 12 Monterey or later is required.
Does Ollama use the GPU on Apple Silicon Mac? +
Yes — Ollama uses Apple's Metal API for GPU acceleration on Apple Silicon (M1, M2, M3, M4) automatically. No configuration is needed. When you run a model, Ollama detects the Metal-capable GPU and offloads all layers to it. You can verify this by watching Activity Monitor > GPU History while a model is generating — you will see GPU utilisation spike to 80–100%. On Intel Macs, Ollama falls back to CPU-only inference, which is significantly slower.
How much unified memory do I need for local LLMs on Mac? +
8 GB is only sufficient for very small models (1B–3B) — macOS takes 3–4 GB at idle, leaving little room. 16 GB runs 7B models at Q4_K_M comfortably (Qwen3 8B uses about 5 GB). 24 GB handles 7B to 14B models well. 48 GB is required for 30B–70B models; Llama 3.3 70B at Q4_K_M uses roughly 43 GB. 64 GB gives comfortable headroom for 70B models and runs 34B at Q8.
Can I run LLMs on an Intel Mac? +
Yes, but with significant limitations. Ollama runs on Intel Macs using CPU inference only — there is no Metal GPU acceleration for Intel integrated graphics in the LLM context. A 7B model at Q4_K_M will generate 3–8 tokens per second on a recent Intel Mac, compared to 25–40 t/s on even a base M1 chip. Intel Macs are usable for experimentation with small models (7B and below), but Apple Silicon is strongly recommended for any regular use.
What is the fastest way to run LLMs on Mac? +
For raw token throughput on Apple Silicon, MLX (Apple's ML framework) is often faster than Ollama/llama.cpp for certain workloads because it is optimised specifically for the Apple Silicon architecture, including the Neural Engine. Install it with "pip install mlx-lm" and run models from the MLX Community on Hugging Face. For convenience and ease of use, Ollama with Metal acceleration is the fastest to get running and handles most use cases at near-MLX speeds.
Related Guides
Popular hardware for local LLMs
Sources & methodology
Behaviour, file-format and runtime details on this page are pulled from primary upstream docs and community benchmark threads. The full sitewide methodology lives on the methodology page. For this guide I relied most on:
- Apple MLX. Apple's MLX framework, the fastest path for native Metal inference on M-series.
- Ollama. The macOS installer most readers will use to pull and run their first model.
- llama.cpp llama-bench discussion. M-series benchmark thread that backs the tokens-per-second ranges in this guide.
Spot a number that does not match the linked source? Email [email protected] and I will update the guide.