Run Local AI on Mac — M1 through M4, Intel too

Editorial: AI drafted the step-by-step; the Apple-specific gotchas (Metal toggles, MLX vs Ollama tradeoffs) were added by hand from real Mac sessions.

Updated May 2026 · Covers Ollama, Open WebUI, MLX · All Apple Silicon chips + Intel

Any Mac made since 2020 can run large language models locally — no cloud subscription, no data leaving your machine. Apple Silicon Macs use Metal GPU acceleration automatically; Intel Macs fall back to CPU. This guide gets you running in under 5 minutes and covers every Mac configuration from an 8 GB M1 to a 64 GB M4 Max.

Quick Start: 3 Commands to Run Your First Model

Requires macOS 12 Monterey or later and Homebrew (install from brew.sh if needed). The whole sequence takes about 2 minutes plus model download time.

Step 1 — Install Ollama

brew install ollama

Alternative: curl -fsSL https://ollama.com/install.sh | sh or download from ollama.com

Step 2 — Start the Ollama server

ollama serve

Leave this running in one Terminal tab. The server listens on http://localhost:11434

Step 3 — Download and run a model

ollama run qwen3:8b

Qwen3 8B Q4_K_M is ~5 GB. Works on any Mac with 8 GB+ RAM. Replace with phi4:14b for 16 GB+ Macs or llama3.3:70b for 48 GB+ Macs.

On Apple Silicon: Metal GPU is automatic

Ollama detects your M-series chip and uses Metal acceleration with no extra flags. All unified memory is available as "VRAM" — there is no separate pool.

Mac Compatibility Table — What Can Each Mac Run?

Token speeds are measured with Ollama running Q4_K_M quantization. Apple Silicon uses Metal; Intel uses CPU only.

Mac	Unified Memory	Qwen3 7B	Phi-4 14B	Llama 3.3 70B	Notes
Any Intel Mac	System RAM	~3-8 t/s (CPU)	✗ (slow)	✗	CPU-only inference — slow but works for small models
M1/M2/M3 8 GB	8 GB unified	~25 t/s	✗ OOM	✗	OS takes 3-4 GB; only 4-5 GB free for models
M1/M2/M3/M4 16 GB	16 GB unified	~30 t/s	~18 t/s (tight)	✗	Good for 7B; 14B is tight
M4 Mac Mini 24 GB	24 GB unified	~38 t/s	~25 t/s	✗	Recommended entry point
M4 Pro 24 GB	24 GB unified	~65 t/s	~45 t/s	✗	2.3x faster — worth the upgrade
M4 Pro 48 GB	48 GB unified	~70 t/s	~50 t/s	~22 t/s	Best Mac Mini for 70B
M4 Max 64 GB	64 GB unified	~90 t/s	~65 t/s	~30 t/s	Mac Studio class

✗ = model will not fit in available memory at Q4_K_M. Token speeds are approximate and vary with prompt length and context size.

Tool Comparison: Ollama, LM Studio, Open WebUI, MLX

Four tools cover most Mac LLM use cases. Ollama is the foundation most others build on — even if you plan to use a GUI, install Ollama first.

Tool	Type	Install	Best For
Ollama	CLI + API server	brew install ollama	Developers, scripting, API access
LM Studio	GUI app	lmstudio.ai download	Beginners, model browsing
Open WebUI	Browser UI (requires Ollama)	pip install open-webui	ChatGPT-like interface
MLX Community	Python framework	pip install mlx-lm	Developers, Apple Silicon native speed

Adding a ChatGPT-style interface

Open WebUI runs in your browser and connects to the Ollama server. Install it with pip, then open localhost:8080.

pip install open-webui
open-webui serve

MLX for maximum Apple Silicon speed

MLX is Apple's own ML framework. For some models it outperforms Ollama/llama.cpp on Apple Silicon by leveraging the Neural Engine.

pip install mlx-lm
mlx_lm.generate --model mlx-community/Qwen3-8B-4bit \
  --prompt "Explain unified memory"

Apple Silicon Tips

Metal GPU acceleration is automatic

Ollama detects Apple Silicon at startup and uses Metal for all GPU operations. No flags, environment variables, or configuration needed. You can confirm GPU usage in Activity Monitor: open it, go to Window > GPU History, and watch GPU utilization spike to 80-100% while a model generates.

All unified memory is available — no separate VRAM limit

Unlike a PC with a discrete GPU (where only the GPU's dedicated VRAM pool is usable for inference), Apple Silicon Macs use all system RAM as "VRAM." An M4 Max with 64 GB has 64 GB available for model weights, minus what macOS and running apps consume (~4-6 GB at idle). There is no PCIe bottleneck — memory bandwidth to the GPU is the same as total system memory bandwidth.

Memory pressure and swap

If you load a model that is too large for available RAM, macOS will swap to SSD. This tanks inference speed dramatically (10x or more slowdown). Watch memory pressure in Activity Monitor: if it turns orange or red while a model loads, you need a smaller model or higher quantization. The 8B model at Q4_K_M uses roughly 5 GB — safe on any 8 GB Mac. The 14B at Q4_K_M uses about 8.5 GB — safe on 16 GB Macs.

Check memory pressure: Activity Monitor > Memory tab > Memory Pressure (bottom of window).

Choosing quantization on Apple Silicon

Q4_K_M is the standard starting point — it balances size and quality well. On 16 GB Macs, stick to Q4_K_M for 7B models. On 24 GB, you can run 7B at Q8 (near-lossless) or 14B at Q4_K_M. The M4 Pro chip at 48 GB can run 13B at Q8 and 34B at Q4_K_M comfortably. Higher quantization (Q8, FP16) improves output quality at the cost of more memory.

Intel Mac: Slower, but Functional

Ollama runs on Intel Macs (macOS 12+) using CPU-only inference. Intel integrated graphics do not have Metal support that Ollama uses for LLM acceleration, so all computation falls on the CPU cores. Speeds are typically 3-8 tokens per second for a 7B model — readable, but noticeably slow compared to Apple Silicon.

What works on Intel Mac

7B models at Q4_K_M (5 GB, ~3-8 t/s)
Ollama with full model management
Open WebUI over Ollama
LM Studio (has its own CPU backend)
Offline, private inference — no API key needed

What to avoid on Intel Mac

13B+ models — too slow for practical use
Long context windows — every token costs more CPU time
MLX — Apple Silicon only
Anything requiring real-time response

Recommendation for Intel Mac users

Install Ollama and run ollama run qwen3:8b to test. If the speed is acceptable for your use case, great. If you want faster inference, the Mac mini M4 is a 4-8x speed improvement and runs on only 30 W.

Common Errors and Fixes

Error: model requires more system memory than available

The model's Q4_K_M weights exceed your available RAM minus what macOS needs. Fix: run a smaller model or use a more aggressive quantization.

# Instead of phi4:14b on an 8 GB Mac, use:
ollama run qwen3:8b

# Or pull the 4-bit version explicitly:
ollama pull qwen3:8b:q4_K_M

Model loads but generation is extremely slow (1-2 t/s)

macOS is swapping model weights to the SSD. The model is too large for available RAM. Open Activity Monitor, check Memory Pressure — it is likely orange or red. Fix: close other applications, restart Ollama, or switch to a smaller model.

# Check how much RAM the model needs:
ollama show qwen3:8b --modelinfo | grep size

# Stop all models to free memory:
ollama stop

ollama: command not found after brew install

Homebrew's bin directory is not in your PATH. Run the path fix or use the direct installer instead.

# Add Homebrew to PATH (add to your ~/.zshrc too):
export PATH="/opt/homebrew/bin:$PATH"

# Or use the direct installer:
curl -fsSL https://ollama.com/install.sh | sh

Model stuck at "pulling manifest" or download stalls

Network issue or Ollama registry timeout. Press Ctrl+C to cancel, then retry. If it keeps stalling, try pulling a different model first to confirm the registry is reachable, or check ollama.com status. Large models (70B) can take 30+ minutes on a typical home connection — the progress bar may appear frozen but is still downloading.

Frequently Asked Questions

How do I install Ollama on Mac? +

Three options: (1) Run "brew install ollama" in Terminal if you have Homebrew. (2) Run "curl -fsSL https://ollama.com/install.sh | sh" to use the official install script. (3) Download the macOS app directly from ollama.com. All three produce the same result. After installing, run "ollama serve" to start the server and "ollama run qwen3:8b" to download and run your first model. macOS 12 Monterey or later is required.

Does Ollama use the GPU on Apple Silicon Mac? +

Yes — Ollama uses Apple's Metal API for GPU acceleration on Apple Silicon (M1, M2, M3, M4) automatically. No configuration is needed. When you run a model, Ollama detects the Metal-capable GPU and offloads all layers to it. You can verify this by watching Activity Monitor > GPU History while a model is generating — you will see GPU utilisation spike to 80–100%. On Intel Macs, Ollama falls back to CPU-only inference, which is significantly slower.

How much unified memory do I need for local LLMs on Mac? +

8 GB is only sufficient for very small models (1B–3B) — macOS takes 3–4 GB at idle, leaving little room. 16 GB runs 7B models at Q4_K_M comfortably (Qwen3 8B uses about 5 GB). 24 GB handles 7B to 14B models well. 48 GB is required for 30B–70B models; Llama 3.3 70B at Q4_K_M uses roughly 43 GB. 64 GB gives comfortable headroom for 70B models and runs 34B at Q8.

Can I run LLMs on an Intel Mac? +

Yes, but with significant limitations. Ollama runs on Intel Macs using CPU inference only — there is no Metal GPU acceleration for Intel integrated graphics in the LLM context. A 7B model at Q4_K_M will generate 3–8 tokens per second on a recent Intel Mac, compared to 25–40 t/s on even a base M1 chip. Intel Macs are usable for experimentation with small models (7B and below), but Apple Silicon is strongly recommended for any regular use.

What is the fastest way to run LLMs on Mac? +

For raw token throughput on Apple Silicon, MLX (Apple's ML framework) is often faster than Ollama/llama.cpp for certain workloads because it is optimised specifically for the Apple Silicon architecture, including the Neural Engine. Install it with "pip install mlx-lm" and run models from the MLX Community on Hugging Face. For convenience and ease of use, Ollama with Metal acceleration is the fastest to get running and handles most use cases at near-MLX speeds.

Related Guides

Popular hardware for local LLMs

RTX 4060 (8 GB)

Budget pick. Runs 7B-8B models at 25-35 tok/s.

Buy on Amazon

RTX 4060 Ti 16 GB

Sweet spot. Runs 13B-14B at full speed. Best value.

Buy on Amazon

RTX 4090 (24 GB)

Top consumer GPU. Runs 70B models with offloading.

Buy on Amazon

Sources & methodology

Behaviour, file-format and runtime details on this page are pulled from primary upstream docs and community benchmark threads. The full sitewide methodology lives on the methodology page. For this guide I relied most on:

Apple MLX. Apple's MLX framework, the fastest path for native Metal inference on M-series.
Ollama. The macOS installer most readers will use to pull and run their first model.
llama.cpp llama-bench discussion. M-series benchmark thread that backs the tokens-per-second ranges in this guide.

Spot a number that does not match the linked source? Email [email protected] and I will update the guide.