What Hardware Do You Need to Run Mistral Locally?

AI pulled the Mistral / Mixtral size table. The MoE VRAM nuance and the per-size hardware picks were edited against the Mistral docs and the cited community benchmarks.

Updated May 2026 · Mistral 7B to 22B · VRAM requirements · Consumer GPU guide · Ollama setup

Mistral 7B is the most popular open-source 7B model — and at ~4.5 GB at Q4_K_M, it runs on nearly any GPU including budget 6 GB cards. For higher quality, Mistral Nemo 12B on the Intel Arc B580 is the sweet spot in 2026. Mistral Small 22B steps up to a 16 GB GPU tier. All three run via Ollama with a single command.

What is Mistral AI?

Mistral AI is a French AI company founded in 2023 that releases open-weight models known for punching above their weight class. Mistral 7B was the first model to significantly challenge Llama 2 at the same parameter count.

Buy on Amazon

Mistral VRAM Requirements by Model Size

VRAM is estimated using ceil(params × bytes_per_param) + 1.5 GB overhead. Q4_K_M uses ~0.5 bytes/param, Q8 uses ~1.0 bytes/param, FP16 uses ~2.0 bytes/param. Add extra headroom for KV cache with long context windows.

ModelParamsQ4_K_M VRAMQ8 VRAMFP16 VRAMMin GPU
Mistral 7B v0.3 7B ~4.5 GB ~8 GB ~15 GB 6–8 GB GPU
Mistral Nemo 12B 12B ~7.5 GB ~13 GB ~26 GB 8–12 GB GPU
Mistral Small 22B 22B ~13 GB ~24 GB ~48 GB 16 GB GPU
Mistral Large 2 123B ~70 GB ~130 GB N/A Mac Studio 128GB

Mistral 7B at Q4_K_M fits on 6 GB GPUs — one of the only 7B models with this headroom. Use the VRAM Calculator for context-length-adjusted estimates.

Which Quantization Should You Use?

Quantization trades a small amount of output quality for a large reduction in VRAM usage. For most users, Q4_K_M is the right default.

Q4_K_M — recommended

Cuts VRAM roughly in half versus FP16 with minimal quality loss. The Ollama default for all Mistral models. Mistral 7B at Q4 runs at ~40 tok/s on an RTX 4060 — fast enough for real-time conversation. Start here.

Q8_0 — best quality

Approximately doubles VRAM vs Q4 but preserves near-FP16 output quality. Mistral 7B at Q8 needs ~8 GB — fits on an RTX 4060 8GB with minimal headroom. Use Q8 when you have extra VRAM and want maximum output fidelity.

FP16 — reference only

Full precision, no quality loss. Mistral 7B at FP16 needs ~15 GB — requires a 16 GB GPU. Mainly used for fine-tuning or benchmarking, not daily inference. Only practical for 7B; larger Mistral models at FP16 need 48+ GB.

Q2/Q3 — constrained devices

Aggressive quantizations for very limited hardware. Quality degrades noticeably — particularly on instruction following and code. Only consider if your device cannot fit Q4. Mistral 7B at Q2 can run on 2–3 GB but output quality suffers.

What Mistral Models Can You Run on Your GPU?

Find your GPU or Mac below. Each card shows which Mistral models fit, and what does not.

RTX 4060 8GB

Runs:

  • +Mistral 7B (all quants — Q4 and Q8 fit with headroom)
  • +Mistral Nemo 12B (Q4 only, tight — ~7.5 GB used)

Does not fit:

  • -Mistral Nemo 12B Q8 (needs ~13 GB)
  • -Mistral Small 22B (needs ~13 GB at Q4)

The best budget entry point for Mistral. Mistral 7B at Q8 uses ~8 GB and leaves minimal headroom — keep context moderate. The 7B at Q4 runs very fast (~40 tok/s) with plenty of headroom. Nemo 12B at Q4 fits but is tight — short context recommended.

Intel Arc B580 12GB

Runs:

  • +Mistral 7B (all quants, comfortable)
  • +Mistral Nemo 12B (Q4, ~7.5 GB — ~4.5 GB headroom)

Does not fit:

  • -Mistral Nemo 12B Q8 (needs ~13 GB)
  • -Mistral Small 22B (needs ~13 GB at Q4)

Best value for Mistral Nemo 12B. 12 GB gives comfortable headroom at Q4 for Nemo. Verify Ollama ROCm / oneAPI compatibility before purchasing — Arc driver support is good but lags NVIDIA in maturity.

RTX 4060 Ti 16GB

Runs:

  • +Mistral 7B and Nemo 12B (all quants)
  • +Mistral Small 22B (Q4, ~13 GB — ~3 GB headroom)

Does not fit:

  • -Mistral Small 22B Q8 (needs ~24 GB)
  • -Mistral Large 2 (needs ~70 GB)

The sweet spot for Mistral Small 22B at Q4. About 3 GB of headroom — keep context windows moderate. All three smaller Mistral models fit comfortably at Q4 and Q8. Best value for the full Mistral 7B–22B family.

RTX 4070 12GB

Runs:

  • +Mistral 7B and Nemo 12B (all quants)
  • +Mistral Nemo Q4 with comfortable headroom

Does not fit:

  • -Mistral Small 22B at Q4 (needs ~13 GB — very tight on 12 GB)
  • -Mistral Small Q8 (needs ~24 GB)

12 GB matches the Arc B580 for Mistral capacity but with faster ~504 GB/s bandwidth. Mistral Small 22B at Q4 needs ~13 GB — technically exceeds 12 GB, so this GPU cannot run it. For 22B support, step up to the RTX 4060 Ti 16GB.

RTX 4070 Ti Super 16GB

Runs:

  • +Mistral 7B and Nemo 12B (all quants)
  • +Mistral Small 22B (Q4, with ~3 GB headroom)

Does not fit:

  • -Mistral Small 22B Q8 (needs ~24 GB)
  • -Mistral Large 2

Same VRAM ceiling as the 4060 Ti 16GB but 2.3x faster memory bandwidth (~672 GB/s). Mistral Small 22B generates noticeably faster. The better choice if Mistral Small generation speed matters.

RTX 4090 24GB

Runs:

  • +Mistral 7B, Nemo 12B, Small 22B (all quants)
  • +Mistral Small 22B Q8 (~24 GB — very tight, ~0 GB headroom)

Does not fit:

  • -Mistral Large 2 (needs ~70 GB)
  • -Mistral Small Q8 with long context (no headroom)

Best single consumer GPU for Mistral 7B–22B. Small 22B at Q8 technically fits but leaves no headroom — use Q4 for practical daily use. At 1,008 GB/s bandwidth, all Mistral models generate fast. Cannot touch Mistral Large 2.

RTX 5090 32GB

Runs:

  • +All Mistral models up to 22B at any quant
  • +Mistral Small 22B Q8 (~24 GB) with 8 GB headroom

Does not fit:

  • -Mistral Large 2 (needs ~70 GB)

32 GB gives Mistral Small 22B Q8 comfortable headroom for long context. At 1,792 GB/s it is the fastest single-GPU for Mistral inference. Still cannot run Mistral Large 2 — that requires multi-GPU or a high-memory Mac.

Mac mini M4 16GB

Runs:

  • +Mistral 7B (all quants)
  • +Mistral Nemo 12B (Q4, ~7.5 GB — comfortable)

Does not fit:

  • -Mistral Nemo 12B Q8 (needs ~13 GB)
  • -Mistral Small 22B (needs ~13 GB at Q4)

Unified memory means all 16 GB is available. Mistral 7B and Nemo 12B at Q4 run smoothly. Silent and energy-efficient. For Mistral Small 22B, step up to the 24 GB model.

Mac mini M4 24GB

Runs:

  • +Mistral 7B and Nemo 12B (all quants)
  • +Mistral Small 22B (Q4, ~13 GB — ~11 GB headroom)

Does not fit:

  • -Mistral Small 22B Q8 (needs ~24 GB, equals full RAM)
  • -Mistral Large 2

The best Mac value for the full Mistral 7B–22B family. Small 22B at Q4 leaves ~11 GB for OS and KV cache — comfortable context lengths. Q8 for the 22B would need the full 24 GB with no headroom — avoid.

Mac mini M4 Pro 48GB

Runs:

  • +Mistral 7B, Nemo 12B, Small 22B (all quants including Q8)
  • +Mistral Small 22B Q8 (~24 GB, ~24 GB headroom)

Does not fit:

  • -Mistral Large 2 (needs ~70 GB)

48 GB gives Mistral Small 22B Q8 generous headroom. Excellent for high-quality Mistral Small inference. Still cannot run Mistral Large 2 at any usable quantization.

Mac Studio M4 Max 128GB

Runs:

  • +Mistral 7B, Nemo 12B, Small 22B (all quants)
  • +Mistral Large 2 (Q4, ~70 GB — ~58 GB headroom)

Does not fit:

  • -Mistral Large 2 Q8 (needs ~130 GB)

The only practical consumer option for Mistral Large 2. At Q4 the 123B model uses ~70 GB of the 128 GB, leaving ~58 GB for OS and context. Bandwidth is ~800 GB/s — reasonably fast for a 123B model. Q8 does not fit.

Inference Speed by Hardware

Token generation speed is bottlenecked by memory bandwidth. The table below shows estimated Q4_K_M token speeds at low batch size. Real-world results vary by driver version, context length, and system load.

HardwareBandwidth7B Q4 tok/sNemo 12B tok/sSmall 22B tok/s
RTX 5090 32GB 1,792 GB/s ~200 t/s ~120 t/s ~65 t/s
RTX 4090 24GB 1,008 GB/s ~112 t/s ~67 t/s ~36 t/s
RTX 4070 Ti Super 16GB 672 GB/s ~75 t/s ~45 t/s ~24 t/s
RTX 4080 16GB 720 GB/s ~80 t/s ~48 t/s ~26 t/s
RTX 4060 Ti 16GB 288 GB/s ~32 t/s ~19 t/s ~10 t/s
Intel Arc B580 12GB 456 GB/s ~51 t/s ~30 t/s
RTX 4060 8GB 272 GB/s ~40 t/s ~18 t/s
Mac Studio M4 Max 128GB ~800 GB/s ~89 t/s ~53 t/s ~29 t/s
Mac mini M4 Pro 48GB ~273 GB/s ~30 t/s ~18 t/s ~10 t/s
Mac mini M4 24GB ~120 GB/s ~13 t/s ~8 t/s ~4 t/s

Speed estimates: tokens/sec ≈ bandwidth (GB/s) / model size in memory (GB). Mac bandwidth figures are approximate. Dash (—) means the model does not fit at that VRAM tier.

How to Run Mistral Locally

Ollama

ollama run mistral-nemo

Easiest option. One command installs and runs the model. Tags: mistral (7B), mistral-nemo (12B), mistral-small (22B). GPU is auto-detected on NVIDIA, AMD (ROCm), and Apple Silicon. Defaults to Q4_K_M quantization.

LM Studio

Search "Mistral" in Discover

GUI-based model browser and chat interface. Download GGUF quantizations directly within the app. Best for non-technical users. Runs on Windows, Mac, and Linux. Lets you pick Q4, Q8, or other quants from a dropdown.

Hugging Face + llama.cpp

mistralai/Mistral-Nemo-Instruct-2407-GGUF

Download GGUF files from the official mistralai org or community repos (bartowski, unsloth) on Hugging Face. Run with llama.cpp for maximum control over quantization, context length, and GPU layer offloading.

For step-by-step installation instructions, see the how to run LLMs locally guide. For a comparison of Ollama vs LM Studio, see the Ollama vs LM Studio guide.

Mistral 7B vs Llama 3.1 8B vs Qwen3 8B

All three are strong open-weight 7–8B models that run on an 8 GB GPU. Here is how they compare for local use:

Mistral 7BLlama 3.1 8BQwen3 8B
VRAM (Q4) ~4.5 GB ~5 GB ~5 GB
Coding Good Good Excellent
Multilingual Excellent Good Excellent
Speed (RTX 4060) ~40 tok/s ~35 tok/s ~35 tok/s
Context window 32K 128K 32K
Best use case Multilingual, fast general use Long context, general Coding, reasoning

Choose Mistral 7B if...

  • +You need multilingual support (French, Spanish, German, etc.)
  • +You want the fastest 7B generation speed
  • +You have a 6 GB GPU and need maximum model quality
  • +You want a battle-tested, widely-used open model

Choose Llama 3.1 8B if...

  • +You work with long documents (128K context)
  • +You want the strongest overall benchmarks at 8B
  • +You want the broadest fine-tune and tool ecosystem
  • +You need Meta's official weights with commercial-friendly license

Choose Qwen3 8B if...

  • +You primarily use the model for coding tasks
  • +You want built-in chain-of-thought reasoning (thinking mode)
  • +You work in Chinese alongside English
  • +You want the best reasoning at the 8B parameter count

Which Hardware Should You Buy for Mistral?

Entry level

RTX 4060 8GB

Runs Mistral 7B at Q8 with minimal headroom and Nemo 12B at Q4 (tight). The 7B model at Q4 is fast, capable, and leaves plenty of headroom for context. Best budget entry for Mistral.

Best value

Intel Arc B580 12GB

The cheapest path to a comfortable Mistral Nemo 12B experience. 12 GB gives ~4.5 GB of headroom at Q4 for Nemo — enough for reasonable context lengths. Verify Ollama compatibility before buying.

Sweet spot

RTX 4060 Ti 16GB

Best-value GPU for the full Mistral 7B–22B family. Mistral Small 22B at Q4 fits with ~3 GB headroom. Runs all three consumer Mistral sizes. If Mistral Small is your target, this is the card to buy.

Mid-range

AMD RX 7900 XTX 24GB

Runs Mistral Small 22B at Q8 with nearly no headroom, or at Q4 with generous headroom. Works with Ollama via ROCm on Linux. 24 GB at a lower price than the RTX 4090.

High end

RTX 4090 24GB

Best single consumer GPU for Mistral 7B–22B. Mistral Small at Q4 runs fast with headroom. 1,008 GB/s bandwidth means fast generation across all sizes. Cannot run Mistral Large 2.

Mac ecosystem

Mac mini M4 24GB

Best Mac value for Mistral Small 22B — ~11 GB of headroom at Q4. Runs Mistral 7B and Nemo 12B at any quant. For Mistral Large 2, you need a Mac Studio M4 Max 128GB. Silent and energy-efficient.

For a full cross-budget GPU comparison, see the best GPU for LLMs guide.

Related Resources

Frequently Asked Questions

Mistral 7B vs Llama 3.1 8B — which is better for local use?

Both are excellent 7–8B models for local use. Mistral 7B uses slightly less VRAM (~4.5 GB at Q4 vs ~5 GB for Llama 3.1 8B) and generates faster at roughly 40 tok/s on an RTX 4060 vs ~35 tok/s. Llama 3.1 8B has a longer context window (128K vs 32K), stronger overall benchmark scores, and a broader fine-tune ecosystem. Mistral 7B is the better choice if you prioritize multilingual quality or raw speed on a 6–8 GB GPU. Llama 3.1 8B is better for long-document tasks and general-purpose use.

Can an 8GB GPU run Mistral Nemo 12B?

Technically yes, but it is tight. Mistral Nemo 12B at Q4_K_M requires approximately 7.5 GB of VRAM. An RTX 4060 8GB has about 0.5 GB of headroom — enough to run but keep context windows short to avoid out-of-memory errors. At Q8, Mistral Nemo needs ~13 GB and will not fit on 8 GB. For a comfortable Nemo experience, the Intel Arc B580 12GB gives ~4.5 GB of headroom at Q4.

Can I run Mistral on a Mac?

Yes. All Mistral models run on Apple Silicon Macs via Ollama or LM Studio. Unified memory means all RAM is available. A Mac mini M4 16GB runs Mistral 7B and Nemo 12B comfortably. A Mac mini M4 24GB adds Mistral Small 22B at Q4 with ~11 GB of headroom. For Mistral Large 2 (123B), you need a Mac Studio M4 Max 128GB.

What is the best GPU for Mistral 7B?

Mistral 7B at Q4_K_M needs ~4.5 GB of VRAM, so almost any GPU with 6 GB or more will run it. The RTX 4060 8GB is the best value — it runs Mistral 7B at full Q8 quality (~8 GB) with plenty of headroom, delivering around 40 tok/s. Even older 6 GB cards like the RTX 2060 can run Mistral 7B at Q4. Prioritize memory bandwidth over raw VRAM when choosing a GPU for the 7B model.

Mistral Nemo vs Phi-4 — which should I run locally?

Mistral Nemo 12B and Phi-4 14B target similar hardware. Phi-4 has stronger STEM and reasoning benchmarks in its size class. Mistral Nemo has better multilingual support and a longer context window (128K tokens). For coding, math, and science tasks in English, Phi-4 is the better choice. For multilingual tasks or long-context use, Mistral Nemo is the pick. Both run well on the Intel Arc B580 12GB at Q4.

How do I run Mistral with Ollama?

Run: ollama run mistral for the 7B model, ollama run mistral-nemo for the 12B model, or ollama run mistral-small for the 22B model. Ollama auto-detects your GPU and downloads Q4_K_M quantizations by default. On an 8 GB GPU, stick with mistral or mistral-nemo. On a 16 GB GPU, mistral-small runs comfortably. Ollama works on NVIDIA, AMD (via ROCm on Linux), and Apple Silicon without configuration changes.

Check VRAM requirements for Mistral models, or compare hardware options.

Related Guides

Sources & methodology

VRAM and tokens-per-second figures on this page are synthesised from open community benchmarks. The sitewide formula and the full source list are on the methodology page. For this guide I leaned on:

Spot a number that does not match the linked source? Email [email protected] and I will update the guide.