What Hardware Do You Need for Gemma 3?
AI did the initial pull from Google's Gemma 3 model cards. Every VRAM and tokens-per-second figure here was cross-checked against the methodology page formula and the linked community sources.
Updated May 2026 · Gemma 3 1B–27B · VRAM requirements · Consumer GPU guide · Ollama & LM Studio
Gemma 3, released by Google DeepMind in March 2025, is a family of open-weight models in 1B, 4B, 12B, and 27B sizes. The 27B flagship delivers GPT-4-class reasoning in just 16 GB of VRAM — making it the best argument for owning an RTX 4060 Ti 16GB or RTX 4080. The 12B model fits on 8 GB. All four sizes run on Ollama and LM Studio with no code required.
What is Gemma 3?
Gemma 3 is an open-weight model family from Google DeepMind. Released in March 2025, it comes in four sizes with instruction-tuned variants available for each.
- Four sizes 1B, 4B, 12B, and 27B — all dense models where every parameter is loaded and active during inference. No MoE routing complexity.
- Strong reasoning The 27B model benchmarks competitively with much larger closed models on math, coding, and multi-step reasoning tasks.
- Multilingual Excellent performance across English, Spanish, French, German, Japanese, Korean, and many other languages — one of Gemma 3's key design goals.
- HuggingFace IDs google/gemma-3-1b-it, google/gemma-3-4b-it, google/gemma-3-12b-it, google/gemma-3-27b-it
Gemma 3 VRAM Requirements by Model Size
VRAM is estimated using ceil(params × bytes_per_param) + 1.5 GB overhead. Q4_K_M uses ~0.5 bytes/param, Q8 uses ~1.0 bytes/param, FP16 uses ~2.0 bytes/param. Add extra headroom for KV cache when using long context windows.
| Model | Params | Q4_K_M VRAM | Q8 VRAM | FP16 VRAM | Min GPU |
|---|---|---|---|---|---|
| Gemma 3 1B | 1B | ~1 GB | ~1.5 GB | ~2 GB | Any GPU |
| Gemma 3 4B | 4B | ~3 GB | ~5 GB | ~8 GB | 8 GB GPU |
| Gemma 3 12B | 12B | ~8 GB | ~13 GB | ~25 GB | 8 GB GPU |
| Gemma 3 27B | 27B | ~16 GB | ~28 GB | ~55 GB | 16 GB GPU |
Gemma 3 27B at Q4_K_M fits exactly in 16 GB — tight on 16GB GPUs. Use the VRAM Calculator for context-length-adjusted estimates and KV cache projections.
Which Quantization Should You Use?
Quantization trades a small amount of output quality for a large reduction in VRAM usage. For most users, Q4_K_M is the right default.
Q4_K_M — recommended
Cuts VRAM roughly in half versus FP16 with minimal quality loss on most tasks. The standard quantization for consumer hardware. All Ollama defaults use Q4_K_M. This is what to use unless you have excess VRAM.
Q8_0 — best quality
Approximately doubles VRAM vs Q4 but preserves near-FP16 quality. Gemma 3 27B at Q8 needs ~28 GB — an RTX 5090 32GB or Mac mini M4 Pro 48GB. Use Q8 when you have the headroom and want maximum output fidelity.
FP16 — reference only
Full precision, no quality loss. Gemma 3 27B at FP16 needs ~55 GB — well beyond any consumer single GPU. FP16 is mainly used for fine-tuning or benchmarking, not daily inference.
Q2/Q3 — small devices
Aggressive quantizations for very constrained hardware. Quality degrades noticeably — especially on reasoning tasks. Only consider Q2/Q3 if your device cannot fit Q4 and you need the model for lightweight tasks.
What Gemma 3 Can You Run on Your GPU?
Find your GPU or Mac below. Each card shows which Gemma 3 models fit, and what does not.
RTX 4060 8GB
Runs:
- +Gemma 3 1B (all quants)
- +Gemma 3 4B (Q4 & Q8)
- +Gemma 3 12B (Q4 only, tight — ~7.5 GB)
Does not fit:
- -Gemma 3 12B Q8 (needs ~13 GB)
- -Gemma 3 27B (needs ~16 GB)
Best budget entry point for Gemma 3. The 12B at Q4 is a tight fit — keep context windows short. The 4B model runs at full Q8 quality with plenty of headroom, making it a great daily-use setup.
Intel Arc B580 12GB
Runs:
- +Gemma 3 1B through 4B (all quants)
- +Gemma 3 12B (Q4 comfortably, ~7.5 GB)
Does not fit:
- -Gemma 3 12B Q8 (needs ~13 GB)
- -Gemma 3 27B (needs ~16 GB)
Best value for Gemma 3 12B. 12 GB gives ~4.5 GB headroom at Q4 on the 12B model — enough for reasonable context lengths. Verify Ollama ROCm / oneAPI compatibility before purchasing.
RTX 4060 Ti 16GB
Runs:
- +Gemma 3 1B through 12B (all quants)
- +Gemma 3 27B (Q4_K_M, ~16 GB — tight)
Does not fit:
- -Gemma 3 27B Q8 (needs ~28 GB)
- -Gemma 3 27B FP16 (needs ~55 GB)
The best-value GPU for Gemma 3 27B. At Q4_K_M the 27B model uses the full 16 GB — reduce context length to avoid OOM. This is the most popular Gemma 3 27B setup for budget-conscious users.
RTX 4070 12GB
Runs:
- +Gemma 3 1B through 4B (all quants)
- +Gemma 3 12B (Q4, comfortable)
Does not fit:
- -Gemma 3 12B Q8 (needs ~13 GB)
- -Gemma 3 27B (needs ~16 GB)
12 GB matches the Arc B580 for Gemma 3 capacity. Faster bandwidth (~504 GB/s) than B580. For Gemma 3 use cases the RTX 4060 Ti 16GB is the better buy — the extra 4 GB unlocks the 27B model.
RTX 4070 Ti Super 16GB
Runs:
- +Gemma 3 1B through 12B (all quants)
- +Gemma 3 27B (Q4_K_M, tight)
Does not fit:
- -Gemma 3 27B Q8 (needs ~28 GB)
Same VRAM ceiling as the 4060 Ti 16GB but 2.3x faster memory bandwidth (~672 GB/s). Gemma 3 27B tokens generate noticeably faster. Still a tight fit at Q4 — same context length caveats apply.
RTX 4080 16GB
Runs:
- +Gemma 3 1B through 12B (all quants)
- +Gemma 3 27B (Q4_K_M, tight)
Does not fit:
- -Gemma 3 27B Q8 (needs ~28 GB)
Fast bandwidth (~720 GB/s) makes Gemma 3 27B generation snappy. The 16 GB ceiling is the same as the 4060 Ti — same VRAM limitation. Pay for speed, not extra capacity, at this tier.
RTX 3090 24GB (used)
Runs:
- +Gemma 3 1B through 27B at Q4_K_M (all with headroom)
- +Gemma 3 12B Q8 (~13 GB, comfortable)
Does not fit:
- -Gemma 3 27B Q8 (needs ~28 GB, exceeds 24 GB)
24 GB unlocks Gemma 3 27B at Q4 with ~8 GB of headroom — comfortable context lengths. Cannot fit 27B Q8. Older PCIe 3.0 and ~936 GB/s bandwidth, but excellent used-market value for running 27B.
AMD RX 7900 XTX 24GB
Runs:
- +Gemma 3 1B through 27B at Q4_K_M
- +Gemma 3 12B Q8 (~13 GB, comfortable)
Does not fit:
- -Gemma 3 27B Q8 (needs ~28 GB)
24 GB at ~960 GB/s bandwidth. Works with Ollama via ROCm on Linux. Same Gemma 3 model ceiling as the RTX 4090 at a lower price. See the AMD ROCm guide for setup instructions.
RTX 4090 24GB
Runs:
- +Gemma 3 1B through 27B at Q4_K_M (comfortable)
- +Gemma 3 12B Q8
Does not fit:
- -Gemma 3 27B Q8 (needs ~28 GB)
Best single consumer GPU for Gemma 3. 1,008 GB/s bandwidth means fast generation on the 27B model. At Q4 the 27B uses ~16 GB leaving 8 GB headroom — generous for long context. 27B Q8 does not fit.
RTX 5090 32GB
Runs:
- +All Gemma 3 sizes at Q4 and Q8
- +Gemma 3 27B Q8 (~28 GB, fits with 4 GB headroom)
Does not fit:
- -Gemma 3 27B FP16 (~55 GB)
Only consumer GPU with enough VRAM for Gemma 3 27B at Q8. At 1,792 GB/s it delivers the fastest single-GPU generation speeds. The 4 GB headroom at 27B Q8 is adequate for moderate context lengths.
Mac mini M4 16GB
Runs:
- +Gemma 3 1B through 4B (all quants)
- +Gemma 3 12B (Q4, ~7.5 GB — fits with headroom)
Does not fit:
- -Gemma 3 27B (~16 GB, equals full RAM — no headroom)
- -Gemma 3 12B Q8 (needs ~13 GB)
Unified memory means all 16 GB is available. Gemma 3 12B at Q4 uses ~7.5 GB leaving 8.5 GB for OS and KV cache. Silent and efficient. The 27B model technically matches 16 GB but leaves no headroom — avoid.
Mac mini M4 24GB
Runs:
- +Gemma 3 1B through 12B (all quants)
- +Gemma 3 27B (Q4, ~16 GB — fits with ~8 GB headroom)
Does not fit:
- -Gemma 3 27B Q8 (needs ~28 GB)
The sweet spot for Gemma 3 27B on Mac. 8 GB of headroom above the model weights gives comfortable context lengths. Gemma 3 12B at Q8 uses ~13 GB, leaving ~11 GB headroom — excellent quality.
Mac mini M4 Pro 48GB
Runs:
- +All Gemma 3 sizes at Q4 and Q8
- +Gemma 3 27B Q8 (~28 GB, ~20 GB headroom)
Does not fit:
- -Gemma 3 27B FP16 (~55 GB, exceeds 48 GB)
48 GB comfortably fits Gemma 3 27B at Q8 with generous headroom for long context. ~273 GB/s bandwidth is slower than discrete GPUs but the silence, power efficiency, and price are compelling.
Mac Studio M4 Max 64GB
Runs:
- +All Gemma 3 sizes at Q4 and Q8
- +Gemma 3 27B Q8 with large context windows
Does not fit:
- -Gemma 3 27B FP16 (~55 GB — fits, but leaves only ~9 GB headroom)
~600 GB/s bandwidth makes Gemma 3 27B Q8 generation fast. Realistically the best Mac for maximum quality Gemma 3 27B inference. FP16 technically fits but context is very limited — stick with Q8.
Inference Speed by Hardware
Token generation speed is bottlenecked by memory bandwidth. The table below shows estimated Q4_K_M token speeds at low batch size. Real-world results vary by driver version, context length, and system load.
| Hardware | Bandwidth | 4B Q4 tok/s | 12B Q4 tok/s | 27B Q4 tok/s |
|---|---|---|---|---|
| RTX 5090 32GB | 1,792 GB/s | ~240 t/s | ~96 t/s | ~45 t/s |
| RTX 4090 24GB | 1,008 GB/s | ~135 t/s | ~54 t/s | ~25 t/s |
| RTX 4070 Ti Super 16GB | 672 GB/s | ~90 t/s | ~36 t/s | ~17 t/s |
| RTX 4080 16GB | 720 GB/s | ~96 t/s | ~38 t/s | ~18 t/s |
| RTX 4060 Ti 16GB | 288 GB/s | ~38 t/s | ~15 t/s | ~7 t/s |
| Intel Arc B580 12GB | 456 GB/s | ~61 t/s | ~24 t/s | — |
| RTX 4060 8GB | 272 GB/s | ~36 t/s | ~14 t/s | — |
| Mac Studio M4 Max 64GB | ~600 GB/s | ~80 t/s | ~32 t/s | ~15 t/s |
| Mac mini M4 Pro 48GB | ~273 GB/s | ~36 t/s | ~14 t/s | ~7 t/s |
| Mac mini M4 24GB | ~120 GB/s | ~16 t/s | ~6 t/s | ~3 t/s |
Speed estimates: tokens/sec ≈ bandwidth (GB/s) / model size in memory (GB). Mac bandwidth figures are approximate. Dash (—) means the model does not fit at that VRAM tier.
How to Run Gemma 3 Locally
Ollama
ollama run gemma3:27b Easiest option. One command installs and runs the model. Available tags: gemma3:1b, gemma3:4b, gemma3:12b, gemma3:27b. GPU is auto-detected on NVIDIA, AMD (ROCm), and Apple Silicon. Ollama defaults to Q4_K_M.
LM Studio
Search "Gemma 3" in Discover GUI-based model browser and chat interface. Download GGUF quantizations directly within the app. Best for non-technical users. Runs on Windows, Mac, and Linux. Lets you pick Q4, Q8, or other quants from a dropdown.
Hugging Face + llama.cpp
google/gemma-3-27b-it-GGUF Download GGUF files from the official google org or community repos (bartowski, unsloth) on Hugging Face. Run with llama.cpp for maximum control over quantization, context length, and GPU layer offloading.
For step-by-step installation instructions, see the how to run LLMs locally guide. For a comparison of Ollama vs LM Studio, see the Ollama vs LM Studio guide.
Gemma 3 vs Qwen3 vs Llama 3.3
All three are leading open-weight model families. Here is how they differ for local hardware use:
| Model family | Largest dense size | VRAM for largest | Min GPU for largest | Thinking mode | Architecture |
|---|---|---|---|---|---|
| Gemma 3 | 27B | ~16 GB Q4 | RTX 4060 Ti 16GB | No | Dense |
| Qwen3 | 32B | ~17.5 GB Q4 | RTX 4090 24GB | Yes (built-in) | Dense + MoE |
| Llama 3.3 | 70B | ~40 GB Q4 | Dual 24GB GPUs | No | Dense |
Choose Gemma 3 if...
- +You want the best 16GB GPU model
- +You prioritize multilingual quality
- +You want Google-backed architecture with strong reasoning
- +You prefer a simpler dense model (no MoE complexity)
Choose Qwen3 if...
- +You want built-in chain-of-thought reasoning
- +You want more size options (0.6B to 32B dense)
- +You have a 24GB GPU and want the 32B model
- +You want the MoE 30B-A3B for fast inference with large weights
Choose Llama 3.3 if...
- +You have a large multi-GPU setup (70B needs 40+ GB)
- +You want Meta's official weights
- +You need broad ecosystem support and fine-tune availability
- +The 8B and 70B sizes cover your use case
Which Hardware Should You Buy for Gemma 3?
RTX 4060 8GB
Runs Gemma 3 4B at Q8 and 12B at Q4 (tight). The 4B model is fast and capable for everyday chat and coding tasks. Best entry point for Gemma 3 on a budget.
Intel Arc B580 12GB
Cheapest path to a comfortable Gemma 3 12B experience. 12 GB gives ~4.5 GB headroom at Q4. Verify Ollama compatibility before buying — Arc driver support is good but lags NVIDIA.
RTX 4060 Ti 16GB
The best-value GPU for Gemma 3 27B. The 27B at Q4_K_M fits exactly in 16 GB — this is the minimum GPU that runs the flagship model. If the 27B is your target, this is the card to buy.
RTX 3090 24GB
Runs Gemma 3 27B at Q4 with 8 GB headroom for generous context lengths. Cannot fit 27B Q8 (needs 28 GB). Older architecture, but excellent VRAM-per-dollar on the used market.
RTX 4090 24GB
Best single consumer GPU for Gemma 3. 27B at Q4 with 8 GB headroom and 1,008 GB/s bandwidth means fast generation. The go-to choice if budget is flexible.
Mac mini M4 24GB
Best Mac value for Gemma 3 27B. Unified memory gives 8 GB headroom above the model. For Q8 quality on 27B, step up to Mac mini M4 Pro 48GB. Silent, efficient, and runs all four Gemma 3 sizes.
For a full cross-budget GPU comparison, see the best GPU for LLMs guide.
Related Resources
What LLMs Can I Run?
Enter your GPU — see every model that fits
Qwen3 Hardware Requirements
Compare Gemma 3 vs Qwen3 for your GPU
Best GPU for LLMs — Full Guide
All budget tiers from entry to workstation
How to Run Gemma 3 Locally
Step-by-step Ollama setup for Gemma 3 1B through 27B — vision included
How to Run LLMs Locally
Step-by-step Ollama, LM Studio, llama.cpp setup
Ollama vs LM Studio
Which tool to use for running Gemma 3 locally
LLM Quantization Explained
Q4 vs Q8 vs FP16 — when trade-offs matter
Apple Silicon for LLMs
M4, M4 Pro, M4 Max — which Mac for Gemma 3?
Best LLM for Coding Locally
Gemma 3 12B and Qwen3 are top coding picks — comparison by GPU tier
VRAM Calculator
Calculate exact VRAM at your context length
Frequently Asked Questions
What GPU do I need to run Gemma 3 27B?
Gemma 3 27B at Q4_K_M requires approximately 16 GB of VRAM. The minimum GPUs that fit it are the RTX 4060 Ti 16GB, RTX 4070 Ti Super 16GB, RTX 4080 16GB, RTX 4090 24GB, and AMD RX 7900 XTX 24GB. On Mac, the Mac mini M4 24GB runs it comfortably. The RTX 4060 Ti 16GB is the best-value option — keep context lengths moderate to avoid OOM.
Can an 8GB GPU run Gemma 3 12B?
Yes, but tightly. Gemma 3 12B at Q4_K_M needs approximately 7.5 GB, which fits on an RTX 4060 8GB with about 0.5 GB of headroom. Keep context windows short to stay stable. At Q8 the 12B model needs ~13 GB and will not fit on 8GB. For a more comfortable 12B experience, the Intel Arc B580 12GB gives ~4.5 GB of headroom at Q4.
How does Gemma 3 compare to Qwen3 quality?
Both are strong 2025 model families. Gemma 3 27B is one of the best 27B models for reasoning and multilingual tasks, fitting in 16 GB VRAM. Qwen3 offers built-in chain-of-thought thinking mode and a wider size range (0.6B to 32B). Qwen3-32B at Q4 edges ahead on complex reasoning but requires a 24GB GPU. For 16GB GPUs, Gemma 3 27B is the flagship choice.
How do I run Gemma 3 with Ollama?
Run: ollama run gemma3:27b — replace "27b" with your target size (1b, 4b, 12b, or 27b). Ollama auto-detects NVIDIA, AMD, and Apple Silicon GPUs and downloads Q4_K_M by default. For an 8GB GPU, use gemma3:4b or gemma3:12b with a short context window.
Can I run Gemma 3 on a Mac?
Yes. All Gemma 3 models run on Apple Silicon Macs via Ollama or LM Studio. Unified memory means all RAM is available. Mac mini M4 16GB runs Gemma 3 1B, 4B, and 12B. Mac mini M4 24GB adds Gemma 3 27B at Q4 with ~8 GB headroom. Mac mini M4 Pro 48GB runs the 27B at Q8 quality.
What is the best quantization for Gemma 3?
Q4_K_M is the recommended default — it halves VRAM usage versus FP16 with minimal quality loss. Q8_0 offers near-FP16 quality but doubles VRAM requirements. For the 27B model: Q4 fits in 16 GB, Q8 needs ~28 GB. Use Q4 unless you have substantial VRAM headroom and want maximum output quality.
Popular hardware for local LLMs
Check VRAM requirements for Gemma 3 models, or compare hardware options.
Related Guides
Apple Silicon for Local LLMs
Mac buyer's guide for running Gemma and other models on M-series chips.
LLM Quantization Explained: Q4, Q8, F16
How quantization lets you fit Gemma 3 into less VRAM.
LM Studio Hardware Requirements
What GPU and RAM you actually need for LM Studio in 2026.
Beginner's Guide to Running AI Locally
Start running open-source models like Gemma 3 in 10 minutes.
Sources & methodology
VRAM and tokens-per-second figures on this page are synthesised from open community benchmarks. The sitewide formula and the full source list are on the methodology page. For this guide I leaned on:
- Hugging Face Hub. Google's official Gemma 3 model cards for 1B, 4B, 12B and 27B parameter counts.
- Ollama. Tested Gemma 3 GGUF quants (Q4_K_M, Q6_K, Q8_0) pulled from the Ollama library.
- Modal: How much VRAM do I need for LLM inference. VRAM formula used to size each Gemma variant against consumer GPUs.
Spot a number that does not match the linked source? Email [email protected] and I will update the guide.