What Hardware Do You Need for Gemma 3?

Q: Can an 8GB GPU run Gemma 3 12B?

Yes — just barely. Gemma 3 12B at Q4_K_M needs approximately 7.5 GB of VRAM, which fits on an 8GB GPU like the RTX 4060 with very little headroom (~0.5 GB). In practice you should reduce context length to stay stable. At Q8, the 12B model needs ~13 GB and will not fit on 8GB. For a comfortable 12B experience, the Intel Arc B580 12GB is the better choice — it gives ~4.5 GB of headroom at Q4.

Q: How does Gemma 3 compare to Qwen3 quality?

Gemma 3 and Qwen3 are both strong 2025 open-weight model families, but they have different strengths. Gemma 3 27B delivers excellent reasoning, coding, and multilingual performance in a dense architecture — making it one of the best 27B models available. Qwen3 has the advantage of built-in thinking mode (chain-of-thought reasoning) and MoE variants that offer more model diversity. For general-purpose use, Gemma 3 27B and Qwen3-14B are roughly comparable. Qwen3-32B edges ahead on complex reasoning at the cost of needing a 24GB GPU. Gemma 3 12B and Qwen3-8B target the same 8GB GPU tier and perform similarly on most benchmarks.

Q: How do I run Gemma 3 with Ollama?

Run: ollama run gemma3:4b — replace "4b" with your target size (1b, 4b, 12b, or 27b). Ollama auto-detects NVIDIA, AMD (via ROCm), and Apple Silicon GPUs and downloads the appropriate GGUF quantization. For the 27B model, run: ollama run gemma3:27b. Ollama will default to Q4_K_M, which requires 16 GB VRAM. If you have an 8GB GPU, stick with gemma3:4b or gemma3:12b with a short context window.

Q: What is the best quantization for Gemma 3?

Q4_K_M is the recommended quantization for most users. It cuts VRAM roughly in half compared to FP16 with minimal quality loss. Q8_0 offers better output quality — closer to the original model — but requires approximately twice the VRAM of Q4. For the 27B model: Q4_K_M fits in 16 GB, Q8 needs ~28 GB. Use Q4_K_M unless you have excess VRAM headroom and want the best possible quality.

AI did the initial pull from Google's Gemma 3 model cards. Every VRAM and tokens-per-second figure here was cross-checked against the methodology page formula and the linked community sources.

Updated May 2026 · Gemma 3 1B–27B · VRAM requirements · Consumer GPU guide · Ollama & LM Studio

Gemma 3, released by Google DeepMind in March 2025, is a family of open-weight models in 1B, 4B, 12B, and 27B sizes. The 27B flagship delivers GPT-4-class reasoning in just 16 GB of VRAM — making it the best argument for owning an RTX 4060 Ti 16GB or RTX 4080. The 12B model fits on 8 GB. All four sizes run on Ollama and LM Studio with no code required.

What is Gemma 3?

Gemma 3 is an open-weight model family from Google DeepMind. Released in March 2025, it comes in four sizes with instruction-tuned variants available for each.

Four sizes 1B, 4B, 12B, and 27B — all dense models where every parameter is loaded and active during inference. No MoE routing complexity.
Strong reasoning The 27B model benchmarks competitively with much larger closed models on math, coding, and multi-step reasoning tasks.
Multilingual Excellent performance across English, Spanish, French, German, Japanese, Korean, and many other languages — one of Gemma 3's key design goals.
HuggingFace IDs google/gemma-3-1b-it, google/gemma-3-4b-it, google/gemma-3-12b-it, google/gemma-3-27b-it

Gemma 3 VRAM Requirements by Model Size

VRAM is estimated using ceil(params × bytes_per_param) + 1.5 GB overhead. Q4_K_M uses ~0.5 bytes/param, Q8 uses ~1.0 bytes/param, FP16 uses ~2.0 bytes/param. Add extra headroom for KV cache when using long context windows.

Model	Params	Q4_K_M VRAM	Q8 VRAM	FP16 VRAM	Min GPU
Gemma 3 1B	1B	~1 GB	~1.5 GB	~2 GB	Any GPU
Gemma 3 4B	4B	~3 GB	~5 GB	~8 GB	8 GB GPU
Gemma 3 12B	12B	~8 GB	~13 GB	~25 GB	8 GB GPU
Gemma 3 27B	27B	~16 GB	~28 GB	~55 GB	16 GB GPU

Gemma 3 27B at Q4_K_M fits exactly in 16 GB — tight on 16GB GPUs. Use the VRAM Calculator for context-length-adjusted estimates and KV cache projections.

Which Quantization Should You Use?

Quantization trades a small amount of output quality for a large reduction in VRAM usage. For most users, Q4_K_M is the right default.

Q4_K_M — recommended

Cuts VRAM roughly in half versus FP16 with minimal quality loss on most tasks. The standard quantization for consumer hardware. All Ollama defaults use Q4_K_M. This is what to use unless you have excess VRAM.

Q8_0 — best quality

Approximately doubles VRAM vs Q4 but preserves near-FP16 quality. Gemma 3 27B at Q8 needs ~28 GB — an RTX 5090 32GB or Mac mini M4 Pro 48GB. Use Q8 when you have the headroom and want maximum output fidelity.

FP16 — reference only

Full precision, no quality loss. Gemma 3 27B at FP16 needs ~55 GB — well beyond any consumer single GPU. FP16 is mainly used for fine-tuning or benchmarking, not daily inference.

Q2/Q3 — small devices

Aggressive quantizations for very constrained hardware. Quality degrades noticeably — especially on reasoning tasks. Only consider Q2/Q3 if your device cannot fit Q4 and you need the model for lightweight tasks.

What Gemma 3 Can You Run on Your GPU?

Find your GPU or Mac below. Each card shows which Gemma 3 models fit, and what does not.

RTX 4060 8GB

Runs:

+Gemma 3 1B (all quants)
+Gemma 3 4B (Q4 & Q8)
+Gemma 3 12B (Q4 only, tight — ~7.5 GB)

Does not fit:

-Gemma 3 12B Q8 (needs ~13 GB)
-Gemma 3 27B (needs ~16 GB)

Best budget entry point for Gemma 3. The 12B at Q4 is a tight fit — keep context windows short. The 4B model runs at full Q8 quality with plenty of headroom, making it a great daily-use setup.

Intel Arc B580 12GB

Runs:

+Gemma 3 1B through 4B (all quants)
+Gemma 3 12B (Q4 comfortably, ~7.5 GB)

Does not fit:

-Gemma 3 12B Q8 (needs ~13 GB)
-Gemma 3 27B (needs ~16 GB)

Best value for Gemma 3 12B. 12 GB gives ~4.5 GB headroom at Q4 on the 12B model — enough for reasonable context lengths. Verify Ollama ROCm / oneAPI compatibility before purchasing.

RTX 4060 Ti 16GB

Runs:

+Gemma 3 1B through 12B (all quants)
+Gemma 3 27B (Q4_K_M, ~16 GB — tight)

Does not fit:

-Gemma 3 27B Q8 (needs ~28 GB)
-Gemma 3 27B FP16 (needs ~55 GB)

The best-value GPU for Gemma 3 27B. At Q4_K_M the 27B model uses the full 16 GB — reduce context length to avoid OOM. This is the most popular Gemma 3 27B setup for budget-conscious users.

RTX 4070 12GB

Runs:

+Gemma 3 1B through 4B (all quants)
+Gemma 3 12B (Q4, comfortable)

Does not fit:

-Gemma 3 12B Q8 (needs ~13 GB)
-Gemma 3 27B (needs ~16 GB)

12 GB matches the Arc B580 for Gemma 3 capacity. Faster bandwidth (~504 GB/s) than B580. For Gemma 3 use cases the RTX 4060 Ti 16GB is the better buy — the extra 4 GB unlocks the 27B model.

RTX 4070 Ti Super 16GB

Runs:

+Gemma 3 1B through 12B (all quants)
+Gemma 3 27B (Q4_K_M, tight)

Does not fit:

-Gemma 3 27B Q8 (needs ~28 GB)

Same VRAM ceiling as the 4060 Ti 16GB but 2.3x faster memory bandwidth (~672 GB/s). Gemma 3 27B tokens generate noticeably faster. Still a tight fit at Q4 — same context length caveats apply.

RTX 4080 16GB

Runs:

+Gemma 3 1B through 12B (all quants)
+Gemma 3 27B (Q4_K_M, tight)

Does not fit:

-Gemma 3 27B Q8 (needs ~28 GB)

Fast bandwidth (~720 GB/s) makes Gemma 3 27B generation snappy. The 16 GB ceiling is the same as the 4060 Ti — same VRAM limitation. Pay for speed, not extra capacity, at this tier.

RTX 3090 24GB (used)

Runs:

+Gemma 3 1B through 27B at Q4_K_M (all with headroom)
+Gemma 3 12B Q8 (~13 GB, comfortable)

Does not fit:

-Gemma 3 27B Q8 (needs ~28 GB, exceeds 24 GB)

24 GB unlocks Gemma 3 27B at Q4 with ~8 GB of headroom — comfortable context lengths. Cannot fit 27B Q8. Older PCIe 3.0 and ~936 GB/s bandwidth, but excellent used-market value for running 27B.

AMD RX 7900 XTX 24GB

Runs:

+Gemma 3 1B through 27B at Q4_K_M
+Gemma 3 12B Q8 (~13 GB, comfortable)

Does not fit:

-Gemma 3 27B Q8 (needs ~28 GB)

24 GB at ~960 GB/s bandwidth. Works with Ollama via ROCm on Linux. Same Gemma 3 model ceiling as the RTX 4090 at a lower price. See the AMD ROCm guide for setup instructions.

RTX 4090 24GB

Runs:

+Gemma 3 1B through 27B at Q4_K_M (comfortable)
+Gemma 3 12B Q8

Does not fit:

-Gemma 3 27B Q8 (needs ~28 GB)

Best single consumer GPU for Gemma 3. 1,008 GB/s bandwidth means fast generation on the 27B model. At Q4 the 27B uses ~16 GB leaving 8 GB headroom — generous for long context. 27B Q8 does not fit.

RTX 5090 32GB

Runs:

+All Gemma 3 sizes at Q4 and Q8
+Gemma 3 27B Q8 (~28 GB, fits with 4 GB headroom)

Does not fit:

-Gemma 3 27B FP16 (~55 GB)

Only consumer GPU with enough VRAM for Gemma 3 27B at Q8. At 1,792 GB/s it delivers the fastest single-GPU generation speeds. The 4 GB headroom at 27B Q8 is adequate for moderate context lengths.

Mac mini M4 16GB

Runs:

+Gemma 3 1B through 4B (all quants)
+Gemma 3 12B (Q4, ~7.5 GB — fits with headroom)

Does not fit:

-Gemma 3 27B (~16 GB, equals full RAM — no headroom)
-Gemma 3 12B Q8 (needs ~13 GB)

Unified memory means all 16 GB is available. Gemma 3 12B at Q4 uses ~7.5 GB leaving 8.5 GB for OS and KV cache. Silent and efficient. The 27B model technically matches 16 GB but leaves no headroom — avoid.

Mac mini M4 24GB

Runs:

+Gemma 3 1B through 12B (all quants)
+Gemma 3 27B (Q4, ~16 GB — fits with ~8 GB headroom)

Does not fit:

-Gemma 3 27B Q8 (needs ~28 GB)

The sweet spot for Gemma 3 27B on Mac. 8 GB of headroom above the model weights gives comfortable context lengths. Gemma 3 12B at Q8 uses ~13 GB, leaving ~11 GB headroom — excellent quality.

Mac mini M4 Pro 48GB

Runs:

+All Gemma 3 sizes at Q4 and Q8
+Gemma 3 27B Q8 (~28 GB, ~20 GB headroom)

Does not fit:

-Gemma 3 27B FP16 (~55 GB, exceeds 48 GB)

48 GB comfortably fits Gemma 3 27B at Q8 with generous headroom for long context. ~273 GB/s bandwidth is slower than discrete GPUs but the silence, power efficiency, and price are compelling.

Mac Studio M4 Max 64GB

Runs:

+All Gemma 3 sizes at Q4 and Q8
+Gemma 3 27B Q8 with large context windows

Does not fit:

-Gemma 3 27B FP16 (~55 GB — fits, but leaves only ~9 GB headroom)

~600 GB/s bandwidth makes Gemma 3 27B Q8 generation fast. Realistically the best Mac for maximum quality Gemma 3 27B inference. FP16 technically fits but context is very limited — stick with Q8.

Inference Speed by Hardware

Token generation speed is bottlenecked by memory bandwidth. The table below shows estimated Q4_K_M token speeds at low batch size. Real-world results vary by driver version, context length, and system load.

Hardware	Bandwidth	4B Q4 tok/s	12B Q4 tok/s	27B Q4 tok/s
RTX 5090 32GB	1,792 GB/s	~240 t/s	~96 t/s	~45 t/s
RTX 4090 24GB	1,008 GB/s	~135 t/s	~54 t/s	~25 t/s
RTX 4070 Ti Super 16GB	672 GB/s	~90 t/s	~36 t/s	~17 t/s
RTX 4080 16GB	720 GB/s	~96 t/s	~38 t/s	~18 t/s
RTX 4060 Ti 16GB	288 GB/s	~38 t/s	~15 t/s	~7 t/s
Intel Arc B580 12GB	456 GB/s	~61 t/s	~24 t/s	—
RTX 4060 8GB	272 GB/s	~36 t/s	~14 t/s	—
Mac Studio M4 Max 64GB	~600 GB/s	~80 t/s	~32 t/s	~15 t/s
Mac mini M4 Pro 48GB	~273 GB/s	~36 t/s	~14 t/s	~7 t/s
Mac mini M4 24GB	~120 GB/s	~16 t/s	~6 t/s	~3 t/s

Speed estimates: tokens/sec ≈ bandwidth (GB/s) / model size in memory (GB). Mac bandwidth figures are approximate. Dash (—) means the model does not fit at that VRAM tier.

How to Run Gemma 3 Locally

Ollama

ollama run gemma3:27b

Easiest option. One command installs and runs the model. Available tags: gemma3:1b, gemma3:4b, gemma3:12b, gemma3:27b. GPU is auto-detected on NVIDIA, AMD (ROCm), and Apple Silicon. Ollama defaults to Q4_K_M.

LM Studio

Search "Gemma 3" in Discover

GUI-based model browser and chat interface. Download GGUF quantizations directly within the app. Best for non-technical users. Runs on Windows, Mac, and Linux. Lets you pick Q4, Q8, or other quants from a dropdown.

Hugging Face + llama.cpp

google/gemma-3-27b-it-GGUF

Download GGUF files from the official google org or community repos (bartowski, unsloth) on Hugging Face. Run with llama.cpp for maximum control over quantization, context length, and GPU layer offloading.

For step-by-step installation instructions, see the how to run LLMs locally guide. For a comparison of Ollama vs LM Studio, see the Ollama vs LM Studio guide.

Gemma 3 vs Qwen3 vs Llama 3.3

All three are leading open-weight model families. Here is how they differ for local hardware use:

Model family	Largest dense size	VRAM for largest	Min GPU for largest	Thinking mode	Architecture
Gemma 3	27B	~16 GB Q4	RTX 4060 Ti 16GB	No	Dense
Qwen3	32B	~17.5 GB Q4	RTX 4090 24GB	Yes (built-in)	Dense + MoE
Llama 3.3	70B	~40 GB Q4	Dual 24GB GPUs	No	Dense

Choose Gemma 3 if...

+You want the best 16GB GPU model
+You prioritize multilingual quality
+You want Google-backed architecture with strong reasoning
+You prefer a simpler dense model (no MoE complexity)

Choose Qwen3 if...

+You want built-in chain-of-thought reasoning
+You want more size options (0.6B to 32B dense)
+You have a 24GB GPU and want the 32B model
+You want the MoE 30B-A3B for fast inference with large weights

Choose Llama 3.3 if...

+You have a large multi-GPU setup (70B needs 40+ GB)
+You want Meta's official weights
+You need broad ecosystem support and fine-tune availability
+The 8B and 70B sizes cover your use case

Which Hardware Should You Buy for Gemma 3?

Entry budget

RTX 4060 8GB

Runs Gemma 3 4B at Q8 and 12B at Q4 (tight). The 4B model is fast and capable for everyday chat and coding tasks. Best entry point for Gemma 3 on a budget.

Best value

Intel Arc B580 12GB

Cheapest path to a comfortable Gemma 3 12B experience. 12 GB gives ~4.5 GB headroom at Q4. Verify Ollama compatibility before buying — Arc driver support is good but lags NVIDIA.

Mid-range sweet spot

RTX 4060 Ti 16GB

The best-value GPU for Gemma 3 27B. The 27B at Q4_K_M fits exactly in 16 GB — this is the minimum GPU that runs the flagship model. If the 27B is your target, this is the card to buy.

Used market

RTX 3090 24GB

Runs Gemma 3 27B at Q4 with 8 GB headroom for generous context lengths. Cannot fit 27B Q8 (needs 28 GB). Older architecture, but excellent VRAM-per-dollar on the used market.

High end

RTX 4090 24GB

Best single consumer GPU for Gemma 3. 27B at Q4 with 8 GB headroom and 1,008 GB/s bandwidth means fast generation. The go-to choice if budget is flexible.

Mac ecosystem

Mac mini M4 24GB

Best Mac value for Gemma 3 27B. Unified memory gives 8 GB headroom above the model. For Q8 quality on 27B, step up to Mac mini M4 Pro 48GB. Silent, efficient, and runs all four Gemma 3 sizes.

For a full cross-budget GPU comparison, see the best GPU for LLMs guide.

Related Resources

What LLMs Can I Run?

Enter your GPU — see every model that fits

Qwen3 Hardware Requirements

Compare Gemma 3 vs Qwen3 for your GPU

Best GPU for LLMs — Full Guide

All budget tiers from entry to workstation

How to Run Gemma 3 Locally

Step-by-step Ollama setup for Gemma 3 1B through 27B — vision included

How to Run LLMs Locally

Step-by-step Ollama, LM Studio, llama.cpp setup

Ollama vs LM Studio

Which tool to use for running Gemma 3 locally

LLM Quantization Explained

Q4 vs Q8 vs FP16 — when trade-offs matter

Apple Silicon for LLMs

M4, M4 Pro, M4 Max — which Mac for Gemma 3?

Best LLM for Coding Locally

Gemma 3 12B and Qwen3 are top coding picks — comparison by GPU tier

VRAM Calculator

Calculate exact VRAM at your context length

Frequently Asked Questions

What GPU do I need to run Gemma 3 27B?

Gemma 3 27B at Q4_K_M requires approximately 16 GB of VRAM. The minimum GPUs that fit it are the RTX 4060 Ti 16GB, RTX 4070 Ti Super 16GB, RTX 4080 16GB, RTX 4090 24GB, and AMD RX 7900 XTX 24GB. On Mac, the Mac mini M4 24GB runs it comfortably. The RTX 4060 Ti 16GB is the best-value option — keep context lengths moderate to avoid OOM.

Can an 8GB GPU run Gemma 3 12B?

Yes, but tightly. Gemma 3 12B at Q4_K_M needs approximately 7.5 GB, which fits on an RTX 4060 8GB with about 0.5 GB of headroom. Keep context windows short to stay stable. At Q8 the 12B model needs ~13 GB and will not fit on 8GB. For a more comfortable 12B experience, the Intel Arc B580 12GB gives ~4.5 GB of headroom at Q4.

How does Gemma 3 compare to Qwen3 quality?

Both are strong 2025 model families. Gemma 3 27B is one of the best 27B models for reasoning and multilingual tasks, fitting in 16 GB VRAM. Qwen3 offers built-in chain-of-thought thinking mode and a wider size range (0.6B to 32B). Qwen3-32B at Q4 edges ahead on complex reasoning but requires a 24GB GPU. For 16GB GPUs, Gemma 3 27B is the flagship choice.

How do I run Gemma 3 with Ollama?

Run: ollama run gemma3:27b — replace "27b" with your target size (1b, 4b, 12b, or 27b). Ollama auto-detects NVIDIA, AMD, and Apple Silicon GPUs and downloads Q4_K_M by default. For an 8GB GPU, use gemma3:4b or gemma3:12b with a short context window.

Can I run Gemma 3 on a Mac?

Yes. All Gemma 3 models run on Apple Silicon Macs via Ollama or LM Studio. Unified memory means all RAM is available. Mac mini M4 16GB runs Gemma 3 1B, 4B, and 12B. Mac mini M4 24GB adds Gemma 3 27B at Q4 with ~8 GB headroom. Mac mini M4 Pro 48GB runs the 27B at Q8 quality.

What is the best quantization for Gemma 3?

Q4_K_M is the recommended default — it halves VRAM usage versus FP16 with minimal quality loss. Q8_0 offers near-FP16 quality but doubles VRAM requirements. For the 27B model: Q4 fits in 16 GB, Q8 needs ~28 GB. Use Q4 unless you have substantial VRAM headroom and want maximum output quality.

Popular hardware for local LLMs

RTX 4060 (8 GB)

Budget pick. Runs 7B-8B models at 25-35 tok/s.

Buy on Amazon

RTX 4060 Ti 16 GB

Sweet spot. Runs 13B-14B at full speed. Best value.

Buy on Amazon

RTX 4090 (24 GB)

Top consumer GPU. Runs 70B models with offloading.

Buy on Amazon

Check VRAM requirements for Gemma 3 models, or compare hardware options.

VRAM Calculator GPU Buying Guide All Guides

Related Guides

Apple Silicon for Local LLMs

Mac buyer's guide for running Gemma and other models on M-series chips.

LLM Quantization Explained: Q4, Q8, F16

How quantization lets you fit Gemma 3 into less VRAM.

LM Studio Hardware Requirements

What GPU and RAM you actually need for LM Studio in 2026.

Beginner's Guide to Running AI Locally

Start running open-source models like Gemma 3 in 10 minutes.

Sources & methodology

VRAM and tokens-per-second figures on this page are synthesised from open community benchmarks. The sitewide formula and the full source list are on the methodology page. For this guide I leaned on:

Hugging Face Hub. Google's official Gemma 3 model cards for 1B, 4B, 12B and 27B parameter counts.
Ollama. Tested Gemma 3 GGUF quants (Q4_K_M, Q6_K, Q8_0) pulled from the Ollama library.
Modal: How much VRAM do I need for LLM inference. VRAM formula used to size each Gemma variant against consumer GPUs.

Spot a number that does not match the linked source? Email [email protected] and I will update the guide.