LLM Quantization Explained: Q4, Q8, FP16 and What Each Costs in VRAM (2026)
AI helped explain the quant math in plain English. Every perplexity claim and byte-per-parameter number was cross-checked against the K-quants discussion thread and the llama.cpp source.
Updated May 2026 · Applies to Ollama, llama.cpp, and all GGUF-format models
Quantization is the single most important concept for running LLMs on consumer hardware. It determines how much VRAM a model uses, how fast it runs, and how much quality you trade away. This guide explains every quantization level, how to calculate VRAM requirements, and which format to choose based on your GPU.
Quick reference
- F16: 2 bytes per parameter, full quality, baseline for most models
- Q8_0: 1 byte per parameter, near-identical quality to F16, half the VRAM
- Q4_K_M: ~0.5 bytes per parameter, 3-5% quality loss, the most popular choice
- VRAM formula: params (billions) x bytes per param x 1.2 (overhead)
- File format: GGUF (.gguf) is the standard for quantized models
What Is Quantization?
Every LLM is made up of billions of numerical weights. At training time these weights are stored as 32-bit or 16-bit floating point numbers, which are highly precise but also large. Quantization compresses these weights into smaller data types, reducing the memory required to store and run the model.
A 7B model in F16 stores 7 billion numbers at 2 bytes each, totalling 14 GB of raw weight data. The same model quantized to Q4 stores those same 7 billion numbers at roughly 0.5 bytes each, totalling about 3.5 GB. The model fits in a quarter of the VRAM.
The tradeoff is precision loss. Compressing a 16-bit float into 4 bits means rounding, which introduces small errors throughout the network. Modern quantization methods like K-quants (the K_M and K_S variants) use smarter rounding strategies that minimize this error, which is why Q4_K_M quality is much better than naive 4-bit quantization.
Key insight
Quantization reduces VRAM usage but does not change the model's architecture or parameter count. A quantized 7B model still has 7B parameters, performs inference the same way, and is the same model. Only the precision of the stored numbers changes.
Quantization Levels Explained
Here is every quantization level you will encounter, from highest precision to lowest:
| Format | Bits | Size per Param | VRAM vs F16 | Quality | Best For |
|---|---|---|---|---|---|
| F32 | 32-bit | 4 bytes | 4x | Lossless | Training only |
| F16 | 16-bit | 2 bytes | 2x | Lossless baseline | Fine-tuning, serving baseline |
| Q8_0 | 8-bit | 1 byte | 1x | Near-lossless | Coding, reasoning, precision tasks |
| Q4_K_M | 4-bit | ~0.5 bytes | 0.5x | ~3-5% quality loss | Daily use, chat, writing (recommended) |
| Q4_K_S | 4-bit | ~0.45 bytes | 0.45x | Slightly below K_M | When K_M barely does not fit |
| Q2_K | 2-bit | ~0.25 bytes | 0.25x | Significant degradation | Extreme low VRAM, last resort |
F32 (32-bit float)
Full precision, 4 bytes per parameter. Used during training to maintain numerical stability. Almost never used for inference because it doubles VRAM compared to F16 with no quality benefit at inference time. You will rarely encounter F32 models for download.
F16 (16-bit float, half precision)
The standard baseline, 2 bytes per parameter. Most models are released or converted to F16 as the starting point. F16 is lossless relative to the trained model and is required for fine-tuning. For pure inference it is often unnecessary unless you are benchmarking quality or plan to fine-tune the model afterward.
Q8_0 (8-bit integer)
One byte per parameter, nearly identical quality to F16. The perplexity difference is statistically negligible for most models. Q8 is the right choice when you have enough VRAM and want maximum quality without going to F16. Particularly valuable for coding, math, and reasoning tasks where precision matters. Use Q8 for 3B models on 8 GB GPUs, or 7B models on 12 GB+ GPUs.
Q4_K_M (4-bit, medium K-quant)
The most widely used quantization for consumer hardware. Roughly 0.5 bytes per parameter, about 3-5% worse than F16 on perplexity benchmarks. In practice this translates to very minor wording differences rather than factual or reasoning errors. The K_M variant uses a smarter quantization grouping that targets the most sensitive weight clusters, delivering significantly better quality than a simple 4-bit approach. This is the default recommendation for almost everyone.
Q4_K_S (4-bit, small K-quant)
Slightly smaller than Q4_K_M with marginally lower quality. The difference between K_S and K_M is small but measurable. Only choose K_S if Q4_K_M barely misses fitting in your VRAM, since the quality tradeoff is rarely worth it otherwise.
IQ variants (iQ4_XS, iQ3_M, iQ2_M)
Importance-based quantization. These variants use the model's own weight importance scores to allocate more precision where it matters most. An iQ4_XS model is smaller than Q4_K_S but has comparable or better quality thanks to the non-uniform bit allocation. If you see iQ variants on Hugging Face, prefer them over standard Q4_K_S when VRAM is tight.
Q2_K (2-bit)
Severe quality degradation. Only useful when a model absolutely cannot fit at Q4 and you still want to try loading it. Coherence and factual accuracy both drop noticeably. Treat Q2 as a last resort, not a routine choice. At this level you are often better off switching to a smaller model at Q4 instead.
VRAM Calculation Formula
Calculating how much VRAM a model needs is straightforward once you know the bytes-per-parameter for each quantization format.
Formula
VRAM = params_billions x bytes_per_param x 1.2 The 1.2 multiplier accounts for the KV cache, activation memory, and runtime overhead. For long context windows or batch inference, add more buffer.
| Model | Quantization | Calculation | VRAM Needed |
|---|---|---|---|
| Llama 3.1 8B | F16 | 8B x 2 bytes | ~16 GB |
| Llama 3.1 8B | Q8_0 | 8B x 1 byte | ~8 GB |
| Llama 3.1 8B | Q4_K_M | 8B x 0.5 bytes | ~4 GB |
| Llama 3.1 70B | Q4_K_M | 70B x 0.5 bytes | ~35 GB |
| Mistral 7B | Q4_K_M | 7B x 0.5 bytes | ~4 GB |
| Gemma 2 27B | Q4_K_M | 27B x 0.5 bytes | ~14 GB |
These are estimates. Actual VRAM use varies by model architecture, context length, and inference backend. Add 15-20% buffer when you are near the limit of your card.
Quality vs Size Tradeoff
Perplexity is the standard metric for measuring quantization quality loss. Lower perplexity means better language modeling. The numbers below are representative percentages based on published benchmarks for typical 7B-13B models. Your specific model may vary.
| Format | Perplexity vs F16 | Practical Impact | Verdict |
|---|---|---|---|
| F16 | Baseline (0%) | Reference quality | Use for fine-tuning |
| Q8_0 | +0.1-0.5% | Imperceptible in practice | Best quality for inference |
| Q4_K_M | +3-5% | Very minor wording differences | Recommended for most users |
| Q4_K_S | +4-6% | Slightly below K_M | Use only when K_M does not fit |
| iQ4_XS | +3-4% | Better than Q4_K_S at same size | Best small 4-bit option |
| Q2_K | +15-25% | Noticeable coherence loss | Last resort only |
Perplexity differences above 5% become noticeable in open-ended generation tasks. Below 5% the differences are usually undetectable without side-by-side comparison.
Which Quantization Should You Use?
For daily chat and writing, use Q4_K_M — it gives the best VRAM efficiency with imperceptible quality loss. For coding and reasoning tasks, use Q8_0 for near-lossless precision. For fine-tuning, use F16. When in doubt, Q4_K_M is the right default for almost every use case.
The right quantization also depends on how much VRAM your GPU has and which model size you want to run. Use this table as your starting point:
| GPU VRAM | Q4_K_M fits | Q8_0 fits | F16 fits | Notes |
|---|---|---|---|---|
| 4 GB | Up to 3B | 3B (tight) | Under 2B only | Very limited, Q4 only for 3B models |
| 8 GB | 7B models | 3B models | Under 4B | RTX 3070, RTX 4060, many laptops |
| 12 GB | 13B models | 7B models | Up to 7B (tight) | RTX 3060 12 GB, Intel Arc B580 |
| 24 GB | 34B models | 13B models | 13B | RTX 3090, RTX 4090, RTX 6000 Ada |
| 48 GB | 70B models | 34B models | 24B | RTX 6000 Ada 48 GB, dual 24 GB |
Daily chat and writing
Q4_K_M
Best VRAM efficiency with imperceptible quality loss for conversational tasks
Coding and reasoning
Q8_0
Near-lossless precision helps with code generation, math, and multi-step reasoning
Fine-tuning
F16
Training requires full precision to avoid gradient errors accumulating across steps
GGUF Format Explained
GGUF (GPT-Generated Unified Format) is the standard container format for quantized LLM weights. It replaced the older GGML format and is supported natively by both Ollama and llama.cpp. When you download a quantized model from Hugging Face, it will almost always be a .gguf file.
Self-contained
A single .gguf file includes weights, tokenizer vocabulary, model metadata, and quantization parameters. No separate config files needed.
Cross-platform
The same .gguf file runs on Windows, Linux, and macOS without conversion. Works with CUDA, Metal, ROCm, and CPU backends.
Fast loading
GGUF supports memory-mapped loading, meaning the OS pages in only the parts of the model actively used. Large models start faster than with other formats.
Versioned format
GGUF includes a format version number. Newer quantization methods like iQ4_XS are added as new format versions, keeping backward compatibility.
GGUF files are named using a consistent convention: the model name, parameter count, quantization type, and sometimes a version suffix. For example, llama-3.1-8b-instruct.Q4_K_M.gguf is the Llama 3.1 8B Instruct model at Q4_K_M quantization.
Using Quantized Models with Ollama
Ollama handles quantization selection automatically. When you run a model, Ollama picks the best quantization for your available VRAM. You can also specify a quantization level explicitly using tags.
Default (Ollama picks best quantization for your VRAM)
ollama run llama3.1:8b Request Q4 quantization explicitly
ollama run llama3.1:8b-instruct-q4_0 Request Q8 quantization explicitly
ollama run llama3.1:8b-instruct-q8_0 List available tags for a model (shows all quantization options)
ollama list Ollama's model library at ollama.com/library shows all available tags and their quantization levels. Not all models have all quantization variants available.
Using Quantized Models with llama.cpp
llama.cpp gives you more direct control over quantization and GPU offloading. You download GGUF files manually from Hugging Face and load them with the llama-cli or llama-server binaries.
Run a GGUF model with all layers offloaded to GPU
./llama-cli --model llama-3.1-8b.Q4_K_M.gguf -ngl 99 -p "Hello" Run with partial GPU offload (24 layers on GPU, rest on CPU)
./llama-cli --model llama-3.1-8b.Q4_K_M.gguf -ngl 24 -p "Hello" Start llama-server for API access
./llama-server --model llama-3.1-8b.Q4_K_M.gguf -ngl 99 --port 8080 Quantize a model yourself (from F16 to Q4_K_M)
./llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M
The -ngl flag controls how many transformer layers are offloaded to the GPU. Setting it to 99 offloads all layers. Reduce it if you run out of VRAM, letting the remaining layers run on CPU at reduced speed.
Frequently Asked Questions
What is the best quantization for everyday LLM use?
Q4_K_M is the best quantization for everyday use. It offers roughly half the VRAM of F16 with only a 3-5% quality reduction on most benchmarks. It is the default choice for chat, writing, and general tasks on consumer hardware.
How much VRAM does a 7B model need at Q4?
A 7B model at Q4_K_M needs approximately 4-5 GB of VRAM. The formula is: parameters in billions times 0.5 bytes per parameter, plus about 20% overhead. So 7B x 0.5 = 3.5 GB plus overhead gives roughly 4-5 GB total.
Is Q8 noticeably better than Q4?
For most tasks, the difference is subtle. Q8_0 is nearly identical to F16 in quality. Q4_K_M is about 3-5% worse on perplexity benchmarks, which translates to occasional wording differences rather than factual errors. For coding and reasoning tasks where precision matters, Q8 is worth the extra VRAM if you have it.
What is a GGUF file?
GGUF (GPT-Generated Unified Format) is the standard file format for quantized LLM weights. It replaced the older GGML format and is used by both Ollama and llama.cpp. GGUF files bundle the model weights, tokenizer, and metadata into a single portable file. Most quantized models on Hugging Face are distributed as GGUF files.
Can I run a 70B model on a single GPU?
Yes, but only at Q4 quantization on very high-VRAM cards. A 70B model at Q4_K_M needs roughly 35-40 GB of VRAM. That fits on a single NVIDIA A100 80 GB or H100 80 GB. For consumer hardware, you need two 24 GB GPUs (like two RTX 3090s or 4090s) or one 48 GB GPU like the RTX 6000 Ada.
Related guides
Calculate exact VRAM needs for any model, or find the best GPU for your budget.
Popular hardware for local LLMs
Related Guides
Sources & methodology
Behaviour, file-format and runtime details on this page are pulled from primary upstream docs and community benchmark threads. The full sitewide methodology lives on the methodology page. For this guide I relied most on:
- llama.cpp. Source-of-truth definitions for Q4_K_M, Q5_K_M, Q6_K, Q8_0 and the IQ variants.
- Hugging Face Hub. Hub examples of identical models in multiple quants, used for the file-size comparisons.
- Modal: How much VRAM do I need for LLM inference. VRAM math that shows how each quant shrinks the model footprint in practice.
Spot a number that does not match the linked source? Email [email protected] and I will update the guide.