LLM Quantization Explained: Q4, Q8, FP16 and What Each Costs in VRAM (2026)

AI helped explain the quant math in plain English. Every perplexity claim and byte-per-parameter number was cross-checked against the K-quants discussion thread and the llama.cpp source.

Updated May 2026 · Applies to Ollama, llama.cpp, and all GGUF-format models

Quantization is the single most important concept for running LLMs on consumer hardware. It determines how much VRAM a model uses, how fast it runs, and how much quality you trade away. This guide explains every quantization level, how to calculate VRAM requirements, and which format to choose based on your GPU.

Quick reference

What Is Quantization?

Every LLM is made up of billions of numerical weights. At training time these weights are stored as 32-bit or 16-bit floating point numbers, which are highly precise but also large. Quantization compresses these weights into smaller data types, reducing the memory required to store and run the model.

A 7B model in F16 stores 7 billion numbers at 2 bytes each, totalling 14 GB of raw weight data. The same model quantized to Q4 stores those same 7 billion numbers at roughly 0.5 bytes each, totalling about 3.5 GB. The model fits in a quarter of the VRAM.

The tradeoff is precision loss. Compressing a 16-bit float into 4 bits means rounding, which introduces small errors throughout the network. Modern quantization methods like K-quants (the K_M and K_S variants) use smarter rounding strategies that minimize this error, which is why Q4_K_M quality is much better than naive 4-bit quantization.

Key insight

Quantization reduces VRAM usage but does not change the model's architecture or parameter count. A quantized 7B model still has 7B parameters, performs inference the same way, and is the same model. Only the precision of the stored numbers changes.

Quantization Levels Explained

Here is every quantization level you will encounter, from highest precision to lowest:

FormatBitsSize per ParamVRAM vs F16QualityBest For
F32 32-bit 4 bytes 4x Lossless Training only
F16 16-bit 2 bytes 2x Lossless baseline Fine-tuning, serving baseline
Q8_0 8-bit 1 byte 1x Near-lossless Coding, reasoning, precision tasks
Q4_K_M 4-bit ~0.5 bytes 0.5x ~3-5% quality loss Daily use, chat, writing (recommended)
Q4_K_S 4-bit ~0.45 bytes 0.45x Slightly below K_M When K_M barely does not fit
Q2_K 2-bit ~0.25 bytes 0.25x Significant degradation Extreme low VRAM, last resort

F32 (32-bit float)

Full precision, 4 bytes per parameter. Used during training to maintain numerical stability. Almost never used for inference because it doubles VRAM compared to F16 with no quality benefit at inference time. You will rarely encounter F32 models for download.

F16 (16-bit float, half precision)

The standard baseline, 2 bytes per parameter. Most models are released or converted to F16 as the starting point. F16 is lossless relative to the trained model and is required for fine-tuning. For pure inference it is often unnecessary unless you are benchmarking quality or plan to fine-tune the model afterward.

Q8_0 (8-bit integer)

One byte per parameter, nearly identical quality to F16. The perplexity difference is statistically negligible for most models. Q8 is the right choice when you have enough VRAM and want maximum quality without going to F16. Particularly valuable for coding, math, and reasoning tasks where precision matters. Use Q8 for 3B models on 8 GB GPUs, or 7B models on 12 GB+ GPUs.

Q4_K_M (4-bit, medium K-quant)

The most widely used quantization for consumer hardware. Roughly 0.5 bytes per parameter, about 3-5% worse than F16 on perplexity benchmarks. In practice this translates to very minor wording differences rather than factual or reasoning errors. The K_M variant uses a smarter quantization grouping that targets the most sensitive weight clusters, delivering significantly better quality than a simple 4-bit approach. This is the default recommendation for almost everyone.

Q4_K_S (4-bit, small K-quant)

Slightly smaller than Q4_K_M with marginally lower quality. The difference between K_S and K_M is small but measurable. Only choose K_S if Q4_K_M barely misses fitting in your VRAM, since the quality tradeoff is rarely worth it otherwise.

IQ variants (iQ4_XS, iQ3_M, iQ2_M)

Importance-based quantization. These variants use the model's own weight importance scores to allocate more precision where it matters most. An iQ4_XS model is smaller than Q4_K_S but has comparable or better quality thanks to the non-uniform bit allocation. If you see iQ variants on Hugging Face, prefer them over standard Q4_K_S when VRAM is tight.

Q2_K (2-bit)

Severe quality degradation. Only useful when a model absolutely cannot fit at Q4 and you still want to try loading it. Coherence and factual accuracy both drop noticeably. Treat Q2 as a last resort, not a routine choice. At this level you are often better off switching to a smaller model at Q4 instead.

VRAM Calculation Formula

Calculating how much VRAM a model needs is straightforward once you know the bytes-per-parameter for each quantization format.

Formula

VRAM = params_billions x bytes_per_param x 1.2

The 1.2 multiplier accounts for the KV cache, activation memory, and runtime overhead. For long context windows or batch inference, add more buffer.

ModelQuantizationCalculationVRAM Needed
Llama 3.1 8B F16 8B x 2 bytes ~16 GB
Llama 3.1 8B Q8_0 8B x 1 byte ~8 GB
Llama 3.1 8B Q4_K_M 8B x 0.5 bytes ~4 GB
Llama 3.1 70B Q4_K_M 70B x 0.5 bytes ~35 GB
Mistral 7B Q4_K_M 7B x 0.5 bytes ~4 GB
Gemma 2 27B Q4_K_M 27B x 0.5 bytes ~14 GB

These are estimates. Actual VRAM use varies by model architecture, context length, and inference backend. Add 15-20% buffer when you are near the limit of your card.

Quality vs Size Tradeoff

Perplexity is the standard metric for measuring quantization quality loss. Lower perplexity means better language modeling. The numbers below are representative percentages based on published benchmarks for typical 7B-13B models. Your specific model may vary.

FormatPerplexity vs F16Practical ImpactVerdict
F16 Baseline (0%) Reference quality Use for fine-tuning
Q8_0 +0.1-0.5% Imperceptible in practice Best quality for inference
Q4_K_M +3-5% Very minor wording differences Recommended for most users
Q4_K_S +4-6% Slightly below K_M Use only when K_M does not fit
iQ4_XS +3-4% Better than Q4_K_S at same size Best small 4-bit option
Q2_K +15-25% Noticeable coherence loss Last resort only

Perplexity differences above 5% become noticeable in open-ended generation tasks. Below 5% the differences are usually undetectable without side-by-side comparison.

Which Quantization Should You Use?

For daily chat and writing, use Q4_K_M — it gives the best VRAM efficiency with imperceptible quality loss. For coding and reasoning tasks, use Q8_0 for near-lossless precision. For fine-tuning, use F16. When in doubt, Q4_K_M is the right default for almost every use case.

The right quantization also depends on how much VRAM your GPU has and which model size you want to run. Use this table as your starting point:

GPU VRAMQ4_K_M fitsQ8_0 fitsF16 fitsNotes
4 GB Up to 3B 3B (tight) Under 2B only Very limited, Q4 only for 3B models
8 GB 7B models 3B models Under 4B RTX 3070, RTX 4060, many laptops
12 GB 13B models 7B models Up to 7B (tight) RTX 3060 12 GB, Intel Arc B580
24 GB 34B models 13B models 13B RTX 3090, RTX 4090, RTX 6000 Ada
48 GB 70B models 34B models 24B RTX 6000 Ada 48 GB, dual 24 GB

Daily chat and writing

Q4_K_M

Best VRAM efficiency with imperceptible quality loss for conversational tasks

Coding and reasoning

Q8_0

Near-lossless precision helps with code generation, math, and multi-step reasoning

Fine-tuning

F16

Training requires full precision to avoid gradient errors accumulating across steps

GGUF Format Explained

GGUF (GPT-Generated Unified Format) is the standard container format for quantized LLM weights. It replaced the older GGML format and is supported natively by both Ollama and llama.cpp. When you download a quantized model from Hugging Face, it will almost always be a .gguf file.

Self-contained

A single .gguf file includes weights, tokenizer vocabulary, model metadata, and quantization parameters. No separate config files needed.

Cross-platform

The same .gguf file runs on Windows, Linux, and macOS without conversion. Works with CUDA, Metal, ROCm, and CPU backends.

Fast loading

GGUF supports memory-mapped loading, meaning the OS pages in only the parts of the model actively used. Large models start faster than with other formats.

Versioned format

GGUF includes a format version number. Newer quantization methods like iQ4_XS are added as new format versions, keeping backward compatibility.

GGUF files are named using a consistent convention: the model name, parameter count, quantization type, and sometimes a version suffix. For example, llama-3.1-8b-instruct.Q4_K_M.gguf is the Llama 3.1 8B Instruct model at Q4_K_M quantization.

Using Quantized Models with Ollama

Ollama handles quantization selection automatically. When you run a model, Ollama picks the best quantization for your available VRAM. You can also specify a quantization level explicitly using tags.

Default (Ollama picks best quantization for your VRAM)

ollama run llama3.1:8b

Request Q4 quantization explicitly

ollama run llama3.1:8b-instruct-q4_0

Request Q8 quantization explicitly

ollama run llama3.1:8b-instruct-q8_0

List available tags for a model (shows all quantization options)

ollama list

Ollama's model library at ollama.com/library shows all available tags and their quantization levels. Not all models have all quantization variants available.

Using Quantized Models with llama.cpp

llama.cpp gives you more direct control over quantization and GPU offloading. You download GGUF files manually from Hugging Face and load them with the llama-cli or llama-server binaries.

Run a GGUF model with all layers offloaded to GPU

./llama-cli --model llama-3.1-8b.Q4_K_M.gguf -ngl 99 -p "Hello"

Run with partial GPU offload (24 layers on GPU, rest on CPU)

./llama-cli --model llama-3.1-8b.Q4_K_M.gguf -ngl 24 -p "Hello"

Start llama-server for API access

./llama-server --model llama-3.1-8b.Q4_K_M.gguf -ngl 99 --port 8080

Quantize a model yourself (from F16 to Q4_K_M)

./llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M

The -ngl flag controls how many transformer layers are offloaded to the GPU. Setting it to 99 offloads all layers. Reduce it if you run out of VRAM, letting the remaining layers run on CPU at reduced speed.

Frequently Asked Questions

What is the best quantization for everyday LLM use?

Q4_K_M is the best quantization for everyday use. It offers roughly half the VRAM of F16 with only a 3-5% quality reduction on most benchmarks. It is the default choice for chat, writing, and general tasks on consumer hardware.

How much VRAM does a 7B model need at Q4?

A 7B model at Q4_K_M needs approximately 4-5 GB of VRAM. The formula is: parameters in billions times 0.5 bytes per parameter, plus about 20% overhead. So 7B x 0.5 = 3.5 GB plus overhead gives roughly 4-5 GB total.

Is Q8 noticeably better than Q4?

For most tasks, the difference is subtle. Q8_0 is nearly identical to F16 in quality. Q4_K_M is about 3-5% worse on perplexity benchmarks, which translates to occasional wording differences rather than factual errors. For coding and reasoning tasks where precision matters, Q8 is worth the extra VRAM if you have it.

What is a GGUF file?

GGUF (GPT-Generated Unified Format) is the standard file format for quantized LLM weights. It replaced the older GGML format and is used by both Ollama and llama.cpp. GGUF files bundle the model weights, tokenizer, and metadata into a single portable file. Most quantized models on Hugging Face are distributed as GGUF files.

Can I run a 70B model on a single GPU?

Yes, but only at Q4 quantization on very high-VRAM cards. A 70B model at Q4_K_M needs roughly 35-40 GB of VRAM. That fits on a single NVIDIA A100 80 GB or H100 80 GB. For consumer hardware, you need two 24 GB GPUs (like two RTX 3090s or 4090s) or one 48 GB GPU like the RTX 6000 Ada.

Related guides

Calculate exact VRAM needs for any model, or find the best GPU for your budget.

Popular hardware for local LLMs

RTX 4060 (8 GB)
Budget pick. Runs 7B-8B models at 25-35 tok/s.
Buy on Amazon
RTX 4060 Ti 16 GB
Sweet spot. Runs 13B-14B at full speed. Best value.
Buy on Amazon
RTX 4090 (24 GB)
Top consumer GPU. Runs 70B models with offloading.
Buy on Amazon

Related Guides

Sources & methodology

Behaviour, file-format and runtime details on this page are pulled from primary upstream docs and community benchmark threads. The full sitewide methodology lives on the methodology page. For this guide I relied most on:

Spot a number that does not match the linked source? Email [email protected] and I will update the guide.