How Recommendations Work

Last updated: May 24, 2026 by Billy G.R.

Data sources

Model data is fetched daily from the Hugging Face Hub API. We pull the top models by download count across all major pipeline types (text generation, code generation, multimodal, image generation, video generation), plus the latest models from the 15 most active open-source providers, so new releases like Qwen3, Llama 4 Scout, DeepSeek-V3.2, and gpt-oss appear as soon as they are published.

Hardware specs are hand-curated from manufacturer and retailer listings and updated periodically. We do not publish prices. They change too quickly to keep accurate, so each product link sends you to Amazon for the live price.

VRAM calculation

Required VRAM is estimated from the model's parameter count, the quantization format, and a runtime overhead factor. The formula we use is:

VRAM (GB) = parameters_billions × bytes_per_param × overhead_factor + os_reserve

Where:

  • parameters_billions: total parameter count in billions (e.g. 7 for a 7B model).
  • bytes_per_param: bytes per weight at the chosen quantization (table below).
  • overhead_factor: 1.2 to cover KV cache, activations, and framework overhead at modest (4K to 8K) context lengths.
  • os_reserve: 2 GB held back for the OS and the inference engine itself (llama.cpp, Ollama, vLLM, LM Studio).

bytes_per_param by quantization:

Quantization Bytes/param Notes
Q4_K_M 0.5 4-bit GGUF, the common quality/size sweet spot
Q5_K_M 0.625 5-bit GGUF, modestly better quality than Q4
Q8 1.0 8-bit, near-lossless against FP16
FP16 2.0 Full 16-bit precision

Worked example: Llama 3 8B at Q4_K_M

VRAM = 8 × 0.5 × 1.2 + 2
     = 4.8 + 2
     = 6.8 GB

That matches what an 8B Q4_K_M GGUF actually consumes on a single GPU in llama.cpp with a 4K context window, cross-checked against the runs in the XiongjieDai repo cited below.

KV cache scales with context length

The 1.2 overhead factor is only honest at short contexts. The KV cache itself scales linearly with context length:

kv_bytes_per_token = 2 × layers × kv_heads × head_dim × bytes_per_element

The leading 2 covers separate keys and values. bytes_per_element is 2 for FP16/BF16 KV cache, 1 for INT8 KV-cache quantization. For a Llama 3 8B model (32 layers, 8 KV heads, head_dim 128, FP16 KV cache) that works out to roughly 131 KB per token, so a 32K context adds about 4 GB on top of the weights. For long-context workloads we use that figure directly instead of the overhead factor.

Hardware tiers

Each model gets three hardware recommendations:

Minimum

The smallest GPU or chip that can load the model in Q4_K_M quantization at a 4K context window. Expect slower inference and limited context.

Comfortable

Fits the model in Q5_K_M or Q8 quantization with an 8K to 16K context. Suitable for everyday use and longer conversations.

Headroom

Enough VRAM for FP16 or Q8 at 32K+ context, or for the next generation of models as they grow in parameter count.

Hardware is matched to tiers by comparing the estimated VRAM requirement at each quantization level against the GPU or unified-memory capacity. The algorithm finds the cheapest hardware that satisfies each tier's VRAM threshold.

Where our numbers come from

Tokens-per-second, watts, and "fits on this card" claims are not numbers we generate. We synthesise them from a small set of open community benchmarks, listed here. Every numeric claim on the site can be traced back to one of these sources.

llama.cpp llama-bench discussion (Apple Silicon results)

Apple Silicon tokens-per-second baselines for M-series chips, published by the llama.cpp project itself.

XiongjieDai/GPU-Benchmarks-on-LLM-Inference

Community llama-bench results across NVIDIA consumer + datacenter cards and Apple Silicon, with model, quant, and context length recorded per run.

Home GPU LLM Leaderboard

VRAM-tier-organised open leaderboard with tokens/sec per GPU; used to cross-check our GPU tier ranking.

Hardware-Corner GPU ranking for local LLM

Consumer + pro GPU tokens/sec at multiple context lengths; used to spot-check our context-length sensitivity claims.

Modal — How much VRAM do I need for LLM inference?

Independent VRAM-formula breakdown (weights + KV cache + activation + framework overhead); used to anchor our formula and overhead factor.

Hugging Face Hub API

Model parameter counts, tags, and downloads. Pulled daily for the model catalogue.

Parameter detection

Parameter counts are extracted from the model ID and tags using regex patterns that match common naming conventions (for example 7B, 70b, 0.5B). When a model does not encode its size in its name or tags, no VRAM estimate is shown.

What we do NOT claim

  • We do not run a full in-house benchmark rig. Tokens/sec and watt figures on this site are aggregated and cross-checked from the cited open sources above, not measured here from scratch.
  • We do not publish original tokens-per-second numbers for hardware we have not personally tested. Where the editor has hands-on access to a card, results are spot-checked; where access is missing, the source is cited.
  • We do not run controlled studies on model output quality. For evals we link to the community benchmarks that do.
  • We are not a hardware vendor and we do not get review units. Affiliate links on this site are disclosed; they do not change which hardware gets recommended.
  • If a number on this site cannot be traced to one of the cited sources, it is wrong. Please email [email protected] with the page URL.

Cross-checks and corrections

Every numeric claim on the site is checked against at least one of the open sources above before publishing. Reader-flagged errors are logged on the corrections page and the original guide is updated with a dated note. Send corrections to [email protected] with the page URL and the source you would prefer we cite.

Limitations

  • Estimates assume a single GPU or unified memory pool. Multi-GPU setups, CPU offloading, and model parallelism are not modeled.
  • Mixture-of-Experts (MoE) models (DeepSeek, Mixtral) activate only a fraction of parameters per token. Our formula uses total parameter count and therefore overestimates VRAM for MoE models.
  • Apple Silicon unified memory is shared between CPU and GPU. Available VRAM depends on system RAM allocation and other running applications.
  • Context length significantly affects KV cache size. The 1.2 overhead factor is calibrated for 4K to 8K context; 32K+ context needs the explicit KV-cache formula above.
  • Quantization quality varies by implementation. GGUF Q4_K_M and GPTQ Q4 are not equivalent even at the same bit width.

Disclaimer

All recommendations are for informational purposes only. Verify requirements before purchasing. See our full disclaimer.