Headshot of Billy G.R., author at LLMHardware.io

Billy G.R.

Editor at LLMHardware.io. Synthesises open community benchmarks for running local LLMs on consumer hardware.

Email: [email protected]

LinkedIn: linkedin.com/in/billy-g-r-2a7739404

About me

I am a working software engineer who got curious about running language models locally when API rate limits started biting on a side project. That curiosity is what this site is built on.

I do not maintain a full multi-GPU benchmark rig, and I do not get review units from hardware vendors. What I do is synthesise published community benchmarks from the llama.cpp project's own llama-bench runs, the XiongjieDai GPU-Benchmarks-on-LLM-Inference repository, the Home GPU LLM Leaderboard, and Hardware Corner's GPU ranking, plus the VRAM-formula breakdown from Modal's engineering blog. Where I have hands-on access to a card I cross-check the numbers myself. Where I do not, I cite the source so you can follow the chain back to the original run.

The full methodology, including the VRAM formula, the worked example, and the source list, is on the methodology page.

What I cover

  • Consumer GPU reviews for local inference (NVIDIA RTX, AMD Radeon, Intel Arc)
  • Apple Silicon for LLMs (M3 and M4 series, unified memory tradeoffs)
  • Multi-GPU setups for larger models, including NVLink and CPU offloading
  • Home lab power and thermals when running models around the clock
  • Quantization in practice: GGUF formats, Q4 vs Q5 vs Q8 quality drops on real prompts
  • Setup notes for Ollama, llama.cpp, LM Studio, vLLM, and Open WebUI on consumer machines

How I cite

Every numeric claim on this site, whether tokens per second, watts, or "fits in N GB of VRAM", links to the open source it came from. The full source list and the VRAM formula live on the methodology page. The short version: llama.cpp's own llama-bench discussion thread for Apple Silicon baselines, the XiongjieDai community repo for NVIDIA + Apple Silicon llama-bench runs, the Home GPU LLM Leaderboard for VRAM-tier comparisons, and Hardware Corner for context-length sensitivity.

If I cannot point at a source for a number, the number does not go in a guide. If you spot one that I missed, I want to hear about it.

What I will not claim to be

  • A benchmark publisher. The numbers on this site are aggregated and cross-checked from published community benchmarks. If a claim cannot be traced to a cited source, it is wrong. Please email [email protected].
  • An academic researcher. I do not publish papers and do not run controlled studies on model quality.
  • An expert on training or fine-tuning at scale. I have run small LoRAs on a single GPU but defer to people who train production models for a living.
  • A datacenter engineer. Recommendations on this site are for home labs, small offices, and individuals, not racks of H100s.
  • A model evaluation specialist. When I write about which model is good for a task, I say what worked for me and link to community evals from people who do this rigorously.

Contact and corrections

Spot something wrong in a guide? Email [email protected] with the page URL and the correction. Confirmed errors are logged on the corrections page and the original guide is updated with a dated note. You can also reach me via LinkedIn.