How to Run LLMs on Windows: Complete Setup Guide (2026)

AI sketched the install order. The driver / WSL2 gotchas come from manual testing on actual Windows boxes — those are the parts the AI consistently got wrong on first pass.

Updated May 2026 · Ollama and LM Studio · NVIDIA, AMD, Intel Arc · Windows 10 and 11

Running large language models locally on Windows is straightforward in 2026. Ollama and LM Studio both have native Windows installers, auto-detect your GPU, and can have you chatting with Qwen3 or Llama 3 in under five minutes. This guide covers both tools, which models to run for your VRAM tier, how to fix common Windows-specific issues, and optional extras like Open WebUI for a ChatGPT-style interface.

Quick Start: Running Your First Model in 2 Minutes

The fastest path to a working local LLM on Windows is Ollama. It installs as a background service, auto-detects NVIDIA and AMD GPUs, and adds the ollama command to your PATH automatically.

  1. 1.

    Download and install Ollama for Windows

    Go to ollama.com/download/windows and run the installer. Ollama installs as a system service that starts automatically with Windows. No Python, no virtual environments, no configuration files needed. After installation, open a new terminal window — the ollama command is immediately available.

  2. 2.

    Run your first model

    Open PowerShell or Command Prompt and run one of these commands. Ollama downloads the model on first run and caches it locally.

    6–8 GB VRAM — Qwen3 8B (fast, highly capable)

    ollama run qwen3:8b

    10–12 GB VRAM — Qwen3 14B (stronger reasoning)

    ollama run qwen3:14b

    No GPU / small VRAM — Phi-4-mini 3.8B (CPU-friendly)

    ollama run phi4-mini
  3. 3.

    Verify GPU is being used

    In a second terminal, run the command below. The GPU% column should show 100% for full GPU inference. Open Task Manager and click the GPU tab to see compute utilization spike during generation.

    ollama ps

System Requirements

Both Ollama and LM Studio run on Windows 10 (version 1903 or later) and Windows 11. A discrete GPU is strongly recommended for any model larger than 3B parameters — CPU-only inference is functional but slow. Here are the minimum practical specs by use case.

Minimum (CPU only)

  • Windows 10/11
  • 16 GB RAM
  • No discrete GPU required
  • Models: Phi-4-mini, Qwen3 1.7B
  • Speed: 2–8 t/s

Recommended (6–8 GB GPU)

  • Windows 10/11
  • 16 GB RAM
  • NVIDIA RTX 3060/4060 or AMD RX 7600
  • Models: Qwen3 8B, Mistral 7B
  • Speed: 40–70 t/s

Ideal (12–24 GB GPU)

  • Windows 10/11
  • 32 GB RAM
  • RTX 4070 12GB, RTX 4090, or RX 7900 XTX
  • Models: Qwen3 14B–70B, Phi-4 14B
  • Speed: 20–50 t/s

GPU Driver Requirements by Brand

GPU BrandRequired DriverDownloadNotes
NVIDIA (GTX 10xx+) Game Ready or Studio Driver nvidia.com/drivers CUDA 12 required. Any recent driver works.
AMD (RX 5000+) Radeon Software 24.0+ amd.com/support Enables ROCm in Ollama. Keep updated.
Intel Arc (A/B series) Intel Arc Control (latest) intel.com/arc Oneapi/SYCL backend. Update frequently.
Integrated (Intel/AMD) Latest available Device Manager Very slow; small models only.

Keep GPU drivers updated. Ollama and LM Studio both release updates that may require newer driver features. When in doubt, update to the latest available driver before troubleshooting GPU detection issues.

Recommended Models by VRAM Tier

Match your model to your VRAM. Running a model that exceeds your VRAM causes CPU offloading — generation slows to 1 to 5 t/s. Always pick the largest quantization that fits fully in VRAM.

VRAMExample GPUsBest ModelsOllama CommandSpeed
6–8 GB RTX 4060 8GB, RTX 3060, RX 7600 Qwen3 8B Q4_K_M, Mistral 7B ollama run qwen3:8b ~40–70 t/s (RTX 4060)
10–12 GB RTX 3080 10GB, Arc B580 12GB, RTX 4070 12GB Qwen3 14B Q4_K_M, Phi-4 14B ollama run qwen3:14b ~35–50 t/s
16 GB RTX 4060 Ti 16GB, RTX 4070 Ti Super Qwen3 14B Q8_0, Phi-4 Q8 ollama run qwen3:14b:q8_0 ~25–40 t/s
24 GB+ RTX 4090, RTX 3090, RX 7900 XTX Llama 3.3 70B Q4_K_M ollama run llama3.3:70b ~20–35 t/s
CPU only 16+ GB RAM, no discrete GPU Qwen3 1.7B, Phi-4-mini 3.8B ollama run phi4-mini ~2–8 t/s

Speed estimates are for Q4_K_M quantization with 100% GPU inference. Use the VRAM Calculator to check exact fit at your context length.

Ollama on Windows: Full Setup Details

Ollama's Windows installer handles everything: it registers a background service, adds the CLI to your system PATH, and configures GPU detection automatically. Here is what you need to know beyond the quick start.

GPU detection: NVIDIA vs AMD vs Intel Arc

NVIDIA GPUs are detected via CUDA — any modern Game Ready or Studio driver works. AMD GPU support on Windows uses DirectML; install the latest Radeon Software (Adrenalin Edition 24.0+). Intel Arc GPUs use an OpenCL/DirectML path — install the latest Intel Arc Control app and keep it updated. After installing drivers, no additional Ollama configuration is needed. Run "ollama run phi4-mini" and check Task Manager GPU tab to confirm acceleration.

Model storage location on Windows

By default, Ollama stores models in %USERPROFILE%\.ollama\models — typically C:\Users\YourName\.ollama\models. Models can be several gigabytes each. To move storage to a different drive, set the OLLAMA_MODELS environment variable before starting Ollama: set OLLAMA_MODELS=D:\ollama\models in a persistent environment variable via System Properties.

Running Ollama as a background API server

Ollama starts automatically as a Windows service after installation. It listens on http://localhost:11434 with an OpenAI-compatible API. This means any tool that supports the OpenAI API — VS Code extensions, Continue.dev, Cursor, Open WebUI — can connect to Ollama without any extra configuration. Just point the tool at http://localhost:11434 and it works.

Updating Ollama on Windows

Ollama on Windows auto-updates by default. When a new version is available, the system tray icon shows an update prompt. You can also download the latest installer from ollama.com and run it over the existing installation — it upgrades cleanly without removing your downloaded models.

LM Studio on Windows: GUI Alternative

LM Studio is a graphical desktop application for running local LLMs without any command-line work. It includes a model browser, built-in chat interface, and a GPU layers slider for fine-tuning how much VRAM to use. Ideal for beginners or anyone who prefers a visual interface.

1. Download LM Studio

Go to lmstudio.ai and download the Windows installer. Run it and follow the prompts — LM Studio installs like any standard Windows application. No dependencies required.

2. Find and download a model

Open the Discover tab and search for a model. For Qwen3 8B, search "qwen3" and look for "bartowski/Qwen3-8B-GGUF" — bartowski's quantizations are well-tested. LM Studio shows file size and recommended VRAM next to each variant.

3. Configure GPU layers

When loading a model, LM Studio shows a GPU Layers slider. Drag it to the maximum value your VRAM supports. Higher layers = more work on GPU = faster generation. LM Studio estimates VRAM usage live as you adjust the slider.

4. Load a GGUF file from disk

Already have a GGUF model downloaded? Use My Models tab and click the import button to point LM Studio at any .gguf file on your system. Useful for models downloaded from Hugging Face or converted from other formats.

Ollama vs LM Studio: which should you use on Windows?

Use LM Studio if you want a visual interface, easy model browsing, and a built-in chat UI without touching the terminal. Use Ollama if you want a persistent API server for integrating with editors, scripts, or Open WebUI, or if you prefer working from the command line. Both run the same GGUF models and support the same GPUs. Many users install both — LM Studio for exploration, Ollama for daily use.

Fixing Common Windows Issues

CUDA not detected — model runs on CPU despite having an NVIDIA GPU

Usually a driver issue. Open Device Manager and verify your GPU is listed under Display Adapters without a warning icon. Update to the latest NVIDIA Game Ready or Studio Driver from nvidia.com. After updating, restart your machine and run "ollama run phi4-mini" — check Task Manager GPU tab during inference to confirm CUDA acceleration. If the GPU is still not detected, run "nvidia-smi" in a terminal; if that command fails, the driver is not installed correctly.

Model runs but generation is very slow (1–5 t/s)

This is CPU offloading — the model does not fully fit in VRAM and Ollama is running layers on system RAM. Run "ollama ps" and check the GPU% column. Fix: switch to a smaller model or a more aggressive quantization. For an 8 GB GPU running Qwen3 14B, switch to Qwen3 8B which fits comfortably. Alternatively, run "ollama run qwen3:8b:q5_k_m" for a higher-quality quantization that still fits in 8 GB.

Port 11434 already in use — Ollama fails to start

Another process is using port 11434. Open PowerShell as Administrator and run: netstat -ano | findstr :11434 — note the PID in the last column, then run: taskkill /PID <pid> /F to terminate it. If Ollama is already running as a service but not responding, open Task Manager, find OllamaService in the Services tab, right-click and restart it.

Windows Defender quarantines the Ollama installer or model files

This is a false positive. Ollama and its model files are safe. To resolve: open Windows Security, go to Virus and Threat Protection, click Protection History, find the quarantined item, and click Allow. For the installer, right-click the downloaded .exe, select Properties, and check "Unblock" at the bottom before running. You can also add the Ollama install directory and model storage path to Windows Defender exclusions.

AMD GPU not detected by Ollama on Windows

Ollama uses DirectML for AMD on Windows, which requires Radeon Software Adrenalin Edition 24.0 or later. Download it from amd.com/support. If your card is older (RX 500 series or earlier), DirectML support may be limited — LM Studio with the Vulkan backend is often more reliable for older AMD cards on Windows. In LM Studio, go to Settings and enable the Vulkan GPU backend.

Optional: WSL2 for Advanced Users

The native Windows Ollama binary works well for most users. However, running Ollama inside WSL2 (Windows Subsystem for Linux) can offer slightly better performance and gives you access to the Linux Ollama ecosystem including ROCm for AMD GPUs and faster model loading in some configurations.

WSL2 Ollama setup (Ubuntu)

1. Enable WSL2 in PowerShell (run as Administrator)

wsl --install

2. Inside Ubuntu terminal, install Ollama for Linux

curl -fsSL https://ollama.com/install.sh | sh

3. Run your model

ollama run qwen3:8b

WSL2 has access to your Windows GPU via CUDA passthrough — NVIDIA GPUs work well. AMD GPU support in WSL2 requires ROCm which has limited Windows passthrough support as of 2026. For most Windows users, the native Ollama installer is simpler and performs nearly identically. WSL2 is worth trying if you encounter driver issues with the native install or need Linux-specific tooling.

Open WebUI: ChatGPT-Style Interface on Windows

Ollama's terminal interface is functional but minimal. Open WebUI adds a full browser-based chat interface with conversation history, model switching, file uploads, and web search — all running locally. It connects to your local Ollama server and requires Docker Desktop on Windows.

1. Install Docker Desktop for Windows

Download from docker.com/products/docker-desktop. Run the installer and enable the WSL2 backend when prompted. Restart your machine after installation.

2. Start Open WebUI with Docker

docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway ghcr.io/open-webui/open-webui:main

3. Open your browser

Navigate to http://localhost:3000. Create an account on the first-run screen (this is local only — no internet account). Select your Ollama model from the dropdown at the top of the chat.

For a full Open WebUI setup walkthrough including configuration options, see the Open WebUI setup guide.

Frequently Asked Questions

Does Ollama work on Windows without a GPU?

Yes, Ollama runs on Windows without a GPU using CPU-only mode. It automatically falls back to CPU inference if no supported GPU is detected. CPU mode works fine but is significantly slower — expect 2 to 8 tokens per second on a modern CPU compared to 30 to 80 t/s on a mid-range GPU. For CPU-only use, stick to small models like Qwen3 1.7B or Phi-4-mini 3.8B which are fast enough to be usable.

Which GPU drivers do I need for LLMs on Windows?

For NVIDIA: install the latest Game Ready or Studio Driver from nvidia.com — any recent driver supports CUDA 12 which Ollama requires. For AMD: install Radeon Software Adrenalin Edition 24.0 or later from amd.com. For Intel Arc: install Intel Arc Control from intel.com and keep it updated. All three brands are detected automatically after driver installation.

How do I check if Ollama is using my GPU on Windows?

Two ways: run "ollama ps" in a terminal while a model is loaded — the GPU% column shows 100% for full GPU inference. Also open Task Manager (Ctrl+Shift+Esc), click Performance, select your GPU — compute utilization should spike during inference. If GPU stays at 0%, the model is on CPU, likely because it exceeds your VRAM.

Can I run LLMs on Windows with integrated graphics?

Technically yes, but performance is limited. Intel Iris Xe and UHD integrated graphics share system RAM, effectively giving 2 to 4 GB for model weights. Only very small models like Qwen3 1.7B or Phi-4-mini fit. AMD Radeon integrated graphics on Ryzen APUs performs similarly. Generation speed is 3 to 10 t/s — usable for testing but slow for regular use. A discrete GPU with 8 GB or more VRAM is strongly recommended for anything beyond small models.

What's the difference between Ollama and LM Studio on Windows?

Ollama is a CLI tool and headless background server. It exposes an OpenAI-compatible API at localhost:11434 and is ideal for developer integrations — VS Code extensions, scripts, Open WebUI. LM Studio is a graphical desktop application with a built-in chat UI, model browser, and a visual GPU layers slider for tuning VRAM usage. LM Studio is easier for beginners; Ollama is better for developers who want a persistent API server. Both support the same GGUF models and NVIDIA, AMD, Intel Arc GPUs on Windows.

Related Guides

Popular hardware for local LLMs

RTX 4060 (8 GB)
Budget pick. Runs 7B-8B models at 25-35 tok/s.
Buy on Amazon
RTX 4060 Ti 16 GB
Sweet spot. Runs 13B-14B at full speed. Best value.
Buy on Amazon
RTX 4090 (24 GB)
Top consumer GPU. Runs 70B models with offloading.
Buy on Amazon

Check what models fit your Windows GPU or find the right hardware.

Sources & methodology

Behaviour, file-format and runtime details on this page are pulled from primary upstream docs and community benchmark threads. The full sitewide methodology lives on the methodology page. For this guide I relied most on:

Spot a number that does not match the linked source? Email [email protected] and I will update the guide.