How to Run LLMs on Windows: Complete Setup Guide (2026)

Q: Which GPU drivers do I need for LLMs on Windows?

Driver requirements depend on your GPU brand. For NVIDIA: install the latest Game Ready or Studio Driver from nvidia.com — any driver from the past year supports CUDA 12 which Ollama requires. For AMD: install Radeon Software (Adrenalin Edition) version 24.0 or later from amd.com, which enables ROCm-based acceleration in Ollama. For Intel Arc: install Intel Arc Control (formerly Intel Graphics Command Center) from intel.com, keeping it up to date for best Ollama compatibility. All three brands detect automatically after driver installation — no manual configuration needed.

Q: How do I check if Ollama is using my GPU on Windows?

Two ways to verify GPU usage. First, run "ollama ps" in a terminal while a model is loaded — the output shows a GPU% column. 100% means fully on VRAM and running at full speed; anything lower means partial CPU offloading and slower generation. Second, open Task Manager (Ctrl+Shift+Esc), click Performance, then select your GPU — you will see GPU compute utilization spike during inference if Ollama is using it. If GPU utilization stays at 0% during generation, your model is running on CPU, likely because it does not fit in VRAM.

Q: Can I run LLMs on Windows with integrated graphics?

Technically yes, but expect limited performance. Intel Iris Xe and Intel UHD integrated graphics share system RAM rather than having dedicated VRAM, giving you 2 to 4 GB effectively available for model weights. Only very small models like Qwen3 1.7B or Phi-4-mini fit without heavy CPU offloading. Generation speed is typically 3 to 10 t/s — slow but usable for testing. AMD Radeon integrated graphics (Ryzen APUs) performs similarly. For anything beyond small models or light experimentation, a discrete GPU with 8 GB or more VRAM is strongly recommended.

Q: What's the difference between Ollama and LM Studio on Windows?

Ollama is a command-line tool and background server. It exposes an OpenAI-compatible API at localhost:11434, runs headlessly, and is ideal for integrating LLMs into other tools like VS Code extensions, scripts, or Open WebUI. LM Studio is a graphical desktop application with a built-in chat interface, model browser, and visual GPU layers slider for tuning VRAM usage. LM Studio is easier for beginners and for experimenting with different models; Ollama is better for developers who want a persistent API server and programmatic access. Both can run the same GGUF models and both support NVIDIA, AMD, and Intel Arc GPUs on Windows.

AI sketched the install order. The driver / WSL2 gotchas come from manual testing on actual Windows boxes — those are the parts the AI consistently got wrong on first pass.

Updated May 2026 · Ollama and LM Studio · NVIDIA, AMD, Intel Arc · Windows 10 and 11

Running large language models locally on Windows is straightforward in 2026. Ollama and LM Studio both have native Windows installers, auto-detect your GPU, and can have you chatting with Qwen3 or Llama 3 in under five minutes. This guide covers both tools, which models to run for your VRAM tier, how to fix common Windows-specific issues, and optional extras like Open WebUI for a ChatGPT-style interface.

Quick Start: Running Your First Model in 2 Minutes

The fastest path to a working local LLM on Windows is Ollama. It installs as a background service, auto-detects NVIDIA and AMD GPUs, and adds the ollama command to your PATH automatically.

1.

Download and install Ollama for Windows

Go to ollama.com/download/windows and run the installer. Ollama installs as a system service that starts automatically with Windows. No Python, no virtual environments, no configuration files needed. After installation, open a new terminal window — the ollama command is immediately available.
2.
Run your first model

Open PowerShell or Command Prompt and run one of these commands. Ollama downloads the model on first run and caches it locally.
6–8 GB VRAM — Qwen3 8B (fast, highly capable)
```
ollama run qwen3:8b
```
10–12 GB VRAM — Qwen3 14B (stronger reasoning)
```
ollama run qwen3:14b
```
No GPU / small VRAM — Phi-4-mini 3.8B (CPU-friendly)
```
ollama run phi4-mini
```
3.
Verify GPU is being used

In a second terminal, run the command below. The GPU% column should show 100% for full GPU inference. Open Task Manager and click the GPU tab to see compute utilization spike during generation.
```
ollama ps
```

System Requirements

Both Ollama and LM Studio run on Windows 10 (version 1903 or later) and Windows 11. A discrete GPU is strongly recommended for any model larger than 3B parameters — CPU-only inference is functional but slow. Here are the minimum practical specs by use case.

Minimum (CPU only)

Windows 10/11
16 GB RAM
No discrete GPU required
Models: Phi-4-mini, Qwen3 1.7B
Speed: 2–8 t/s

Recommended (6–8 GB GPU)

Windows 10/11
16 GB RAM
NVIDIA RTX 3060/4060 or AMD RX 7600
Models: Qwen3 8B, Mistral 7B
Speed: 40–70 t/s

Ideal (12–24 GB GPU)

Windows 10/11
32 GB RAM
RTX 4070 12GB, RTX 4090, or RX 7900 XTX
Models: Qwen3 14B–70B, Phi-4 14B
Speed: 20–50 t/s

GPU Driver Requirements by Brand

GPU Brand	Required Driver	Download	Notes
NVIDIA (GTX 10xx+)	Game Ready or Studio Driver	nvidia.com/drivers	CUDA 12 required. Any recent driver works.
AMD (RX 5000+)	Radeon Software 24.0+	amd.com/support	Enables ROCm in Ollama. Keep updated.
Intel Arc (A/B series)	Intel Arc Control (latest)	intel.com/arc	Oneapi/SYCL backend. Update frequently.
Integrated (Intel/AMD)	Latest available	Device Manager	Very slow; small models only.

Keep GPU drivers updated. Ollama and LM Studio both release updates that may require newer driver features. When in doubt, update to the latest available driver before troubleshooting GPU detection issues.

Recommended Models by VRAM Tier

Match your model to your VRAM. Running a model that exceeds your VRAM causes CPU offloading — generation slows to 1 to 5 t/s. Always pick the largest quantization that fits fully in VRAM.

VRAM	Example GPUs	Best Models	Ollama Command	Speed
6–8 GB	RTX 4060 8GB, RTX 3060, RX 7600	Qwen3 8B Q4_K_M, Mistral 7B	ollama run qwen3:8b	~40–70 t/s (RTX 4060)
10–12 GB	RTX 3080 10GB, Arc B580 12GB, RTX 4070 12GB	Qwen3 14B Q4_K_M, Phi-4 14B	ollama run qwen3:14b	~35–50 t/s
16 GB	RTX 4060 Ti 16GB, RTX 4070 Ti Super	Qwen3 14B Q8_0, Phi-4 Q8	ollama run qwen3:14b:q8_0	~25–40 t/s
24 GB+	RTX 4090, RTX 3090, RX 7900 XTX	Llama 3.3 70B Q4_K_M	ollama run llama3.3:70b	~20–35 t/s
CPU only	16+ GB RAM, no discrete GPU	Qwen3 1.7B, Phi-4-mini 3.8B	ollama run phi4-mini	~2–8 t/s

Speed estimates are for Q4_K_M quantization with 100% GPU inference. Use the VRAM Calculator to check exact fit at your context length.

Ollama on Windows: Full Setup Details

Ollama's Windows installer handles everything: it registers a background service, adds the CLI to your system PATH, and configures GPU detection automatically. Here is what you need to know beyond the quick start.

GPU detection: NVIDIA vs AMD vs Intel Arc

NVIDIA GPUs are detected via CUDA — any modern Game Ready or Studio driver works. AMD GPU support on Windows uses DirectML; install the latest Radeon Software (Adrenalin Edition 24.0+). Intel Arc GPUs use an OpenCL/DirectML path — install the latest Intel Arc Control app and keep it updated. After installing drivers, no additional Ollama configuration is needed. Run "ollama run phi4-mini" and check Task Manager GPU tab to confirm acceleration.

Model storage location on Windows

By default, Ollama stores models in %USERPROFILE%\.ollama\models — typically C:\Users\YourName\.ollama\models. Models can be several gigabytes each. To move storage to a different drive, set the OLLAMA_MODELS environment variable before starting Ollama: set OLLAMA_MODELS=D:\ollama\models in a persistent environment variable via System Properties.

Running Ollama as a background API server

Ollama starts automatically as a Windows service after installation. It listens on http://localhost:11434 with an OpenAI-compatible API. This means any tool that supports the OpenAI API — VS Code extensions, Continue.dev, Cursor, Open WebUI — can connect to Ollama without any extra configuration. Just point the tool at http://localhost:11434 and it works.

Updating Ollama on Windows

Ollama on Windows auto-updates by default. When a new version is available, the system tray icon shows an update prompt. You can also download the latest installer from ollama.com and run it over the existing installation — it upgrades cleanly without removing your downloaded models.

LM Studio on Windows: GUI Alternative

LM Studio is a graphical desktop application for running local LLMs without any command-line work. It includes a model browser, built-in chat interface, and a GPU layers slider for fine-tuning how much VRAM to use. Ideal for beginners or anyone who prefers a visual interface.

1. Download LM Studio

Go to lmstudio.ai and download the Windows installer. Run it and follow the prompts — LM Studio installs like any standard Windows application. No dependencies required.

2. Find and download a model

Open the Discover tab and search for a model. For Qwen3 8B, search "qwen3" and look for "bartowski/Qwen3-8B-GGUF" — bartowski's quantizations are well-tested. LM Studio shows file size and recommended VRAM next to each variant.

3. Configure GPU layers

When loading a model, LM Studio shows a GPU Layers slider. Drag it to the maximum value your VRAM supports. Higher layers = more work on GPU = faster generation. LM Studio estimates VRAM usage live as you adjust the slider.

4. Load a GGUF file from disk

Already have a GGUF model downloaded? Use My Models tab and click the import button to point LM Studio at any .gguf file on your system. Useful for models downloaded from Hugging Face or converted from other formats.

Ollama vs LM Studio: which should you use on Windows?

Use LM Studio if you want a visual interface, easy model browsing, and a built-in chat UI without touching the terminal. Use Ollama if you want a persistent API server for integrating with editors, scripts, or Open WebUI, or if you prefer working from the command line. Both run the same GGUF models and support the same GPUs. Many users install both — LM Studio for exploration, Ollama for daily use.

Fixing Common Windows Issues

CUDA not detected — model runs on CPU despite having an NVIDIA GPU

Usually a driver issue. Open Device Manager and verify your GPU is listed under Display Adapters without a warning icon. Update to the latest NVIDIA Game Ready or Studio Driver from nvidia.com. After updating, restart your machine and run "ollama run phi4-mini" — check Task Manager GPU tab during inference to confirm CUDA acceleration. If the GPU is still not detected, run "nvidia-smi" in a terminal; if that command fails, the driver is not installed correctly.

Model runs but generation is very slow (1–5 t/s)

This is CPU offloading — the model does not fully fit in VRAM and Ollama is running layers on system RAM. Run "ollama ps" and check the GPU% column. Fix: switch to a smaller model or a more aggressive quantization. For an 8 GB GPU running Qwen3 14B, switch to Qwen3 8B which fits comfortably. Alternatively, run "ollama run qwen3:8b:q5_k_m" for a higher-quality quantization that still fits in 8 GB.

Port 11434 already in use — Ollama fails to start

Another process is using port 11434. Open PowerShell as Administrator and run: netstat -ano | findstr :11434 — note the PID in the last column, then run: taskkill /PID <pid> /F to terminate it. If Ollama is already running as a service but not responding, open Task Manager, find OllamaService in the Services tab, right-click and restart it.

Windows Defender quarantines the Ollama installer or model files

This is a false positive. Ollama and its model files are safe. To resolve: open Windows Security, go to Virus and Threat Protection, click Protection History, find the quarantined item, and click Allow. For the installer, right-click the downloaded .exe, select Properties, and check "Unblock" at the bottom before running. You can also add the Ollama install directory and model storage path to Windows Defender exclusions.

AMD GPU not detected by Ollama on Windows

Ollama uses DirectML for AMD on Windows, which requires Radeon Software Adrenalin Edition 24.0 or later. Download it from amd.com/support. If your card is older (RX 500 series or earlier), DirectML support may be limited — LM Studio with the Vulkan backend is often more reliable for older AMD cards on Windows. In LM Studio, go to Settings and enable the Vulkan GPU backend.

Optional: WSL2 for Advanced Users

The native Windows Ollama binary works well for most users. However, running Ollama inside WSL2 (Windows Subsystem for Linux) can offer slightly better performance and gives you access to the Linux Ollama ecosystem including ROCm for AMD GPUs and faster model loading in some configurations.

WSL2 Ollama setup (Ubuntu)

1. Enable WSL2 in PowerShell (run as Administrator)

wsl --install

2. Inside Ubuntu terminal, install Ollama for Linux

curl -fsSL https://ollama.com/install.sh | sh

3. Run your model

ollama run qwen3:8b

WSL2 has access to your Windows GPU via CUDA passthrough — NVIDIA GPUs work well. AMD GPU support in WSL2 requires ROCm which has limited Windows passthrough support as of 2026. For most Windows users, the native Ollama installer is simpler and performs nearly identically. WSL2 is worth trying if you encounter driver issues with the native install or need Linux-specific tooling.

Open WebUI: ChatGPT-Style Interface on Windows

Ollama's terminal interface is functional but minimal. Open WebUI adds a full browser-based chat interface with conversation history, model switching, file uploads, and web search — all running locally. It connects to your local Ollama server and requires Docker Desktop on Windows.

1. Install Docker Desktop for Windows

Download from docker.com/products/docker-desktop. Run the installer and enable the WSL2 backend when prompted. Restart your machine after installation.

2. Start Open WebUI with Docker

docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway ghcr.io/open-webui/open-webui:main

3. Open your browser

Navigate to http://localhost:3000. Create an account on the first-run screen (this is local only — no internet account). Select your Ollama model from the dropdown at the top of the chat.

For a full Open WebUI setup walkthrough including configuration options, see the Open WebUI setup guide.

Frequently Asked Questions

Does Ollama work on Windows without a GPU?

Yes, Ollama runs on Windows without a GPU using CPU-only mode. It automatically falls back to CPU inference if no supported GPU is detected. CPU mode works fine but is significantly slower — expect 2 to 8 tokens per second on a modern CPU compared to 30 to 80 t/s on a mid-range GPU. For CPU-only use, stick to small models like Qwen3 1.7B or Phi-4-mini 3.8B which are fast enough to be usable.

Which GPU drivers do I need for LLMs on Windows?

For NVIDIA: install the latest Game Ready or Studio Driver from nvidia.com — any recent driver supports CUDA 12 which Ollama requires. For AMD: install Radeon Software Adrenalin Edition 24.0 or later from amd.com. For Intel Arc: install Intel Arc Control from intel.com and keep it updated. All three brands are detected automatically after driver installation.

How do I check if Ollama is using my GPU on Windows?

Two ways: run "ollama ps" in a terminal while a model is loaded — the GPU% column shows 100% for full GPU inference. Also open Task Manager (Ctrl+Shift+Esc), click Performance, select your GPU — compute utilization should spike during inference. If GPU stays at 0%, the model is on CPU, likely because it exceeds your VRAM.

Can I run LLMs on Windows with integrated graphics?

Technically yes, but performance is limited. Intel Iris Xe and UHD integrated graphics share system RAM, effectively giving 2 to 4 GB for model weights. Only very small models like Qwen3 1.7B or Phi-4-mini fit. AMD Radeon integrated graphics on Ryzen APUs performs similarly. Generation speed is 3 to 10 t/s — usable for testing but slow for regular use. A discrete GPU with 8 GB or more VRAM is strongly recommended for anything beyond small models.

What's the difference between Ollama and LM Studio on Windows?

Ollama is a CLI tool and headless background server. It exposes an OpenAI-compatible API at localhost:11434 and is ideal for developer integrations — VS Code extensions, scripts, Open WebUI. LM Studio is a graphical desktop application with a built-in chat UI, model browser, and a visual GPU layers slider for tuning VRAM usage. LM Studio is easier for beginners; Ollama is better for developers who want a persistent API server. Both support the same GGUF models and NVIDIA, AMD, Intel Arc GPUs on Windows.

Related Guides

How to Run LLMs Locally

General intro to local LLM setup

Open WebUI Setup

ChatGPT-like browser UI for Ollama

Local AI Coding Assistant

VS Code + Continue.dev setup

What Can I Run?

Find models that fit your GPU

RTX 4060 LLM Guide

Popular Windows budget GPU deep dive

RTX 4070 LLM Guide

Popular Windows mid-range GPU guide

Best LLMs to Run Locally

Model picks by VRAM tier

AI on Your Gaming PC

GPU tier table for gamers — RTX 4060 to 4090

Private Offline AI

Run AI with zero cloud — fully air-gap capable after setup

WSL2 LLM Setup Guide

Get AMD ROCm or Linux tools on Windows

Popular hardware for local LLMs

RTX 4060 (8 GB)

Budget pick. Runs 7B-8B models at 25-35 tok/s.

Buy on Amazon

RTX 4060 Ti 16 GB

Sweet spot. Runs 13B-14B at full speed. Best value.

Buy on Amazon

RTX 4090 (24 GB)

Top consumer GPU. Runs 70B models with offloading.

Buy on Amazon

Check what models fit your Windows GPU or find the right hardware.

VRAM Calculator What Can I Run? All Guides

Sources & methodology

Behaviour, file-format and runtime details on this page are pulled from primary upstream docs and community benchmark threads. The full sitewide methodology lives on the methodology page. For this guide I relied most on:

LM Studio. Windows GUI used in this guide's step-by-step screenshots.
Ollama. Alternative Windows runtime referenced for CLI-first users.
Hardware Corner GPU ranking. Tokens per second numbers for the Windows GPUs we name (3060, 4070, 4090).

Spot a number that does not match the linked source? Email [email protected] and I will update the guide.