Ollama Commands Cheat Sheet: Complete CLI Reference (2026)
Editorial: AI handled the first sweep through `ollama --help`. The flag annotations are mine — the ones I genuinely forget and re-look up.
Updated May 2026 · Ollama CLI · API · Modelfile · Environment Variables
Every Ollama command in one place. Covers the full CLI, REST API usage with curl examples, Modelfile syntax for custom models, environment variable reference, and useful one-liners. Bookmark this page as your daily driver reference for running local LLMs with Ollama.
Quick Reference: All Ollama Commands
| Command | What It Does | Example |
|---|---|---|
| ollama run <model> | Chat with a model (downloads if needed) | ollama run qwen3:8b |
| ollama pull <model> | Download a model without running it | ollama pull llama3.3:70b |
| ollama list | List all installed models | ollama list |
| ollama rm <model> | Remove a model and free disk space | ollama rm qwen3:14b |
| ollama show <model> | Show model info, parameters, Modelfile | ollama show qwen3:8b |
| ollama ps | Show running models and GPU usage | ollama ps |
| ollama stop <model> | Unload a model from memory | ollama stop qwen3:8b |
| ollama create <name> -f <file> | Create a custom model from a Modelfile | ollama create mybot -f Modelfile |
| ollama cp <src> <dst> | Copy a model to a new name | ollama cp qwen3:8b myqwen |
| ollama push <model> | Push a model to Ollama registry | ollama push myuser/mymodel |
Model Download and Management
Ollama stores models in ~/.ollama/models by default. Models are downloaded automatically on first ollama run, or manually with ollama pull. Use ollama rm to free disk space.
Download a model without running it
ollama pull qwen3:8b
List all installed models (name, size, modified date)
ollama list
Remove a model to free disk space
ollama rm qwen3:14b
Show model details, parameters, and Modelfile
ollama show qwen3:8b
Show template and system prompt for a model
ollama show qwen3:8b --modelfile
Copy a model under a new local name
ollama cp qwen3:8b my-qwen
Check which models are loaded in VRAM right now
ollama ps
Unload a model from VRAM immediately
ollama stop qwen3:8b
Find all available models at ollama.com/library. Append a tag for a specific size or quantization, e.g. qwen3:8b, qwen3:14b, qwen3:8b:q8_0.
Popular Models and VRAM Requirements
All commands use default Q4_K_M quantization unless a tag specifies otherwise. VRAM figures are approximate for Q4_K_M at 4K context.
| Ollama Command | VRAM Needed | Type | Notes |
|---|---|---|---|
| ollama run qwen3:8b | 5.2 GB | Chat + thinking | Best 8B model, fast on 8GB VRAM |
| ollama run qwen3:14b | 9.0 GB | Chat + thinking | Strong reasoning, fits 12GB VRAM |
| ollama run qwen3:32b | 20 GB | Chat + thinking | Needs 24GB VRAM (RTX 4090, 3090) |
| ollama run qwen3:30b-a3b | 20 GB | MoE (sparse) | Uses 20GB VRAM, active params ~3B speed |
| ollama run phi4:14b | 9.1 GB | Chat | Microsoft's Phi-4, strong reasoning |
| ollama run gemma3:12b | 8.1 GB | Chat | Google's Gemma 3 12B |
| ollama run llama3.3:70b | 42 GB | Chat | Needs dual-GPU or 48GB VRAM |
| ollama run deepseek-r1:14b | 9.0 GB | Thinking/reasoning | DeepSeek R1 14B distill |
Use the VRAM Calculator to check if a model fits your GPU at a given context length.
Chat and Inference Commands
The ollama run command opens an interactive prompt. Inside the chat session, several slash commands control behavior. You can also pipe input directly for non-interactive use.
Start interactive chat (downloads model if not present)
ollama run qwen3:8b
Run with verbose output: shows token speed and timing
ollama run qwen3:8b --verbose
Pipe a single question non-interactively
echo "explain quantum computing in 3 sentences" | ollama run qwen3:8b
Pass a file as context
cat myfile.txt | ollama run qwen3:8b "summarize this"
In-Session Slash Commands
| Command | What It Does |
|---|---|
| /bye | Exit the chat session |
| /clear | Clear the conversation context (start fresh) |
| /? | Show all available slash commands and help |
| /set parameter temperature 0.5 | Change temperature mid-session |
| /set parameter num_ctx 8192 | Change context window mid-session |
| /show info | Show model info and current parameters |
| /show modelfile | Show the full Modelfile for the running model |
| Ctrl+D | Send multi-line input (finish block and submit) |
| Ctrl+C | Interrupt current generation or exit |
API Usage: curl Examples
Ollama runs a local REST server at http://localhost:11434. Two main endpoints handle generation. Ollama also supports an OpenAI-compatible API, so you can drop it into any app that uses the OpenAI SDK.
/api/generate — single-turn completion
curl http://localhost:11434/api/generate \
-d '{"model":"qwen3:8b","prompt":"What is the capital of France?","stream":false}' /api/chat — multi-turn conversation with messages array
curl http://localhost:11434/api/chat \
-d '{
"model": "qwen3:8b",
"messages": [
{"role": "user", "content": "Why is the sky blue?"}
],
"stream": false
}' OpenAI-compatible endpoint — drop-in replacement for OpenAI API
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3:8b",
"messages": [{"role": "user", "content": "Hello!"}]
}' Python using the OpenAI library pointed at Ollama
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
response = client.chat.completions.create(
model="qwen3:8b",
messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content) List available models via API
curl http://localhost:11434/api/tags
Stream responses token-by-token (stream defaults to true)
curl http://localhost:11434/api/generate -d '{"model":"qwen3:8b","prompt":"Tell me a joke"}' Expose Ollama on the network
By default, Ollama only listens on localhost. To allow connections from other devices on your network, set OLLAMA_HOST=0.0.0.0:11434 before starting the service. Then connect from other machines using your host's IP address.
Modelfile Reference: Create Custom Models
A Modelfile lets you create a named custom model with a specific system prompt, temperature, and context size. Once created, it appears in ollama list and runs with ollama run mymodel.
Example Modelfile — coding assistant with 8K context
FROM qwen3:8b SYSTEM """ You are an expert software engineer. Write clean, well-commented code. Always explain what the code does and why you chose this approach. """ PARAMETER temperature 0.3 PARAMETER num_ctx 8192 PARAMETER num_predict 2048
Create and run the custom model
ollama create coding-assistant -f Modelfile
ollama run coding-assistant
Modelfile Directives Reference
| Directive | Example | Description |
|---|---|---|
| FROM | FROM qwen3:8b | Base model to build on (required) |
| SYSTEM | SYSTEM "You are a helpful assistant" | System prompt injected at the start of every conversation |
| PARAMETER temperature | PARAMETER temperature 0.7 | Randomness: 0 = deterministic, 1 = creative (default: 0.8) |
| PARAMETER num_ctx | PARAMETER num_ctx 8192 | Context window size in tokens (default: 2048) |
| PARAMETER num_predict | PARAMETER num_predict 1024 | Max tokens to generate per response (-1 = unlimited) |
| PARAMETER num_gpu | PARAMETER num_gpu 99 | GPU layers to use. 99 forces all layers to GPU. |
| PARAMETER top_p | PARAMETER top_p 0.9 | Nucleus sampling threshold (default: 0.9) |
| PARAMETER top_k | PARAMETER top_k 40 | Top-K sampling (default: 40) |
| PARAMETER repeat_penalty | PARAMETER repeat_penalty 1.1 | Penalize repeated tokens to reduce loops |
| PARAMETER stop | PARAMETER stop "\n\n" | Stop generation at this token sequence |
| TEMPLATE | TEMPLATE """{{ .Prompt }}""" | Override the prompt template (advanced) |
| MESSAGE | MESSAGE user "Hello" | Pre-populate conversation history |
Environment Variables Reference
Set these before starting the Ollama process or in the systemd service file on Linux. On macOS and Windows, set them as system or user environment variables before launching the Ollama app.
| Variable | Default | Example Value | Description |
|---|---|---|---|
| OLLAMA_HOST | 127.0.0.1:11434 | 0.0.0.0:11434 | Address Ollama listens on. Set to 0.0.0.0 to allow network access from other devices. |
| OLLAMA_MODELS | ~/.ollama/models | /mnt/storage/ollama | Custom path for model storage. Useful for moving models to a larger drive. |
| OLLAMA_NUM_PARALLEL | 1 | 2 | Number of parallel inference requests. Higher values share VRAM across requests. |
| OLLAMA_MAX_LOADED_MODELS | 1 | 2 | Maximum models kept loaded in VRAM simultaneously. |
| OLLAMA_KEEP_ALIVE | 5m | 0 | How long to keep a model loaded after last use. 0 = never unload. -1 = unload immediately. |
| OLLAMA_FLASH_ATTENTION | 0 | 1 | Enable Flash Attention for faster inference and lower VRAM. Requires compatible GPU. |
| OLLAMA_GPU_OVERHEAD | 0 | 524288000 | Reserve VRAM (bytes) for OS/other use. Helps prevent OOM errors on tight VRAM budgets. |
Setting Environment Variables on Linux (systemd)
Edit the Ollama systemd service override
sudo systemctl edit ollama
Add your variables inside the editor under a [Service] section:
[Service] Environment="OLLAMA_HOST=0.0.0.0:11434" Environment="OLLAMA_MODELS=/mnt/storage/ollama" Environment="OLLAMA_KEEP_ALIVE=0"
Reload systemd and restart Ollama to apply changes
sudo systemctl daemon-reload && sudo systemctl restart ollama
Useful One-Liners
Practical commands for common tasks. Combine with pipes and shell scripting to automate workflows.
Benchmark token speed for a model
ollama run qwen3:8b --verbose "write a 200 word essay about AI" 2>&1 | grep "eval rate"
Ask a one-shot question and get a plain text answer
ollama run qwen3:8b "what is 17 * 43?" --nowordwrap
Summarize a local file
cat report.txt | ollama run qwen3:8b "summarize this in 3 bullet points"
Translate a file to Spanish
cat document.txt | ollama run qwen3:8b "translate this to Spanish"
Check if Ollama API is up
curl -s http://localhost:11434/ | grep -i ollama
Pull multiple models in sequence
for m in qwen3:8b phi4:14b gemma3:12b; do ollama pull $m; done
Remove all models at once (careful: deletes everything)
ollama list | awk 'NR>1 {print $1}' | xargs -I{} ollama rm {} Keep Ollama model loaded indefinitely (no idle unload)
OLLAMA_KEEP_ALIVE=0 ollama serve
Force a model to use GPU layers (in Modelfile)
PARAMETER num_gpu 99
Run Ollama with a custom host and port
OLLAMA_HOST=0.0.0.0:11435 ollama serve
Troubleshooting Commands
When Ollama is not working as expected, these commands help diagnose the issue. Start with ollama ps and the relevant GPU tool.
Check if Ollama service is running
systemctl status ollama
View Ollama logs (Linux)
journalctl -u ollama -n 50 --no-pager
View Ollama logs (macOS)
cat ~/.ollama/logs/server.log | tail -50
Restart Ollama service (Linux)
sudo systemctl restart ollama
Check NVIDIA GPU and VRAM
nvidia-smi
Check AMD GPU and ROCm
rocm-smi
Confirm GPU is used (check GPU% column)
ollama ps
Check Ollama version
ollama --version
Run model with verbose token speed output
ollama run qwen3:8b --verbose
Common Problems and Fixes
Generation is very slow (1-5 t/s)
The model is offloading to CPU because it does not fit in VRAM. Run "ollama ps" and check the GPU% column. If it is below 100%, switch to a smaller model. For 8GB VRAM, use qwen3:8b instead of qwen3:14b. Check available VRAM with "nvidia-smi" or "rocm-smi".
Model not using GPU at all
Check that GPU drivers are installed (nvidia-smi or rocm-smi should return output). Restart the Ollama service after driver installation: sudo systemctl restart ollama on Linux. On AMD, confirm your user is in the render and video groups: groups $USER.
ollama: command not found
Ollama is installed but not in your PATH. On Linux it installs to /usr/local/bin. On macOS it may be in /usr/local/bin or /opt/homebrew/bin. Run "which ollama" or add the directory to your PATH in ~/.bashrc or ~/.zshrc.
API returns connection refused on localhost:11434
The Ollama service is not running. On Linux: sudo systemctl start ollama. On macOS: open the Ollama app from Applications or run "ollama serve" in a terminal to start the server manually.
Out of memory / model crashes on load
The model is too large for your VRAM. Use a smaller quantization (switch from q8_0 to q4_K_M) or a smaller model variant. Set OLLAMA_GPU_OVERHEAD to reserve some VRAM for the OS: export OLLAMA_GPU_OVERHEAD=524288000 (512 MB).
Frequently Asked Questions
What is the ollama run command?
The "ollama run <model>" command starts an interactive chat session. If the model is not already downloaded, Ollama pulls it automatically. For example, "ollama run qwen3:8b" downloads the 8B Qwen3 model if needed, then opens a chat prompt. Type /bye or press Ctrl+C to exit. You can also pipe input: echo "your question" | ollama run qwen3:8b for non-interactive use.
How do I list all downloaded Ollama models?
Run "ollama list" to see all models installed on your system, with size and last modified date. To see which models are currently loaded in VRAM, run "ollama ps". To remove a model and free disk space, run "ollama rm <model-name>".
How do I use Ollama as an API?
Ollama exposes a REST API on localhost:11434. Use /api/generate for single-turn completions or /api/chat for multi-turn conversations. Ollama also has an OpenAI-compatible endpoint at /v1/chat/completions, so you can use it with any OpenAI client library by setting base_url to http://localhost:11434/v1.
What is an Ollama Modelfile?
A Modelfile is a configuration file that defines a custom model. It uses FROM to set the base model, SYSTEM to set a system prompt, and PARAMETER directives to set temperature, context window, and GPU layers. Run "ollama create mymodel -f Modelfile" to register it, then "ollama run mymodel" to use it.
How do I keep an Ollama model loaded in memory?
Set the OLLAMA_KEEP_ALIVE environment variable. Use OLLAMA_KEEP_ALIVE=0 to keep the model loaded indefinitely. Use OLLAMA_KEEP_ALIVE=-1 to unload immediately after each request. The default is 5m (5 minutes). Set this before starting the Ollama service for it to take effect.
How do I check what Ollama models are running and using GPU?
Run "ollama ps" to see all currently loaded models. The output shows the model name, memory size, processor used, and time until unload. A GPU% of 100% means full GPU inference at maximum speed. Below 100% means partial CPU offloading, which significantly reduces generation speed.
Related Guides
Run LLMs Locally
General LLM setup guide
Open WebUI Setup
Add a ChatGPT-style UI
Ollama vs LM Studio
Compare the two tools
Run LLMs on Windows
Windows-specific setup
Run LLMs on Linux
Linux GPU and driver setup
Ollama Python API Guide
Full Python code examples: ollama library, OpenAI API, streaming, embeddings.
Run Ollama in Docker
Full Docker setup with GPU passthrough and Docker Compose.
Ollama Performance Guide
Flash attention, context size, KEEP_ALIVE tuning for faster inference.
LM Studio vs Ollama
Side-by-side comparison to pick the right tool
Popular hardware for local LLMs
Find models that fit your GPU or calculate exact VRAM requirements.
Sources & methodology
Behaviour, file-format and runtime details on this page are pulled from primary upstream docs and community benchmark threads. The full sitewide methodology lives on the methodology page. For this guide I relied most on:
- Ollama. Ollama repo - canonical command list, Modelfile syntax and library contents.
- llama.cpp. Underlying inference engine the cheat-sheet's flags ultimately map to.
- Hugging Face Hub. Source for every base model the Ollama library re-packages as GGUF.
Spot a number that does not match the linked source? Email [email protected] and I will update the guide.