Ollama Commands Cheat Sheet: Complete CLI Reference (2026)

Q: How do I list all downloaded Ollama models?

Run "ollama list" to see all models currently downloaded on your system. The output shows the model name, ID, size on disk, and when it was last modified. To see which models are currently loaded in memory (using VRAM), run "ollama ps" instead. To remove a model and free up disk space, run "ollama rm ".

Q: How do I use Ollama as an API?

Ollama exposes a REST API on localhost:11434 by default. The main endpoints are /api/generate for single-turn completions and /api/chat for multi-turn conversations with a messages array. Ollama also supports an OpenAI-compatible endpoint at /v1/chat/completions, which means you can use it as a drop-in replacement for the OpenAI API by setting the base URL to http://localhost:11434/v1 and using any OpenAI client library.

Q: What is an Ollama Modelfile?

A Modelfile is a configuration file that defines a custom Ollama model. It lets you set a base model with the FROM directive, add a system prompt with SYSTEM, and configure parameters like temperature, context window size, and GPU layers with PARAMETER directives. Once you write a Modelfile, run "ollama create mymodel -f Modelfile" to register it. You can then use "ollama run mymodel" to chat with your custom configuration.

Q: How do I keep an Ollama model loaded in memory?

By default, Ollama unloads a model from memory 5 minutes after the last request. To change this, set the OLLAMA_KEEP_ALIVE environment variable. Use OLLAMA_KEEP_ALIVE=0 to keep the model loaded indefinitely (never unload). Use OLLAMA_KEEP_ALIVE=-1 to unload immediately after each request, freeing VRAM instantly. You can also set it to a duration like 30m or 2h. Set this in your environment before starting the Ollama service.

Editorial: AI handled the first sweep through `ollama --help`. The flag annotations are mine — the ones I genuinely forget and re-look up.

Updated May 2026 · Ollama CLI · API · Modelfile · Environment Variables

Every Ollama command in one place. Covers the full CLI, REST API usage with curl examples, Modelfile syntax for custom models, environment variable reference, and useful one-liners. Bookmark this page as your daily driver reference for running local LLMs with Ollama.

Quick Reference: All Ollama Commands

Command	What It Does	Example
ollama run <model>	Chat with a model (downloads if needed)	ollama run qwen3:8b
ollama pull <model>	Download a model without running it	ollama pull llama3.3:70b
ollama list	List all installed models	ollama list
ollama rm <model>	Remove a model and free disk space	ollama rm qwen3:14b
ollama show <model>	Show model info, parameters, Modelfile	ollama show qwen3:8b
ollama ps	Show running models and GPU usage	ollama ps
ollama stop <model>	Unload a model from memory	ollama stop qwen3:8b
ollama create <name> -f <file>	Create a custom model from a Modelfile	ollama create mybot -f Modelfile
ollama cp <src> <dst>	Copy a model to a new name	ollama cp qwen3:8b myqwen
ollama push <model>	Push a model to Ollama registry	ollama push myuser/mymodel

Model Download and Management

Ollama stores models in ~/.ollama/models by default. Models are downloaded automatically on first ollama run, or manually with ollama pull. Use ollama rm to free disk space.

Download a model without running it

ollama pull qwen3:8b

List all installed models (name, size, modified date)

ollama list

Remove a model to free disk space

ollama rm qwen3:14b

Show model details, parameters, and Modelfile

ollama show qwen3:8b

Show template and system prompt for a model

ollama show qwen3:8b --modelfile

Copy a model under a new local name

ollama cp qwen3:8b my-qwen

Check which models are loaded in VRAM right now

ollama ps

Unload a model from VRAM immediately

ollama stop qwen3:8b

Find all available models at ollama.com/library. Append a tag for a specific size or quantization, e.g. qwen3:8b, qwen3:14b, qwen3:8b:q8_0.

Popular Models and VRAM Requirements

All commands use default Q4_K_M quantization unless a tag specifies otherwise. VRAM figures are approximate for Q4_K_M at 4K context.

Ollama Command	VRAM Needed	Type	Notes
ollama run qwen3:8b	5.2 GB	Chat + thinking	Best 8B model, fast on 8GB VRAM
ollama run qwen3:14b	9.0 GB	Chat + thinking	Strong reasoning, fits 12GB VRAM
ollama run qwen3:32b	20 GB	Chat + thinking	Needs 24GB VRAM (RTX 4090, 3090)
ollama run qwen3:30b-a3b	20 GB	MoE (sparse)	Uses 20GB VRAM, active params ~3B speed
ollama run phi4:14b	9.1 GB	Chat	Microsoft's Phi-4, strong reasoning
ollama run gemma3:12b	8.1 GB	Chat	Google's Gemma 3 12B
ollama run llama3.3:70b	42 GB	Chat	Needs dual-GPU or 48GB VRAM
ollama run deepseek-r1:14b	9.0 GB	Thinking/reasoning	DeepSeek R1 14B distill

Use the VRAM Calculator to check if a model fits your GPU at a given context length.

Chat and Inference Commands

The ollama run command opens an interactive prompt. Inside the chat session, several slash commands control behavior. You can also pipe input directly for non-interactive use.

Start interactive chat (downloads model if not present)

ollama run qwen3:8b

Run with verbose output: shows token speed and timing

ollama run qwen3:8b --verbose

Pipe a single question non-interactively

echo "explain quantum computing in 3 sentences" | ollama run qwen3:8b

Pass a file as context

cat myfile.txt | ollama run qwen3:8b "summarize this"

In-Session Slash Commands

Command	What It Does
/bye	Exit the chat session
/clear	Clear the conversation context (start fresh)
/?	Show all available slash commands and help
/set parameter temperature 0.5	Change temperature mid-session
/set parameter num_ctx 8192	Change context window mid-session
/show info	Show model info and current parameters
/show modelfile	Show the full Modelfile for the running model
Ctrl+D	Send multi-line input (finish block and submit)
Ctrl+C	Interrupt current generation or exit

API Usage: curl Examples

Ollama runs a local REST server at http://localhost:11434. Two main endpoints handle generation. Ollama also supports an OpenAI-compatible API, so you can drop it into any app that uses the OpenAI SDK.

/api/generate — single-turn completion

curl http://localhost:11434/api/generate \
  -d '{"model":"qwen3:8b","prompt":"What is the capital of France?","stream":false}'

/api/chat — multi-turn conversation with messages array

curl http://localhost:11434/api/chat \
  -d '{
    "model": "qwen3:8b",
    "messages": [
      {"role": "user", "content": "Why is the sky blue?"}
    ],
    "stream": false
  }'

OpenAI-compatible endpoint — drop-in replacement for OpenAI API

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3:8b",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Python using the OpenAI library pointed at Ollama

from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
response = client.chat.completions.create(
    model="qwen3:8b",
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)

List available models via API

curl http://localhost:11434/api/tags

Stream responses token-by-token (stream defaults to true)

curl http://localhost:11434/api/generate -d '{"model":"qwen3:8b","prompt":"Tell me a joke"}'

Expose Ollama on the network

By default, Ollama only listens on localhost. To allow connections from other devices on your network, set OLLAMA_HOST=0.0.0.0:11434 before starting the service. Then connect from other machines using your host's IP address.

Modelfile Reference: Create Custom Models

A Modelfile lets you create a named custom model with a specific system prompt, temperature, and context size. Once created, it appears in ollama list and runs with ollama run mymodel.

Example Modelfile — coding assistant with 8K context

FROM qwen3:8b

SYSTEM """
You are an expert software engineer. Write clean, well-commented code.
Always explain what the code does and why you chose this approach.
"""

PARAMETER temperature 0.3
PARAMETER num_ctx 8192
PARAMETER num_predict 2048

Create and run the custom model

ollama create coding-assistant -f Modelfile

ollama run coding-assistant

Modelfile Directives Reference

Directive	Example	Description
FROM	FROM qwen3:8b	Base model to build on (required)
SYSTEM	SYSTEM "You are a helpful assistant"	System prompt injected at the start of every conversation
PARAMETER temperature	PARAMETER temperature 0.7	Randomness: 0 = deterministic, 1 = creative (default: 0.8)
PARAMETER num_ctx	PARAMETER num_ctx 8192	Context window size in tokens (default: 2048)
PARAMETER num_predict	PARAMETER num_predict 1024	Max tokens to generate per response (-1 = unlimited)
PARAMETER num_gpu	PARAMETER num_gpu 99	GPU layers to use. 99 forces all layers to GPU.
PARAMETER top_p	PARAMETER top_p 0.9	Nucleus sampling threshold (default: 0.9)
PARAMETER top_k	PARAMETER top_k 40	Top-K sampling (default: 40)
PARAMETER repeat_penalty	PARAMETER repeat_penalty 1.1	Penalize repeated tokens to reduce loops
PARAMETER stop	PARAMETER stop "\n\n"	Stop generation at this token sequence
TEMPLATE	TEMPLATE """{{ .Prompt }}"""	Override the prompt template (advanced)
MESSAGE	MESSAGE user "Hello"	Pre-populate conversation history

Environment Variables Reference

Set these before starting the Ollama process or in the systemd service file on Linux. On macOS and Windows, set them as system or user environment variables before launching the Ollama app.

Variable	Default	Example Value	Description
OLLAMA_HOST	127.0.0.1:11434	0.0.0.0:11434	Address Ollama listens on. Set to 0.0.0.0 to allow network access from other devices.
OLLAMA_MODELS	~/.ollama/models	/mnt/storage/ollama	Custom path for model storage. Useful for moving models to a larger drive.
OLLAMA_NUM_PARALLEL	1	2	Number of parallel inference requests. Higher values share VRAM across requests.
OLLAMA_MAX_LOADED_MODELS	1	2	Maximum models kept loaded in VRAM simultaneously.
OLLAMA_KEEP_ALIVE	5m	0	How long to keep a model loaded after last use. 0 = never unload. -1 = unload immediately.
OLLAMA_FLASH_ATTENTION	0	1	Enable Flash Attention for faster inference and lower VRAM. Requires compatible GPU.
OLLAMA_GPU_OVERHEAD	0	524288000	Reserve VRAM (bytes) for OS/other use. Helps prevent OOM errors on tight VRAM budgets.

Setting Environment Variables on Linux (systemd)

Edit the Ollama systemd service override

sudo systemctl edit ollama

Add your variables inside the editor under a [Service] section:

[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_MODELS=/mnt/storage/ollama"
Environment="OLLAMA_KEEP_ALIVE=0"

Reload systemd and restart Ollama to apply changes

sudo systemctl daemon-reload && sudo systemctl restart ollama

Useful One-Liners

Practical commands for common tasks. Combine with pipes and shell scripting to automate workflows.

Benchmark token speed for a model

ollama run qwen3:8b --verbose "write a 200 word essay about AI" 2>&1 | grep "eval rate"

Ask a one-shot question and get a plain text answer

ollama run qwen3:8b "what is 17 * 43?" --nowordwrap

Summarize a local file

cat report.txt | ollama run qwen3:8b "summarize this in 3 bullet points"

Translate a file to Spanish

cat document.txt | ollama run qwen3:8b "translate this to Spanish"

Check if Ollama API is up

curl -s http://localhost:11434/ | grep -i ollama

Pull multiple models in sequence

for m in qwen3:8b phi4:14b gemma3:12b; do ollama pull $m; done

Remove all models at once (careful: deletes everything)

ollama list | awk 'NR>1 {print $1}' | xargs -I&#123;&#125; ollama rm &#123;&#125;

Keep Ollama model loaded indefinitely (no idle unload)

OLLAMA_KEEP_ALIVE=0 ollama serve

Force a model to use GPU layers (in Modelfile)

PARAMETER num_gpu 99

Run Ollama with a custom host and port

OLLAMA_HOST=0.0.0.0:11435 ollama serve

Troubleshooting Commands

When Ollama is not working as expected, these commands help diagnose the issue. Start with ollama ps and the relevant GPU tool.

Check if Ollama service is running

systemctl status ollama

View Ollama logs (Linux)

journalctl -u ollama -n 50 --no-pager

View Ollama logs (macOS)

cat ~/.ollama/logs/server.log | tail -50

Restart Ollama service (Linux)

sudo systemctl restart ollama

Check NVIDIA GPU and VRAM

nvidia-smi

Check AMD GPU and ROCm

rocm-smi

Confirm GPU is used (check GPU% column)

ollama ps

Check Ollama version

ollama --version

Run model with verbose token speed output

ollama run qwen3:8b --verbose

Common Problems and Fixes

Generation is very slow (1-5 t/s)

The model is offloading to CPU because it does not fit in VRAM. Run "ollama ps" and check the GPU% column. If it is below 100%, switch to a smaller model. For 8GB VRAM, use qwen3:8b instead of qwen3:14b. Check available VRAM with "nvidia-smi" or "rocm-smi".

Model not using GPU at all

Check that GPU drivers are installed (nvidia-smi or rocm-smi should return output). Restart the Ollama service after driver installation: sudo systemctl restart ollama on Linux. On AMD, confirm your user is in the render and video groups: groups $USER.

ollama: command not found

Ollama is installed but not in your PATH. On Linux it installs to /usr/local/bin. On macOS it may be in /usr/local/bin or /opt/homebrew/bin. Run "which ollama" or add the directory to your PATH in ~/.bashrc or ~/.zshrc.

API returns connection refused on localhost:11434

The Ollama service is not running. On Linux: sudo systemctl start ollama. On macOS: open the Ollama app from Applications or run "ollama serve" in a terminal to start the server manually.

Out of memory / model crashes on load

The model is too large for your VRAM. Use a smaller quantization (switch from q8_0 to q4_K_M) or a smaller model variant. Set OLLAMA_GPU_OVERHEAD to reserve some VRAM for the OS: export OLLAMA_GPU_OVERHEAD=524288000 (512 MB).

Frequently Asked Questions

What is the ollama run command?

The "ollama run <model>" command starts an interactive chat session. If the model is not already downloaded, Ollama pulls it automatically. For example, "ollama run qwen3:8b" downloads the 8B Qwen3 model if needed, then opens a chat prompt. Type /bye or press Ctrl+C to exit. You can also pipe input: echo "your question" | ollama run qwen3:8b for non-interactive use.

How do I list all downloaded Ollama models?

Run "ollama list" to see all models installed on your system, with size and last modified date. To see which models are currently loaded in VRAM, run "ollama ps". To remove a model and free disk space, run "ollama rm <model-name>".

How do I use Ollama as an API?

Ollama exposes a REST API on localhost:11434. Use /api/generate for single-turn completions or /api/chat for multi-turn conversations. Ollama also has an OpenAI-compatible endpoint at /v1/chat/completions, so you can use it with any OpenAI client library by setting base_url to http://localhost:11434/v1.

What is an Ollama Modelfile?

A Modelfile is a configuration file that defines a custom model. It uses FROM to set the base model, SYSTEM to set a system prompt, and PARAMETER directives to set temperature, context window, and GPU layers. Run "ollama create mymodel -f Modelfile" to register it, then "ollama run mymodel" to use it.

How do I keep an Ollama model loaded in memory?

Set the OLLAMA_KEEP_ALIVE environment variable. Use OLLAMA_KEEP_ALIVE=0 to keep the model loaded indefinitely. Use OLLAMA_KEEP_ALIVE=-1 to unload immediately after each request. The default is 5m (5 minutes). Set this before starting the Ollama service for it to take effect.

How do I check what Ollama models are running and using GPU?

Run "ollama ps" to see all currently loaded models. The output shows the model name, memory size, processor used, and time until unload. A GPU% of 100% means full GPU inference at maximum speed. Below 100% means partial CPU offloading, which significantly reduces generation speed.

Related Guides

Run LLMs Locally

General LLM setup guide

Open WebUI Setup

Add a ChatGPT-style UI

Ollama vs LM Studio

Compare the two tools

Run LLMs on Windows

Windows-specific setup

Run LLMs on Linux

Linux GPU and driver setup

Ollama Python API Guide

Full Python code examples: ollama library, OpenAI API, streaming, embeddings.

Run Ollama in Docker

Full Docker setup with GPU passthrough and Docker Compose.

Ollama Performance Guide

Flash attention, context size, KEEP_ALIVE tuning for faster inference.

LM Studio vs Ollama

Side-by-side comparison to pick the right tool

Popular hardware for local LLMs

RTX 4060 (8 GB)

Budget pick. Runs 7B-8B models at 25-35 tok/s.

Buy on Amazon

RTX 4060 Ti 16 GB

Sweet spot. Runs 13B-14B at full speed. Best value.

Buy on Amazon

RTX 4090 (24 GB)

Top consumer GPU. Runs 70B models with offloading.

Buy on Amazon

Find models that fit your GPU or calculate exact VRAM requirements.

VRAM Calculator What Can I Run? All Guides

Sources & methodology

Behaviour, file-format and runtime details on this page are pulled from primary upstream docs and community benchmark threads. The full sitewide methodology lives on the methodology page. For this guide I relied most on:

Ollama. Ollama repo - canonical command list, Modelfile syntax and library contents.
llama.cpp. Underlying inference engine the cheat-sheet's flags ultimately map to.
Hugging Face Hub. Source for every base model the Ollama library re-packages as GGUF.

Spot a number that does not match the linked source? Email [email protected] and I will update the guide.