Ollama Commands Cheat Sheet: Complete CLI Reference (2026)

Editorial: AI handled the first sweep through `ollama --help`. The flag annotations are mine — the ones I genuinely forget and re-look up.

Updated May 2026 · Ollama CLI · API · Modelfile · Environment Variables

Every Ollama command in one place. Covers the full CLI, REST API usage with curl examples, Modelfile syntax for custom models, environment variable reference, and useful one-liners. Bookmark this page as your daily driver reference for running local LLMs with Ollama.

Quick Reference: All Ollama Commands

CommandWhat It DoesExample
ollama run <model> Chat with a model (downloads if needed) ollama run qwen3:8b
ollama pull <model> Download a model without running it ollama pull llama3.3:70b
ollama list List all installed models ollama list
ollama rm <model> Remove a model and free disk space ollama rm qwen3:14b
ollama show <model> Show model info, parameters, Modelfile ollama show qwen3:8b
ollama ps Show running models and GPU usage ollama ps
ollama stop <model> Unload a model from memory ollama stop qwen3:8b
ollama create <name> -f <file> Create a custom model from a Modelfile ollama create mybot -f Modelfile
ollama cp <src> <dst> Copy a model to a new name ollama cp qwen3:8b myqwen
ollama push <model> Push a model to Ollama registry ollama push myuser/mymodel

Model Download and Management

Ollama stores models in ~/.ollama/models by default. Models are downloaded automatically on first ollama run, or manually with ollama pull. Use ollama rm to free disk space.

Download a model without running it

ollama pull qwen3:8b

List all installed models (name, size, modified date)

ollama list

Remove a model to free disk space

ollama rm qwen3:14b

Show model details, parameters, and Modelfile

ollama show qwen3:8b

Show template and system prompt for a model

ollama show qwen3:8b --modelfile

Copy a model under a new local name

ollama cp qwen3:8b my-qwen

Check which models are loaded in VRAM right now

ollama ps

Unload a model from VRAM immediately

ollama stop qwen3:8b

Find all available models at ollama.com/library. Append a tag for a specific size or quantization, e.g. qwen3:8b, qwen3:14b, qwen3:8b:q8_0.

Popular Models and VRAM Requirements

All commands use default Q4_K_M quantization unless a tag specifies otherwise. VRAM figures are approximate for Q4_K_M at 4K context.

Ollama CommandVRAM NeededTypeNotes
ollama run qwen3:8b 5.2 GB Chat + thinking Best 8B model, fast on 8GB VRAM
ollama run qwen3:14b 9.0 GB Chat + thinking Strong reasoning, fits 12GB VRAM
ollama run qwen3:32b 20 GB Chat + thinking Needs 24GB VRAM (RTX 4090, 3090)
ollama run qwen3:30b-a3b 20 GB MoE (sparse) Uses 20GB VRAM, active params ~3B speed
ollama run phi4:14b 9.1 GB Chat Microsoft's Phi-4, strong reasoning
ollama run gemma3:12b 8.1 GB Chat Google's Gemma 3 12B
ollama run llama3.3:70b 42 GB Chat Needs dual-GPU or 48GB VRAM
ollama run deepseek-r1:14b 9.0 GB Thinking/reasoning DeepSeek R1 14B distill

Use the VRAM Calculator to check if a model fits your GPU at a given context length.

Chat and Inference Commands

The ollama run command opens an interactive prompt. Inside the chat session, several slash commands control behavior. You can also pipe input directly for non-interactive use.

Start interactive chat (downloads model if not present)

ollama run qwen3:8b

Run with verbose output: shows token speed and timing

ollama run qwen3:8b --verbose

Pipe a single question non-interactively

echo "explain quantum computing in 3 sentences" | ollama run qwen3:8b

Pass a file as context

cat myfile.txt | ollama run qwen3:8b "summarize this"

In-Session Slash Commands

CommandWhat It Does
/bye Exit the chat session
/clear Clear the conversation context (start fresh)
/? Show all available slash commands and help
/set parameter temperature 0.5 Change temperature mid-session
/set parameter num_ctx 8192 Change context window mid-session
/show info Show model info and current parameters
/show modelfile Show the full Modelfile for the running model
Ctrl+D Send multi-line input (finish block and submit)
Ctrl+C Interrupt current generation or exit

API Usage: curl Examples

Ollama runs a local REST server at http://localhost:11434. Two main endpoints handle generation. Ollama also supports an OpenAI-compatible API, so you can drop it into any app that uses the OpenAI SDK.

/api/generate — single-turn completion

curl http://localhost:11434/api/generate \
  -d '{"model":"qwen3:8b","prompt":"What is the capital of France?","stream":false}'

/api/chat — multi-turn conversation with messages array

curl http://localhost:11434/api/chat \
  -d '{
    "model": "qwen3:8b",
    "messages": [
      {"role": "user", "content": "Why is the sky blue?"}
    ],
    "stream": false
  }'

OpenAI-compatible endpoint — drop-in replacement for OpenAI API

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3:8b",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Python using the OpenAI library pointed at Ollama

from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
response = client.chat.completions.create(
    model="qwen3:8b",
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)

List available models via API

curl http://localhost:11434/api/tags

Stream responses token-by-token (stream defaults to true)

curl http://localhost:11434/api/generate -d '{"model":"qwen3:8b","prompt":"Tell me a joke"}'

Expose Ollama on the network

By default, Ollama only listens on localhost. To allow connections from other devices on your network, set OLLAMA_HOST=0.0.0.0:11434 before starting the service. Then connect from other machines using your host's IP address.

Modelfile Reference: Create Custom Models

A Modelfile lets you create a named custom model with a specific system prompt, temperature, and context size. Once created, it appears in ollama list and runs with ollama run mymodel.

Example Modelfile — coding assistant with 8K context

FROM qwen3:8b

SYSTEM """
You are an expert software engineer. Write clean, well-commented code.
Always explain what the code does and why you chose this approach.
"""

PARAMETER temperature 0.3
PARAMETER num_ctx 8192
PARAMETER num_predict 2048

Create and run the custom model

ollama create coding-assistant -f Modelfile
ollama run coding-assistant

Modelfile Directives Reference

DirectiveExampleDescription
FROM FROM qwen3:8b Base model to build on (required)
SYSTEM SYSTEM "You are a helpful assistant" System prompt injected at the start of every conversation
PARAMETER temperature PARAMETER temperature 0.7 Randomness: 0 = deterministic, 1 = creative (default: 0.8)
PARAMETER num_ctx PARAMETER num_ctx 8192 Context window size in tokens (default: 2048)
PARAMETER num_predict PARAMETER num_predict 1024 Max tokens to generate per response (-1 = unlimited)
PARAMETER num_gpu PARAMETER num_gpu 99 GPU layers to use. 99 forces all layers to GPU.
PARAMETER top_p PARAMETER top_p 0.9 Nucleus sampling threshold (default: 0.9)
PARAMETER top_k PARAMETER top_k 40 Top-K sampling (default: 40)
PARAMETER repeat_penalty PARAMETER repeat_penalty 1.1 Penalize repeated tokens to reduce loops
PARAMETER stop PARAMETER stop "\n\n" Stop generation at this token sequence
TEMPLATE TEMPLATE """{{ .Prompt }}""" Override the prompt template (advanced)
MESSAGE MESSAGE user "Hello" Pre-populate conversation history

Environment Variables Reference

Set these before starting the Ollama process or in the systemd service file on Linux. On macOS and Windows, set them as system or user environment variables before launching the Ollama app.

VariableDefaultExample ValueDescription
OLLAMA_HOST 127.0.0.1:11434 0.0.0.0:11434 Address Ollama listens on. Set to 0.0.0.0 to allow network access from other devices.
OLLAMA_MODELS ~/.ollama/models /mnt/storage/ollama Custom path for model storage. Useful for moving models to a larger drive.
OLLAMA_NUM_PARALLEL 1 2 Number of parallel inference requests. Higher values share VRAM across requests.
OLLAMA_MAX_LOADED_MODELS 1 2 Maximum models kept loaded in VRAM simultaneously.
OLLAMA_KEEP_ALIVE 5m 0 How long to keep a model loaded after last use. 0 = never unload. -1 = unload immediately.
OLLAMA_FLASH_ATTENTION 0 1 Enable Flash Attention for faster inference and lower VRAM. Requires compatible GPU.
OLLAMA_GPU_OVERHEAD 0 524288000 Reserve VRAM (bytes) for OS/other use. Helps prevent OOM errors on tight VRAM budgets.

Setting Environment Variables on Linux (systemd)

Edit the Ollama systemd service override

sudo systemctl edit ollama

Add your variables inside the editor under a [Service] section:

[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_MODELS=/mnt/storage/ollama"
Environment="OLLAMA_KEEP_ALIVE=0"

Reload systemd and restart Ollama to apply changes

sudo systemctl daemon-reload && sudo systemctl restart ollama

Useful One-Liners

Practical commands for common tasks. Combine with pipes and shell scripting to automate workflows.

Benchmark token speed for a model

ollama run qwen3:8b --verbose "write a 200 word essay about AI" 2>&1 | grep "eval rate"

Ask a one-shot question and get a plain text answer

ollama run qwen3:8b "what is 17 * 43?" --nowordwrap

Summarize a local file

cat report.txt | ollama run qwen3:8b "summarize this in 3 bullet points"

Translate a file to Spanish

cat document.txt | ollama run qwen3:8b "translate this to Spanish"

Check if Ollama API is up

curl -s http://localhost:11434/ | grep -i ollama

Pull multiple models in sequence

for m in qwen3:8b phi4:14b gemma3:12b; do ollama pull $m; done

Remove all models at once (careful: deletes everything)

ollama list | awk 'NR>1 {print $1}' | xargs -I&#123;&#125; ollama rm &#123;&#125;

Keep Ollama model loaded indefinitely (no idle unload)

OLLAMA_KEEP_ALIVE=0 ollama serve

Force a model to use GPU layers (in Modelfile)

PARAMETER num_gpu 99

Run Ollama with a custom host and port

OLLAMA_HOST=0.0.0.0:11435 ollama serve

Troubleshooting Commands

When Ollama is not working as expected, these commands help diagnose the issue. Start with ollama ps and the relevant GPU tool.

Check if Ollama service is running

systemctl status ollama

View Ollama logs (Linux)

journalctl -u ollama -n 50 --no-pager

View Ollama logs (macOS)

cat ~/.ollama/logs/server.log | tail -50

Restart Ollama service (Linux)

sudo systemctl restart ollama

Check NVIDIA GPU and VRAM

nvidia-smi

Check AMD GPU and ROCm

rocm-smi

Confirm GPU is used (check GPU% column)

ollama ps

Check Ollama version

ollama --version

Run model with verbose token speed output

ollama run qwen3:8b --verbose

Common Problems and Fixes

Generation is very slow (1-5 t/s)

The model is offloading to CPU because it does not fit in VRAM. Run "ollama ps" and check the GPU% column. If it is below 100%, switch to a smaller model. For 8GB VRAM, use qwen3:8b instead of qwen3:14b. Check available VRAM with "nvidia-smi" or "rocm-smi".

Model not using GPU at all

Check that GPU drivers are installed (nvidia-smi or rocm-smi should return output). Restart the Ollama service after driver installation: sudo systemctl restart ollama on Linux. On AMD, confirm your user is in the render and video groups: groups $USER.

ollama: command not found

Ollama is installed but not in your PATH. On Linux it installs to /usr/local/bin. On macOS it may be in /usr/local/bin or /opt/homebrew/bin. Run "which ollama" or add the directory to your PATH in ~/.bashrc or ~/.zshrc.

API returns connection refused on localhost:11434

The Ollama service is not running. On Linux: sudo systemctl start ollama. On macOS: open the Ollama app from Applications or run "ollama serve" in a terminal to start the server manually.

Out of memory / model crashes on load

The model is too large for your VRAM. Use a smaller quantization (switch from q8_0 to q4_K_M) or a smaller model variant. Set OLLAMA_GPU_OVERHEAD to reserve some VRAM for the OS: export OLLAMA_GPU_OVERHEAD=524288000 (512 MB).

Frequently Asked Questions

What is the ollama run command?

The "ollama run <model>" command starts an interactive chat session. If the model is not already downloaded, Ollama pulls it automatically. For example, "ollama run qwen3:8b" downloads the 8B Qwen3 model if needed, then opens a chat prompt. Type /bye or press Ctrl+C to exit. You can also pipe input: echo "your question" | ollama run qwen3:8b for non-interactive use.

How do I list all downloaded Ollama models?

Run "ollama list" to see all models installed on your system, with size and last modified date. To see which models are currently loaded in VRAM, run "ollama ps". To remove a model and free disk space, run "ollama rm <model-name>".

How do I use Ollama as an API?

Ollama exposes a REST API on localhost:11434. Use /api/generate for single-turn completions or /api/chat for multi-turn conversations. Ollama also has an OpenAI-compatible endpoint at /v1/chat/completions, so you can use it with any OpenAI client library by setting base_url to http://localhost:11434/v1.

What is an Ollama Modelfile?

A Modelfile is a configuration file that defines a custom model. It uses FROM to set the base model, SYSTEM to set a system prompt, and PARAMETER directives to set temperature, context window, and GPU layers. Run "ollama create mymodel -f Modelfile" to register it, then "ollama run mymodel" to use it.

How do I keep an Ollama model loaded in memory?

Set the OLLAMA_KEEP_ALIVE environment variable. Use OLLAMA_KEEP_ALIVE=0 to keep the model loaded indefinitely. Use OLLAMA_KEEP_ALIVE=-1 to unload immediately after each request. The default is 5m (5 minutes). Set this before starting the Ollama service for it to take effect.

How do I check what Ollama models are running and using GPU?

Run "ollama ps" to see all currently loaded models. The output shows the model name, memory size, processor used, and time until unload. A GPU% of 100% means full GPU inference at maximum speed. Below 100% means partial CPU offloading, which significantly reduces generation speed.

Related Guides

Popular hardware for local LLMs

RTX 4060 (8 GB)
Budget pick. Runs 7B-8B models at 25-35 tok/s.
Buy on Amazon
RTX 4060 Ti 16 GB
Sweet spot. Runs 13B-14B at full speed. Best value.
Buy on Amazon
RTX 4090 (24 GB)
Top consumer GPU. Runs 70B models with offloading.
Buy on Amazon

Find models that fit your GPU or calculate exact VRAM requirements.

Sources & methodology

Behaviour, file-format and runtime details on this page are pulled from primary upstream docs and community benchmark threads. The full sitewide methodology lives on the methodology page. For this guide I relied most on:

Spot a number that does not match the linked source? Email [email protected] and I will update the guide.