Best LLMs for Writing Locally: Creative, Content, and Editing (2026)

AI drafted, a human edited and pruned. The "feel" judgements are mine after running the models on my own prompts; the size + speed numbers come from the cited community benchmarks.

Updated May 2026 · Covers Llama 3.3, Mistral Small, Qwen3, Phi-4, Gemma 3

The best local LLM for writing depends on your task and your GPU. For creative writing: Llama 3.3 70B (48 GB) or Mistral Small 22B (16 GB). For content writing: Qwen3 14B Q8 (16 GB) or Phi-4 14B (12 GB). For editing: Qwen3 14B Q4_K_M (9 GB) is the sweet spot. This guide covers VRAM tiers, temperature settings, system prompt examples, and Open WebUI setup.

Not sure what your hardware can run? Use the VRAM Calculator, or read the best local LLMs guide for general-purpose picks.

TL;DR

Quick Picks by Use Case

Creative Writing

Llama 3.3 70B Q4

~40 GB VRAM

Best open-weights model for fiction, narrative, and roleplay. Rich character voice, strong pacing. Needs 48 GB.

Budget pick: Mistral Small 22B Q4 (13.5 GB) for 16 GB VRAM

ollama run llama3.3:70b

Content Writing

Qwen3 14B Q8

15 GB VRAM

Excellent instruction following for blog posts, essays, and marketing copy. Premium quality at 16 GB.

Budget pick: Phi-4 14B Q4 (8.8 GB) for structured content at 12 GB

ollama run qwen3:14b

Editing

Qwen3 14B Q4_K_M

9 GB VRAM

Fast and precise at 9 GB. Strong instruction following for applying edits while preserving author voice.

Budget pick: Any 14B+ model works well for editing tasks

ollama run qwen3:14b:q4_k_m

Best Writing Models by VRAM Tier

VRAMModelQuantVRAM UsedBest ForGPU Example
8 GB Qwen3 7B Q4_K_M 4.9 GB Fast drafting RTX 4060
8 GB Llama 3.1 8B Q4_K_M 5.2 GB Compact drafting RTX 4060
12 GB Qwen3 14B Q4_K_M 9 GB Best at 12 GB RTX 4070
12 GB Mistral 7B Q8_0 7.7 GB Quality at 12 GB RTX 4070
16 GB Qwen3 14B Q8_0 15 GB Premium content (16 GB) RTX 4060 Ti 16GB
16 GB Mistral Small 22B Q4_K_M 13.5 GB Best creative at 16 GB RTX 4060 Ti 16GB
24 GB Qwen3 32B Q4_K_M 20 GB Creative + long-form RTX 4090
24 GB DeepSeek R1 Distill 32B Q4_K_M 20 GB Long-form reasoning RTX 4090
48 GB Llama 3.3 70B Q4_K_M ~40 GB Best creative writing Mac M4 Pro

VRAM figures include KV cache headroom at default context size. Use the VRAM Calculator for exact sizes with larger context.

Model Details and Recommendations

Llama 3.3 70B

48 GB ~40 GB Q4_K_M 14-20 tok/s on Mac M4 Pro

The best open-weights model for creative writing. Runs well on Mac M4 Pro / M4 Max (48-128 GB unified memory) or high-VRAM workstation GPUs.

Strengths

  • + Best creative writing quality locally
  • + Rich vocabulary and character voice
  • + Strong narrative pacing and structure
  • + Excellent for long-form fiction and roleplay

Limitations

  • - Requires 48 GB VRAM or unified memory
  • - Slow on consumer GPU hardware without NVLink

Mistral Small 22B

16 GB GPU 13.5 GB Q4_K_M 20-30 tok/s on RTX 4060 Ti 16GB

The best creative writing model for 16 GB GPUs. A significant step up from 14B models for fiction and storytelling.

Strengths

  • + Best creative writing model at 16 GB VRAM
  • + Excellent prose quality for its size
  • + Strong at dialogue and character voice
  • + Good instruction following for style direction

Limitations

  • - Needs a 16 GB GPU to fit comfortably
  • - Not as strong as 70B for long-form narrative

Qwen3 14B Q8

16 GB GPU 15 GB Q8_0 20-30 tok/s on RTX 4060 Ti 16GB

The best content writing model at 16 GB. Choose Qwen3 14B Q8 for blog posts, marketing copy, and editing. Choose Mistral Small 22B for fiction.

Strengths

  • + Premium instruction-following quality
  • + Excellent for structured content (blogs, essays)
  • + Strong at editing with style preservation
  • + Fits precisely at 16 GB VRAM

Limitations

  • - Content writing strength over pure creative
  • - Less raw creative imagination than Mistral Small 22B

Qwen3 32B

24 GB GPU ~20 GB Q4_K_M 25-35 tok/s on RTX 4090

A solid all-round writing model for 24 GB GPUs. Good for both creative and content writing with a large context window.

Strengths

  • + Good balance of creativity and instruction following
  • + Handles long-form documents well
  • + Strong context coherence across long pieces
  • + Better than 14B for complex narrative

Limitations

  • - Needs a 24 GB GPU (RTX 4090)
  • - Not as creatively strong as Llama 3.3 70B

Phi-4 14B

12 GB GPU 8.8 GB Q4_K_M 25-35 tok/s on RTX 4070

Punches above its weight for content writing and editing. Best 12 GB option if you prioritize structure and instruction following over raw creative quality.

Strengths

  • + Surprisingly strong for structured content at 8.8 GB
  • + Excellent instruction following
  • + Good at outlines, drafts, and rewrites
  • + Fast enough for interactive drafting

Limitations

  • - Less creatively expressive than Mistral models
  • - Limited by 14B parameter ceiling for complex prose

Llama 3.1 8B

8 GB GPU 5.2 GB Q4_K_M 35-50 tok/s on RTX 4060

The best option for 8 GB GPUs. Fast and capable enough for drafting, short content, and brainstorming. Upgrade to 14B+ for final quality.

Strengths

  • + Very fast for quick drafts
  • + Fits in 8 GB VRAM comfortably
  • + Good for brainstorming and short pieces
  • + Fast iteration for revision cycles

Limitations

  • - Limited depth for long-form writing
  • - Noticeably weaker prose quality than 14B+

Temperature Guide for Writing

Temperature controls how predictable vs. varied the model's output is. For writing tasks, this has a significant effect on quality. Set temperature in Ollama with PARAMETER temperature 0.8 in your Modelfile, or adjust it in Open WebUI's model settings.

TaskTemperatureWhy
Fiction / stories 0.8 – 0.9 High variety, unexpected twists, vivid prose
Roleplay / dialogue 0.7 – 0.85 Natural character voice with some spontaneity
Blog posts / essays 0.5 – 0.7 Balanced between structure and natural flow
Marketing copy 0.4 – 0.6 On-message, predictable structure
Editing / proofreading 0.3 – 0.5 Precise, consistent, minimal hallucination
Summarization 0.2 – 0.4 Faithful to source, no creative additions

System Prompt Examples and Modelfiles

A good system prompt has a larger effect on writing quality than model size. These Modelfiles can be used with any model — create them with ollama create my-model -f Modelfile.

Creative Fiction Writer

Best with Llama 3.3 70B or Mistral Small 22B. Temperature 0.8.

FROM llama3.3:70b

PARAMETER temperature 0.8
PARAMETER num_ctx 8192

SYSTEM """
You are a skilled fiction author. Write vivid, engaging prose
with strong character voice and sensory detail. Show rather than
tell. Vary sentence rhythm. Use specific, concrete details over
vague generalities. When asked to continue a scene, match the
established tone and style exactly.
"""

Professional Editor

Works well with any 14B+ model. Temperature 0.4.

FROM qwen3:14b

PARAMETER temperature 0.4
PARAMETER num_ctx 8192

SYSTEM """
You are a professional editor. When given text, improve clarity,
flow, and style while strictly preserving the author's voice and
intent. Do not add new content or ideas. Fix grammar, remove
redundancy, strengthen weak verbs, and tighten sentences.
Return only the revised text unless asked for comments.
"""

Web Content Writer

Best with Qwen3 14B Q8 or Phi-4 14B. Temperature 0.6.

FROM qwen3:14b

PARAMETER temperature 0.6
PARAMETER num_ctx 8192

SYSTEM """
You are an expert content writer. Write engaging, well-structured
content optimized for the web. Use clear headings, short paragraphs,
and active voice. Lead with the most important information. Include
specific facts and examples. Avoid filler phrases and jargon.
"""

Setting Up Open WebUI for Writing

Open WebUI is the best interface for writing with local LLMs. It supports long conversation history, system prompt presets, and per-model settings — all useful for writing workflows.

1. Set context size for long documents

Go to Admin Panel → Models → Edit your model. Set Context Length to 8192 or 16384 for long documents. This lets the model see more of your existing text when continuing a story or essay.

2. Create writing personas

In Settings → System Prompt, create presets for your writing modes (fiction, editor, content). Save each as a separate model variant so you can switch with one click.

3. Adjust temperature per session

In the chat interface, click the model name to access Advanced Parameters. Adjust temperature on-the-fly: bump it up to 0.9 when brainstorming, drop to 0.4 when editing.

4. Use conversation mode for continuity

Keep writing sessions in a single long conversation rather than starting new chats. This gives the model full context of previous exchanges and maintains consistency across a document.

Install Open WebUI with Docker: docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway ghcr.io/open-webui/open-webui:main. Full guide: Ollama + Open WebUI Setup.

Frequently Asked Questions

What is the best local LLM for creative writing in 2026?

For creative writing, Llama 3.3 70B is the best available locally — it produces rich narratives and strong character voice, but requires 48 GB of VRAM or unified memory (Mac M4 Pro / M4 Max). For 16 GB VRAM, Mistral Small 22B Q4 at 13.5 GB is the best creative writing model that fits. For 12 GB, Qwen3 14B Q8 at 15 GB is excellent for structured content, or use Qwen3 14B Q4 (9 GB) for creative drafting. Set temperature 0.7-0.9 for more variety and a detailed system prompt for best results.

What temperature should I use for writing with a local LLM?

For creative writing (fiction, stories, roleplay), use temperature 0.7-0.9. Higher values produce more variety and surprise in prose but can sometimes drift off-topic. For editing and proofreading, use temperature 0.3-0.5 — lower values make the model more precise and consistent when applying corrections. For content writing (blog posts, essays), 0.5-0.7 is a good middle ground. In Ollama, set temperature in the Modelfile: PARAMETER temperature 0.8

Can a local LLM replace ChatGPT for writing?

For most writing tasks, yes — especially with a 14B or larger model. Qwen3 14B Q8 and Mistral Small 22B are competitive with GPT-3.5 for content writing and editing. Llama 3.3 70B approaches GPT-4 quality for creative writing. The main advantages of local models are privacy (your writing never leaves your machine), no usage limits, and the ability to fine-tune system prompts via Modelfile. The main limitation is speed — a 14B model on a 16 GB GPU runs at 20-35 tok/s vs near-instant cloud APIs.

What VRAM do I need for a writing LLM?

For basic writing assistance, 8 GB VRAM runs Qwen3 7B Q4 (4.9 GB) or Llama 3.1 8B Q4 (5.2 GB) — fast and decent for drafting. For better quality, 12-16 GB VRAM is the sweet spot: Qwen3 14B Q8 (15 GB) or Mistral Small 22B Q4 (13.5 GB) produce noticeably richer output. For the best creative writing locally, 48 GB unified memory (Mac M4 Pro) runs Llama 3.3 70B, which is the top open-weights model for fiction and narrative.

How do I set up a system prompt for writing in Ollama?

Create a Modelfile with the FROM and SYSTEM directives. For example: FROM llama3.3:70b and then SYSTEM followed by your prompt. Run "ollama create my-writer -f Modelfile" to create the custom model, then "ollama run my-writer" to use it. You can also set system prompts directly in Open WebUI under Model Settings, which lets you switch between writing personas without creating separate Modelfiles.

Is Llama 3.3 70B better than Mistral Small 22B for creative writing?

Yes, Llama 3.3 70B is significantly better for creative writing — richer vocabulary, stronger character voice, better pacing and narrative structure. But it requires 48 GB of VRAM or unified memory. Mistral Small 22B Q4 at 13.5 GB is the best creative writing model that fits in 16 GB VRAM, and it is genuinely good — the gap is real but not enormous for shorter pieces. If you have a Mac M4 Pro or M4 Max with 48 GB+ RAM, Llama 3.3 70B is the clear choice.

What context size should I use for writing long documents?

For long documents (novels, long essays, extended stories), set context size to at least 8192 tokens. In llama.cpp use --ctx-size 8192. In Ollama Modelfile use PARAMETER num_ctx 8192. Larger context (16384 or 32768) helps the model maintain consistency across long documents but requires more VRAM. As a rule of thumb: 8192 tokens covers about 6000 words, 16384 covers about 12000 words. For short pieces, the default 2048-4096 is fine.

Related Guides

Popular hardware for local LLMs

RTX 4060 (8 GB)
Budget pick. Runs 7B-8B models at 25-35 tok/s.
Buy on Amazon
RTX 4060 Ti 16 GB
Sweet spot. Runs 13B-14B at full speed. Best value.
Buy on Amazon
RTX 4090 (24 GB)
Top consumer GPU. Runs 70B models with offloading.
Buy on Amazon

Know which model you want? Check exact VRAM requirements or find the right GPU.

Sources & methodology

Model parameter counts, context lengths and the VRAM estimates above come from a mix of official model cards and open benchmarks. The full sitewide methodology is documented on the methodology page. The three sources that did most of the work for this guide:

Spot a number that does not match the linked source? Email [email protected] and I will update the guide.