Best LLMs for Writing Locally: Creative, Content, and Editing (2026)
AI drafted, a human edited and pruned. The "feel" judgements are mine after running the models on my own prompts; the size + speed numbers come from the cited community benchmarks.
Updated May 2026 · Covers Llama 3.3, Mistral Small, Qwen3, Phi-4, Gemma 3
The best local LLM for writing depends on your task and your GPU. For creative writing: Llama 3.3 70B (48 GB) or Mistral Small 22B (16 GB). For content writing: Qwen3 14B Q8 (16 GB) or Phi-4 14B (12 GB). For editing: Qwen3 14B Q4_K_M (9 GB) is the sweet spot. This guide covers VRAM tiers, temperature settings, system prompt examples, and Open WebUI setup.
Not sure what your hardware can run? Use the VRAM Calculator, or read the best local LLMs guide for general-purpose picks.
TL;DR
- Best creative writing: Llama 3.3 70B Q4 (48 GB) — Mistral Small 22B Q4 (13.5 GB) for 16 GB GPUs
- Best content writing: Qwen3 14B Q8 (15 GB, 16 GB GPU) — Phi-4 14B Q4 (8.8 GB, 12 GB GPU)
- Best editing: Qwen3 14B Q4_K_M (9 GB) — fast, precise, preserves voice
- 8 GB GPU: Qwen3 7B Q4 (4.9 GB) or Llama 3.1 8B Q4 (5.2 GB) for quick drafts
- Temperature: 0.7-0.9 for creative, 0.3-0.5 for editing — system prompt matters more than model size
Quick Picks by Use Case
Creative Writing
Llama 3.3 70B Q4
~40 GB VRAM
Best open-weights model for fiction, narrative, and roleplay. Rich character voice, strong pacing. Needs 48 GB.
Budget pick: Mistral Small 22B Q4 (13.5 GB) for 16 GB VRAM
ollama run llama3.3:70b Content Writing
Qwen3 14B Q8
15 GB VRAM
Excellent instruction following for blog posts, essays, and marketing copy. Premium quality at 16 GB.
Budget pick: Phi-4 14B Q4 (8.8 GB) for structured content at 12 GB
ollama run qwen3:14b Editing
Qwen3 14B Q4_K_M
9 GB VRAM
Fast and precise at 9 GB. Strong instruction following for applying edits while preserving author voice.
Budget pick: Any 14B+ model works well for editing tasks
ollama run qwen3:14b:q4_k_m Best Writing Models by VRAM Tier
| VRAM | Model | Quant | VRAM Used | Best For | GPU Example |
|---|---|---|---|---|---|
| 8 GB | Qwen3 7B | Q4_K_M | 4.9 GB | Fast drafting | RTX 4060 |
| 8 GB | Llama 3.1 8B | Q4_K_M | 5.2 GB | Compact drafting | RTX 4060 |
| 12 GB | Qwen3 14B | Q4_K_M | 9 GB | Best at 12 GB | RTX 4070 |
| 12 GB | Mistral 7B | Q8_0 | 7.7 GB | Quality at 12 GB | RTX 4070 |
| 16 GB | Qwen3 14B | Q8_0 | 15 GB | Premium content (16 GB) | RTX 4060 Ti 16GB |
| 16 GB | Mistral Small 22B | Q4_K_M | 13.5 GB | Best creative at 16 GB | RTX 4060 Ti 16GB |
| 24 GB | Qwen3 32B | Q4_K_M | 20 GB | Creative + long-form | RTX 4090 |
| 24 GB | DeepSeek R1 Distill 32B | Q4_K_M | 20 GB | Long-form reasoning | RTX 4090 |
| 48 GB | Llama 3.3 70B | Q4_K_M | ~40 GB | Best creative writing | Mac M4 Pro |
VRAM figures include KV cache headroom at default context size. Use the VRAM Calculator for exact sizes with larger context.
Model Details and Recommendations
Llama 3.3 70B
48 GB ~40 GB Q4_K_M 14-20 tok/s on Mac M4 ProThe best open-weights model for creative writing. Runs well on Mac M4 Pro / M4 Max (48-128 GB unified memory) or high-VRAM workstation GPUs.
Strengths
- + Best creative writing quality locally
- + Rich vocabulary and character voice
- + Strong narrative pacing and structure
- + Excellent for long-form fiction and roleplay
Limitations
- - Requires 48 GB VRAM or unified memory
- - Slow on consumer GPU hardware without NVLink
Mistral Small 22B
16 GB GPU 13.5 GB Q4_K_M 20-30 tok/s on RTX 4060 Ti 16GBThe best creative writing model for 16 GB GPUs. A significant step up from 14B models for fiction and storytelling.
Strengths
- + Best creative writing model at 16 GB VRAM
- + Excellent prose quality for its size
- + Strong at dialogue and character voice
- + Good instruction following for style direction
Limitations
- - Needs a 16 GB GPU to fit comfortably
- - Not as strong as 70B for long-form narrative
Qwen3 14B Q8
16 GB GPU 15 GB Q8_0 20-30 tok/s on RTX 4060 Ti 16GBThe best content writing model at 16 GB. Choose Qwen3 14B Q8 for blog posts, marketing copy, and editing. Choose Mistral Small 22B for fiction.
Strengths
- + Premium instruction-following quality
- + Excellent for structured content (blogs, essays)
- + Strong at editing with style preservation
- + Fits precisely at 16 GB VRAM
Limitations
- - Content writing strength over pure creative
- - Less raw creative imagination than Mistral Small 22B
Qwen3 32B
24 GB GPU ~20 GB Q4_K_M 25-35 tok/s on RTX 4090A solid all-round writing model for 24 GB GPUs. Good for both creative and content writing with a large context window.
Strengths
- + Good balance of creativity and instruction following
- + Handles long-form documents well
- + Strong context coherence across long pieces
- + Better than 14B for complex narrative
Limitations
- - Needs a 24 GB GPU (RTX 4090)
- - Not as creatively strong as Llama 3.3 70B
Phi-4 14B
12 GB GPU 8.8 GB Q4_K_M 25-35 tok/s on RTX 4070Punches above its weight for content writing and editing. Best 12 GB option if you prioritize structure and instruction following over raw creative quality.
Strengths
- + Surprisingly strong for structured content at 8.8 GB
- + Excellent instruction following
- + Good at outlines, drafts, and rewrites
- + Fast enough for interactive drafting
Limitations
- - Less creatively expressive than Mistral models
- - Limited by 14B parameter ceiling for complex prose
Llama 3.1 8B
8 GB GPU 5.2 GB Q4_K_M 35-50 tok/s on RTX 4060The best option for 8 GB GPUs. Fast and capable enough for drafting, short content, and brainstorming. Upgrade to 14B+ for final quality.
Strengths
- + Very fast for quick drafts
- + Fits in 8 GB VRAM comfortably
- + Good for brainstorming and short pieces
- + Fast iteration for revision cycles
Limitations
- - Limited depth for long-form writing
- - Noticeably weaker prose quality than 14B+
Temperature Guide for Writing
Temperature controls how predictable vs. varied the model's output is. For writing tasks, this has a significant effect on quality. Set temperature in Ollama with PARAMETER temperature 0.8 in your Modelfile, or adjust it in Open WebUI's model settings.
| Task | Temperature | Why |
|---|---|---|
| Fiction / stories | 0.8 – 0.9 | High variety, unexpected twists, vivid prose |
| Roleplay / dialogue | 0.7 – 0.85 | Natural character voice with some spontaneity |
| Blog posts / essays | 0.5 – 0.7 | Balanced between structure and natural flow |
| Marketing copy | 0.4 – 0.6 | On-message, predictable structure |
| Editing / proofreading | 0.3 – 0.5 | Precise, consistent, minimal hallucination |
| Summarization | 0.2 – 0.4 | Faithful to source, no creative additions |
System Prompt Examples and Modelfiles
A good system prompt has a larger effect on writing quality than model size. These Modelfiles can be used with any model — create them with ollama create my-model -f Modelfile.
Creative Fiction Writer
Best with Llama 3.3 70B or Mistral Small 22B. Temperature 0.8.
FROM llama3.3:70b PARAMETER temperature 0.8 PARAMETER num_ctx 8192 SYSTEM """ You are a skilled fiction author. Write vivid, engaging prose with strong character voice and sensory detail. Show rather than tell. Vary sentence rhythm. Use specific, concrete details over vague generalities. When asked to continue a scene, match the established tone and style exactly. """
Professional Editor
Works well with any 14B+ model. Temperature 0.4.
FROM qwen3:14b PARAMETER temperature 0.4 PARAMETER num_ctx 8192 SYSTEM """ You are a professional editor. When given text, improve clarity, flow, and style while strictly preserving the author's voice and intent. Do not add new content or ideas. Fix grammar, remove redundancy, strengthen weak verbs, and tighten sentences. Return only the revised text unless asked for comments. """
Web Content Writer
Best with Qwen3 14B Q8 or Phi-4 14B. Temperature 0.6.
FROM qwen3:14b PARAMETER temperature 0.6 PARAMETER num_ctx 8192 SYSTEM """ You are an expert content writer. Write engaging, well-structured content optimized for the web. Use clear headings, short paragraphs, and active voice. Lead with the most important information. Include specific facts and examples. Avoid filler phrases and jargon. """
Setting Up Open WebUI for Writing
Open WebUI is the best interface for writing with local LLMs. It supports long conversation history, system prompt presets, and per-model settings — all useful for writing workflows.
1. Set context size for long documents
Go to Admin Panel → Models → Edit your model. Set Context Length to 8192 or 16384 for long documents. This lets the model see more of your existing text when continuing a story or essay.
2. Create writing personas
In Settings → System Prompt, create presets for your writing modes (fiction, editor, content). Save each as a separate model variant so you can switch with one click.
3. Adjust temperature per session
In the chat interface, click the model name to access Advanced Parameters. Adjust temperature on-the-fly: bump it up to 0.9 when brainstorming, drop to 0.4 when editing.
4. Use conversation mode for continuity
Keep writing sessions in a single long conversation rather than starting new chats. This gives the model full context of previous exchanges and maintains consistency across a document.
Install Open WebUI with Docker: docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway ghcr.io/open-webui/open-webui:main. Full guide: Ollama + Open WebUI Setup.
Frequently Asked Questions
What is the best local LLM for creative writing in 2026?
For creative writing, Llama 3.3 70B is the best available locally — it produces rich narratives and strong character voice, but requires 48 GB of VRAM or unified memory (Mac M4 Pro / M4 Max). For 16 GB VRAM, Mistral Small 22B Q4 at 13.5 GB is the best creative writing model that fits. For 12 GB, Qwen3 14B Q8 at 15 GB is excellent for structured content, or use Qwen3 14B Q4 (9 GB) for creative drafting. Set temperature 0.7-0.9 for more variety and a detailed system prompt for best results.
What temperature should I use for writing with a local LLM?
For creative writing (fiction, stories, roleplay), use temperature 0.7-0.9. Higher values produce more variety and surprise in prose but can sometimes drift off-topic. For editing and proofreading, use temperature 0.3-0.5 — lower values make the model more precise and consistent when applying corrections. For content writing (blog posts, essays), 0.5-0.7 is a good middle ground. In Ollama, set temperature in the Modelfile: PARAMETER temperature 0.8
Can a local LLM replace ChatGPT for writing?
For most writing tasks, yes — especially with a 14B or larger model. Qwen3 14B Q8 and Mistral Small 22B are competitive with GPT-3.5 for content writing and editing. Llama 3.3 70B approaches GPT-4 quality for creative writing. The main advantages of local models are privacy (your writing never leaves your machine), no usage limits, and the ability to fine-tune system prompts via Modelfile. The main limitation is speed — a 14B model on a 16 GB GPU runs at 20-35 tok/s vs near-instant cloud APIs.
What VRAM do I need for a writing LLM?
For basic writing assistance, 8 GB VRAM runs Qwen3 7B Q4 (4.9 GB) or Llama 3.1 8B Q4 (5.2 GB) — fast and decent for drafting. For better quality, 12-16 GB VRAM is the sweet spot: Qwen3 14B Q8 (15 GB) or Mistral Small 22B Q4 (13.5 GB) produce noticeably richer output. For the best creative writing locally, 48 GB unified memory (Mac M4 Pro) runs Llama 3.3 70B, which is the top open-weights model for fiction and narrative.
How do I set up a system prompt for writing in Ollama?
Create a Modelfile with the FROM and SYSTEM directives. For example: FROM llama3.3:70b and then SYSTEM followed by your prompt. Run "ollama create my-writer -f Modelfile" to create the custom model, then "ollama run my-writer" to use it. You can also set system prompts directly in Open WebUI under Model Settings, which lets you switch between writing personas without creating separate Modelfiles.
Is Llama 3.3 70B better than Mistral Small 22B for creative writing?
Yes, Llama 3.3 70B is significantly better for creative writing — richer vocabulary, stronger character voice, better pacing and narrative structure. But it requires 48 GB of VRAM or unified memory. Mistral Small 22B Q4 at 13.5 GB is the best creative writing model that fits in 16 GB VRAM, and it is genuinely good — the gap is real but not enormous for shorter pieces. If you have a Mac M4 Pro or M4 Max with 48 GB+ RAM, Llama 3.3 70B is the clear choice.
What context size should I use for writing long documents?
For long documents (novels, long essays, extended stories), set context size to at least 8192 tokens. In llama.cpp use --ctx-size 8192. In Ollama Modelfile use PARAMETER num_ctx 8192. Larger context (16384 or 32768) helps the model maintain consistency across long documents but requires more VRAM. As a rule of thumb: 8192 tokens covers about 6000 words, 16384 covers about 12000 words. For short pieces, the default 2048-4096 is fine.
Related Guides
Popular hardware for local LLMs
Know which model you want? Check exact VRAM requirements or find the right GPU.
Sources & methodology
Model parameter counts, context lengths and the VRAM estimates above come from a mix of official model cards and open benchmarks. The full sitewide methodology is documented on the methodology page. The three sources that did most of the work for this guide:
- Hugging Face Hub. Source model cards for the creative-writing models (Mistral, Qwen, Llama, Gemma) we compare.
- LM Studio. The frontend most writers use, plus its built-in quant picker we reference.
- Modal: How much VRAM do I need for LLM inference. VRAM-per-parameter math behind the 'fits in 8 / 12 / 16 GB' calls.
Spot a number that does not match the linked source? Email [email protected] and I will update the guide.