Creative writing is the workload where model choice matters most and is measured least. There are real differences in how these three models handle prose rhythm, voice retention, and the small mechanical tells that mark a paragraph as AI-generated — and almost none of them show up on public leaderboards. The short version: Claude Opus 4.7 is the default pick for prose quality and voice retention. GPT-5 is the right call for structured long-form — anything organized into sections, character arcs, or scene-by-scene plotting, where adherence to an outline matters more than sentence-level texture. Gemini 2.5 Pro wins where its 2M-token window lets you fit an entire novel and its style bible into a single shot.
How We Evaluated
Creative writing has fewer benchmark anchors than reasoning, coding, or math, and most of the public ones measure things that don't correlate well with what working writers actually need. There is no SWE-Bench for "does this feel like a human wrote it." There is no AIME for "did the model maintain a consistent narrative distance for forty pages."
So this comparison is built on capability dimensions rather than numbers. We don't invent benchmark numbers, and most benchmarks for creative writing measure things that don't correlate well with what working writers actually need. The seven dimensions in the matrix are:
- Context window — how much source material (style samples, story bible, prior chapters, character notes) the model can hold in a single conversation. Factual.
- Prose quality and rhythm — sentence-level texture: variation in length, control of cadence, willingness to write a fragment, restraint with adjectives. Qualitative.
- Voice retention over long outputs — whether the narrative voice in paragraph forty matches the voice in paragraph one, or drifts toward the model's house style. Qualitative.
- Adherence to user-supplied style — how closely the model imitates a sample you provide versus reverting to its default prose patterns. Qualitative.
- Avoidance of LLM tells — the small mechanical signatures that mark AI prose: em-dash overuse, recycled transitions, "tapestry of," "in conclusion," tidy bow endings, the rhetorical pivot every paragraph. Qualitative.
- Receptiveness to editorial feedback — whether telling the model "tighten this scene, cut the third beat, raise the stakes in the middle" produces a meaningfully different draft or a cosmetic rewrite. Qualitative.
- Cost tier — pricing context relative to the other two. Premium or Mid.
Capability columns are rated Best-in-class, Strong, Adequate, or Trailing. No invented percentages, no leaderboard claims that don't exist.
The Decision Matrix
| Dimension | Claude Opus 4.7 | GPT-5 | Gemini 2.5 Pro |
|---|---|---|---|
| Context window | 1M tokens | 1M tokens | 2M tokens |
| Prose quality and rhythm | Best-in-class | Strong | Adequate |
| Voice retention over long outputs | Best-in-class | Strong | Strong |
| Adherence to user-supplied style | Best-in-class | Strong | Strong |
| Avoidance of LLM tells | Strong | Adequate | Adequate |
| Receptiveness to editorial feedback | Best-in-class | Strong | Strong |
| Cost tier | Premium | Premium | Mid |
Claude Opus 4.7: When It's the Right Call
Opus 4.7 is the default for any task where the sentence is the unit of quality. Short fiction, literary essay, narrative journalism, voice-driven marketing copy, dialogue-heavy scenes — the things where a reader can tell within two paragraphs whether the writer has an ear or doesn't. On those tasks it wins because of three behaviors:
It varies sentence length without being asked. Most AI prose has a characteristic rhythm — moderate-length sentences, similar clause structures, periodic three-item lists. Opus 4.7 will write a four-word sentence and then a thirty-word one. It will use a fragment. It will let a beat sit.
It holds a voice across long outputs. Tell it to write in a clipped, hard-boiled register and it stays in that register for four thousand words. Tell it to write loose, digressive, first-person essayistic prose and it doesn't drift back toward neutral expository voice by the third section. Voice retention is the dimension where the gap between Opus 4.7 and everything else is widest.
It takes editorial direction. You can tell Opus 4.7 things like "the second beat in this scene is doing the same work as the fourth — cut one" or "the dialogue between A and B sounds like the same person talking to themselves — give B a different verbal habit" and get a rewrite that actually addresses the note. The other models tend to do cosmetic rewrites where surface words change and structural problems persist.
What Opus 4.7 is most prone to: em-dash overuse — it loves them — and occasional "in a world where" openings if you don't head them off. It's also slightly biased toward writing in present tense when given ambiguous instructions. None of these are dealbreakers. They are correctable with a single line in the prompt.
What it isn't ideal for: workloads where you have a rigid outline and need the model to execute it exactly. Opus 4.7 will sometimes push back on outline beats it thinks are weaker — which is a feature if you want a collaborator and a friction point if you want a typist. For that you want GPT-5.
One more behavior worth flagging: Opus 4.7 handles narrative distance well. If you tell it to write in close third with shallow access to the POV character's interiority, it will. If you tell it to pull back to a more omniscient register for an interlude chapter and then return to close third in the next chapter, it will track that shift across thousands of words without drifting. This is the kind of craft variable that almost no benchmark measures and that working novelists notice immediately.
GPT-5: When It's the Right Call
GPT-5 is the structural workhorse. If you have a novel outline with thirty chapters and a beat sheet per chapter and you want the model to execute it without drifting, GPT-5 is the call. It tracks outlines closely, holds character arcs across long generations, and is meaningfully better than Opus 4.7 at hitting an instructed structure without renegotiating it.
The trade-off is texture. GPT-5 prose is competent — it scans clean, it doesn't make grammar mistakes, the structure works — but it has a more uniform rhythm than Opus 4.7. Sentences cluster around a similar length unless explicitly prompted otherwise. Transitions repeat across paragraphs. The model has favored cadences that surface across drafts. Working writers will recognize the texture. Casual readers usually won't.
Where this matters most: marketing long-form, structured business narrative, screenplay treatments where the structure is doing more work than the line, episodic content where consistency between episodes matters more than per-episode lyricism. GPT-5 is also stronger than Opus 4.7 at certain genre conventions — procedural pacing in thriller, beat structure in romance, the specific architecture of a five-act commercial screenplay. If you're writing inside a recognizable commercial form and you want the model to respect the form rather than reinvent it, GPT-5.
GPT-5's most common LLM tells: "in conclusion" and its variants ("ultimately," "in the end," "to sum up") at section boundaries, even in fiction; transition phrases like "but more than that," "what's more"; and a tendency to land every scene on a neat thematic beat instead of letting the scene end with a question. Heading these off requires explicit prompting.
It is also a stronger choice than Opus 4.7 when you need the model to integrate research — historical fiction with period-accurate detail, fiction set in a specific industry or subculture. GPT-5's tendency to anchor in retrieved or supplied factual context is slightly tighter, and the looser-narrative trade-off matters less when factual grounding matters more.
A useful way to think about the Opus-versus-GPT-5 split: Opus 4.7 is closer to a literary collaborator and GPT-5 is closer to a professional ghostwriter. The collaborator will surprise you with a turn of phrase and occasionally argue with your outline. The ghostwriter will execute the brief, hit the beats, and produce a usable draft you can polish. Neither role is universally better. For a novelist trying to find a voice, the collaborator. For a content team filling a publication calendar, the ghostwriter.
Gemini 2.5 Pro: When It's the Right Call
Gemini 2.5 Pro is the right call when context size is the binding constraint. Its 2M-token window is twice what either Opus 4.7 or GPT-5 offers, and for novel-length work that's a real difference. You can drop an entire 90,000-word draft, a 15,000-word style bible, character sheets, and a chapter outline into a single conversation and ask for a revision pass with all of that in scope.
This matters more than it sounds. Most "long novel" workflows with the other two models involve juggling — summarizing prior chapters into a running synopsis, then feeding the synopsis instead of the chapters themselves. The model never sees the actual prose it's continuing. Gemini 2.5 Pro can. That changes the quality of continuity passes, voice-matching against the existing draft, and any task where the model needs to know what's already on the page.
The trade-off is prose texture. Gemini 2.5 Pro's default voice is the most neutral of the three — adequate, technically correct, but less rhythmically varied than Opus 4.7 and less structurally distinctive than GPT-5. It is also Adequate (not Strong) on avoidance of LLM tells: it has favored transitions and recycled openings that surface across drafts. With a strong style sample in the prompt, voice adherence is Strong — it will follow what you give it. Without one, it reverts to a generic register faster than Opus 4.7 does.
The Mid cost tier matters here too. For revision passes across long documents — where you're running the same long context through many iterations — the per-token economics of Gemini 2.5 Pro are noticeably better than the two premium options. For a novelist doing dozens of full-manuscript passes, that compounds.
Where Gemini 2.5 Pro is not the right call: anything short. The 2M window is dead weight on a 1,500-word short story, and the prose-texture deficit shows more in short forms where every paragraph carries more weight. Use it for the workloads its window unlocks. Use Opus 4.7 for everything else.
One workflow that exploits Gemini 2.5 Pro's window particularly well: tone audits. Load the entire manuscript, then ask the model to flag chapters or scenes where the narrative voice drifts away from the established register, where a character's dialogue starts to sound like a different character's, or where the pacing breaks from the rhythm of the surrounding work. Because the model has the whole book in scope, the flags it produces are anchored in actual textual comparison rather than guesses against a summary. That kind of pass is the single highest-value use of the 2M window for novelists.
Which to Pick by Sub-Segment
Short fiction and flash
Claude Opus 4.7. Short forms reward sentence-level care, and Opus 4.7's prose rhythm and voice retention are widest-margin advantages on flash and short-story-length work. GPT-5 is acceptable if you have a rigid structural conceit (a story built around a specific formal constraint), but the default is Opus 4.7.
Novel-length narrative continuity
Split the workload. Use Gemini 2.5 Pro for revision and continuity passes that need the entire draft in context — voice consistency audits, timeline checks, "does the foreshadowing in chapter four pay off in chapter twenty-eight" passes. Use Claude Opus 4.7 for generating new prose chapter by chapter. Use GPT-5 for outline-execution passes where you're filling in a planned structure.
Voice-matching to a sample
Claude Opus 4.7, by a wide margin. Give it 1,000–2,000 words of an author's prose (your own or someone else's, public-domain or otherwise) tagged in XML, and tell it to write new content in that voice. Opus 4.7's pattern-following on stylistic features — sentence length distribution, vocabulary register, syntactic habits, punctuation rhythm — is the strongest of the three, and meaningfully better than describing the style in instructions. See voice prompting for the technique.
Marketing and brand long-form
GPT-5 if the brand voice is structural — built around recognizable formats (case studies, listicles, before-after narratives). Claude Opus 4.7 if the brand voice is textural — built around a recognizable rhythm or register that won't survive structural compression. For most B2B SaaS marketing the answer is GPT-5; for founder-voice newsletters and editorial brand work, Opus 4.7.
Screenplay and dialogue
GPT-5 for screenplay structure — beat sheets, three-act and five-act architecture, treatment-to-script expansion. Claude Opus 4.7 for dialogue itself, especially scenes where character voice differentiation matters. The two-model split is real here: GPT-5 plans, Opus 4.7 writes the lines, and a human edits the seams.
Poetry and constrained-form
Claude Opus 4.7. Poetry and forms with prosodic constraints (sonnets, villanelles, formal verse) reward the same sentence-level attention that Opus 4.7 is strongest at, plus the model is more willing to write a strange line and let it stand. GPT-5 is more likely to smooth a strange line out toward conventional phrasing. Neither model is a substitute for a poet; both are useful drafting tools.
Worldbuilding documents and series bibles
Gemini 2.5 Pro for the document itself — long, internally cross-referential, the kind of artifact where holding the whole thing in context during edits is the binding constraint. Claude Opus 4.7 for the prose excerpts inside the bible (sample passages of in-world text, character voice samples, fragments from in-world documents) that need to read like real prose rather than reference material. GPT-5 for the structural scaffolding (timeline grids, faction relationship charts, magic-system rule tables) that benefits from rigid consistency.
Translation and prose adaptation
Claude Opus 4.7 for literary translation where preserving voice across languages matters more than literal fidelity, and for adapting prose between registers (translating contemporary literary fiction into a more accessible register, or vice versa). GPT-5 for technical or commercial translation where fidelity to source structure is the priority.
Sample Prompt for the Recommended Winner
The clearest demonstration of Claude Opus 4.7's voice-matching strength is a prompt that gives it a tagged style sample and asks for new content in that voice.
<task>
Write a 600-word short scene in the voice of the style sample below.
The scene: [one-sentence scene description, e.g., "a divorced
father picks his son up from school on a rainy Tuesday."]
</task>
<style_sample>
[Paste 800-1500 words of prose in the target voice here.
Can be your own writing, a public-domain author, or
licensed sample material. The longer and more representative,
the better the match.]
</style_sample>
<instructions>
- Match the sentence-length distribution of the style sample.
If it averages short, write short. If it varies wildly, vary wildly.
- Match the vocabulary register (formal/colloquial/period/technical)
rather than picking words you think "sound literary."
- Match punctuation habits — if the sample avoids em-dashes,
avoid them; if it uses sentence fragments, use them.
- Do not summarize or analyze the style. Just write the scene.
- Do not end on a tidy thematic beat. Let the scene end where
it ends.
</instructions>
Two things to notice about why this works on Opus 4.7. First, voice imitation works better when the model has prose to pattern-match against than when it has a verbal description of style. "Write in the voice of a hard-boiled noir narrator" is an instruction. A thousand words of actual hard-boiled prose is a target. Opus 4.7 is best-in-class at the second formulation. Second, the explicit anti-tells in the instructions block — don't end on a tidy beat, follow the sample's punctuation rather than your defaults — head off the specific failure modes Opus 4.7 is most prone to. That combination of a strong target plus pointed anti-tells is what unlocks the voice-matching the model is capable of.
Closing
If you only remember three things from this comparison: Claude Opus 4.7 is the default for prose quality and voice. GPT-5 is the right call when structure carries more weight than texture. Gemini 2.5 Pro is the pick when you need the 2M-token window to hold a whole novel and its style bible in one shot.
SurePrompts has voice-matching templates, scene generators, and long-form structure prompts tuned for each of these three models. Build a prompt with the right model selected and the right structural scaffolding and you skip most of the trial-and-error.
Related reading:
- The AI Model Selection Guide — the umbrella framework this post sits inside.
- Best Claude Opus 4.7 Prompts for 2026 — specific prompt patterns for the recommended winner.
- Voice Prompting — the technique behind the sample prompt above.