Prompting Reasoning Models in 2026: GPT-5.5, Claude, Gemini, and DeepSeek

Q: What is a reasoning model?

A reasoning model is a class of language model that performs an internal deliberation pass before producing its visible answer. Unlike a standard model that generates the response token-by-token from the first word forward, a reasoning model spends dedicated test-time compute on hidden or separately-budgeted thinking tokens — exploring the problem, weighing approaches, sometimes self-correcting — and only then emits the response. The major 2026 families are OpenAI's GPT-5.5 with its reasoning-effort levels, Claude Opus 4.8 and Fable 5 with adaptive thinking (and Sonnet 4.6 with configurable extended thinking), Gemini 3.1 Pro and 2.5 Pro with thinking levels, and DeepSeek V4 (V4-Flash and V4-Pro) with thinking modes. Mechanically the thinking is built-in chain-of-thought, given a budgeted place to live and a sampler that does not commit to an answer until that place is used.

Q: Which reasoning model should I use in 2026?

It depends on the task. OpenAI GPT-5.5 is the strongest general reasoner with a tunable reasoning-effort dial (none/low/medium/high/xhigh); the GPT-5.4 family (mini and nano) matches it on lighter STEM-heavy work at a fraction of the cost. Claude Opus 4.8 with adaptive thinking is the workhorse for code, legal, and long-context analysis where the brief is dense and the success criteria specific, with Fable 5 above it for the very hardest work; Sonnet 4.6 with configurable extended thinking is the daily-driver tier. Gemini 3.1 Pro with high thinking is the right tool for genuinely open problem spaces and multimodal reasoning that benefits from comparing multiple approaches. DeepSeek V4 delivers near-frontier reasoning at open-weights pricing with a transparent thinking trace, which makes it the default for cost-sensitive or self-hosted reasoning workloads. Llama 4 and other open-weight reasoners expand the local-deployment options further.

Q: Why does chain-of-thought prompting backfire on reasoning models?

Because the model is already doing the chain-of-thought work internally. On a 2023-era standard model, adding 'think step by step' to a prompt boosted accuracy because it gave the model space in its visible response to reason. On a 2026 reasoning model that space already exists in the dedicated thinking phase. Telling the model to also think step by step in the visible answer triggers one of two failure modes: the model duplicates the reasoning, doubling output cost without improving quality, or it constrains its internal exploration to match your specified procedure and misses better approaches it would have found on its own. The same logic applies to elaborate persona stacking, take-a-deep-breath emotional primers, and few-shot examples on pure reasoning tasks — these were tricks for older models and are now noise.

Q: What is the universal anatomy of a good reasoning prompt?

Six slots that carry across GPT-5.5, Claude extended thinking, Gemini thinking, and DeepSeek V4. Goal — what success looks like, stated as an outcome rather than a procedure. Constraints — the hard rules and binding requirements the answer must satisfy. Context — the project-specific facts the model needs that are not on the public internet. Audience and output shape — who reads this and in what format. Reasoning budget — the effort or thinking-token cap, set as an API parameter rather than coaxed through prose. Evaluation criteria — what a correct answer looks like, named explicitly so the model can self-check inside its thinking phase. You can omit a slot on purpose, but a missing slot becomes a generic default the model fills for you.

Q: How is prompting Claude extended thinking different from GPT-5.5?

The mechanics differ in three ways. First, Sonnet 4.6 exposes a thinking-budget knob measured in tokens (min 1,024) plus an effort level, while Opus 4.8 and Fable 5 use always-on adaptive thinking that scales to the problem; GPT-5.5 uses a discrete reasoning-effort dial (none, low, medium, high, xhigh). Second, Claude's thinking trace is visible to you via the API in a separate block from the response, useful for debugging and audit; GPT-5.5's reasoning summary is far more abridged. Third, Claude rewards XML-tagged content (wrap documents, code, and constraints in tags) and tight schema-pinned output formats, while GPT-5.5 leans on its native structured-outputs JSON-schema feature for the same job. The shared rule across both: set the depth via API parameter, not via prose, and put binding requirements at the tail of the prompt so they stay salient when the model emerges from thinking.

Q: When should I use a reasoning model vs a standard model?

Reach for a reasoning model when the task genuinely benefits from deliberation — multi-step math and proofs, code architecture and debugging across files, legal or contract analysis with interacting clauses, research synthesis, strategic planning with many variables, and any eval where chain-of-thought used to help. Stay with a standard model for direct recall, simple classification, format conversion, short creative outputs, single-paragraph extraction, and latency-sensitive turns where time-to-first-token matters. The practical heuristic: if you cannot name the steps the model should think through, the task probably does not need a reasoning model. Reasoning charges thinking tokens on top of input and output, so on a simple task those tokens are pure waste — and reasoning models on trivially simple tasks sometimes produce worse output than standard models because they second-guess obvious answers.

Q: How do I control how much a reasoning model thinks?

Through API parameters, not through prose. OpenAI GPT-5.5 exposes a reasoning_effort dial with none, low, medium, high, and xhigh levels — high or xhigh for math proofs, multi-file debugging, and high-stakes architectural decisions; low or none for straightforward reasoning where the path is clear. Claude Sonnet 4.6 exposes a budget_tokens cap (minimum 1,024, typical production range 10K to 100K) plus an effort level — start low and raise only when you see the trace truncate on real tasks — while Opus 4.8 and Fable 5 scale their always-on adaptive thinking to the problem automatically. Gemini 3.1 Pro exposes thinking levels via the API. DeepSeek V4's thinking chain is streamed in its thinking mode and is bounded by the hosted-service limits or your self-hosted compute. Across all four, never try to coerce more thinking through phrases like 'think harder' or 'really consider this' — set the dial.

Q: How do I evaluate reasoning model output?

Walk the brief slot by slot and check the answer against each. Did the response solve the goal as stated? Does it satisfy every named constraint? Does it use the supplied context rather than hallucinating around it? Is the output in the requested shape and length? On reasoning-heavy tasks, run the answer through an llm-as-judge pass against an explicit rubric — this catches beautiful-sounding wrong answers that human reviewers miss because the prose reads coherent. The SurePrompts Quality Rubric formalizes this for production prompts. For high-stakes work, layer self-critique and self-refine loops outside the model's single-turn deliberation: the thinking phase makes one turn better, but a critique-and-revise loop makes a sequence of turns reliable. The trap to avoid is treating coherence as correctness — a reasoning model that confidently arrives at the wrong answer is the most dangerous failure mode in this category.

Imtiaz Rayhan

Key takeaways:

The reasoning-model market split into four useful shapes in 2026: general high-reasoning (GPT-5.5 with its reasoning-effort levels), long-context dense work (Claude Opus 4.8 and Fable 5 with adaptive thinking, Sonnet 4.6 thinking), open exploration and multimodal reasoning (Gemini 3.1 Pro and 2.5 Pro with thinking levels), and open-weights cost-efficient (DeepSeek V4, Llama 4). The universal prompt anatomy is shared across all of them; the dialects and the budget knobs are not.
The 2023 chain-of-thought playbook actively backfires on these models. "Think step by step," elaborate persona stacking, take-a-deep-breath primers, and few-shot examples on pure reasoning tasks all add noise without unlocking anything — the model is already doing the work you are trying to prompt into existence.
A strong reasoning prompt fills six slots: goal (not procedure), constraints, context, audience and output shape, reasoning budget, evaluation criteria. Forgetting any of them means the model picks a generic default, and the default is almost always too shallow or too verbose.
Reasoning depth is set via API parameters, not prose. GPT-5.5's reasoning_effort, Sonnet 4.6's budget_tokens and effort level (Opus 4.8 and Fable 5 scale adaptively), Gemini 3.1 Pro's thinking level, DeepSeek V4's thinking mode. "Think harder" in the user message does nothing that the dial does not already do.
Most tasks do not need a reasoning model. Direct recall, simple classification, format conversion, latency-sensitive chat — standard models win on cost, speed, and sometimes quality. The practical heuristic: if you cannot name the steps the model should think through, the task probably does not need a reasoning model.
Coherent prose is not correct content. A reasoning model that confidently arrives at the wrong answer is the most dangerous failure mode in this category. Evaluate slot-by-slot against the brief, layer self-critique loops on high-stakes work, and use llm-as-judge rubrics for outputs that ship.
Reasoning models compose with everything else. They are the planner inside an agentic loop, the synthesizer at the end of a RAG pipeline, and the deep-deliberation step inside a hybrid stack where a fast standard model handles the rest. Do not treat them as a replacement for the rest of the toolkit — treat them as the part of it that thinks before answering.

Two years ago, "let's think step by step" was the single most reliable trick in prompt engineering. You added it to a prompt, the model wrote out its reasoning, and accuracy went up. It was magic — until the architecture changed underneath it. In 2026 that same phrase is either useless or actively counterproductive on the frontier reasoning models. The reasoning is happening anyway, in dedicated hidden tokens you may never see; what you write in the prompt now directs deliberation that already exists, rather than coaxing reasoning into being.

This pillar consolidates the SurePrompts reasoning-models cluster into a canonical entry point. Each section links out to the deep-dive post for the model or technique it references. Use this page to pick the right reasoning tier for a problem, learn the shared six-slot anatomy, understand the per-model dialects, and know when not to reach for a reasoning model at all. For the broader discipline this sits inside, see context engineering — the 2026 replacement for prompt engineering as a generic label. For the two sister pillars in this Phase 3 series, see AI image prompting and AI video prompting; the structural moves are the same, the modality and the dialects are not.

What Reasoning Models Actually Are in 2026

A reasoning model is a language model that performs a dedicated deliberation pass before producing its visible answer. The deliberation lives in a separate phase from the response — sometimes in largely hidden tokens with only an abridged summary (GPT-5.5), sometimes in a separately-budgeted but inspectable thinking block (Claude extended thinking), sometimes streamed inline as a transparent chain (DeepSeek V4 in thinking mode), sometimes in a deep multi-hypothesis exploration (Gemini 3.1 Pro thinking). The shared property is that the model has spent meaningful test-time compute on the problem before it commits to an answer.

Mechanically this is built-in chain-of-thought. The model was already going to reason; the architectural change gives that reasoning a budgeted place to live and tells the sampler not to force an answer until that place is used. The economic change is that you now pay for the thinking tokens as input on top of the visible output. The behavioral change — the one that matters for prompting — is that the reasoning is no longer something you elicit. It is something you direct.

Five things shift when you move from a standard model to a thinking one.

Reasoning location. In a standard model, the only reasoning is in the visible output. That is why chain-of-thought prompting worked — you were literally giving the model space to reason by asking it to "think step by step." In a thinking model, the reasoning happens in a phase you do not write into. Your job shifts from eliciting reasoning to directing it.

Token economics. Thinking tokens are billed. On a simple task they are pure waste; on a complex one they are the difference between a wrong answer and a right one. The cost shape of a reasoning-model app is determined by which routes enable thinking and what budget they use.

Latency. Reasoning calls are slower than standard calls by 3-10x at high effort. For interactive applications, that latency is a UX cost. For batch or background work, it is invisible.

Visibility. The thinking trace is, depending on the model, largely hidden behind an abridged summary, separately inspectable, or streamed inline. Inspectable traces help debugging and audit. Abridged or hidden traces force you to evaluate the answer alone. Streamed traces (DeepSeek V4 in thinking mode) are sometimes user-facing in product UI.

Failure modes. Standard models fail by being shallow or hallucinating confidently. Reasoning models fail by overthinking trivial tasks, deliberating against the wrong target when the brief is fuzzy, and most dangerously by arriving at coherent-sounding wrong answers after long deliberation. The evaluation discipline has to match.

For the deeper category framing, see the reasoning-model glossary entry, the test-time compute glossary entry, the extended-thinking glossary entry, and the thinking-model glossary entry. The mental model worth carrying: standard model prompting is getting the model to reason at all; reasoning model prompting is directing a model that already reasons toward the right problem at the right depth.

The 2026 Model Landscape

The reasoning-model market in 2026 is not a one-horse race. Each major model has a distinct personality, a distinct control surface, and a distinct cost shape. Picking the right model per task is half the work.

Model	Reasoning depth control	Visible thinking	Strong at	Cost shape	Commercial terms
OpenAI GPT-5.5	reasoning_effort: none / low / medium / high / xhigh	Abridged summary	General hard reasoning, math, code, multi-step problems	Premium per-token, high at high effort (~$5/$30 per 1M)	Per OpenAI terms
OpenAI GPT-5.4 mini / nano	reasoning_effort dial	Abridged summary	STEM, code, structured-data extraction at lower cost and latency	Substantially cheaper than GPT-5.5 (nano ~$0.20/$1.25)	Per OpenAI terms
Claude Opus 4.8 (adaptive thinking)	Always-on adaptive thinking, scales to the problem	Inspectable trace	Long-context dense work, code review, legal, multi-file debugging, planning	Premium per-token (~$5/$25 per 1M); caching matters	Per Anthropic terms
Claude Fable 5 (adaptive thinking)	Always-on adaptive thinking	Inspectable trace	Anthropic's most capable model; the very hardest reasoning and long-context work	Top tier (~$10/$50 per 1M)	Per Anthropic terms
Claude Sonnet 4.6 (extended thinking)	budget_tokens cap (min 1,024) + effort level	Inspectable trace	Daily-driver tier; most analysis and writing where Opus is overkill	Mid-tier (~$3/$15 per 1M); default on free and Pro	Per Anthropic terms
Gemini 3.1 Pro (thinking)	Thinking levels (API)	Limited visibility	Genuinely open problem spaces, deep reasoning, multimodal	Flagship reasoning tier (~$2/$12 per 1M)	Per Google terms
Gemini 3.5 Flash	Thinking levels	Limited visibility	Cost-efficient reasoning, multimodal-aware fast iteration	Mid-tier (~$1.50/$9 per 1M)	Per Google terms
DeepSeek V4-Pro	Thinking mode (hosted limits or local compute)	Fully visible inline	Math, formal reasoning, agentic coding, transparent traces	Cheaper API (~$0.44/$0.87 per 1M); self-hostable	Open weights; check license
Llama 4 (Maverick / Scout)	Self-hosted compute	Fully visible inline	Open-weight local reasoning, fine-tuneable, Scout's 10M context	Open-weight cost; ~$0.30/$0.85 via hosts	Llama 4 Community License

A few threads worth pulling on.

OpenAI GPT-5.5 is the general-purpose hard reasoner. Its reasoning_effort dial — none, low, medium, high, xhigh — is the cleanest depth control in the market, and high or xhigh effort produces work on multi-step math, code, and analysis at the top of every reasoning benchmark (GPT-5.5 sits in the mid-90s on a near-saturated GPQA Diamond). The thinking is shown only as an abridged summary, but structured-outputs JSON-schema lets you separate reasoning from response shape entirely. The old o-series is retired — reasoning is now folded directly into GPT-5.5's effort levels. GPT-5.4 mini and nano fill the cheap-and-fast role, optimized for STEM, code, and structured extraction at substantially lower cost and latency, often good enough for math, science, and code generation that does not need the top reasoning ceiling. The full GPT-5.5 and Gemini patterns live in advanced prompt engineering for Claude, GPT-5, and Gemini.

Claude Opus 4.8 with adaptive thinking is the workhorse for long-context dense work where the brief is rich and the success criteria are specific. Its 1M-token context, prompt caching, and inspectable thinking trace combine into a stack that rewards heavy briefs — multi-file code review, contract analysis with interacting clauses, large-document synthesis, debugging across many functions. Opus 4.8 leads on near-saturated benchmarks like GPQA Diamond (~93.6%), and its always-on adaptive thinking scales to the problem rather than needing a manual budget. Above it sits Fable 5, Anthropic's most capable model, for the very hardest reasoning. Sonnet 4.6 sits below as the daily-driver, with configurable extended thinking governed by budget_tokens (minimum 1,024) plus an effort level. The complete Opus playbook is in the Claude Opus 4.7 prompting guide; broader Claude patterns are in the Claude 4 prompting guide; the extended-thinking specifics are in extended thinking prompts for Claude.

Gemini 3.1 Pro is the flagship reasoning model in Google's lineup, with configurable thinking levels and deep, multi-hypothesis deliberation under the hood. Rather than committing to a single early answer, a high thinking level lets it generate and weigh several framings before settling on one — so prompts that open the space with multiple framings get dramatically better results. Gemini also pairs reasoning with native multimodal strength (text, image, audio, video) and Google Search grounding in a way no other reasoning model matches. Gemini 3.5 Flash is the cost-efficient sibling, and 2.5 Pro remains a capable lower-cost reasoning option. The Gemini patterns sit alongside the GPT-5.5 and Claude patterns in advanced prompt engineering for Claude, GPT-5, and Gemini.

DeepSeek V4 is the open-weights reasoning option, superseding the older R1 and V3 line. It delivers near-frontier reasoning at API pricing an order of magnitude lower than the closed competitors, with downloadable weights for self-hosting. V4-Pro handles reasoning and agentic coding (around 80% on SWE-bench Verified) while V4-Flash covers general chat; both are mixture-of-experts models that activate only a fraction of total params per token. In thinking mode the chain is streamed and fully visible inline — the chain is the product, not a byproduct — useful when you want users or auditors to see the reasoning. V4 rewards explicit step-by-step requests, structured problem formats, and verification gates; the DeepSeek vs ChatGPT comparison covers strategic positioning, and the 40 best DeepSeek prompts is the template library tuned for DeepSeek's strengths.

Llama 4 and other open-weight reasoners sit alongside DeepSeek V4 on the open-weight surface and expand local-deployment options further. Llama 4 Maverick (400B total / 17B active params) and Scout (with a 10M-token context, the largest of any widely deployed model) are the right tool when self-hosting is non-negotiable (data sovereignty, regulated workloads, sensitive proprietary data). Output ceiling lower than the closed frontier; pipeline ceiling higher because you own everything.

For cost-sensitive routing across this landscape, see the model-cascade glossary entry and the hybrid-workflows section below — most production reasoning workloads use a fast standard model for the parts that do not need deliberation and route only the genuinely hard turns to a reasoning model.

The Universal Reasoning-Prompt Anatomy

Every strong reasoning prompt — regardless of model — covers six slots. You can omit a slot on purpose. You cannot forget the slot exists. When a slot is missing, the model fills it with a plausible default, and the default is almost always too shallow or too verbose.

1. Goal — what success looks like. State the outcome, not the procedure. "Identify the security vulnerabilities that could lead to data exposure or unauthorized access, ranked by severity" is a goal. "First check for SQL injection, then check for XSS, then check for CSRF" is a procedure. The first lets the model apply its full reasoning capacity, including categories you would not have thought to list. The second caps quality at the level of your enumeration. Reasoning models are path-finders; give them the destination, not the route.

2. Constraints — the hard rules and binding requirements. Constraints define the boundaries of acceptable output. "Keep the response under 500 words." "Only use information from the provided documents." "All code must be Python 3.12 compatible." "Do not recommend solutions that cost more than $10,000 a month." These are different from procedures — constraints scope the answer, procedures script the path. Reasoning models reward tight constraints because their thinking phase uses them as steering signals.

3. Context — what the model needs to know that is not on the public internet. Project-specific facts, prior decisions, codebase conventions, the team's runway, the migration history, the customer's domain. Without context the model deliberates against an imagined generic situation; with context it reasons against your actual one. For long-context Claude work, this is the wrapped reference block that benefits from prompt caching. For agentic runs, this is the session memory the model carries across turns. For one-shot prompts, this is a "key facts" block right before the task.

4. Audience and output shape. Who reads this and in what format. "JSON matching the schema below" is an output shape. "A 200-word executive summary followed by a numbered action list" is also an output shape. "Make it good" is not. Audience changes the answer in ways that matter — a code review for a junior engineer versus a senior reviewer is a different artifact even from the same diff. Name both.

5. Reasoning budget. This is not a prose slot — it is an API parameter. GPT-5.5's reasoning_effort, Sonnet 4.6's budget_tokens plus effort level (Opus 4.8 and Fable 5 scale adaptively), Gemini 3.1 Pro's thinking level, DeepSeek V4's thinking mode and hosted-service or self-hosted limits. Set it deliberately; do not try to coerce more thinking through phrases like "really consider this" or "think very carefully." The dial does what the prose pretends to do, more reliably and without bloating the input. The right budget is "enough to not truncate mid-reason on the hardest case in your eval," not "the maximum allowed."

6. Evaluation criteria — what a correct answer looks like. Name the standards the model can use to self-check inside the thinking phase. "Verify your answer against the original constraints and test it with edge cases." "After choosing an algorithm, confirm the time complexity matches the stated performance requirement." "Before finalizing, check that every cited claim appears in the supplied document." Reasoning models can self-verify when you give them something to verify against; without an evaluation slot they ship the first plausible answer and call it done.

A worked example. The weak version: "Analyze this dataset by first calculating the mean, then the median, then the standard deviation, then identifying outliers using the IQR method, then summarizing trends." The strong version names the goal (find patterns and anomalies that affect a specific business decision), the context (90 days of SaaS transaction data, three pricing tiers, an open question about a mid-tier), the constraints (use only the supplied data, flag claims that depend on outside data), the output shape (markdown report with executive summary, ranked patterns, anomalies, recommendation), the audience (pricing team, quantitative-comfortable), and the evaluation (verify each cited statistic against the data, confirm the recommendation follows from the patterns). Reasoning effort is set on the API call, not in the prose. The strong version is longer because it fills slots, not because it is more ornate — every phrase is doing work. This is what 2026-native reasoning prompts look like.

Why the 2023 Chain-of-Thought Playbook Backfires

This is the central counter-intuitive insight of the category. The techniques that made you effective with earlier instruction-tuned models can actively hurt your results with reasoning models. The Wharton 2025 Prompting Science Report's finding — that chain-of-thought prompting adds negligible benefit on models that already think step-by-step — is the same principle that explains every item in this section: the model is doing the work you are trying to prompt into existence.

Redundant reasoning narration. Asking a reasoning model to "think step by step" in the visible response either duplicates the work — once in hidden thinking tokens, once in narrated output, doubling cost without improving quality — or causes the visible narration to drift from the hidden chain because it improvises beyond what the deliberation produced. Both are worse than trusting the dedicated thinking phase. If you need the reasoning shown, ask for a "brief justification" after the answer.

Over-specified procedures. A prompt that reads like a procedure manual replaces the model's reasoning with yours, and the model's reasoning is usually better than the script you would write, because it can explore approaches you would not have thought of — race conditions in a security audit, alternative algorithms in a code review. Procedures cap quality at your enumeration. Constraints scope the answer without scripting the path.

Anchoring few-shot examples on reasoning tasks. Few-shot prompting still wins for pattern-matching, format-following, and classification. On reasoning tasks where you want the model to think from scratch, examples backfire — the model anchors on your specific solution path and reduces solution diversity. The same applies to the alternative reasoning patterns the field developed for older models: step-back prompting, least-to-most prompting, and tree-of-thought all encoded reasoning structure into the prompt. On a thinking model, that structure is already happening; encoding it externally either constrains the internal version or wastes tokens reproducing it.

"Think step by step" as wasted tokens. The canonical anti-pattern. Drop it from reasoning-model prompts; keep it in standard-model prompts where it still works.

Persona stacking, take-a-deep-breath primers, confidence-eliciting phrases. "You are a senior X with 15 years of experience" was a 2023 trick that did real work on small instruction-tuned models. On a frontier reasoning model, detailed personas underperform direct task framing, emotional primers do nothing measurable, and "if you're unsure, say so" is now a tax because reasoning models self-flag uncertainty inside the thinking phase when it matters. State the task, provide the context, set the evaluation criteria — skip the costume.

The shift to internalize: in 2023, prompt engineering was about coaxing reasoning out of models that did not want to reason. In 2026 on reasoning models it is about directing models that are already reasoning toward the right problem at the right depth. What worked then often fails now, not because the models got worse but because the failure mode changed.

Per-Model Dialects

Six slots are portable. How you express them shifts by platform.

OpenAI GPT-5.5 and GPT-5.4

GPT-5.5 rewards clean prose plus structured outputs for the response shape. The reasoning_effort parameter (none, low, medium, high, xhigh) does the depth-control work; the user message focuses on the brief. Use structured outputs (JSON Schema) to separate reasoning from output format entirely. Use system prompts for persona and persistent rules; keep evaluation criteria and constraints in the user message where they sit next to the problem context. Pick GPT-5.5 vs the GPT-5.4 family deliberately — GPT-5.4 mini and nano are tuned for STEM, code, and structured extraction at substantially lower cost and latency; GPT-5.5 wins on open-ended analysis and tasks that need the highest reasoning ceiling.

A short GPT-5.5 prompt for an architectural decision states the current stack, the team size, the expected scale, the optimization criteria, and the explicit failure-mode flag, with reasoning_effort: high on the request. No "think step by step," no persona stacking — the constraints and criteria are explicit, and the model will consider options you have not listed and weight them against your actual situation. The full GPT-5.5, GPT-5.4, and Gemini patterns are in advanced prompt engineering for Claude, GPT-5, and Gemini.

Claude with Extended Thinking

Claude rewards XML-tagged content and explicit format pinning. Wrap reference material in tags (<code>, <spec>, <constraints>, <review_criteria>) so it reads as data, not meta-instructions. Pin the output shape at the tail — the last thing the model reads before emerging from thinking is the most reliable place to keep binding output requirements salient. Set the budget via API, not prose: on Sonnet 4.6, the budget_tokens cap (minimum 1,024) and effort level are the depth knobs — start low, raise only when the trace truncates on a real task, and enable per-route, since extended thinking on every turn overpays classification routes. On Opus 4.8 and Fable 5, adaptive thinking is always on and scales to the problem, so there is no budget to tune.

A short Claude prompt for a code review wraps the diff in <code_to_review> tags, lists the operational facts in <constraints>, names the bar in <review_criteria>, and pins a strict JSON output schema at the tail; on Sonnet 4.6 it sets budget_tokens and a high effort level on the request, while Opus 4.8 simply runs adaptive thinking at depth. The system message defines the role; the user message wraps content in tags; the criteria are named; the budget (where applicable) is on the request — not in the prose. The Opus specifics — 1M-context structuring, prompt caching at Opus pricing, tool-use patterns — live in the Claude Opus 4.7 prompting guide. The broader Claude patterns are in the Claude 4 prompting guide. The extended-thinking-specific patterns — when to enable, when it hurts, how to size the budget — are in extended thinking prompts for Claude.

Gemini 3.1 Pro Thinking

Gemini 3.1 Pro at a high thinking level explores deeply — generating and weighing multiple hypotheses before a single answer is committed. Open the problem space, do not narrow it: instead of "what is the best approach to X?", ask "explore at least three distinct approaches to X, compare their tradeoffs, then recommend one." Outputs become more honest — the model is more likely to surface the runner-up and explain why it lost. Stack constraints freely; Gemini handles layered constraints well. Pair with multimodal input — Gemini reasons natively across diagrams, charts, photographs, and recorded media in a way no other reasoning model matches. Combine with Google Search grounding for questions that need both current data and deep analysis.

A short Gemini 3.1 Pro prompt for a strategic analysis: "Analyze this quarter's performance. Explore at least three narratives that explain the Q3 revenue dip, using evidence from the attached PDF, the earnings call transcript, and the roadmap timeline. For each narrative, list the strongest supporting evidence and the strongest counter-evidence. Then recommend which narrative leadership should adopt in the public messaging." What is not in this prompt: no "think step by step," no "you are a financial analyst." Gemini is already going to think carefully at a high thinking level — your job is to frame the exploration, not coach the reasoning.

DeepSeek V4

DeepSeek V4's distinguishing feature in thinking mode is its fully visible streamed thinking trace — the chain is the product, not the byproduct. Request explicit reasoning chains: "think step by step, show every step of your reasoning." This is the opposite of what works on GPT-5.5 or Claude, and works on V4 specifically because its thinking mode treats the chain as a first-class output. Use structured problem formats (Given/Find/Solution); V4 handles formal structures more reliably than conversational requests. Be explicit about verification — "verify your answer against the original constraints" — V4's reasoning capability makes self-verification actually useful and naming the step produces a tangible quality lift on math and logic tasks.

A short V4 prompt for a logic problem opens with the explicit step-by-step request, then names a six-step procedure (restate, identify given information, plan, execute showing work, verify against constraints, state the final answer), and closes with "if you are uncertain about any step, flag it and explain why." The structure that backfires on GPT-5.5 and Claude — explicit step-by-step procedure — is the structure V4 rewards, because the chain is meant to be visible and the structure is what readers and verifiers use. The strategic positioning of DeepSeek versus the closed competitors is in DeepSeek vs ChatGPT; the full template library — 40 prompts across reasoning, math, coding, writing, research, business, and creative work — is in the 40 best DeepSeek prompts for 2026.

Llama 4 and Other Open-Weight Reasoners

Llama 4 sits in the same shape as DeepSeek V4 — open weights, strong on math and code, downloadable for self-hosting. Maverick (400B total / 17B active params) and Scout (with a 10M-token context) are released under the Llama 4 Community License (open-weight / source-available rather than strictly open source). The dialect transfers: explicit step-by-step requests, structured problem formats, verification gates. Other open-weight reasoners (smaller fine-tunes, research releases) follow the same general shape but with more variability.

These models matter most when self-hosting is non-negotiable — data sovereignty, regulated workloads, sensitive proprietary data — or when cost at extreme scale makes even DeepSeek's hosted API uneconomical. The pipeline ceiling is high (full control, fine-tuning, custom inference stacks); the output ceiling is lower than the closed frontier. Treat them the way the image pillar treats Stable Diffusion — the option when you need ownership more than you need the absolute top of the quality curve.

Reasoning Tier Selection: When NOT to Use a Reasoning Model

Counterweight section. Most tasks do not need a reasoning model. Using one for simple work is like using a scanning electron microscope to check if your plants need watering — technically it works, but you are wasting time, money, and latency for no quality gain.

When standard models win. Direct recall is a single forward pass; reasoning is overhead. Simple classification, sentiment, and labeling are pattern-matching, not reasoning. Format conversion is transformation. Short creative outputs reflect taste, not deliberation. Single-paragraph extraction is grounded in supplied text. Latency-sensitive turns — chat, in-product assistants, support — pay a multi-second tax users notice and the task does not need. Reasoning models on simple tasks sometimes produce worse output than standard models because they second-guess obvious responses. The cost is bidirectional — money and quality both.

The model-cascade pattern. Most production reasoning workloads should not be all-reasoning-model traffic. The model-cascade pattern routes requests by complexity: a fast standard model (GPT-5.4 nano, Claude Haiku 4.5, Gemini 3.1 Flash-Lite) handles obvious cases, and only the genuinely hard turns escalate. The router can be a small model, a heuristic on the input, or a confidence threshold from the standard model's first attempt. Done well, cascades cut cost and latency by an order of magnitude while preserving the quality lift on the cases that need it.

Hybrid workflows where reasoning plans and a fast model executes. Use a reasoning model once at the start of a workflow to produce a plan, then hand each step to a fast model for execution. The reasoning model spends its budget on the hard part (the plan); the fast model spends its speed on the volume part. For agentic loops, the same pattern applies — reasoning model as planner-and-reflector, standard model as per-tool-call worker. The full architecture is in the agentic prompt stack; the working example is in the research-agent walkthrough.

The practical heuristic. If you could solve the task yourself in under 30 seconds with full context, a standard model is probably sufficient. If the task requires multiple interacting factors, tradeoff weighing, or chained logic, reach for a reasoning model. If you cannot name the steps you would expect the model to think through, the task probably does not need a reasoning model.

Controlling Reasoning Depth

Depth is a dial, not a prompt trick. Each model exposes the dial differently.

OpenAI GPT-5.5. Five levels: none, low, medium, high, xhigh. None or low for straightforward reasoning — clear-spec coding, synthesis Q&A. Medium is the default for analysis, multi-step problems, writing that needs planning. High and xhigh for complex math, formal proofs, multi-file code generation, problems with many interacting constraints. Every step up costs latency and tokens; default lower and raise only when accuracy matters more than speed or cost.

Claude extended thinking. On Sonnet 4.6, two knobs: budget_tokens (the cap, minimum 1,024) and an effort level. The budget governs how much room the model has; the effort level governs how aggressively it uses that room. Typical production range: 10K tokens for simple analysis, up to 100K for complex multi-step problems. Setting the budget too low truncates mid-reason; too high wastes money. Start low and raise only when an eval shows truncation on real tasks. On Opus 4.8 and Fable 5, thinking is always-on and adaptive — it scales to the problem with no budget to set.

Gemini 3.1 Pro thinking. Configurable thinking levels via the API. Gemini 3.5 Flash is the cost-efficient sibling. The decision: does the task warrant deep multi-hypothesis exploration? Yes — open problem space, multimodal input, layered constraints — set a high thinking level. No — a lower level, Gemini 3.5 Flash, or 2.5 Pro.

DeepSeek V4. Streamed chain in thinking mode, bounded by hosted-service limits or self-hosted compute. Depth is more emergent than dialed — V4 reasons until it converges or hits the limit. The control surface is less granular than GPT-5.5 or Sonnet 4.6, but the fully visible trace makes debugging and audit easier than with abridged-summary models.

Cost implications. Thinking tokens are billed as input tokens. Routine problems produce short traces; hard ones use the full budget. Higher budget is not linear quality — the returns curve is steep then flat; you want enough, not maximum. Cache the static prefix (system prompts, stable context) where supported. Enable per-task, not per-app. Measure lift, not just cost.

The general rule across all four families: set depth via API parameter, not prompt text. Phrases like "really consider this carefully" do not move the dial — they consume tokens that could be carrying your actual task.

Hybrid Workflows

Reasoning models compose with the rest of the toolkit. Three patterns dominate production.

Reasoning model + tool use. The thinking phase is most useful when the model can act on the world between thoughts. Claude's interleaved thinking — built into Opus 4.8, Fable 5, and Sonnet 4.6 — lets the model think between tool calls, not just before the first response. For multi-step tasks where each tool result changes what to do next, this is a different shape of capability than single-shot reasoning. Give the model clean tool definitions and a clear objective; the model plans, acts, observes, replans. Your prompt does not script the algorithm — it gives the model what it needs to choose one. The full pattern is in the agentic prompt stack; the deeper agent-design treatment is in the AI agents prompting guide.

Reasoning model + RAG. Retrieval-Augmented Generation benefits from reasoning at two points. At the synthesis step, a reasoning model is meaningfully better at multi-document synthesis than a standard model — the synthesis is genuinely a reasoning task (which sources support which claims, where do they conflict, what does the combined evidence imply). At the routing step in agentic RAG, a reasoning model's deliberation produces better routing decisions (which tool, which index, whether to query at all) than a standard model's pattern match. The complete walkthrough is in the agentic RAG walkthrough.

Reasoning model as planner in an agentic loop. A common production architecture: use a reasoning model once at the top to produce a plan, then iterate with a fast model handling per-step execution and a reasoning model handling per-step reflection only when execution fails or surfaces ambiguity. This concentrates reasoning cost on the parts that benefit and keeps steady-state per-step cost low. For the working example, see the agentic prompt stack research-agent walkthrough.

The general shape: reasoning concentrated on the parts that benefit from deliberation, and the rest of the workflow on cheaper and faster components. The right stack is heterogeneous; the reasoning model is a specialized component, not a universal one.

Honest Evaluation

"It sounds right" and "it is right" are different standards on reasoning-model output. The most dangerous failure mode in this category is the beautiful-sounding wrong answer — coherent prose, confident tone, internally consistent argument, externally false conclusion. Standard models fail by being shallow or hallucinating obviously; reasoning models fail by being deeply, articulately, persuasively wrong. Evaluation has to catch this.

A practical checklist. Goal faithfulness — did the response solve the goal as stated, or solve an adjacent problem? "Recommend a real-time notification architecture optimized for time-to-ship" is not the same goal as "compare WebSockets to SSE in detail." Constraint compliance — walk every constraint and verify; constraint violations are the most common silent failure because the answer otherwise looks competent. Context use — did the model use the supplied context or hallucinate around it? On long-context Claude work this matters most for facts buried in the middle of a reference block. Output shape — JSON parses, schema satisfied, sections present, length within bounds. Audience match — a code review for a junior engineer that reads like an internal post-mortem missed the audience slot, even if the technical content is correct. Reasoning soundness on the hard parts — for high-stakes work, walk the chain on the parts of the answer that mattered most; reasoning models are confident, and that confidence is uncorrelated with correctness on the cases where they are wrong.

Coherence is not correctness. The single most important discipline in evaluating reasoning-model output is not letting fluent prose substitute for actual verification. A confidently-stated wrong answer is worse than an obviously-shallow one because it bypasses the alarm.

Two patterns formalize this evaluation for production work. LLM-as-judge rubrics pass the response back to a different model with an explicit rubric ("score 1-5 on goal faithfulness, constraint compliance, context use, output shape; flag any unsupported claim"). LLM-as-judge inherits some of the same failure modes as the model it judges, but catches a meaningful fraction of beautiful-sounding wrong answers that human reviewers miss at scale. The SurePrompts Quality Rubric is the rubric we use for the prompts themselves; the same shape works for evaluating outputs. Self-critique and self-refine loops generate, critique, revise. The thinking phase makes a single turn better; self-refine makes a sequence reliable. They compose: extended-thinking generate, extended-thinking critique, extended-thinking revise. For high-stakes outputs the marginal cost of a critique-and-revise pass is small relative to the cost of shipping a wrong answer.

The deeper category framing lives in the extended thinking glossary entry and extended thinking prompts for Claude. The second-order trap to internalize: reasoning models are good enough at sounding right that the evaluation discipline matters more than it did with shallower models, not less. The tool got better; catching its failures got harder.

Failure Modes

Six anti-patterns that quietly wreck reasoning-model work.

Treating a reasoning prompt like a chain-of-thought prompt. Adding "think step by step" to GPT-5.5 or Claude. The model is already thinking; the phrase is noise at best, narration-bait at worst. Cure: drop the phrase from reasoning-model prompts; keep it in standard-model prompts where it still works.

Over-specifying the procedure. Writing a five-step script for a problem the model would have solved better in three steps it chose itself. Cure: state the goal and constraints; let the model find the path.

Setting the depth via prose instead of via API. "Please think very carefully about this." The dial does what the phrase pretends to. Cure: set reasoning_effort, budget_tokens, or the equivalent on the request, and keep the prompt focused on the brief.

Reasoning on tasks that do not benefit. Classification, format conversion, short Q&A, simple recall — every one of these is overhead on a reasoning model and waste on the bill. Cure: route by task complexity; default to standard models and escalate only when the task warrants deliberation.

Treating coherence as correctness. Accepting beautiful-sounding wrong answers because the prose reads competent. Cure: evaluate slot-by-slot against the brief, run high-stakes outputs through llm-as-judge rubrics, layer self-critique loops on shipping work.

Ignoring the cost shape. Enabling extended thinking on every route, leaving stable system prompts uncached on Opus, running max-effort traffic through routes that do not need it. The bill grows faster than the quality. Cure: enable per-route, cache stable prefixes, measure lift not just cost, and route to the cheapest tier that meets the bar on each task class.

Our Position

Six opinionated stances we hold on 2026 reasoning-model prompting.

Pick the right reasoning tier per task, not per project. GPT-5.5 for general hard reasoning. GPT-5.4 mini and nano for STEM and code at lower cost and latency. Claude Opus 4.8 for long-context dense work, with Fable 5 above it for the hardest reasoning. Sonnet 4.6 for the daily-driver tier below them. Gemini 3.1 Pro thinking for open exploration and multimodal reasoning. DeepSeek V4 for cost-sensitive transparent-trace work. Llama 4 and other open weights for self-hosted requirements. Project-level single-model choices leave quality and cost on the table.

State the goal, not the procedure. Always. The most reliable single move you can make on reasoning prompts. Constraints scope the answer; procedures script the path. Reasoning models reward the first and underperform on the second.

Set depth via API parameter, not via prose. reasoning_effort, budget_tokens, effort level, Gemini thinking level. The dial is the dial. Phrases that pretend to move the dial just consume the input budget.

Most tasks should not use a reasoning model. Default to standard models. Escalate only when the task warrants deliberation. Cascade by complexity. The cost of routing is much lower than the cost of routing wrong in either direction.

Coherent prose is not correct content. Evaluate against the brief, not the vibe. Run high-stakes outputs through llm-as-judge rubrics. Layer self-critique loops on shipping work. The most dangerous failure mode in this category is the beautiful-sounding wrong answer; the evaluation discipline has to be sharper than it was with shallower models, not looser.

Reasoning models are components, not replacements. They fit inside agentic loops, on top of RAG pipelines, behind cascade routers — as the part of the stack that thinks before answering. Treating them as a drop-in replacement for everything else pays the cost without capturing the structural benefit.

What's Next: From Reasoning Models to Reasoning Agents

The frontier is moving from single-call reasoning to multi-call reasoning agents. Claude's interleaved thinking — reasoning between tool calls, not just before the first response — is the early version of what becomes default behavior. GPT-5.5 is increasingly used as the planner and reflector inside agentic loops where most of the per-step work is handled by faster components like GPT-5.4 nano. Gemini 3.1 Pro paired with search grounding is the early version of an agent that researches, deliberates, and answers in one continuous flow. DeepSeek V4 self-hosted as the reasoning core of a custom agent stack is increasingly common in cost-sensitive production deployments. The single-shot reasoning prompt is becoming the inside of a loop, not the whole interaction.

The skill that compounds: clean reasoning prompts at the inner level make agentic stacks work; messy reasoning prompts compound failures across every iteration of the loop. The discipline scales — what you learn from writing a strong six-slot brief for a single Claude extended-thinking call is what you reuse, ten times, inside an agent that calls Claude ten times across a multi-step workflow. For the agent-side architecture, see the AI agents prompting guide, the agentic prompt stack, the agentic prompt stack research-agent walkthrough, and the agentic RAG walkthrough. For the broader discipline this all sits inside, the context engineering pillar and the Context Engineering Maturity Model. For the two sister pillars in this Phase 3 series, the AI image prompting guide and the AI video prompting guide.

The SurePrompts reasoning-models cluster and the frameworks it rests on.

Reasoning-model deep dives. Prompt engineering for reasoning models — the original cluster top-of-funnel guide. Extended thinking prompts for Claude — when to enable, how to budget, prompt-structure shifts. Chain-of-thought prompting — the foundational technique and why it does not transfer to reasoning models.
Per-model guides. Claude Opus 4.7 prompting guide — extended thinking, 1M context, prompt caching, tool use. Claude 4 prompting guide — broader Claude patterns. Advanced prompt engineering for Claude, GPT-5, and Gemini — the cross-model frontier playbook.
DeepSeek cluster. DeepSeek vs ChatGPT in 2026 — strategic positioning. 40 best DeepSeek prompts — the template library tuned for DeepSeek V4's strengths.
Frameworks. RCAF prompt structure — the four-part structure that generalizes across modalities. SurePrompts Quality Rubric — the audit we run before shipping production prompts. Agentic Prompt Stack — the layered model for tool-using reasoning agents. Context Engineering Maturity Model — where your org sits and where to go next.
Sister pillars. AI Image Prompting: The Complete 2026 Guide — the image-side canonical. AI Video Prompting: The Complete 2026 Guide — the video-side canonical.
Pillar. Context Engineering: The 2026 Replacement for Prompt Engineering — the broader discipline reasoning-model prompting sits inside.
Glossary. Reasoning model. Thinking model. Extended thinking. Test-time compute. Chain-of-thought. Step-back prompting. Least-to-most prompting. Tree-of-thought. Few-shot prompting. Model cascade. LLM-as-judge. Self-critique. Self-refine. Tool use. Prompt caching.

Reasoning-model prompting in 2026 is a brief-writing discipline with a depth-control dial on the side and an evaluation discipline at the end. Pick the right tier for the task. State the goal, not the procedure. Frontload constraints, context, and success criteria. Translate into the model's dialect. Set the budget on the request. Evaluate the answer against the brief, not the vibe. The single-shot reasoning prompt that gets you lucky on the first try is memorable. The repeatable reasoning workflow that ships a correct answer the third time, every time, on the cases you actually need to solve — that is what scales.

Prompting Reasoning Models in 2026: GPT-5.5, Claude, Gemini, and DeepSeek

What Reasoning Models Actually Are in 2026

The 2026 Model Landscape

The Universal Reasoning-Prompt Anatomy

Why the 2023 Chain-of-Thought Playbook Backfires

Per-Model Dialects

OpenAI GPT-5.5 and GPT-5.4

Claude with Extended Thinking

Gemini 3.1 Pro Thinking

DeepSeek V4

Llama 4 and Other Open-Weight Reasoners

Reasoning Tier Selection: When NOT to Use a Reasoning Model

Controlling Reasoning Depth

Hybrid Workflows

Honest Evaluation

Failure Modes

Our Position

What's Next: From Reasoning Models to Reasoning Agents

Get ready-made Claude prompts

Related Resources

DeepSeek Prompt Generator

Related Articles

How to Prompt Reasoning Models: 7 Principles That Work Across Every Model

Extended Thinking Prompts for Claude (2026)

Chain-of-Thought Prompting: The Secret to Complex Problem Solving

Prompting Reasoning Models in 2026: GPT-5.5, Claude, Gemini, and DeepSeek

What Reasoning Models Actually Are in 2026

The 2026 Model Landscape

The Universal Reasoning-Prompt Anatomy

Why the 2023 Chain-of-Thought Playbook Backfires

Per-Model Dialects

OpenAI GPT-5.5 and GPT-5.4

Claude with Extended Thinking

Gemini 3.1 Pro Thinking

DeepSeek V4

Llama 4 and Other Open-Weight Reasoners

Reasoning Tier Selection: When NOT to Use a Reasoning Model

Controlling Reasoning Depth

Hybrid Workflows

Honest Evaluation

Failure Modes

Our Position

What's Next: From Reasoning Models to Reasoning Agents

Related Reading

Get ready-made Claude prompts

Related Resources

DeepSeek Prompt Generator

Related Articles

How to Prompt Reasoning Models: 7 Principles That Work Across Every Model

Extended Thinking Prompts for Claude (2026)

Chain-of-Thought Prompting: The Secret to Complex Problem Solving