Skip to main content
Back to Blog
Comprehensive GuideFeatured
AI reasoning modelso3Claude extended thinkingGemini Deep ThinkDeepSeek R1extended thinkingreasoning promptingtest-time compute

AI Reasoning Models: The Complete 2026 Prompting Guide

The canonical 2026 guide to prompting reasoning models — what they actually are, the model landscape (o3, Claude extended thinking, Gemini Deep Think, DeepSeek R1), the universal anatomy of a strong reasoning prompt, per-model dialects, when not to use a reasoning model, and honest evaluation.

SurePrompts Team
April 22, 2026
35 min read

TL;DR

A strong 2026 reasoning-model prompt is not chain-of-thought hand-holding — it is a clear brief that directs an already-thinking model toward the right problem at the right depth. This pillar consolidates the SurePrompts reasoning cluster: what reasoning models actually are, the 2026 model landscape, the universal prompt anatomy, per-model dialects, thinking-budget control, and honest evaluation.

Key takeaways:

  • The reasoning-model market split into four useful shapes in 2026: general high-reasoning (o3, o4-mini), long-context dense work (Claude Opus 4.7 with extended thinking, Sonnet 4.6 thinking), parallel exploration (Gemini 2.5 Pro Deep Think and Flash Thinking), and open-weights cost-efficient (DeepSeek R1, Qwen QwQ). The universal prompt anatomy is shared across all of them; the dialects and the budget knobs are not.
  • The 2023 chain-of-thought playbook actively backfires on these models. "Think step by step," elaborate persona stacking, take-a-deep-breath primers, and few-shot examples on pure reasoning tasks all add noise without unlocking anything — the model is already doing the work you are trying to prompt into existence.
  • A strong reasoning prompt fills six slots: goal (not procedure), constraints, context, audience and output shape, reasoning budget, evaluation criteria. Forgetting any of them means the model picks a generic default, and the default is almost always too shallow or too verbose.
  • Reasoning depth is set via API parameters, not prose. o3's reasoning_effort, Claude's budget_tokens and effort level, Gemini Deep Think's toggle, DeepSeek's streamed chain. "Think harder" in the user message does nothing that the dial does not already do.
  • Most tasks do not need a reasoning model. Direct recall, simple classification, format conversion, latency-sensitive chat — standard models win on cost, speed, and sometimes quality. The practical heuristic: if you cannot name the steps the model should think through, the task probably does not need a reasoning model.
  • Coherent prose is not correct content. A reasoning model that confidently arrives at the wrong answer is the most dangerous failure mode in this category. Evaluate slot-by-slot against the brief, layer self-critique loops on high-stakes work, and use llm-as-judge rubrics for outputs that ship.
  • Reasoning models compose with everything else. They are the planner inside an agentic loop, the synthesizer at the end of a RAG pipeline, and the deep-deliberation step inside a hybrid stack where a fast standard model handles the rest. Do not treat them as a replacement for the rest of the toolkit — treat them as the part of it that thinks before answering.

Two years ago, "let's think step by step" was the single most reliable trick in prompt engineering. You added it to a prompt, the model wrote out its reasoning, and accuracy went up. It was magic — until the architecture changed underneath it. In 2026 that same phrase is either useless or actively counterproductive on the frontier reasoning models. The reasoning is happening anyway, in dedicated hidden tokens you may never see; what you write in the prompt now directs deliberation that already exists, rather than coaxing reasoning into being.

This pillar consolidates the SurePrompts reasoning-models cluster into a canonical entry point. Each section links out to the deep-dive post for the model or technique it references. Use this page to pick the right reasoning tier for a problem, learn the shared six-slot anatomy, understand the per-model dialects, and know when not to reach for a reasoning model at all. For the broader discipline this sits inside, see context engineering — the 2026 replacement for prompt engineering as a generic label. For the two sister pillars in this Phase 3 series, see AI image prompting and AI video prompting; the structural moves are the same, the modality and the dialects are not.

What Reasoning Models Actually Are in 2026

A reasoning model is a language model that performs a dedicated deliberation pass before producing its visible answer. The deliberation lives in a separate phase from the response — sometimes in fully hidden tokens (o3, o4-mini), sometimes in a separately-budgeted but inspectable thinking block (Claude extended thinking), sometimes streamed inline as a transparent chain (DeepSeek R1), sometimes in a parallel exploration of multiple hypotheses (Gemini Deep Think). The shared property is that the model has spent meaningful test-time compute on the problem before it commits to an answer.

Mechanically this is built-in chain-of-thought. The model was already going to reason; the architectural change gives that reasoning a budgeted place to live and tells the sampler not to force an answer until that place is used. The economic change is that you now pay for the thinking tokens as input on top of the visible output. The behavioral change — the one that matters for prompting — is that the reasoning is no longer something you elicit. It is something you direct.

Five things shift when you move from a standard model to a thinking one.

Reasoning location. In a standard model, the only reasoning is in the visible output. That is why chain-of-thought prompting worked — you were literally giving the model space to reason by asking it to "think step by step." In a thinking model, the reasoning happens in a phase you do not write into. Your job shifts from eliciting reasoning to directing it.

Token economics. Thinking tokens are billed. On a simple task they are pure waste; on a complex one they are the difference between a wrong answer and a right one. The cost shape of a reasoning-model app is determined by which routes enable thinking and what budget they use.

Latency. Reasoning calls are slower than standard calls by 3-10x at high effort. For interactive applications, that latency is a UX cost. For batch or background work, it is invisible.

Visibility. The thinking trace is, depending on the model, fully hidden, separately inspectable, or streamed inline. Inspectable traces help debugging and audit. Hidden traces force you to evaluate the answer alone. Streamed traces (DeepSeek R1) are sometimes user-facing in product UI.

Failure modes. Standard models fail by being shallow or hallucinating confidently. Reasoning models fail by overthinking trivial tasks, deliberating against the wrong target when the brief is fuzzy, and most dangerously by arriving at coherent-sounding wrong answers after long deliberation. The evaluation discipline has to match.

For the deeper category framing, see the reasoning-model glossary entry, the test-time compute glossary entry, the extended-thinking glossary entry, and the thinking-model glossary entry. The mental model worth carrying: standard model prompting is getting the model to reason at all; reasoning model prompting is directing a model that already reasons toward the right problem at the right depth.

The 2026 Model Landscape

The reasoning-model market in 2026 is not a one-horse race. Each major model has a distinct personality, a distinct control surface, and a distinct cost shape. Picking the right model per task is half the work.

ModelReasoning depth controlVisible thinkingStrong atCost shapeCommercial terms
OpenAI o3reasoning_effort: low / medium / highHiddenGeneral hard reasoning, math, code, multi-step problemsPremium per-token, high at high effortPer OpenAI terms
OpenAI o4-minireasoning_effort: low / medium / highHiddenSTEM, code, structured-data extraction at o3-class quality for lessSubstantially cheaper than o3, often matches on STEMPer OpenAI terms
Claude Opus 4.7 (extended thinking)budget_tokens cap + effort level low/med/high/maxInspectable traceLong-context dense work, code review, legal, multi-file debugging, planningPremium per-token; caching mattersPer Anthropic terms
Claude Sonnet 4.6 (thinking)budget_tokens cap + effort levelInspectable traceDaily-driver tier; most analysis and writing where Opus is overkillMid-tier; the routing default for most workflowsPer Anthropic terms
Gemini 2.5 Pro Deep ThinkToggle on/off (Gemini app, API)Limited visibilityGenuinely open problem spaces, parallel exploration, multimodal reasoningPremium tier (Google AI Ultra)Per Google terms
Gemini 2.5 Flash ThinkingToggleLimited visibilityCost-efficient reasoning, multimodal-aware fast iterationMid-tierPer Google terms
DeepSeek R1Streamed chain (hosted limits or local compute)Fully visible inlineMath, formal reasoning, transparent traces, cost-sensitive workloadsOrder-of-magnitude cheaper API; self-hostableOpen weights; check license
Qwen QwQStreamed chainFully visible inlineOpen-weights local reasoning, fine-tuneable for domain workOpen-weights cost; depends on infrastructureOpen weights; check license

A few threads worth pulling on.

OpenAI o3 is the general-purpose hard reasoner. Its reasoning_effort dial — low, medium, high — is the cleanest depth control in the market, and high effort produces work on multi-step math, code, and analysis at the top of every reasoning benchmark. The thinking is hidden, but structured-outputs JSON-schema lets you separate reasoning from response shape entirely. o4-mini is not a smaller o3 — it is optimized for STEM and code at o3-class quality for substantially less, often matching o3 on math, science, and code generation at 5-10x lower cost. The full o3 and o4-mini patterns live in advanced prompt engineering for Claude, GPT-5, and Gemini.

Claude Opus 4.7 with extended thinking is the workhorse for long-context dense work where the brief is rich and the success criteria are specific. Its 1M-token context, prompt caching, and inspectable thinking trace combine into a stack that rewards heavy briefs — multi-file code review, contract analysis with interacting clauses, large-document synthesis, debugging across many functions. The thinking is governed by budget_tokens plus an effort level (low, medium, high, max). Sonnet 4.6 sits one tier below as the daily-driver. The complete Opus 4.7 playbook is in the Claude Opus 4.7 prompting guide; broader Claude patterns are in the Claude 4 prompting guide; the extended-thinking specifics are in extended thinking prompts for Claude.

Gemini 2.5 Pro Deep Think is the most architecturally different of the four families. Instead of a single reasoning chain with adjustable depth, Deep Think runs parallel reasoning — multiple hypotheses generated and considered simultaneously, revised or combined before a single answer is committed. You are not asking a single reasoner to think harder; you are asking a committee of parallel reasoners to explore a space. Prompts that open the space with multiple framings get dramatically better results. Deep Think also pairs with Gemini's multimodal strength in a way no other reasoning model matches. Flash Thinking is the cost-efficient sibling. The Deep Think patterns sit alongside the o3 and Claude patterns in advanced prompt engineering for Claude, GPT-5, and Gemini.

DeepSeek R1 is the open-weights reasoning option. It delivers near-frontier reasoning at API pricing roughly 5-10x lower than the closed competitors, with downloadable weights for self-hosting. Its thinking trace is fully visible inline — the chain is the product, not a byproduct — useful when you want users or auditors to see the reasoning. R1 rewards explicit step-by-step requests, structured problem formats, and verification gates; the DeepSeek vs ChatGPT comparison covers strategic positioning, and the 40 best DeepSeek prompts is the template library tuned for R1's strengths.

Qwen QwQ and other open-weights reasoners sit a step below R1 on raw quality but expand the local-deployment surface. They are the right tool when self-hosting is non-negotiable (data sovereignty, regulated workloads, sensitive proprietary data). Output ceiling lower than the frontier; pipeline ceiling higher because you own everything.

For cost-sensitive routing across this landscape, see the model-cascade glossary entry and the hybrid-workflows section below — most production reasoning workloads use a fast standard model for the parts that do not need deliberation and route only the genuinely hard turns to a reasoning model.

The Universal Reasoning-Prompt Anatomy

Every strong reasoning prompt — regardless of model — covers six slots. You can omit a slot on purpose. You cannot forget the slot exists. When a slot is missing, the model fills it with a plausible default, and the default is almost always too shallow or too verbose.

1. Goal — what success looks like. State the outcome, not the procedure. "Identify the security vulnerabilities that could lead to data exposure or unauthorized access, ranked by severity" is a goal. "First check for SQL injection, then check for XSS, then check for CSRF" is a procedure. The first lets the model apply its full reasoning capacity, including categories you would not have thought to list. The second caps quality at the level of your enumeration. Reasoning models are path-finders; give them the destination, not the route.

2. Constraints — the hard rules and binding requirements. Constraints define the boundaries of acceptable output. "Keep the response under 500 words." "Only use information from the provided documents." "All code must be Python 3.12 compatible." "Do not recommend solutions that cost more than $10,000 a month." These are different from procedures — constraints scope the answer, procedures script the path. Reasoning models reward tight constraints because their thinking phase uses them as steering signals.

3. Context — what the model needs to know that is not on the public internet. Project-specific facts, prior decisions, codebase conventions, the team's runway, the migration history, the customer's domain. Without context the model deliberates against an imagined generic situation; with context it reasons against your actual one. For long-context Claude work, this is the wrapped reference block that benefits from prompt caching. For agentic runs, this is the session memory the model carries across turns. For one-shot prompts, this is a "key facts" block right before the task.

4. Audience and output shape. Who reads this and in what format. "JSON matching the schema below" is an output shape. "A 200-word executive summary followed by a numbered action list" is also an output shape. "Make it good" is not. Audience changes the answer in ways that matter — a code review for a junior engineer versus a senior reviewer is a different artifact even from the same diff. Name both.

5. Reasoning budget. This is not a prose slot — it is an API parameter. o3's reasoning_effort, Claude's budget_tokens plus effort level, Gemini Deep Think's toggle, DeepSeek's hosted-service or self-hosted limits. Set it deliberately; do not try to coerce more thinking through phrases like "really consider this" or "think very carefully." The dial does what the prose pretends to do, more reliably and without bloating the input. The right budget is "enough to not truncate mid-reason on the hardest case in your eval," not "the maximum allowed."

6. Evaluation criteria — what a correct answer looks like. Name the standards the model can use to self-check inside the thinking phase. "Verify your answer against the original constraints and test it with edge cases." "After choosing an algorithm, confirm the time complexity matches the stated performance requirement." "Before finalizing, check that every cited claim appears in the supplied document." Reasoning models can self-verify when you give them something to verify against; without an evaluation slot they ship the first plausible answer and call it done.

A worked example. The weak version: "Analyze this dataset by first calculating the mean, then the median, then the standard deviation, then identifying outliers using the IQR method, then summarizing trends." The strong version names the goal (find patterns and anomalies that affect a specific business decision), the context (90 days of SaaS transaction data, three pricing tiers, an open question about a mid-tier), the constraints (use only the supplied data, flag claims that depend on outside data), the output shape (markdown report with executive summary, ranked patterns, anomalies, recommendation), the audience (pricing team, quantitative-comfortable), and the evaluation (verify each cited statistic against the data, confirm the recommendation follows from the patterns). Reasoning effort is set on the API call, not in the prose. The strong version is longer because it fills slots, not because it is more ornate — every phrase is doing work. This is what 2026-native reasoning prompts look like.

Why the 2023 Chain-of-Thought Playbook Backfires

This is the central counter-intuitive insight of the category. The techniques that made you effective with GPT-4 and Claude 3 can actively hurt your results with reasoning models. The Wharton 2025 Prompting Science Report's finding — that chain-of-thought prompting adds negligible benefit on models that already think step-by-step — is the same principle that explains every item in this section: the model is doing the work you are trying to prompt into existence.

Redundant reasoning narration. Asking a reasoning model to "think step by step" in the visible response either duplicates the work — once in hidden thinking tokens, once in narrated output, doubling cost without improving quality — or causes the visible narration to drift from the hidden chain because it improvises beyond what the deliberation produced. Both are worse than trusting the dedicated thinking phase. If you need the reasoning shown, ask for a "brief justification" after the answer.

Over-specified procedures. A prompt that reads like a procedure manual replaces the model's reasoning with yours, and the model's reasoning is usually better than the script you would write, because it can explore approaches you would not have thought of — race conditions in a security audit, alternative algorithms in a code review. Procedures cap quality at your enumeration. Constraints scope the answer without scripting the path.

Anchoring few-shot examples on reasoning tasks. Few-shot prompting still wins for pattern-matching, format-following, and classification. On reasoning tasks where you want the model to think from scratch, examples backfire — the model anchors on your specific solution path and reduces solution diversity. The same applies to the alternative reasoning patterns the field developed for older models: step-back prompting, least-to-most prompting, and tree-of-thought all encoded reasoning structure into the prompt. On a thinking model, that structure is already happening; encoding it externally either constrains the internal version or wastes tokens reproducing it.

"Think step by step" as wasted tokens. The canonical anti-pattern. Drop it from reasoning-model prompts; keep it in standard-model prompts where it still works.

Persona stacking, take-a-deep-breath primers, confidence-eliciting phrases. "You are a senior X with 15 years of experience" was a 2023 trick that did real work on small instruction-tuned models. On a frontier reasoning model, detailed personas underperform direct task framing, emotional primers do nothing measurable, and "if you're unsure, say so" is now a tax because reasoning models self-flag uncertainty inside the thinking phase when it matters. State the task, provide the context, set the evaluation criteria — skip the costume.

The shift to internalize: in 2023, prompt engineering was about coaxing reasoning out of models that did not want to reason. In 2026 on reasoning models it is about directing models that are already reasoning toward the right problem at the right depth. What worked then often fails now, not because the models got worse but because the failure mode changed.

Per-Model Dialects

Six slots are portable. How you express them shifts by platform.

OpenAI o3 and o4-mini

o3 rewards clean prose plus structured outputs for the response shape. The reasoning_effort parameter does the depth-control work; the user message focuses on the brief. Use structured outputs (JSON Schema) to separate reasoning from output format entirely. Use system prompts for persona and persistent rules; keep evaluation criteria and constraints in the user message where they sit next to the problem context. Pick o3 vs o4-mini deliberately — o4-mini is optimized for STEM, code, and structured extraction at o3-class quality for substantially less; o3 wins on open-ended analysis and tasks that need the highest reasoning ceiling.

A short o3 prompt for an architectural decision states the current stack, the team size, the expected scale, the optimization criteria, and the explicit failure-mode flag, with reasoning_effort: high on the request. No "think step by step," no persona stacking — the constraints and criteria are explicit, and the model will consider options you have not listed and weight them against your actual situation. The full o3, GPT-5.4, and Gemini patterns are in advanced prompt engineering for Claude, GPT-5, and Gemini.

Claude with Extended Thinking

Claude rewards XML-tagged content and explicit format pinning. Wrap reference material in tags (<code>, <spec>, <constraints>, <review_criteria>) so it reads as data, not meta-instructions. Pin the output shape at the tail — the last thing the model reads before emerging from thinking is the most reliable place to keep binding output requirements salient. Set the budget via API, not prose: Claude's budget_tokens cap and effort level (low, medium, high, max) are the depth knobs. Start low, raise only when the trace truncates on a real task. Enable per-route, not per-app — extended thinking on every turn overpays classification routes.

A short Claude prompt for a code review wraps the diff in <code_to_review> tags, lists the operational facts in <constraints>, names the bar in <review_criteria>, pins a strict JSON output schema at the tail, and sets budget_tokens and effort: high on the request. The system message defines the role; the user message wraps content in tags; the criteria are named; the budget is on the request — not in the prose. The Opus 4.7 specifics — 1M-context structuring, prompt caching at Opus pricing, tool-use patterns — live in the Claude Opus 4.7 prompting guide. The broader Claude patterns are in the Claude 4 prompting guide. The extended-thinking-specific patterns — when to enable, when it hurts, how to size the budget — are in extended thinking prompts for Claude.

Gemini Deep Think

Gemini 2.5 Pro Deep Think runs parallel reasoning — multiple hypotheses generated and considered simultaneously before a single answer is committed. Open the problem space, do not narrow it: instead of "what is the best approach to X?", ask "explore at least three distinct approaches to X, compare their tradeoffs, then recommend one." Outputs become more honest — Deep Think is more likely to surface the runner-up and explain why it lost. Stack constraints freely; Deep Think handles layered constraints well because it weighs them in parallel. Pair with multimodal input — Gemini reasons across diagrams, charts, photographs, and recorded media in a way no other reasoning model matches. Combine with Google Search grounding for questions that need both current data and deep analysis.

A short Deep Think prompt for a strategic analysis: "Analyze this quarter's performance. Explore at least three narratives that explain the Q3 revenue dip, using evidence from the attached PDF, the earnings call transcript, and the roadmap timeline. For each narrative, list the strongest supporting evidence and the strongest counter-evidence. Then recommend which narrative leadership should adopt in the public messaging." What is not in this prompt: no "think step by step," no "you are a financial analyst." Deep Think is already going to think carefully — your job is to frame the exploration, not coach the reasoning.

DeepSeek R1

DeepSeek R1's distinguishing feature is its fully visible streamed thinking trace — the chain is the product, not the byproduct. Request explicit reasoning chains: "think step by step, show every step of your reasoning." This is the opposite of what works on o3 or Claude, and works on R1 specifically because R1's architecture treats the chain as a first-class output. Use structured problem formats (Given/Find/Solution); R1 handles formal structures more reliably than conversational requests. Be explicit about verification — "verify your answer against the original constraints" — R1's reasoning capability makes self-verification actually useful and naming the step produces a tangible quality lift on math and logic tasks.

A short R1 prompt for a logic problem opens with the explicit step-by-step request, then names a six-step procedure (restate, identify given information, plan, execute showing work, verify against constraints, state the final answer), and closes with "if you are uncertain about any step, flag it and explain why." The structure that backfires on o3 and Claude — explicit step-by-step procedure — is the structure R1 rewards, because the chain is meant to be visible and the structure is what readers and verifiers use. The strategic positioning of DeepSeek versus the closed competitors is in DeepSeek vs ChatGPT; the full template library — 40 prompts across reasoning, math, coding, writing, research, business, and creative work — is in the 40 best DeepSeek prompts for 2026.

Qwen QwQ and Other Open-Weights Reasoners

Qwen QwQ sits in the same shape as DeepSeek R1 — open weights, streamed visible thinking, strong on math and code, downloadable for self-hosting. Quality trails R1 by a step on raw benchmarks; the deployment surface is similar. The dialect transfers: explicit step-by-step requests, structured problem formats, verification gates. Other open-weights reasoners (small Llama-based fine-tunes, research releases) follow the same general shape but with more variability.

These models matter most when self-hosting is non-negotiable — data sovereignty, regulated workloads, sensitive proprietary data — or when cost at extreme scale makes even DeepSeek's hosted API uneconomical. The pipeline ceiling is high (full control, fine-tuning, custom inference stacks); the output ceiling is lower than the frontier. Treat them the way the image pillar treats Stable Diffusion — the option when you need ownership more than you need the absolute top of the quality curve.

Reasoning Tier Selection: When NOT to Use a Reasoning Model

Counterweight section. Most tasks do not need a reasoning model. Using one for simple work is like using a scanning electron microscope to check if your plants need watering — technically it works, but you are wasting time, money, and latency for no quality gain.

When standard models win. Direct recall is a single forward pass; reasoning is overhead. Simple classification, sentiment, and labeling are pattern-matching, not reasoning. Format conversion is transformation. Short creative outputs reflect taste, not deliberation. Single-paragraph extraction is grounded in supplied text. Latency-sensitive turns — chat, in-product assistants, support — pay a multi-second tax users notice and the task does not need. Reasoning models on simple tasks sometimes produce worse output than standard models because they second-guess obvious responses. The cost is bidirectional — money and quality both.

The model-cascade pattern. Most production reasoning workloads should not be all-reasoning-model traffic. The model-cascade pattern routes requests by complexity: a fast standard model (GPT-4o, Claude Haiku, Gemini Flash) handles obvious cases, and only the genuinely hard turns escalate. The router can be a small model, a heuristic on the input, or a confidence threshold from the standard model's first attempt. Done well, cascades cut cost and latency by an order of magnitude while preserving the quality lift on the cases that need it.

Hybrid workflows where reasoning plans and a fast model executes. Use a reasoning model once at the start of a workflow to produce a plan, then hand each step to a fast model for execution. The reasoning model spends its budget on the hard part (the plan); the fast model spends its speed on the volume part. For agentic loops, the same pattern applies — reasoning model as planner-and-reflector, standard model as per-tool-call worker. The full architecture is in the agentic prompt stack; the working example is in the research-agent walkthrough.

The practical heuristic. If you could solve the task yourself in under 30 seconds with full context, a standard model is probably sufficient. If the task requires multiple interacting factors, tradeoff weighing, or chained logic, reach for a reasoning model. If you cannot name the steps you would expect the model to think through, the task probably does not need a reasoning model.

Controlling Reasoning Depth

Depth is a dial, not a prompt trick. Each model exposes the dial differently.

OpenAI o3 and o4-mini. Three levels: low, medium, high. Low for straightforward reasoning — clear-spec coding, synthesis Q&A. Medium is the default for analysis, multi-step problems, writing that needs planning. High for complex math, formal proofs, multi-file code generation, problems with many interacting constraints. Every step up costs latency and tokens; default lower and raise only when accuracy matters more than speed or cost.

Claude extended thinking. Two knobs: budget_tokens (the cap) and effort level (low, medium, high, max in 4.6+). The budget governs how much room the model has; the effort level governs how aggressively it uses that room. Typical production range: 10K tokens for simple analysis, up to 100K for complex multi-step problems. Setting the budget too low truncates mid-reason; too high wastes money. Start low and raise only when an eval shows truncation on real tasks.

Gemini Deep Think. A toggle, not a granular dial — on or off at the app level for Google AI Ultra subscribers, at the model-tier level via API. Flash Thinking is the cost-efficient sibling. Binary decision: does the task warrant parallel hypothesis exploration? Yes — open problem space, multimodal input, layered constraints. No — Flash Thinking or 2.5 Pro without Deep Think.

DeepSeek R1. Streamed chain bounded by hosted-service limits or self-hosted compute. Depth is more emergent than dialed — R1 reasons until it converges or hits the limit. The control surface is less granular than o3 or Claude, but the fully visible trace makes debugging and audit easier than with hidden-chain models.

Cost implications. Thinking tokens are billed as input tokens. Routine problems produce short traces; hard ones use the full budget. Higher budget is not linear quality — the returns curve is steep then flat; you want enough, not maximum. Cache the static prefix (system prompts, stable context) where supported. Enable per-task, not per-app. Measure lift, not just cost.

The general rule across all four families: set depth via API parameter, not prompt text. Phrases like "really consider this carefully" do not move the dial — they consume tokens that could be carrying your actual task.

Hybrid Workflows

Reasoning models compose with the rest of the toolkit. Three patterns dominate production.

Reasoning model + tool use. The thinking phase is most useful when the model can act on the world between thoughts. Claude's interleaved thinking — built into Opus 4.7 and Sonnet 4.6 — lets the model think between tool calls, not just before the first response. For multi-step tasks where each tool result changes what to do next, this is a different shape of capability than single-shot reasoning. Give the model clean tool definitions and a clear objective; the model plans, acts, observes, replans. Your prompt does not script the algorithm — it gives the model what it needs to choose one. The full pattern is in the agentic prompt stack; the deeper agent-design treatment is in the AI agents prompting guide.

Reasoning model + RAG. Retrieval-Augmented Generation benefits from reasoning at two points. At the synthesis step, a reasoning model is meaningfully better at multi-document synthesis than a standard model — the synthesis is genuinely a reasoning task (which sources support which claims, where do they conflict, what does the combined evidence imply). At the routing step in agentic RAG, a reasoning model's deliberation produces better routing decisions (which tool, which index, whether to query at all) than a standard model's pattern match. The complete walkthrough is in the agentic RAG walkthrough.

Reasoning model as planner in an agentic loop. A common production architecture: use a reasoning model once at the top to produce a plan, then iterate with a fast model handling per-step execution and a reasoning model handling per-step reflection only when execution fails or surfaces ambiguity. This concentrates reasoning cost on the parts that benefit and keeps steady-state per-step cost low. For the working example, see the agentic prompt stack research-agent walkthrough.

The general shape: reasoning concentrated on the parts that benefit from deliberation, and the rest of the workflow on cheaper and faster components. The right stack is heterogeneous; the reasoning model is a specialized component, not a universal one.

Honest Evaluation

"It sounds right" and "it is right" are different standards on reasoning-model output. The most dangerous failure mode in this category is the beautiful-sounding wrong answer — coherent prose, confident tone, internally consistent argument, externally false conclusion. Standard models fail by being shallow or hallucinating obviously; reasoning models fail by being deeply, articulately, persuasively wrong. Evaluation has to catch this.

A practical checklist. Goal faithfulness — did the response solve the goal as stated, or solve an adjacent problem? "Recommend a real-time notification architecture optimized for time-to-ship" is not the same goal as "compare WebSockets to SSE in detail." Constraint compliance — walk every constraint and verify; constraint violations are the most common silent failure because the answer otherwise looks competent. Context use — did the model use the supplied context or hallucinate around it? On long-context Claude work this matters most for facts buried in the middle of a reference block. Output shape — JSON parses, schema satisfied, sections present, length within bounds. Audience match — a code review for a junior engineer that reads like an internal post-mortem missed the audience slot, even if the technical content is correct. Reasoning soundness on the hard parts — for high-stakes work, walk the chain on the parts of the answer that mattered most; reasoning models are confident, and that confidence is uncorrelated with correctness on the cases where they are wrong.

Coherence is not correctness. The single most important discipline in evaluating reasoning-model output is not letting fluent prose substitute for actual verification. A confidently-stated wrong answer is worse than an obviously-shallow one because it bypasses the alarm.

Two patterns formalize this evaluation for production work. LLM-as-judge rubrics pass the response back to a different model with an explicit rubric ("score 1-5 on goal faithfulness, constraint compliance, context use, output shape; flag any unsupported claim"). LLM-as-judge inherits some of the same failure modes as the model it judges, but catches a meaningful fraction of beautiful-sounding wrong answers that human reviewers miss at scale. The SurePrompts Quality Rubric is the rubric we use for the prompts themselves; the same shape works for evaluating outputs. Self-critique and self-refine loops generate, critique, revise. The thinking phase makes a single turn better; self-refine makes a sequence reliable. They compose: extended-thinking generate, extended-thinking critique, extended-thinking revise. For high-stakes outputs the marginal cost of a critique-and-revise pass is small relative to the cost of shipping a wrong answer.

The deeper category framing lives in the extended thinking glossary entry and extended thinking prompts for Claude. The second-order trap to internalize: reasoning models are good enough at sounding right that the evaluation discipline matters more than it did with shallower models, not less. The tool got better; catching its failures got harder.

Failure Modes

Six anti-patterns that quietly wreck reasoning-model work.

  • Treating a reasoning prompt like a chain-of-thought prompt. Adding "think step by step" to o3 or Claude. The model is already thinking; the phrase is noise at best, narration-bait at worst. Cure: drop the phrase from reasoning-model prompts; keep it in standard-model prompts where it still works.
  • Over-specifying the procedure. Writing a five-step script for a problem the model would have solved better in three steps it chose itself. Cure: state the goal and constraints; let the model find the path.
  • Setting the depth via prose instead of via API. "Please think very carefully about this." The dial does what the phrase pretends to. Cure: set reasoning_effort, budget_tokens, or the equivalent on the request, and keep the prompt focused on the brief.
  • Reasoning on tasks that do not benefit. Classification, format conversion, short Q&A, simple recall — every one of these is overhead on a reasoning model and waste on the bill. Cure: route by task complexity; default to standard models and escalate only when the task warrants deliberation.
  • Treating coherence as correctness. Accepting beautiful-sounding wrong answers because the prose reads competent. Cure: evaluate slot-by-slot against the brief, run high-stakes outputs through llm-as-judge rubrics, layer self-critique loops on shipping work.
  • Ignoring the cost shape. Enabling extended thinking on every route, leaving stable system prompts uncached on Opus, running max-effort traffic through routes that do not need it. The bill grows faster than the quality. Cure: enable per-route, cache stable prefixes, measure lift not just cost, and route to the cheapest tier that meets the bar on each task class.

Our Position

Six opinionated stances we hold on 2026 reasoning-model prompting.

  • Pick the right reasoning tier per task, not per project. o3 for general hard reasoning. o4-mini for STEM and code at o3-class quality for less. Claude Opus 4.7 for long-context dense work. Sonnet 4.6 for the daily-driver tier below it. Gemini Deep Think for parallel exploration and multimodal reasoning. DeepSeek R1 for cost-sensitive transparent-trace work. Qwen QwQ and other open-weights for self-hosted requirements. Project-level single-model choices leave quality and cost on the table.
  • State the goal, not the procedure. Always. The most reliable single move you can make on reasoning prompts. Constraints scope the answer; procedures script the path. Reasoning models reward the first and underperform on the second.
  • Set depth via API parameter, not via prose. reasoning_effort, budget_tokens, effort level, Deep Think toggle. The dial is the dial. Phrases that pretend to move the dial just consume the input budget.
  • Most tasks should not use a reasoning model. Default to standard models. Escalate only when the task warrants deliberation. Cascade by complexity. The cost of routing is much lower than the cost of routing wrong in either direction.
  • Coherent prose is not correct content. Evaluate against the brief, not the vibe. Run high-stakes outputs through llm-as-judge rubrics. Layer self-critique loops on shipping work. The most dangerous failure mode in this category is the beautiful-sounding wrong answer; the evaluation discipline has to be sharper than it was with shallower models, not looser.
  • Reasoning models are components, not replacements. They fit inside agentic loops, on top of RAG pipelines, behind cascade routers — as the part of the stack that thinks before answering. Treating them as a drop-in replacement for everything else pays the cost without capturing the structural benefit.

What's Next: From Reasoning Models to Reasoning Agents

The frontier is moving from single-call reasoning to multi-call reasoning agents. Claude's interleaved thinking — reasoning between tool calls, not just before the first response — is the early version of what becomes default behavior. o3 and o4-mini are increasingly used as the planner and reflector inside agentic loops where most of the per-step work is handled by faster components. Gemini Deep Think paired with search grounding is the early version of an agent that researches, deliberates, and answers in one continuous flow. DeepSeek R1 self-hosted as the reasoning core of a custom agent stack is increasingly common in cost-sensitive production deployments. The single-shot reasoning prompt is becoming the inside of a loop, not the whole interaction.

The skill that compounds: clean reasoning prompts at the inner level make agentic stacks work; messy reasoning prompts compound failures across every iteration of the loop. The discipline scales — what you learn from writing a strong six-slot brief for a single Claude extended-thinking call is what you reuse, ten times, inside an agent that calls Claude ten times across a multi-step workflow. For the agent-side architecture, see the AI agents prompting guide, the agentic prompt stack, the agentic prompt stack research-agent walkthrough, and the agentic RAG walkthrough. For the broader discipline this all sits inside, the context engineering pillar and the Context Engineering Maturity Model. For the two sister pillars in this Phase 3 series, the AI image prompting guide and the AI video prompting guide.

The SurePrompts reasoning-models cluster and the frameworks it rests on.

Reasoning-model prompting in 2026 is a brief-writing discipline with a depth-control dial on the side and an evaluation discipline at the end. Pick the right tier for the task. State the goal, not the procedure. Frontload constraints, context, and success criteria. Translate into the model's dialect. Set the budget on the request. Evaluate the answer against the brief, not the vibe. The single-shot reasoning prompt that gets you lucky on the first try is memorable. The repeatable reasoning workflow that ships a correct answer the third time, every time, on the cases you actually need to solve — that is what scales.

Try it yourself

Build expert-level prompts from plain English with SurePrompts — 350+ templates with real-time preview.

Open Prompt Builder

Get ready-made Claude prompts

Browse our curated Claude prompt library — tested templates you can use right away, no prompt engineering required.

Browse Claude Prompts