Extended thinking changed how Claude prompts should be written. The model now has an explicit, separately-budgeted reasoning phase that runs before the visible output — you can influence it with prompt structure and pay for it as input tokens. Used well, it lifts quality on problems that genuinely need deliberation. Used carelessly, it burns budget on tasks a single forward pass would nail. This guide, under the context engineering pillar, covers what extended thinking is, when to reach for it, and how to shape prompts so the reasoning budget earns its keep.
What Extended Thinking Is
Extended thinking is a mode on recent Claude models where the model does a dedicated reasoning pass before producing its visible response. Two things matter about its design.
Separate token budget. When extended thinking is enabled, Claude generates a block of reasoning tokens — its "thinking trace" — and then a regular response. The thinking budget is a dial set on the request; the output budget is separate. That's different from older chain-of-thought prompting, where reasoning and answer came out of the same response budget and had to be rationed against each other.
A visible trace. Depending on API settings, you can read the thinking block, log it, and use it for debugging. The response is still the authoritative output — but the trace is an honest artifact you can inspect.
Mechanically, extended thinking is built-in chain-of-thought. The model was already going to reason; extended thinking gives it a budgeted place to do it and tells the sampler not to force an answer until that place is used.
When Extended Thinking Helps
Extended thinking pays off when the task genuinely needs deliberation — when a single forward pass is likely to skip a step the model would catch if given room to check itself.
- Math and symbolic reasoning. Multi-step algebra, probability, combinatorics — anything where intermediate results compound.
- Complex code and debugging. Following control flow through several functions, choosing between algorithmic approaches, reconciling type constraints across files.
- Multi-step planning. Break-down tasks, project plans, migration sequences where early decisions constrain later ones.
- Careful analysis. Legal or policy review, paper evaluation, trade-off comparisons where the model must weigh several considerations before concluding.
- Structured outputs with internal constraints. Long JSON with totals that must sum or references that must resolve.
Common thread: accuracy scales with thought, and a worse-but-faster answer is genuinely worse.
When It Doesn't
Extended thinking is overhead when the task doesn't need it. The model will still produce a trace if you force one — short, generic, or padded — and you'll pay for the tokens.
- Direct recall. "Capital of Peru?" A single pass is right.
- Simple classification. Sentiment, category, boolean labels from short inputs.
- Format conversion. The model isn't reasoning; it's transforming.
- Short creative outputs. A headline, a subject line. Taste isn't deliberation.
- Extraction from short context. Three fields from a one-paragraph email.
- Latency-sensitive turns. Support, in-product assistants — time-to-first-visible-token matters.
Useful reflex: if you can't name the steps you'd expect the model to think through, it probably doesn't need extended thinking.
Prompt Structure That Uses Extended Thinking Well
Wording still matters — extended thinking doesn't replace prompt engineering, it changes what the prompt should look like.
Frame the problem as multi-step. Not "what's the answer" but "work through this problem, then answer." The model already reasons in the thinking phase; framing reinforces the expectation and makes the trace more useful.
Name the decision points. If the task has places where choices compound — pick an algorithm, decide between schemas, sequence steps — say so. "Consider at least two approaches before choosing." Naming the joints focuses the budget on what matters.
Leave the reasoning approach open. Don't hand-hold the thinking phase. "First think about X, then Y, then Z" turns the budget into a rigid script. Let the model find its own path; constrain only the output.
Constrain the output tightly. The visible response is where structure belongs: a schema, a format, a word count, an explicit list of fields. The thinking phase is free-form; the answer is not.
Put binding requirements at the end. The tail is load-bearing. Restating "respond only in the JSON schema above" keeps the constraint salient when the model emerges from thinking.
Prompt Structure That Wastes Extended Thinking
- Over-prescribing the steps. Dictating "think about step 1, then step 2, then step 3" converts the thinking phase into transcription. The model stops reasoning and starts narrating.
- Asking for simple recall. "Using extended thinking, tell me the year World War II ended" wastes the budget on a lookup.
- Trivial problems. Extended thinking on "generate three tweet ideas" produces three tweet ideas with a filler trace — and you still pay.
- Nested rubrics in the prompt body. Long embedded rubrics force the thinking phase to reproduce prompt structure rather than reason about content. Move rubrics to the system message.
- Treating the trace as an explanation. "Explain your reasoning" in the output is fine, but the thinking phase isn't an explanation generator. It's deliberation — not user-facing.
- Tight budgets on hard problems. Capping thinking too low truncates mid-reason and quality drops. Either give the budget room or don't enable thinking.
Token Economics: Budget vs. Quality
Extended thinking tokens are billed as input tokens. That matters for the cost shape of your app.
- Thinking tokens scale with problem hardness. Routine problems produce short traces; hard ones use the full budget.
- Higher budget isn't linear quality. Beyond enough, extra budget yields little. The returns curve is steep then flat — you want enough, not maximum.
- Cache the static prefix. System prompts and stable context should be cached. Extended thinking changes how much the variable tail is worth spending on.
- Enable per-task, not per-app. Default off, flip on by route for tasks that benefit. Extended thinking on every turn overpays classification routes.
- Measure lift, not just cost. A 2x-cost turn that's meaningfully more accurate is worth it; one that matches baseline quality is not.
For allocating spend across prompt components more broadly, see the token economics guide.
Extended Thinking vs. Plain Chain-of-Thought
| Dimension | Plain chain-of-thought prompt | Extended thinking mode |
|---|---|---|
| Where reasoning lives | In the response, before the answer | In a separate thinking block |
| Budget | Shared with output tokens | Separate, set per-request |
| User visibility | Mixed into output | Trace hidden by default; response is clean |
| How it's triggered | "Let's think step by step" phrasing | Built-in; prompt doesn't have to induce it |
| Best use | Older models, quick deliberation | Problems that need real deliberation time |
The practical implication: on models with extended thinking, you don't need "let's think step by step" — the model already reasons in the thinking phase. Your prompt can focus on framing the problem and constraining the answer.
Relation to Self-Refine and Chain-of-Thought
Extended thinking sits next to — not instead of — the other reasoning patterns.
- Chain-of-thought is a prompting trick for older models. Extended thinking subsumes it by making the reasoning phase first-class.
- Self-refine is a generate-critique-revise loop across turns. Extended thinking happens within a single turn's reasoning phase; self-refine sits outside it. They compose: extended-thinking generate, extended-thinking critique, extended-thinking revise.
- Reflexion and ReAct are agentic patterns across multiple calls and tool uses. Extended thinking improves each call's deliberation; it doesn't replace the outer loop.
Mental model: extended thinking is a better single turn. Self-refine and reflexion are better sequences of turns.
For how prompt engineering and context engineering relate more broadly, see context engineering vs prompt engineering.
Example: Prompt That Uses Extended Thinking Well
Hypothetical but representative — a code-review prompt that benefits from deliberation.
[SYSTEM]
You are a senior backend engineer reviewing a pull request for correctness,
concurrency safety, and performance regressions. Respond only with JSON
matching this schema:
{
"issues": [
{
"severity": "low"|"med"|"high",
"category": "correctness"|"concurrency"|"performance"|"style",
"file": string,
"line": number,
"summary": string,
"suggested_fix": string
}
],
"overall_risk": "low"|"med"|"high",
"merge_recommendation": "approve"|"request_changes"|"block"
}
[USER]
Review the following diff. The service handles thousands of concurrent
connections and uses a shared cache across goroutines.
Before answering, consider:
- Are there race conditions introduced by this change?
- Does the error handling match existing patterns in the codebase?
- Are there performance implications under the stated concurrency?
Do not be exhaustive — only flag issues a senior reviewer would raise.
If the change is safe and clean, return an empty "issues" array and
"merge_recommendation": "approve".
Output strictly as JSON matching the schema at the top. No prose outside JSON.
--- DIFF START ---
[diff content here]
--- DIFF END ---
The shape: system defines output format; the user message frames the problem as needing deliberation, names the decision points, leaves the reasoning approach open, and pins the output to a strict schema at the tail. The thinking budget is set on the request; the prompt doesn't mention it.
Common Anti-Patterns
- Extended thinking on every route. Classification and lookup routes don't need it. Pay per-use.
- Over-scripting the thinking phase. The budget is worth more when the model chooses its own sequence.
- No output schema. Extended thinking doesn't enforce structure; the output format does.
- Treating the trace as user-facing. The trace is for debugging; the response is for the user.
- Budget ceiling too low for the problem. Truncated thinking is worse than no thinking — the model emerges mid-reason and improvises.
- No measurement. Enabling extended thinking without A/B-ing quality and cost is shipping on vibes.
FAQ
When should I enable extended thinking?
On tasks where a worse-but-faster answer is genuinely worse — multi-step math, complex code, multi-step plans, careful analysis. Leave it off for recall, simple classification, format conversion, and latency-sensitive turns.
Do I still need "let's think step by step"?
No. On models with extended thinking, the model already reasons in the thinking phase. Induction phrases are redundant at best and can distort the thinking shape at worst. Focus the prompt on framing and output constraints.
Should I show the thinking trace to users?
Generally no. The trace is an internal artifact for debugging and logging; the response is what users should see. Mixing the trace into the UI invites misreading a deliberation step as the final answer.
How do I pick a thinking budget?
Start with the default for the task class, measure output quality, then tune. Hard problems benefit from more; light analysis saturates quickly. The right budget is "enough to not truncate mid-reason," not "the maximum allowed."
Does extended thinking replace self-refine or reflexion?
No. Extended thinking makes each turn's reasoning better. Self-refine and reflexion are loops across turns. Use extended thinking inside each turn of a self-refine loop when the task calls for deliberation in both generation and critique.
Wrap-Up
Extended thinking is a better single turn for problems that need one. It separates reasoning from response, gives deliberation its own budget, and makes "let's think step by step" unnecessary. The prompt skill shifts: frame as multi-step, name the decision points, leave the reasoning open, constrain the output tightly, restate binding requirements at the tail. Enable per-route. Measure the lift. The thinking trace is for you, not the user.
For the broader frame, the context engineering pillar. For how prompt engineering and context engineering relate, context engineering vs prompt engineering. For budgeting across prompt components, the token economics guide. For the outer loops that extended thinking fits inside, self-refine prompting. For the reasoning pattern extended thinking automates, chain of thought prompting.