Self-Ask Prompting: A Guide to Decomposing Multi-Hop Questions

Q: What is Self-Ask prompting?

Self-Ask is a reasoning pattern where the model explicitly generates follow-up sub-questions, answers each one, and then composes the final answer from those intermediate answers. It was introduced by Press et al. in 2022 in 'Measuring and Narrowing the Compositionality Gap in Language Models.' The scaffold asks the model whether follow-up questions are needed, then structures the trace as Follow up / Intermediate answer pairs until the composition is resolved.

Q: How is Self-Ask different from Chain-of-Thought?

Chain-of-Thought produces free-form step-by-step reasoning. Self-Ask structures that reasoning as explicit sub-questions and their answers. The extra structure helps on compositional questions where the model might otherwise collapse multiple hops into one shallow leap. On problems that do not decompose into sub-questions — single-step arithmetic or pure pattern matching — CoT is lighter and works as well.

Q: When does Self-Ask beat a single prompt?

Self-Ask wins on multi-hop, compositional questions where the final answer depends on several intermediate facts the model has to retrieve or reason to. 'Who was president of the country that won the 1998 World Cup?' is the archetype. It helps less on atomic questions and adds latency without benefit on tasks that do not have hops.

Q: Can I combine Self-Ask with a search tool?

Yes, and this is where the pattern really earns its keep. Each sub-question can trigger its own retrieval or search call, so intermediate answers are grounded in evidence rather than hallucinated. The resulting prompt behaves like a lightweight research agent — less flexible than a full ReAct loop, but easier to audit because every hop is an explicit sub-question.

Q: Does Self-Ask work with reasoning models?

Usually redundant. Reasoning models (o-series, Claude's extended thinking, Gemini thinking) already decompose internally and would just add another layer of structure on top. Keep Self-Ask for non-reasoning chat models where you need to force explicit decomposition, or for pipelines where you want the sub-questions and intermediate answers as auditable artifacts.

Q: How do I evaluate Self-Ask outputs?

Grade each sub-question and intermediate answer separately, then grade the final composition. Use an LLM-as-Judge or the criteria in the SurePrompts Quality Rubric — factuality, relevance of sub-questions to the original, and whether the final answer follows from the intermediate answers. A Self-Ask run where every hop is right but the composition is wrong should score lower than one where every hop resolves and the final answer follows.

Imtiaz Rayhan

Self-Ask is a prompting pattern where the model explicitly writes sub-questions and answers them before producing the final response. Introduced by Press et al. in 2022 in "Measuring and Narrowing the Compositionality Gap in Language Models," the paper observed that models often know each atomic fact required by a multi-hop question but fail to compose them when asked directly. The fix: a scaffold that forces decomposition. The pattern is simple, but it closes much of the gap on compositional reasoning.

Key Takeaways

Self-Ask is the explicit, structured cousin of chain-of-thought — sub-questions instead of a free-form trace.
It wins on multi-hop, compositional questions and adds little on atomic ones.
Wiring each sub-question to a search or retrieval tool turns Self-Ask into a minimal research agent.
Reasoning models already decompose internally, so Self-Ask is mostly for non-reasoning chat models and for pipelines that need auditable sub-questions as artifacts.
The common failure is compositional error — every sub-answer correct, but the final composition wrong — so evaluate the final step separately.

The Problem Self-Ask Solves

Ask a plain chat model: "Who was the president of the country that won the 1998 FIFA World Cup?" Two hops — which country won in 1998, and who was its president at that time. The model might answer correctly. It might also collapse the hops into a single confident guess and name the wrong person. The failure is not ignorance; Press et al. showed that models often know each fact in isolation and still fail to compose them.

This is the compositionality gap. The knowledge is there, but nothing in a naive prompt forces the model to lay out the intermediate steps. Chain-of-thought helps — "let's think step by step" gets you part of the way — but the trace is still free-form, and the model can skip from premise to conclusion without ever writing down the intermediate fact.

Self-Ask forces the decomposition. The scaffold makes each hop an explicit sub-question; the final answer cannot appear until every sub-question has an intermediate answer. That structural requirement closes much of the gap.

The Pattern

The literal scaffold is short. Drop this above the user's question:

code

Question: {question}
Are follow up questions needed here: Yes/No

If yes, format each step as:
Follow up: {sub-question}
Intermediate answer: {answer}

Repeat until you have enough to answer the original question.
Then write:
So the final answer is: {answer}

The model decides whether the question is compositional. If it is, it generates a sub-question, answers it, and either continues with another or stops and composes the final answer. The scaffold is deliberately thin — it does not prescribe the content of the sub-questions, just the shape of the trace.

Press et al. used a few-shot variant, seeding the model with two or three worked examples. In 2026 most frontier chat models follow the zero-shot version reliably, but if you see drift — the model skipping straight to "So the final answer is" without any follow-ups — adding one worked example usually fixes it. What you are encoding is: before you answer, ask what you would need to know, then answer that first.

Worked Example — Pure Self-Ask

Here is an illustrative trace on a hypothetical business question. The scenario, data, and names are made up; the point is to show the shape of the pattern.

code

Question: Which of our top 3 2025 customers by revenue
has a renewal coming up before June 2026?

Are follow up questions needed here: Yes.

Follow up: Who were our top 3 customers by revenue in 2025?
Intermediate answer: Acme Logistics, Brightwave Media, and Cerulean
Health were the top three customers by 2025 revenue.

Follow up: When is each of those three customers' contracts up
for renewal?
Intermediate answer: Acme Logistics renews in September 2026.
Brightwave Media renews in April 2026. Cerulean Health renews in
August 2026.

Follow up: Which of those renewal dates falls before June 2026?
Intermediate answer: Only Brightwave Media — April 2026 is before
June 2026. The other two renew later in 2026.

So the final answer is: Brightwave Media, with renewal in April 2026.

Three things to notice. First, the sub-questions are narrower than the original — each asks for one piece of information. Second, the final answer is a composition step, not a lookup. Third, the trace is auditable. If Brightwave's renewal is actually in July 2026, you can see exactly which sub-answer was wrong and fix that hop rather than re-running the whole prompt. A single-prompt version would force the model to hold three customer names, three renewal dates, and a comparison rule in one pass; on a noisy day it collapses hops and names the wrong customer with full confidence.

Worked Example — Self-Ask Plus a Search Tool

The pattern is most useful when each sub-question can trigger retrieval. The scaffold stays the same; you add a rule that says every Follow up goes through the search tool before the Intermediate answer is written.

code

Question: Which author of a New York Times bestseller in March 2026
previously won a National Book Award?

Are follow up questions needed here: Yes.

Follow up: Which books were on the NYT bestseller list in March 2026?
[search("NYT bestseller list March 2026")]
Intermediate answer: The top five titles in March 2026 were
[hypothetical list of titles and authors].

Follow up: Of the authors on that list, have any won the
National Book Award?
[search("National Book Award winners authors")]
Intermediate answer: [Hypothetical Author X] won the National Book
Award in 2014 for [Hypothetical Title].

So the final answer is: [Hypothetical Author X], whose March 2026
bestseller was [Hypothetical Title] and who previously won the
National Book Award in 2014.

This is a degenerate research agent — less flexible than a full ReAct loop, but easier to audit and cheaper to run. The model cannot improvise new tools or re-plan mid-trace; it can only ask the next sub-question and search for the answer. That rigidity is a feature on well-scoped research tasks.

The same shape generalises to RAG pipelines. Each Follow up becomes a retrieval query; each Intermediate answer is written from the retrieved passages. Our RAG prompt engineering guide covers the retrieval side; Self-Ask supplies the reasoning structure that decides what to retrieve.

When to Use Self-Ask vs. ReAct vs. Chain-of-Thought

All three are reasoning scaffolds; they differ in structure and in what they assume about the environment.

Pattern	Best for	Structure	Tool use
Chain-of-thought	Reasoning that does not decompose into discrete sub-questions	Free-form step-by-step trace	None
Self-Ask	Compositional, multi-hop questions with clear sub-questions	Explicit Follow up / Intermediate answer pairs	Optional per sub-question
ReAct	Open-ended agentic tasks where the environment surprises you	Interleaved Thought / Action / Observation	Core to the pattern

Use chain-of-thought when the question needs reasoning but does not break cleanly into sub-questions — most math word problems, logic puzzles, "explain the tradeoff" prompts. Use Self-Ask when the question is compositional and you can imagine the sub-questions you would ask on paper. Use ReAct when the path is not knowable up front and each observation changes what you do next.

Self-Ask and ReAct look similar when both are paired with search, but the shape differs. Self-Ask decides sub-questions from the original question; ReAct decides each next action from the last observation. Self-Ask is planning-flavoured, ReAct is reactive. For predictable sub-questions (product comparisons, fact-checking, structured lookups), Self-Ask is lighter. Prompt chaining can wrap a Self-Ask step inside a larger pipeline — one stage decomposes, the next scores the sub-answers, the next composes the final response. See the agentic prompt stack for how these scaffolds layer.

Failure Modes

The model refuses to decompose. It answers "Are follow up questions needed here: No" on a question that obviously needs them, then produces a confident single-hop guess. Fix: add one or two few-shot examples showing decomposition, or tighten the instruction to "Assume follow up questions are needed unless the question is a single atomic fact."

Sub-questions drift from the original. The first follow-up is on topic, the second veers into tangential territory, the third answers a different question entirely. Fix: include the original question in the scaffold for every hop, and instruct the model to state how each sub-question relates back.

Compositional error — right hops, wrong composition. Every intermediate answer is correct, and the final "So the final answer is" line names something that does not follow. The sneakiest failure because the trace looks clean. Fix: add a penultimate hop that restates the intermediate answers and the composition rule before writing the final answer.

Over-decomposition on atomic questions. "What is the capital of France?" emits three pointless sub-questions before Paris. Fix: do not use Self-Ask when you know the question is atomic; or trust the Yes/No gate to route simple questions directly.

Score each sub-question and intermediate answer against the original, then score the final composition as a separate criterion. The SurePrompts Quality Rubric covers factuality, relevance of sub-questions, and whether the final answer follows from the trace. LLM-as-Judge works well here because the decomposed trace is easier to grade than a free-form CoT.

Our Position

Self-Ask is underrated for non-agentic pipelines. Most 2026 discussion of multi-step reasoning jumps straight to ReAct or full agents, which are heavier than many tasks need. For predictable hops with no mid-flight adaptation, Self-Ask is a tenth of the code and gives you an auditable trace.

Do not use Self-Ask on reasoning models. Claude's extended thinking, o-series, and Gemini thinking already decompose; stacking Self-Ask on top is redundant. Save it for non-reasoning chat models and for pipelines where you want the sub-questions as artifacts.

Pair Self-Ask with retrieval before reaching for an agent. Much of what teams build agents for — "look up these three facts and compose an answer" — is a Self-Ask trace with a search tool. Start narrow, graduate to an agent only when the narrow pattern breaks.

Grade composition separately from hops. "Final answer correct: yes/no" misses the class of failures where the trace is right and the composition is wrong. Keep the scaffold thin — resist "Follow up category," "Confidence score," "Source citation" on every step until you have evidence the plain version is underperforming.

For the reasoning patterns that sit next to Self-Ask: chain-of-thought prompting (the free-form cousin), ReAct prompting guide (the agentic cousin), and prompt chaining guide (how to compose Self-Ask steps into larger pipelines). For retrieval-flavoured Self-Ask see the RAG prompt engineering guide. For layering patterns into production systems see the agentic prompt stack and advanced prompt engineering techniques. For evaluating Self-Ask outputs use the SurePrompts Quality Rubric. If you are new to the broader space, start with prompt engineering basics 2026. And browse the glossary for compact definitions of self-ask prompting, chain-of-thought, ReAct prompting, prompt chaining, chain-of-verification, and RAG.

Self-Ask Prompting: A Guide to Decomposing Multi-Hop Questions

Key Takeaways

The Problem Self-Ask Solves

The Pattern

Worked Example — Pure Self-Ask

Worked Example — Self-Ask Plus a Search Tool

When to Use Self-Ask vs. ReAct vs. Chain-of-Thought

Failure Modes

Our Position

Ready to write better prompts?

Related Resources

Prompt Refinement Template

Prompt Chain Builder Template

System Prompt Writer Template

Prompt Engineering Framework Template

Related Articles

Chain-of-Thought Prompting: The Secret to Complex Problem Solving

Prompt Chaining: How to Break Complex Tasks Into Simple Steps (2026 Guide)

ReAct Prompting Guide: Reasoning Plus Acting for AI Agents (2026)

Self-Ask Prompting: A Guide to Decomposing Multi-Hop Questions

Key Takeaways

The Problem Self-Ask Solves

The Pattern

Worked Example — Pure Self-Ask

Worked Example — Self-Ask Plus a Search Tool

When to Use Self-Ask vs. ReAct vs. Chain-of-Thought

Failure Modes

Our Position

Related Reading

Ready to write better prompts?

Related Resources

Prompt Refinement Template

Prompt Chain Builder Template

System Prompt Writer Template

Prompt Engineering Framework Template

Related Articles

Chain-of-Thought Prompting: The Secret to Complex Problem Solving

Prompt Chaining: How to Break Complex Tasks Into Simple Steps (2026 Guide)

ReAct Prompting Guide: Reasoning Plus Acting for AI Agents (2026)