Least-to-Most Prompting: A Worked Example for Compositional Tasks

Q: What is Least-to-Most prompting?

Least-to-Most is a two-phase reasoning pattern introduced by Zhou et al. in 2022 in 'Least-to-Most Prompting Enables Complex Reasoning in Large Language Models.' In phase one the model decomposes a hard problem into an ordered list of easier sub-problems, smallest prerequisite first. In phase two it solves each sub-problem in order, with the answers to earlier sub-problems supplied as context for later ones. The pattern is designed for compositional tasks where the final answer depends on intermediate results the model would otherwise have to hold in its head.

Q: How is Least-to-Most different from Chain-of-Thought?

Chain-of-Thought writes a single free-form reasoning trace in one pass. Least-to-Most splits the work into a planning pass and one or more solving passes, and forces the solving passes to happen in a specific prerequisite order. The explicit decomposition is what separates it from CoT — the sub-problems become first-class artifacts you can grade, edit, or retry independently.

Q: How is Least-to-Most different from Self-Ask?

The shapes look similar, but the timing is different. Self-Ask lets the model ask and answer a follow-up, then decide whether another is needed — the decomposition is incremental. Least-to-Most plans the full sub-problem sequence upfront, then solves each one in order. Self-Ask is reactive; Least-to-Most is a committed plan. On well-structured compositional tasks the committed plan holds together better; on open-ended research Self-Ask adapts more cleanly.

Q: When does Least-to-Most help?

Tasks with clear prerequisite structure: math word problems where each step depends on the last, multi-step code refactors where files must change in a specific order, migrations where sub-task N cannot run until sub-task N-1 has landed, and any composition where the inputs of a later step are the outputs of an earlier one. It helps less on atomic tasks and on exploratory work where the right decomposition only becomes visible after you start.

Q: Do I run it as one prompt or multiple prompts?

Both patterns work. The single-prompt version has the model emit decomposition and all solutions in one response — lower latency, harder to audit. The multi-prompt version runs the decomposition call first, you (or a checker) approve the sub-problem list, then each sub-problem runs as its own prompt with the previous answers piped in as context. Multi-prompt is more code and more tokens; it is also where Least-to-Most gets most of its reliability gain because bad decompositions can be caught before any solving happens.

Q: How do I evaluate Least-to-Most outputs?

Grade the decomposition and the solutions separately. For the decomposition: is the order correct, are prerequisites respected, is the sub-problem granularity right. For each solution: factual correctness given the prior answers. For the assembled result: does it follow from the sub-solutions. The SurePrompts Quality Rubric covers the dimensions; LLM-as-Judge runs well on this shape because each artifact is small and structured.

Imtiaz Rayhan

Least-to-Most decomposes a hard problem into an ordered sequence of easier sub-problems, then solves them one at a time with earlier answers feeding the later ones. Zhou et al. introduced it in 2022 in "Least-to-Most Prompting Enables Complex Reasoning in Large Language Models." The paper's motivating observation: chain-of-thought sometimes generalises badly from short reasoning chains to long ones. Force the composition to be explicit — plan first, solve in prerequisite order — and the composition gap narrows.

Tip

Least-to-Most is for tasks with visible prerequisite structure. If you can sketch the sub-problems on paper before you start solving, Least-to-Most will usually beat a single-shot prompt or a free-form CoT trace. If you cannot sketch them, reach for self-ask prompting or a ReAct-style loop instead.

Key Takeaways

Two phases: decompose into ordered sub-problems, then solve in order with prior answers as context for later steps.
The decomposition itself is a prompt output — grade it, edit it, retry it before any solving happens.
Single-prompt Least-to-Most is convenient; multi-prompt Least-to-Most is auditable and where most of the reliability comes from.
Best fit: compositional tasks with clear prerequisite ordering — math word problems, migrations, multi-step code refactors, ordered research syntheses.
Main failure mode is compounding error: a wrong sub-answer early poisons every later step because later steps trust it.
On reasoning models, Least-to-Most is usually redundant; keep it for non-reasoning chat models or when you want the sub-problem list as a reviewable artifact.

Why Least-to-Most Exists

Compositional tasks punish single-shot prompts. Ask a model to refactor a pipeline with four ordered migrations and the common failure is not ignorance of any one migration — it is a prompt that collapses the dependencies. The model guesses at migration three before migration two has been written; the guess is internally consistent but inconsistent with what the previous step was supposed to produce.

Chain-of-thought helps but not always enough. A free-form trace lets the model reason step by step, but there is no structural guarantee that step n precedes step n+1 — the model can skip, re-order, or bundle steps. Least-to-Most adds that guarantee by making the step list an explicit artifact produced before any solving happens. See chain-of-thought prompting for the weaker, unstructured cousin.

The Pattern

Two phases. They do not overlap.

Phase 1 — Decomposition. Prompt the model to list the sub-problems in prerequisite order, smallest first. The output is just the list; no solving yet.

code

Problem: {problem}

List the sub-problems needed to solve this, in prerequisite order.
Start with the smallest/earliest sub-problem; each later sub-problem
may use answers from earlier sub-problems but not the reverse.
Output only the ordered list.

Phase 2 — Solving. For each sub-problem in the list, solve it with the answers to all previous sub-problems available as context. In a single-prompt version this is one long response. In a multi-prompt version each sub-problem is its own call.

code

Problem: {problem}
Sub-problem plan:
1. {sub_1}
2. {sub_2}
...

Solve sub-problem {k} given the answers so far:
Sub-problem 1 answer: {a_1}
...
Sub-problem {k-1} answer: {a_{k-1}}

Solve sub-problem {k}:

The scaffold is thin on purpose. What carries the pattern is the ordering — every step has strictly less context than a single-shot prompt on the full problem. The full problem is replaced by "this one sub-problem plus the answers that come before it." The trade: more calls, smaller calls, an explicit prerequisite chain.

Worked Example

A hypothetical legacy data pipeline refactor. A team has an overnight ETL loading CSV files into a MySQL reporting database. They want to migrate it to a streaming Kafka-to-Postgres pipeline with the warehouse reports working throughout. Names and details are illustrative.

Phase 1 — Decomposition output

code

Problem: Migrate the overnight CSV->MySQL ETL to a streaming
Kafka->Postgres pipeline, keeping all warehouse reports working
throughout the migration. The reports are business-critical and
cannot be down for more than 15 minutes in any window.

Sub-problem plan (ordered, smallest first):
1. Inventory every table the current ETL writes, and every report
   that reads from each table.
2. Pick the target Postgres schema — data types, partition strategy,
   primary keys, and the mapping from current MySQL columns.
3. Dual-write: extend the existing ETL to write each row into both
   MySQL and the new Postgres schema, so both stores stay in sync.
4. Backfill Postgres from MySQL for the historical window the
   reports depend on, and verify row counts match.
5. Migrate one low-risk report to read from Postgres and run it in
   shadow mode alongside the MySQL version for one week.
6. Migrate the remaining reports table-by-table as shadow runs agree;
   keep dual-write active until the last report is migrated.
7. Replace the batch CSV ingestion with the Kafka stream, pointed
   at the Postgres writer only; decommission MySQL.

Notice the prerequisite structure. Sub-problem 4 cannot run until 3 is live because backfill assumes both stores exist. Sub-problem 5 requires 4 because shadowing needs Postgres populated. Sub-problem 7 is last because it is irreversible. The decomposition encodes the "no report down for more than 15 minutes" constraint as an ordering rule: every switch is shadowed before it is primary.

Phase 2 — Solving, in order

At this point each sub-problem becomes its own call with the earlier answers in context. For the tutorial, we show two of the seven in shape.

code

Sub-problem 1: Inventory every table the ETL writes and every
report that reads from it.

Answer 1: Eight tables — orders, order_items, customers, products,
inventory, shipments, returns, sessions. Five reports: Daily Sales
(orders, order_items, products), Customer Cohorts (customers,
sessions, orders), Inventory Aging (inventory, products), Returns
Analysis (returns, orders, order_items), Fulfillment SLA
(shipments, orders). Overlapping reads on orders and order_items.

code

Sub-problem 3: Dual-write the ETL to both MySQL and Postgres.

Given answer 1 (8 tables, 5 reports) and answer 2 (Postgres
schema with partitioning on orders and shipments, BIGINT primary
keys, timestamptz for time fields):

Answer 3: Extend the ETL with a second writer targeting Postgres
using the mapping from sub-problem 2. Wrap both writes in one
transactional boundary per batch; on Postgres failure, log and
alert but do not roll back MySQL — MySQL stays authoritative
until sub-problem 6 completes. Add a checksum table with per-batch
row counts and a hash of the primary-key list, written to both
stores; a nightly reconciliation job compares and flags divergence.
Start dual-write on the two smallest tables (returns, sessions);
add the rest on a rolling schedule if reconciliation stays clean.

Each later sub-problem quotes earlier answers verbatim. Sub-problem 3 uses the table count and schema mapping from 1 and 2; sub-problem 4 uses the dual-write contract from 3; sub-problem 5 uses the row-count guarantee from 4. The context window for each call is the original problem, the ordered sub-problem list, and answers to prior sub-problems only.

Assembled result

The final output is the decomposition plus the seven sub-answers, plus a short synthesis naming the exit criteria: all five reports migrated, reconciliation clean for two weeks, MySQL writers disabled, old CSV job decommissioned. Because each sub-answer was produced with earlier answers in context, the synthesis is mostly mechanical — every ordering decision is already encoded.

Scoring with the quality rubric

Run the assembled result through the SurePrompts Quality Rubric. The rubric surfaces three things Least-to-Most should get right: correctness of the decomposition (ordering respects prerequisites), faithfulness of each sub-answer to its prior context, and composition (the final result follows from the sub-answers). Decomposition is the hardest to grade — the seven steps must cover the full problem and respect the no-downtime constraint. A weak plan like "1. Stand up Postgres, 2. Move all reports, 3. Delete MySQL" would score low on prerequisite ordering and never recover. Grade the list first; do not waste tokens solving a bad plan.

Least-to-Most vs. Chain-of-Thought vs. Self-Ask vs. Plan-and-Execute

Four reasoning scaffolds that look similar on paper and diverge in practice.

Pattern	Shape	Best for	Artifact to grade
Chain-of-thought	Free-form step-by-step trace, one pass	Reasoning that does not decompose cleanly	The final trace
Least-to-Most	Ordered sub-problems, then solve in order	Compositional tasks with prerequisite structure	The sub-problem list, then each solution
Self-Ask	Incremental follow-ups answered as they arise	Multi-hop Q&A, questions with unknown hop count	Each follow-up and its answer
Plan-and-execute	Plan once, execute each step with optional tools	Agentic tasks where planning is cheaper than reacting	The plan, then each execution step

Least-to-Most and Plan-and-Execute are siblings — both commit to a plan upfront. The difference: Least-to-Most sub-problems are usually pure reasoning, plan-and-execute steps are often tool-using actions. Self-Ask is reactive and does not commit to a list before it starts. Chain-of-thought is the weakest structural commitment — no list, no ordering. Use Least-to-Most when you can see the sub-problems from the top; Self-Ask when you cannot. See the agentic prompt stack for how these layer.

Failure Modes

Wrong decomposition order. The plan lists B before A even though B depends on A. Solving fails outright or silently assumes inputs that do not yet exist. Fix: ask the model to state each sub-problem's prerequisites, then verify the ordering respects them.

Sub-task drift. The first three sub-answers stay on-problem; by sub-answer five the model is solving something adjacent. Fix: include the original problem statement and the current sub-problem's exact wording in every solving call. Do not rely on the model to remember the plan across six calls.

Compounding error. Sub-problem 2 is 90% right — one number off. Sub-problem 3 uses sub-answer 2 verbatim; by sub-answer 7 the error has travelled through five steps and looks authoritative. Fix: add a verification pass between phases that checks each sub-answer against the original problem before it enters the next step's context.

Model skipping dependencies. On a single-prompt run the model solves sub-problem 5 before 4, or bundles 4 and 5. Fix: split into a multi-prompt pipeline where each sub-problem is a separate call. This is where prompt chaining earns its keep — a chained Least-to-Most run is strictly more reliable, at the cost of tokens and latency.

Our Position

Decompose before you solve. Most of the lift is in the decomposition pass, not the solving pass. Running decomposition as its own call, reading the output, and correcting a bad plan before any solving happens is where the reliability compounds. A bad plan solved perfectly is still a bad answer.

Prefer multi-prompt Least-to-Most on anything that matters. Single-prompt is fine for demos; per-step calls are the whole point of the pattern. When a run fails, you want to know which hop failed, not scan a 2,000-token response for the break.

Skip Least-to-Most on reasoning models. Claude's extended thinking, o-series, and Gemini thinking already decompose internally. Layering on top burns tokens on a structure the model has already produced invisibly. Keep it for non-reasoning chat models and pipelines where the sub-problem list must exist as a reviewable artifact.

Grade the plan separately from the solutions. Score decomposition on prerequisite correctness, granularity, and coverage; solutions on factuality given prior context; composition on whether it follows. Three scores, not one. The same logic applies to self-ask prompting.

Do not reach for Least-to-Most when the task is not compositional. Atomic questions and open-ended exploration do not benefit — forcing a plan on them produces a plan that is brittle by step two. The pattern earns its keep on problems with visible prerequisite structure.

Neighbouring scaffolds: chain-of-thought prompting (free-form cousin), self-ask prompting (reactive cousin), plan-and-execute prompting (agentic cousin), and prompt chaining guide for running Least-to-Most as a pipeline. For reasoning-model trade-offs see prompting reasoning models. For production layering see the agentic prompt stack and advanced prompt engineering techniques. For a worked example of a different decomposition-style pattern, see Chain-of-Density prompting for dense summaries. For evaluation apply the SurePrompts Quality Rubric. Glossary: least-to-most prompting, chain-of-thought, self-ask prompting, prompt chaining, plan-and-execute, reasoning model.

Least-to-Most Prompting: A Worked Example for Compositional Tasks

Key Takeaways

Why Least-to-Most Exists

The Pattern

Worked Example

Phase 1 — Decomposition output

Phase 2 — Solving, in order

Assembled result

Scoring with the quality rubric

Least-to-Most vs. Chain-of-Thought vs. Self-Ask vs. Plan-and-Execute

Failure Modes

Our Position

Ready to write better prompts?

Related Resources

Prompt Refinement Template

Prompt Chain Builder Template

System Prompt Writer Template

Prompt Engineering Framework Template

Related Articles

Chain-of-Thought Prompting: The Secret to Complex Problem Solving

Self-Ask Prompting: A Guide to Decomposing Multi-Hop Questions

Prompt Chaining: How to Break Complex Tasks Into Simple Steps (2026 Guide)

Least-to-Most Prompting: A Worked Example for Compositional Tasks

Key Takeaways

Why Least-to-Most Exists

The Pattern

Worked Example

Phase 1 — Decomposition output

Phase 2 — Solving, in order

Assembled result

Scoring with the quality rubric

Least-to-Most vs. Chain-of-Thought vs. Self-Ask vs. Plan-and-Execute

Failure Modes

Our Position

Related Reading

Ready to write better prompts?

Related Resources

Prompt Refinement Template

Prompt Chain Builder Template

System Prompt Writer Template

Prompt Engineering Framework Template

Related Articles

Chain-of-Thought Prompting: The Secret to Complex Problem Solving

Self-Ask Prompting: A Guide to Decomposing Multi-Hop Questions

Prompt Chaining: How to Break Complex Tasks Into Simple Steps (2026 Guide)