Program-of-Thoughts Prompting: A Worked Example for Numerical Reasoning

Q: What is Program-of-Thoughts?

Program-of-Thoughts (PoT) is a prompting technique introduced by Chen et al. (2022) where the model generates a Python program that solves a numerical problem instead of reasoning through it in natural language. An external interpreter — a Python runtime, code-execution tool, or sandbox — then runs the code and returns the result. The model's job is to state the problem as code; the interpreter's job is to compute. That division of labor is the whole point.

Q: How is Program-of-Thoughts different from Chain-of-Thought?

Chain-of-Thought reasons in natural language: the model writes out each step and arrives at an answer by token prediction. Program-of-Thoughts writes code for those steps and delegates execution to an interpreter. CoT mixes reasoning and computation in one stream; PoT separates them. For arithmetic-heavy tasks, the separation matters because LLMs are not reliable calculators — interpreters are.

Q: When should I use PoT over CoT?

Use PoT when the problem involves non-trivial arithmetic, compound growth, iteration, tabular reasoning, unit conversions, or anything a calculator would do better than a person doing mental math. Use CoT when the reasoning is conceptual, textual, or planning-oriented — places where the answer is argued rather than computed. Many real problems are hybrids; you can use CoT to set up the structure and PoT for the numeric core.

Q: What tools do I need for PoT?

A code execution layer — Python interpreter, hosted code-interpreter tool (Claude's built-in, OpenAI's), a sandbox (Jupyter, Piston, Modal), or your own subprocess runner. Without execution you do not have PoT, you have a model that happens to have written code.

Q: Does PoT work without tool use?

Partially and dangerously. The model can write code, but if nothing executes it, the model has to predict what the code would output — which is exactly the unreliable arithmetic PoT is meant to avoid. A non-executed program that ends with a predicted result is worse than CoT because it looks rigorous. Either wire the execution step or do not claim PoT.

Q: How does PoT pair with structured output?

PoT produces the computed values; structured output handles the final report. A common pattern: PoT returns numeric results (forecast, variance, confidence interval) as JSON, then a second prompt templates those values into a structured report with fixed sections. This keeps computation deterministic and presentation readable.

Imtiaz Rayhan

Key takeaways:

LLMs are unreliable calculators. PoT stops asking them to be calculators and starts asking them to be program authors — which they are good at.
PoT without code execution is not PoT. If nothing runs the program, the model predicts the output and you have regressed below Chain-of-Thought.
The core split: natural language for understanding the problem, code for computing the answer, natural language again for presenting the result.
PoT dominates arithmetic, compound growth, iteration, and tabular math. It does not help conceptual reasoning where there is nothing to compute.
Pair PoT with structured output for the final report — the code produces numbers, the schema produces a readable artifact.
Score the whole flow with the SurePrompts Quality Rubric; the Output Validation dimension is where PoT earns its keep.

Why Program-of-Thoughts exists

Chain-of-thought prompting is a large step forward for reasoning tasks because it forces the model to decompose a problem instead of jumping to an answer. But CoT has a well-known ceiling on arithmetic: when the final step requires multiplying three decimals together, or compounding a growth rate over four periods, the model is still doing next-token prediction over digits. It gets the setup right, narrates the steps right, and then prints a wrong number.

That failure is structural. An LLM generating "4.7 × 1.23 × 1.31 =" does not execute a multiplication — it predicts the string of digits most likely to come next. Most of the time it gets close. Sometimes it does not. And when the numbers are large, decimal, or compounded, "close" is not "correct."

The paper "Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks" (Chen et al., 2022) made the disentanglement explicit. Have the model produce the code that represents the computation, then hand the code to an interpreter and read the interpreter's output. The model does what it is good at (translating a word problem into executable steps); the interpreter does what it is good at (computing exactly).

The paper's framing and subsequent follow-up work found that PoT outperformed CoT on arithmetic-heavy benchmarks across multiple model families. The direction of the result is what matters for practitioners: if arithmetic is load-bearing, delegate it.

The pattern

The PoT prompt skeleton is simple:

code

You will be given a quantitative problem.

Do NOT attempt to solve it in natural language.
Instead, write a self-contained Python program that:
  - Defines all inputs as named variables
  - Computes the answer step by step
  - Prints the final result, labeled

I will execute the program and return the stdout to you.
Then you will interpret the output and answer the original question.

Three moves: state the problem, generate code, execute, interpret. The model writes the program on the first turn; your runtime executes it; the model reads the execution output and produces the final answer on the second turn. (Single-turn variants exist where the model calls a code interpreter tool mid-generation — same shape, different plumbing.)

The worked example

The starting problem

Calculate 2026 projected revenue for a company that grew 23% YoY in 2024 and 31% YoY in 2025, on a 2023 base of $4.7M. Assume a 10% churn rate applied against the 2025 book for 2026, and assume 2026 will grow 25% YoY on the post-churn base.

This is a realistic-looking forecast question. Hypothetical — not a specific company. The math is not hard, but it is fiddly. Compound a few rates, apply churn, compound again, round appropriately. The kind of problem where CoT gets one step right and another step a little wrong.

The CoT attempt (hypothetical, with an arithmetic error)

A typical CoT response might read:

Starting base (2023): $4.7M.

After 23% growth in 2024: 4.7 × 1.23 = $5.78M.

After 31% growth in 2025: 5.78 × 1.31 = $7.57M.

Churn 10% on the 2025 book: 7.57 × 0.90 = $6.81M.

Growth 25% on the post-churn base in 2026: 6.81 × 1.25 = $8.52M.

>

Projected 2026 revenue: $8.52M.

This is illustrative — a hypothetical CoT trace that plausibly contains one of the failure modes we are describing. The structure is correct. But watch the first multiplication: 4.7 × 1.23 is 5.781, which rounds to 5.78 — fine. The second: 5.78 × 1.31 is 7.5718, which rounds to 7.57 — fine. The third: 7.57 × 0.90 is 6.813, which rounds to 6.81 — fine. The fourth: 6.81 × 1.25 is 8.5125, which rounds to 8.51, not 8.52.

An 0.01M rounding error on the last step. Small, but wrong. And this is the easy version of the failure mode — when the growth rates are less friendly numbers, the CoT drift is larger and less visible.

More importantly: intermediate rounding has compounded the error. Carrying full precision through all four steps gives a different final number than rounding at each intermediate.

The PoT attempt

The same question, PoT-style. The model's first turn produces a program:

python

base_2023 = 4_700_000
growth_2024 = 0.23
growth_2025 = 0.31
churn_2026 = 0.10
growth_2026 = 0.25

revenue_2024 = base_2023 * (1 + growth_2024)
revenue_2025 = revenue_2024 * (1 + growth_2025)
post_churn_base_2026 = revenue_2025 * (1 - churn_2026)
revenue_2026 = post_churn_base_2026 * (1 + growth_2026)

print(f"2024 revenue: ${revenue_2024:,.2f}")
print(f"2025 revenue: ${revenue_2025:,.2f}")
print(f"2026 post-churn base: ${post_churn_base_2026:,.2f}")
print(f"2026 projected revenue: ${revenue_2026:,.2f}")

The runtime executes this and returns stdout:

code

2024 revenue: $5,781,000.00
2025 revenue: $7,573,110.00
2026 post-churn base: $6,815,799.00
2026 projected revenue: $8,519,748.75

On the second turn, the model synthesizes:

2026 projected revenue is $8,519,748.75 (≈ $8.52M), built from a 2023 base of $4.70M growing 23% in 2024 and 31% in 2025, then losing 10% to churn before growing 25% in 2026. Intermediate values are printed above for audit.

Rounded to $8.52M — the same final number the CoT happened to land on, but for different reasons. The CoT got there partly through rounding drift; the PoT got there through full-precision computation. Change any input by a non-round amount and the CoT will drift visibly while the PoT stays correct.

Scoring with the SurePrompts Quality Rubric

Scored against the SurePrompts Quality Rubric, the PoT version improves one dimension dramatically:

Output validation: 2 → 5. Stdout is a deterministic artifact you can diff against expected values. Regression tests become trivial: given these inputs, stdout must match this string. Validation went from "human reads answer" to "CI compares strings."
Instruction specificity: 3 → 5. The prompt does not ask for "an answer"; it asks for a program that defines inputs, computes intermediates, and prints labeled outputs. That is the specification, unambiguous.
Constraint tightness: 2 → 4. "Do NOT solve in natural language" removes the dominant failure mode (hallucinated arithmetic). Other constraints (no external libraries, no network calls, print-only output) can be added to reach 5.

The other dimensions (role, context, format, examples) are unchanged in this framing — they are orthogonal to whether the computation happens in tokens or in a Python process.

What the interpreter setup looks like

"Execute the program" is a loaded phrase in production. Four common realizations:

Hosted code-interpreter tool. Both Anthropic and OpenAI ship a first-party code execution tool the model can call mid-generation. You do not run a sandbox yourself; the provider does. Simplest to set up, highest per-call cost, tightly coupled to the provider's environment.

Your own subprocess sandbox. A subprocess.run(["python3", "-"], input=code, timeout=5) in a restricted environment — no network, no filesystem beyond a temp dir, resource limits. Full control, real engineering work to harden, cheap per-call once built.

Hosted notebook execution. Jupyter kernel, Modal, E2B, Piston — services whose whole product is "run this code safely and return stdout." Middle ground on cost and operational surface.

Pre-approved function catalog. Instead of arbitrary code, expose a fixed set of Python functions as function-calling targets. The model composes them. More constrained than raw PoT; safer for sensitive domains.

The common pattern in all four: the model's output is code or a tool call, a runtime executes it, and the runtime's output is fed back as the next turn's context. See Tool use prompting patterns for the broader shape of this loop.

Failure modes

PoT is not magic. The main failure modes we see:

Wrong formula in correct-looking code. The model writes syntactically valid Python that computes the wrong thing — applying growth to the wrong base, using simple interest where compound is needed, mixing up percentage point changes with percentage changes. The interpreter returns a number; the number is wrong because the program was wrong. Mitigation: scored test cases, unit tests on the generated function, or an LLM-as-judge that re-reads the code and flags formula suspicion.

Silent code failure that gets misinterpreted. The program raises an exception or prints nothing; the model, on its second turn, fabricates a plausible answer instead of surfacing the failure. Mitigation: the interpreter's return contract should make failure unmissable (e.g., "EXECUTION_FAILED: ..." prefix). The prompt should instruct the model to return a specific "computation failed" response if the stdout starts with the failure marker.

Hallucinated execution when not actually wired. The model writes code, then "runs" it in natural language — producing a fake stdout block that looks like real output. This is the single most dangerous failure because it is invisible. Mitigation: never infer execution from conversational cues; always route through an actual interpreter, and log the runtime's exit code alongside the output.

Dependency drift. The program imports libraries that are not in the sandbox. Mitigation: whitelist libraries in the prompt ("use only the Python standard library and numpy") and enforce at the sandbox layer.

Variants and extensions

PoT generalizes beyond the simple compound-growth case:

Financial modeling — DCF calculations, sensitivity tables, scenario analysis. The code expresses the model; the interpreter runs the scenarios.
Data analysis — aggregations, joins, quantiles, statistical tests over a dataset the model has access to via the interpreter. See AI prompts for data analysis for task shapes that benefit.
Physics and engineering — unit conversions, differential equations solved numerically, tolerance checks. Any domain where a human would reach for a calculator anyway.
Combinatorics and search — counting problems, brute-force enumerations the model can express but not accurately predict.
Reasoning models — even reasoning models benefit from PoT on arithmetic. Their internal reasoning is stronger than a non-reasoning model's, but it is still tokens predicting tokens. See the prompting reasoning models guide for how extended thinking and PoT stack.

Where plain Chain-of-Thought is still the right call: conceptual reasoning, argumentation, planning, summarization, any task where the answer is argued rather than computed. If there is nothing to calculate, there is no interpreter advantage.

Our position

PoT is the default for numerical tasks in production. CoT is fine for drafts and exploration; anything that ships with a number in its output should be computing that number in an interpreter, not predicting it in tokens.

A Program-of-Thoughts prompt without execution is a bug. Either wire the code interpreter or remove the code-generation framing. The middle state — code shown, output hallucinated — is the worst of both worlds and the easiest to miss in review.

Log stdout with every PoT trace. Two reasons. First, it lets you replay and diff the computation. Second, it makes Rubric Output Validation scoring automatable — stdout is a checkable artifact.

Pair PoT with structured output. Let the interpreter produce numbers; let a schema format the report around them. Keeps computation deterministic and presentation readable. See our guide to structured output prompting for the pairing.

Do not apply PoT where there is no arithmetic. It adds latency, cost, and surface area to a model that was already fine at the qualitative task. PoT is a targeted tool, not a universal upgrade.

Chain-of-Thought Prompting: The Secret to Complex Problem Solving — the technique PoT is a targeted improvement over.
Chain-of-Code: a walkthrough for mixed reasoning tasks — PoT's close cousin, where the model interleaves code and natural-language reasoning instead of delegating computation wholesale.
Least-to-most prompting: a worked example for compositional tasks — when the numeric problem is really a chain of sub-problems, decompose first, then compute.
Step-back prompting: a worked example for knowledge-intensive reasoning — abstract to the governing principle before grinding the arithmetic.
Chain-of-density: a worked example for dense summaries — the same worked-example treatment applied to summarization rather than computation.
The SurePrompts Quality Rubric — seven dimensions to score the full PoT flow, not just the code.
Tool use prompting patterns — the broader loop PoT plugs into.
Prompting reasoning models guide — how PoT stacks with extended thinking.
Prompt engineering for developers — patterns for wiring prompts into real runtimes.
AI prompts for data analysis — task shapes where PoT is the obvious fit.
Advanced prompt engineering techniques — where PoT sits in the broader toolkit.
RCAF prompt structure — the drafting skeleton that pairs with PoT for the wrapper prompt.

Program-of-Thoughts Prompting: A Worked Example for Numerical Reasoning

Why Program-of-Thoughts exists

The pattern

The worked example

The starting problem

The CoT attempt (hypothetical, with an arithmetic error)

The PoT attempt

Scoring with the SurePrompts Quality Rubric

What the interpreter setup looks like

Failure modes

Variants and extensions

Our position

Ready to write better prompts?

Related Articles

Chain-of-Thought Prompting: The Secret to Complex Problem Solving

Every Prompt Engineering Technique Explained: The Research-Backed Guide (2026)

Prompt Engineering for Developers: The Technical Guide to AI-Assisted Coding (2026)

Program-of-Thoughts Prompting: A Worked Example for Numerical Reasoning

Why Program-of-Thoughts exists

The pattern

The worked example

The starting problem

The CoT attempt (hypothetical, with an arithmetic error)

The PoT attempt

Scoring with the SurePrompts Quality Rubric

What the interpreter setup looks like

Failure modes

Variants and extensions

Our position

Related reading

Ready to write better prompts?

Related Articles

Chain-of-Thought Prompting: The Secret to Complex Problem Solving

Every Prompt Engineering Technique Explained: The Research-Backed Guide (2026)

Prompt Engineering for Developers: The Technical Guide to AI-Assisted Coding (2026)