Tip
TL;DR: Program-of-Thoughts (PoT) prompting asks the model to write a Python program that solves a numerical problem, then executes that program in an interpreter. The model reasons in code, the interpreter computes. On arithmetic-heavy tasks, this removes the LLM-as-calculator failure mode that even strong Chain-of-Thought prompts cannot fully fix. This post walks through a revenue-forecast example end to end.
Key takeaways:
- LLMs are unreliable calculators. PoT stops asking them to be calculators and starts asking them to be program authors — which they are good at.
- PoT without code execution is not PoT. If nothing runs the program, the model predicts the output and you have regressed below Chain-of-Thought.
- The core split: natural language for understanding the problem, code for computing the answer, natural language again for presenting the result.
- PoT dominates arithmetic, compound growth, iteration, and tabular math. It does not help conceptual reasoning where there is nothing to compute.
- Pair PoT with structured output for the final report — the code produces numbers, the schema produces a readable artifact.
- Score the whole flow with the SurePrompts Quality Rubric; the Output Validation dimension is where PoT earns its keep.
Why Program-of-Thoughts exists
Chain-of-thought prompting is a large step forward for reasoning tasks because it forces the model to decompose a problem instead of jumping to an answer. But CoT has a well-known ceiling on arithmetic: when the final step requires multiplying three decimals together, or compounding a growth rate over four periods, the model is still doing next-token prediction over digits. It gets the setup right, narrates the steps right, and then prints a wrong number.
That failure is structural. An LLM generating "4.7 × 1.23 × 1.31 =" does not execute a multiplication — it predicts the string of digits most likely to come next. Most of the time it gets close. Sometimes it does not. And when the numbers are large, decimal, or compounded, "close" is not "correct."
The paper "Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks" (Chen et al., 2022) made the disentanglement explicit. Have the model produce the code that represents the computation, then hand the code to an interpreter and read the interpreter's output. The model does what it is good at (translating a word problem into executable steps); the interpreter does what it is good at (computing exactly).
The paper's framing and subsequent follow-up work found that PoT outperformed CoT on arithmetic-heavy benchmarks across multiple model families. The direction of the result is what matters for practitioners: if arithmetic is load-bearing, delegate it.
The pattern
The PoT prompt skeleton is simple:
You will be given a quantitative problem.
Do NOT attempt to solve it in natural language.
Instead, write a self-contained Python program that:
- Defines all inputs as named variables
- Computes the answer step by step
- Prints the final result, labeled
I will execute the program and return the stdout to you.
Then you will interpret the output and answer the original question.
Three moves: state the problem, generate code, execute, interpret. The model writes the program on the first turn; your runtime executes it; the model reads the execution output and produces the final answer on the second turn. (Single-turn variants exist where the model calls a code interpreter tool mid-generation — same shape, different plumbing.)
The worked example
The starting problem
Calculate 2026 projected revenue for a company that grew 23% YoY in 2024 and 31% YoY in 2025, on a 2023 base of $4.7M. Assume a 10% churn rate applied against the 2025 book for 2026, and assume 2026 will grow 25% YoY on the post-churn base.
This is a realistic-looking forecast question. Hypothetical — not a specific company. The math is not hard, but it is fiddly. Compound a few rates, apply churn, compound again, round appropriately. The kind of problem where CoT gets one step right and another step a little wrong.
The CoT attempt (hypothetical, with an arithmetic error)
A typical CoT response might read:
Starting base (2023): $4.7M.
After 23% growth in 2024: 4.7 × 1.23 = $5.78M.
After 31% growth in 2025: 5.78 × 1.31 = $7.57M.
Churn 10% on the 2025 book: 7.57 × 0.90 = $6.81M.
Growth 25% on the post-churn base in 2026: 6.81 × 1.25 = $8.52M.
>
Projected 2026 revenue: $8.52M.
This is illustrative — a hypothetical CoT trace that plausibly contains one of the failure modes we are describing. The structure is correct. But watch the first multiplication: 4.7 × 1.23 is 5.781, which rounds to 5.78 — fine. The second: 5.78 × 1.31 is 7.5718, which rounds to 7.57 — fine. The third: 7.57 × 0.90 is 6.813, which rounds to 6.81 — fine. The fourth: 6.81 × 1.25 is 8.5125, which rounds to 8.51, not 8.52.
An 0.01M rounding error on the last step. Small, but wrong. And this is the easy version of the failure mode — when the growth rates are less friendly numbers, the CoT drift is larger and less visible.
More importantly: intermediate rounding has compounded the error. Carrying full precision through all four steps gives a different final number than rounding at each intermediate.
The PoT attempt
The same question, PoT-style. The model's first turn produces a program:
base_2023 = 4_700_000
growth_2024 = 0.23
growth_2025 = 0.31
churn_2026 = 0.10
growth_2026 = 0.25
revenue_2024 = base_2023 * (1 + growth_2024)
revenue_2025 = revenue_2024 * (1 + growth_2025)
post_churn_base_2026 = revenue_2025 * (1 - churn_2026)
revenue_2026 = post_churn_base_2026 * (1 + growth_2026)
print(f"2024 revenue: ${revenue_2024:,.2f}")
print(f"2025 revenue: ${revenue_2025:,.2f}")
print(f"2026 post-churn base: ${post_churn_base_2026:,.2f}")
print(f"2026 projected revenue: ${revenue_2026:,.2f}")
The runtime executes this and returns stdout:
2024 revenue: $5,781,000.00
2025 revenue: $7,573,110.00
2026 post-churn base: $6,815,799.00
2026 projected revenue: $8,519,748.75
On the second turn, the model synthesizes:
2026 projected revenue is $8,519,748.75 (≈ $8.52M), built from a 2023 base of $4.70M growing 23% in 2024 and 31% in 2025, then losing 10% to churn before growing 25% in 2026. Intermediate values are printed above for audit.
Rounded to $8.52M — the same final number the CoT happened to land on, but for different reasons. The CoT got there partly through rounding drift; the PoT got there through full-precision computation. Change any input by a non-round amount and the CoT will drift visibly while the PoT stays correct.
Scoring with the SurePrompts Quality Rubric
Scored against the SurePrompts Quality Rubric, the PoT version improves one dimension dramatically:
- Output validation: 2 → 5. Stdout is a deterministic artifact you can diff against expected values. Regression tests become trivial: given these inputs, stdout must match this string. Validation went from "human reads answer" to "CI compares strings."
- Instruction specificity: 3 → 5. The prompt does not ask for "an answer"; it asks for a program that defines inputs, computes intermediates, and prints labeled outputs. That is the specification, unambiguous.
- Constraint tightness: 2 → 4. "Do NOT solve in natural language" removes the dominant failure mode (hallucinated arithmetic). Other constraints (no external libraries, no network calls, print-only output) can be added to reach 5.
The other dimensions (role, context, format, examples) are unchanged in this framing — they are orthogonal to whether the computation happens in tokens or in a Python process.
What the interpreter setup looks like
"Execute the program" is a loaded phrase in production. Four common realizations:
- Hosted code-interpreter tool. Both Anthropic and OpenAI ship a first-party code execution tool the model can call mid-generation. You do not run a sandbox yourself; the provider does. Simplest to set up, highest per-call cost, tightly coupled to the provider's environment.
- Your own subprocess sandbox. A
subprocess.run(["python3", "-"], input=code, timeout=5)in a restricted environment — no network, no filesystem beyond a temp dir, resource limits. Full control, real engineering work to harden, cheap per-call once built.
- Hosted notebook execution. Jupyter kernel, Modal, E2B, Piston — services whose whole product is "run this code safely and return stdout." Middle ground on cost and operational surface.
- Pre-approved function catalog. Instead of arbitrary code, expose a fixed set of Python functions as function-calling targets. The model composes them. More constrained than raw PoT; safer for sensitive domains.
The common pattern in all four: the model's output is code or a tool call, a runtime executes it, and the runtime's output is fed back as the next turn's context. See Tool use prompting patterns for the broader shape of this loop.
Failure modes
PoT is not magic. The main failure modes we see:
Wrong formula in correct-looking code. The model writes syntactically valid Python that computes the wrong thing — applying growth to the wrong base, using simple interest where compound is needed, mixing up percentage point changes with percentage changes. The interpreter returns a number; the number is wrong because the program was wrong. Mitigation: scored test cases, unit tests on the generated function, or an LLM-as-judge that re-reads the code and flags formula suspicion.
Silent code failure that gets misinterpreted. The program raises an exception or prints nothing; the model, on its second turn, fabricates a plausible answer instead of surfacing the failure. Mitigation: the interpreter's return contract should make failure unmissable (e.g., "EXECUTION_FAILED: ..." prefix). The prompt should instruct the model to return a specific "computation failed" response if the stdout starts with the failure marker.
Hallucinated execution when not actually wired. The model writes code, then "runs" it in natural language — producing a fake stdout block that looks like real output. This is the single most dangerous failure because it is invisible. Mitigation: never infer execution from conversational cues; always route through an actual interpreter, and log the runtime's exit code alongside the output.
Dependency drift. The program imports libraries that are not in the sandbox. Mitigation: whitelist libraries in the prompt ("use only the Python standard library and numpy") and enforce at the sandbox layer.
Variants and extensions
PoT generalizes beyond the simple compound-growth case:
- Financial modeling — DCF calculations, sensitivity tables, scenario analysis. The code expresses the model; the interpreter runs the scenarios.
- Data analysis — aggregations, joins, quantiles, statistical tests over a dataset the model has access to via the interpreter. See AI prompts for data analysis for task shapes that benefit.
- Physics and engineering — unit conversions, differential equations solved numerically, tolerance checks. Any domain where a human would reach for a calculator anyway.
- Combinatorics and search — counting problems, brute-force enumerations the model can express but not accurately predict.
- Reasoning models — even reasoning models benefit from PoT on arithmetic. Their internal reasoning is stronger than a non-reasoning model's, but it is still tokens predicting tokens. See the prompting reasoning models guide for how extended thinking and PoT stack.
Where plain Chain-of-Thought is still the right call: conceptual reasoning, argumentation, planning, summarization, any task where the answer is argued rather than computed. If there is nothing to calculate, there is no interpreter advantage.
Our position
- PoT is the default for numerical tasks in production. CoT is fine for drafts and exploration; anything that ships with a number in its output should be computing that number in an interpreter, not predicting it in tokens.
- A Program-of-Thoughts prompt without execution is a bug. Either wire the code interpreter or remove the code-generation framing. The middle state — code shown, output hallucinated — is the worst of both worlds and the easiest to miss in review.
- Log stdout with every PoT trace. Two reasons. First, it lets you replay and diff the computation. Second, it makes Rubric Output Validation scoring automatable — stdout is a checkable artifact.
- Pair PoT with structured output. Let the interpreter produce numbers; let a schema format the report around them. Keeps computation deterministic and presentation readable. See our guide to structured output prompting for the pairing.
- Do not apply PoT where there is no arithmetic. It adds latency, cost, and surface area to a model that was already fine at the qualitative task. PoT is a targeted tool, not a universal upgrade.
Related reading
- Chain-of-Thought Prompting: The Secret to Complex Problem Solving — the technique PoT is a targeted improvement over.
- The SurePrompts Quality Rubric — seven dimensions to score the full PoT flow, not just the code.
- Tool use prompting patterns — the broader loop PoT plugs into.
- Prompting reasoning models guide — how PoT stacks with extended thinking.
- Prompt engineering for developers — patterns for wiring prompts into real runtimes.
- AI prompts for data analysis — task shapes where PoT is the obvious fit.
- Advanced prompt engineering techniques — where PoT sits in the broader toolkit.
- RCAF prompt structure — the drafting skeleton that pairs with PoT for the wrapper prompt.