Tip
TL;DR: Chain-of-Code (CoC) prompting has the model write pseudocode that mixes executable Python with natural-language "pseudo-functions" the interpreter cannot run. The interpreter executes the code lines exactly; the model emulates the non-code lines by predicting what their output would be. You get the arithmetic guarantees of Program-of-Thoughts plus the flexibility of Chain-of-Thought — on tasks where part of the work is numeric and part is qualitative.
Key takeaways:
- Chain-of-Code bridges PoT and CoT. PoT handles computable tasks cleanly; CoT handles qualitative tasks cleanly. CoC is the mixed-workload primitive.
- From Li et al. (2023), "Chain of Code: Reasoning with a Language Model-Augmented Code Emulator." The framing treats the LLM as a second executor alongside the interpreter.
- Executable lines run exactly; non-executable lines are emulated by the model. The model plays interpreter when the interpreter cannot.
- The dispatch contract matters. Lines must be unambiguous about their executor, or the system silently hallucinates arithmetic that should have been computed.
- Score the full flow with the SurePrompts Quality Rubric; CoC improves Output Validation on the code lines and Instruction Specificity on the whole trace.
Where CoT and PoT leave gaps
Chain-of-Thought and Program-of-Thoughts have complementary ceilings. CoT reasons in natural language but is an unreliable calculator — "multiply 7.57 by 1.25 and round" gets the digits wrong because it is predicting tokens, not computing. PoT fixes that with a real interpreter, but is brittle on steps that cannot be coded: "classify the customer's emotional tone," "identify the likely root cause of this outage." Forcing a stand-in that returns "unknown" makes the code run while hiding that the hard part of the reasoning never happened.
Most real analytical tasks live in the middle: compute some quantities, then interpret them. CoT loses on the arithmetic; PoT loses on the interpretation. Practitioners hack around this by stitching the two with a third prompt — glue that is a source of bugs and drift.
Chain-of-Code proposes a cleaner primitive: one pseudocode trace where code lines are really executed and non-code lines are emulated by the model. The runtime dispatches each line to the right executor, producing a single coherent program that spans both modes.
The pattern
The CoC prompt skeleton:
You will be given a task that mixes computation with qualitative reasoning.
Write a single pseudocode program. It may contain:
(A) Standard Python lines — executed by a real interpreter.
(B) Natural-language pseudo-function calls — evaluated by YOU based on the
function name, the inputs, and the context.
Name pseudo-functions with an `emulated_` prefix (e.g.,
`emulated_classify_tone(sentence)`) so the runtime can route them to you
instead of the interpreter.
The runtime will execute the program line by line, feeding interpreter and
emulated output back into the variable scope. You receive the full trace and
produce the final answer.
Four moves: the model writes the mixed program, the runtime partitions lines by type, each line runs on the correct executor feeding a shared scope, and the final answer reassembles from the trace. The emulated_ prefix is the contract — without it the runtime will either execute nonsense as Python or hand arithmetic to the model.
Worked example
The starting problem
Given monthly revenue for 2024 and 2025 (24 data points), detect anomalies using a z-score threshold of ±2, and for each anomaly provide a likely cause and a recommended next action. Return a structured summary.
Hypothetical data, not a specific company. The anomaly detection is a clean computation, the cause explanation is qualitative judgment, and the recommended action combines the two. Pure Program-of-Thoughts would have to mock the cause step; pure Chain-of-Thought would mis-round the standard deviation. CoC routes each step to its natural executor.
The CoC pseudocode
The model's first turn produces a program like this. The emulated_ prefix marks lines the model will evaluate itself; everything else is real Python.
import statistics
monthly_revenue = {
"2024-01": 412_000, "2024-02": 398_000, "2024-03": 425_000,
"2024-04": 441_000, "2024-05": 418_000, "2024-06": 432_000,
"2024-07": 285_000, "2024-08": 447_000, "2024-09": 452_000,
"2024-10": 468_000, "2024-11": 471_000, "2024-12": 503_000,
"2025-01": 495_000, "2025-02": 489_000, "2025-03": 512_000,
"2025-04": 521_000, "2025-05": 508_000, "2025-06": 534_000,
"2025-07": 541_000, "2025-08": 538_000, "2025-09": 672_000,
"2025-10": 552_000, "2025-11": 561_000, "2025-12": 588_000,
}
values = list(monthly_revenue.values())
mean = statistics.mean(values)
stdev = statistics.stdev(values)
anomalies = []
for month, revenue in monthly_revenue.items():
z = (revenue - mean) / stdev
if abs(z) >= 2:
anomalies.append({"month": month, "revenue": revenue, "z": round(z, 2)})
# Emulated steps: the interpreter cannot explain causes or recommend actions.
enriched = []
for a in anomalies:
cause = emulated_likely_cause(
month=a["month"], revenue=a["revenue"], z=a["z"], mean=mean
)
action = emulated_recommended_action(month=a["month"], cause=cause, z=a["z"])
enriched.append({**a, "cause": cause, "action": action})
summary = emulated_structured_summary(
anomalies=enriched, mean=mean, stdev=stdev, horizon="2024-2025"
)
print(summary)
The execution trace
The runtime walks the program line by line. The imports, the dict, and the loop body are standard Python — the interpreter executes them and produces real values:
mean = 483,458.33
stdev = 78,214.90
anomalies detected (|z| >= 2):
- {"month": "2024-07", "revenue": 285,000, "z": -2.54}
- {"month": "2025-09", "revenue": 672,000, "z": 2.41}
(Those numbers are computed by the interpreter against the dataset above. They would be reproducible by anyone running the same code on the same data.)
When the loop hits emulated_likely_cause(...), the runtime pauses, sends the call as a sub-prompt to the model, and the model responds. For the July 2024 anomaly:
emulated_likely_cause(month="2024-07", revenue=285000, z=-2.54, mean=483458.33) → "Sharp contraction of ~41% below the mean. Likely causes: (1) a large deal slipping out of the quarter, (2) a seasonal dip compounded by customer churn, (3) a billing incident delaying recognition. (1) is the most common explanation for a single-month negative z of this magnitude."
For September 2025:
emulated_likely_cause(month="2025-09", ...) → "Revenue spike of ~39% above the mean. Likely causes: (1) a large expansion deal booked in the month, (2) a campaign or product launch landing mid-quarter, (3) a one-time consulting engagement. Given the steady 2025 uptrend, (1) is the most common cause."
The emulated_recommended_action(...) calls produce short, task-aware responses. Finally emulated_structured_summary(...) composes the final text, which the interpreter prints.
The assembled answer
The runtime returns the full interleaved trace. Final user-facing output:
Anomaly summary for 2024-2025. Two months fall outside the ±2 z-score threshold (mean \$483.5K, σ \$78.2K).
>
- 2024-07 — \$285K, z = −2.54. Likely cause: large deal slipped out of the quarter (most probable), with seasonal dip and billing incident as alternates. Action: pull the July pipeline report and compare closed-won against the Q2 forecast.
- 2025-09 — \$672K, z = +2.41. Likely cause: a large expansion deal or multi-year contract booked in the month. Action: confirm whether the revenue is recurring vs. one-time before adjusting the 2026 baseline.
Every number was computed by the interpreter; every explanation was emulated by the model. The division of labor is visible and auditable on both sides.
Scoring with the SurePrompts Quality Rubric
Against the SurePrompts Quality Rubric, CoC moves two dimensions relative to plain CoT:
- Output Validation: 2 → 5 on the code lines. The z-scores and anomaly list are deterministic functions of the dataset; a regression test pins them. Emulated lines stay at CoT's level — you have not made interpretation testable, only ringfenced it.
- Instruction Specificity: 3 → 5. The program is the specification: inputs, thresholds, emulation points. Ambiguity is pushed to named
emulated_functions, where it is at least visible. - Constraint Tightness: 3 → 4. Pseudo-function signatures limit what emulation covers. The model cannot smuggle arithmetic into an emulation — the computations are already in Python above it.
How the "simulated" parts work
When the runtime hits emulated_likely_cause(...), it does not throw NameError. It dispatches the call to a model invocation with a sub-prompt:
You are acting as an interpreter for a pseudo-function call.
Function name: emulated_likely_cause
Inputs: month="2024-07", revenue=285000, z=-2.54, mean=483458.33
Produce the return value. Infer the expected type from the function name.
Return only the value, no preamble.
The model responds with a value; the runtime binds it to the variable on the left of the assignment; execution continues. This is the core divergence from PoT — in PoT the handoff is clean and one-way (model writes code, interpreter runs it); in CoC the handoff is interleaved and bidirectional (interpreter runs what it can, model runs what it cannot, both feed into the same scope).
Li et al. (2023) describe this formally: the LLM is an "emulator" that stands in for the interpreter on steps it cannot handle. The engineering consequence is a runtime to build — or extend, if you already run a code interpreter tool-use framework. Reasoning models benefit too: extended thinking helps plan the dispatch boundary but does not replace the interpreter.
When to pick CoC vs PoT vs CoT
| Dimension | CoT | PoT | CoC |
|---|---|---|---|
| Best on | Conceptual reasoning | Numerical computation | Compute + interpret |
| Execution | Token prediction | Interpreter | Interpreter + model emulator |
| Runtime cost | Low | Medium (sandbox) | Medium-high (sandbox + extra calls) |
| Arithmetic guarantee | Weak | Strong | Strong on code lines only |
| Interpretation guarantee | Weak | Not supported | Weak (CoT-level on emulated lines) |
| Failure if misused | Wrong numbers | Correct code, wrong formula | Dispatch boundary blurred |
Rule of thumb: no numbers → CoT; pure computation → PoT; mixed with non-codeable interpretation → CoC.
Failure modes
CoC is not free. The failure modes cluster around the dispatch boundary:
Model skips the real execution. The model narrates outputs of the real code lines instead of letting the interpreter compute them — mean = 483,458.33 printed as a token prediction, not an interpreter result. Mitigation: the runtime, not the model, executes code lines. The model stops at code; the runtime inserts the real output; the model continues.
Hallucinated execution results. Adjacent failure — the model invents interpreter output for code lines. The most dangerous mode because invented numbers are plausible. Mitigation: log actual interpreter output alongside the model-visible trace; if the two diverge on a code line, the model hallucinated.
Mixed-format confusion. Missing emulated_ prefix on a qualitative step, or real code written in pseudo-function form. The runtime either raises NameError or emulates something that should have been computed. Mitigation: enforce the prefix contract; verify each emulated_* line is a call and every non-emulated_ line is valid Python.
Emulation leakage. The model writes emulated_compute_growth(base, rate) and "emulates" it by predicting a number — the exact LLM-as-calculator failure CoC was meant to avoid. Mitigation: reject pseudo-functions whose signatures look arithmetic (compute_*, calculate_*, sum_*) and instruct the model to use real Python for calculator-shaped work.
Our position
- Chain-of-Code is the right default for mixed analytical tasks. Data analysis with commentary, experiment summaries with interpretation, diagnostic triage with root-cause reasoning — the tasks teams most often build awkward multi-prompt stitching for.
- The runtime is the hard part, not the prompt. CoC depends on a dispatcher that routes lines to the right executor, logs both sides, and surfaces boundary violations. See Tool use prompting patterns for the loop shape to extend.
- Start from a working Program-of-Thoughts runtime. Teams building from scratch underestimate the sandboxing work. Add an emulation dispatcher to an existing PoT stack and you are most of the way there.
- Do not force CoC on tasks that do not mix modes. Pure reasoning stays in CoT; pure computation stays in PoT. CoC earns its complexity specifically on mixed workloads.
- Log every dispatch decision. For each line, record whether it was executed by the interpreter or emulated by the model. This is the audit trail for Output Validation scoring and the starting point for every postmortem.
Related reading
- Program-of-Thoughts worked example — the pure-computation sibling CoC extends; read it first if you have not, then come back to the Program-of-Thoughts walkthrough again once you have the mixed case working.
- Chain-of-Thought Prompting — the qualitative-reasoning baseline CoC builds on.
- Tool use prompting patterns — the loop shape the CoC runtime plugs into.
- The SurePrompts Quality Rubric — scoring dimensions for the full CoC trace.
- Prompting reasoning models guide — how CoC stacks with extended thinking.
- AI prompts for data analysis — task shapes squarely in CoC territory.
- Prompt engineering for developers — patterns for building runtimes the model can call into.
- Advanced prompt engineering techniques — where CoC sits alongside other 2026-era techniques.