Chain-of-Code Prompting: A Walkthrough for Mixed Reasoning Tasks

Q: What is Chain-of-Code?

Chain-of-Code (CoC) is a prompting technique introduced by Li et al. (2023) in 'Chain of Code: Reasoning with a Language Model-Augmented Code Emulator.' The model writes pseudocode where some lines are genuine, executable code and others are natural-language operations that a real interpreter could not run. Executable lines are handed to a real interpreter; the non-executable lines are 'emulated' by the language model itself, which predicts what their output would be. The full program is evaluated as a mix of real and simulated execution, and the interleaved outputs are combined into the final answer.

Q: How does CoC differ from Program-of-Thoughts?

Program-of-Thoughts asks the model to write a program that is entirely executable, then runs it in an interpreter. Chain-of-Code allows the program to contain lines the interpreter cannot run — a call to detect_cause(anomaly) or is_sarcasm(sentence) — and assigns those lines to the model itself for simulated execution. PoT is a clean handoff to code; CoC is a cooperative trace in which code and model take turns executing the pseudocode. The gain is coverage: tasks that mix arithmetic with qualitative steps no longer need to be forced into pure code.

Q: What kind of tasks benefit from Chain-of-Code?

Mixed-reasoning tasks where part of the work is numeric and part is qualitative. Examples: data analysis with interpretive commentary ('detect anomalies and explain likely causes'), experiment results that need both statistical tests and contextual interpretation, scoring rubrics where some dimensions are calculable and others require judgment, and multi-step agent tasks where some subroutines are calls to APIs and others are classifications the model must perform. If a task is pure arithmetic, use Program-of-Thoughts. If it is pure reasoning, use Chain-of-Thought. CoC earns its complexity on the middle band.

Q: Do I need a code interpreter for CoC?

Yes for the executable parts. CoC is a cooperative setup between a code interpreter and the language model — both are required. The interpreter runs the lines it can run and returns real outputs; the model fills in the rest. Without the interpreter, CoC degenerates into Chain-of-Thought with a Python-flavored syntax, losing the computational guarantee on the lines that could have been executed exactly.

Q: How do I prompt for Chain-of-Code?

Explicit framing works best: 'Write pseudocode that solves this task. Lines that are valid Python will be executed by an interpreter. Lines that are not valid Python — calls to functions you define in natural language — will be evaluated by you based on the function name and inputs. Mark each line with a comment indicating which mode it uses.' The model then produces a program it can partition itself, and your runtime dispatches lines to the correct executor. Many teams add a schema for the natural-language calls: the model must name the function, list inputs, and describe the expected output type.

Q: Can CoC replace Chain-of-Thought entirely?

No. Chain-of-Thought remains correct for pure conceptual reasoning — argumentation, planning, summarization, open-ended analysis — where there is nothing to compute and no benefit to pretending there is. Wrapping such tasks in pseudocode adds syntactic overhead without adding structure. CoC shines specifically when the task contains genuine computation alongside reasoning; it is a targeted bridge, not a universal upgrade over CoT.

Imtiaz Rayhan

Key takeaways:

Chain-of-Code bridges PoT and CoT. PoT handles computable tasks cleanly; CoT handles qualitative tasks cleanly. CoC is the mixed-workload primitive.
From Li et al. (2023), "Chain of Code: Reasoning with a Language Model-Augmented Code Emulator." The framing treats the LLM as a second executor alongside the interpreter.
Executable lines run exactly; non-executable lines are emulated by the model. The model plays interpreter when the interpreter cannot.
The dispatch contract matters. Lines must be unambiguous about their executor, or the system silently hallucinates arithmetic that should have been computed.
Score the full flow with the SurePrompts Quality Rubric; CoC improves Output Validation on the code lines and Instruction Specificity on the whole trace.

Where CoT and PoT leave gaps

Chain-of-Thought and Program-of-Thoughts have complementary ceilings. CoT reasons in natural language but is an unreliable calculator — "multiply 7.57 by 1.25 and round" gets the digits wrong because it is predicting tokens, not computing. PoT fixes that with a real interpreter, but is brittle on steps that cannot be coded: "classify the customer's emotional tone," "identify the likely root cause of this outage." Forcing a stand-in that returns "unknown" makes the code run while hiding that the hard part of the reasoning never happened.

Most real analytical tasks live in the middle: compute some quantities, then interpret them. CoT loses on the arithmetic; PoT loses on the interpretation. Practitioners hack around this by stitching the two with a third prompt — glue that is a source of bugs and drift.

Chain-of-Code proposes a cleaner primitive: one pseudocode trace where code lines are really executed and non-code lines are emulated by the model. The runtime dispatches each line to the right executor, producing a single coherent program that spans both modes.

The pattern

The CoC prompt skeleton:

code

You will be given a task that mixes computation with qualitative reasoning.

Write a single pseudocode program. It may contain:
  (A) Standard Python lines — executed by a real interpreter.
  (B) Natural-language pseudo-function calls — evaluated by YOU based on the
      function name, the inputs, and the context.

Name pseudo-functions with an `emulated_` prefix (e.g.,
`emulated_classify_tone(sentence)`) so the runtime can route them to you
instead of the interpreter.

The runtime will execute the program line by line, feeding interpreter and
emulated output back into the variable scope. You receive the full trace and
produce the final answer.

Four moves: the model writes the mixed program, the runtime partitions lines by type, each line runs on the correct executor feeding a shared scope, and the final answer reassembles from the trace. The emulated_ prefix is the contract — without it the runtime will either execute nonsense as Python or hand arithmetic to the model.

Worked example

The starting problem

Given monthly revenue for 2024 and 2025 (24 data points), detect anomalies using a z-score threshold of ±2, and for each anomaly provide a likely cause and a recommended next action. Return a structured summary.

Hypothetical data, not a specific company. The anomaly detection is a clean computation, the cause explanation is qualitative judgment, and the recommended action combines the two. Pure Program-of-Thoughts would have to mock the cause step; pure Chain-of-Thought would mis-round the standard deviation. CoC routes each step to its natural executor.

The CoC pseudocode

The model's first turn produces a program like this. The emulated_ prefix marks lines the model will evaluate itself; everything else is real Python.

python

import statistics

monthly_revenue = {
    "2024-01": 412_000, "2024-02": 398_000, "2024-03": 425_000,
    "2024-04": 441_000, "2024-05": 418_000, "2024-06": 432_000,
    "2024-07": 285_000, "2024-08": 447_000, "2024-09": 452_000,
    "2024-10": 468_000, "2024-11": 471_000, "2024-12": 503_000,
    "2025-01": 495_000, "2025-02": 489_000, "2025-03": 512_000,
    "2025-04": 521_000, "2025-05": 508_000, "2025-06": 534_000,
    "2025-07": 541_000, "2025-08": 538_000, "2025-09": 672_000,
    "2025-10": 552_000, "2025-11": 561_000, "2025-12": 588_000,
}

values = list(monthly_revenue.values())
mean = statistics.mean(values)
stdev = statistics.stdev(values)

anomalies = []
for month, revenue in monthly_revenue.items():
    z = (revenue - mean) / stdev
    if abs(z) >= 2:
        anomalies.append({"month": month, "revenue": revenue, "z": round(z, 2)})

# Emulated steps: the interpreter cannot explain causes or recommend actions.
enriched = []
for a in anomalies:
    cause = emulated_likely_cause(
        month=a["month"], revenue=a["revenue"], z=a["z"], mean=mean
    )
    action = emulated_recommended_action(month=a["month"], cause=cause, z=a["z"])
    enriched.append({**a, "cause": cause, "action": action})

summary = emulated_structured_summary(
    anomalies=enriched, mean=mean, stdev=stdev, horizon="2024-2025"
)

print(summary)

The execution trace

The runtime walks the program line by line. The imports, the dict, and the loop body are standard Python — the interpreter executes them and produces real values:

code

mean = 483,458.33
stdev = 78,214.90

anomalies detected (|z| >= 2):
  - {"month": "2024-07", "revenue": 285,000, "z": -2.54}
  - {"month": "2025-09", "revenue": 672,000, "z":  2.41}

(Those numbers are computed by the interpreter against the dataset above. They would be reproducible by anyone running the same code on the same data.)

When the loop hits emulated_likely_cause(...), the runtime pauses, sends the call as a sub-prompt to the model, and the model responds. For the July 2024 anomaly:

emulated_likely_cause(month="2024-07", revenue=285000, z=-2.54, mean=483458.33) → "Sharp contraction of ~41% below the mean. Likely causes: (1) a large deal slipping out of the quarter, (2) a seasonal dip compounded by customer churn, (3) a billing incident delaying recognition. (1) is the most common explanation for a single-month negative z of this magnitude."

For September 2025:

emulated_likely_cause(month="2025-09", ...) → "Revenue spike of ~39% above the mean. Likely causes: (1) a large expansion deal booked in the month, (2) a campaign or product launch landing mid-quarter, (3) a one-time consulting engagement. Given the steady 2025 uptrend, (1) is the most common cause."

The emulated_recommended_action(...) calls produce short, task-aware responses. Finally emulated_structured_summary(...) composes the final text, which the interpreter prints.

The assembled answer

The runtime returns the full interleaved trace. Final user-facing output:

Anomaly summary for 2024-2025. Two months fall outside the ±2 z-score threshold (mean \$483.5K, σ \$78.2K).

>

- 2024-07 — \$285K, z = −2.54. Likely cause: large deal slipped out of the quarter (most probable), with seasonal dip and billing incident as alternates. Action: pull the July pipeline report and compare closed-won against the Q2 forecast.

- 2025-09 — \$672K, z = +2.41. Likely cause: a large expansion deal or multi-year contract booked in the month. Action: confirm whether the revenue is recurring vs. one-time before adjusting the 2026 baseline.

Every number was computed by the interpreter; every explanation was emulated by the model. The division of labor is visible and auditable on both sides.

Scoring with the SurePrompts Quality Rubric

Against the SurePrompts Quality Rubric, CoC moves two dimensions relative to plain CoT:

Output Validation: 2 → 5 on the code lines. The z-scores and anomaly list are deterministic functions of the dataset; a regression test pins them. Emulated lines stay at CoT's level — you have not made interpretation testable, only ringfenced it.
Instruction Specificity: 3 → 5. The program is the specification: inputs, thresholds, emulation points. Ambiguity is pushed to named emulated_ functions, where it is at least visible.
Constraint Tightness: 3 → 4. Pseudo-function signatures limit what emulation covers. The model cannot smuggle arithmetic into an emulation — the computations are already in Python above it.

How the "simulated" parts work

When the runtime hits emulated_likely_cause(...), it does not throw NameError. It dispatches the call to a model invocation with a sub-prompt:

code

You are acting as an interpreter for a pseudo-function call.

Function name: emulated_likely_cause
Inputs: month="2024-07", revenue=285000, z=-2.54, mean=483458.33

Produce the return value. Infer the expected type from the function name.
Return only the value, no preamble.

The model responds with a value; the runtime binds it to the variable on the left of the assignment; execution continues. This is the core divergence from PoT — in PoT the handoff is clean and one-way (model writes code, interpreter runs it); in CoC the handoff is interleaved and bidirectional (interpreter runs what it can, model runs what it cannot, both feed into the same scope).

Li et al. (2023) describe this formally: the LLM is an "emulator" that stands in for the interpreter on steps it cannot handle. The engineering consequence is a runtime to build — or extend, if you already run a code interpreter tool-use framework. Reasoning models benefit too: extended thinking helps plan the dispatch boundary but does not replace the interpreter.

When to pick CoC vs PoT vs CoT

Dimension	CoT	PoT	CoC
Best on	Conceptual reasoning	Numerical computation	Compute + interpret
Execution	Token prediction	Interpreter	Interpreter + model emulator
Runtime cost	Low	Medium (sandbox)	Medium-high (sandbox + extra calls)
Arithmetic guarantee	Weak	Strong	Strong on code lines only
Interpretation guarantee	Weak	Not supported	Weak (CoT-level on emulated lines)
Failure if misused	Wrong numbers	Correct code, wrong formula	Dispatch boundary blurred

Rule of thumb: no numbers → CoT; pure computation → PoT; mixed with non-codeable interpretation → CoC.

Failure modes

CoC is not free. The failure modes cluster around the dispatch boundary:

Model skips the real execution. The model narrates outputs of the real code lines instead of letting the interpreter compute them — mean = 483,458.33 printed as a token prediction, not an interpreter result. Mitigation: the runtime, not the model, executes code lines. The model stops at code; the runtime inserts the real output; the model continues.

Hallucinated execution results. Adjacent failure — the model invents interpreter output for code lines. The most dangerous mode because invented numbers are plausible. Mitigation: log actual interpreter output alongside the model-visible trace; if the two diverge on a code line, the model hallucinated.

Mixed-format confusion. Missing emulated_ prefix on a qualitative step, or real code written in pseudo-function form. The runtime either raises NameError or emulates something that should have been computed. Mitigation: enforce the prefix contract; verify each emulated_* line is a call and every non-emulated_ line is valid Python.

Emulation leakage. The model writes emulated_compute_growth(base, rate) and "emulates" it by predicting a number — the exact LLM-as-calculator failure CoC was meant to avoid. Mitigation: reject pseudo-functions whose signatures look arithmetic (compute_*, calculate_*, sum_*) and instruct the model to use real Python for calculator-shaped work.

Our position

Chain-of-Code is the right default for mixed analytical tasks. Data analysis with commentary, experiment summaries with interpretation, diagnostic triage with root-cause reasoning — the tasks teams most often build awkward multi-prompt stitching for.

The runtime is the hard part, not the prompt. CoC depends on a dispatcher that routes lines to the right executor, logs both sides, and surfaces boundary violations. See Tool use prompting patterns for the loop shape to extend.

Start from a working Program-of-Thoughts runtime. Teams building from scratch underestimate the sandboxing work. Add an emulation dispatcher to an existing PoT stack and you are most of the way there.

Do not force CoC on tasks that do not mix modes. Pure reasoning stays in CoT; pure computation stays in PoT. CoC earns its complexity specifically on mixed workloads.

Log every dispatch decision. For each line, record whether it was executed by the interpreter or emulated by the model. This is the audit trail for Output Validation scoring and the starting point for every postmortem.

Program-of-Thoughts worked example — the pure-computation sibling CoC extends; read it first if you have not, then come back to the Program-of-Thoughts walkthrough again once you have the mixed case working.
Chain-of-Thought Prompting — the qualitative-reasoning baseline CoC builds on.
Tool use prompting patterns — the loop shape the CoC runtime plugs into.
The SurePrompts Quality Rubric — scoring dimensions for the full CoC trace.
Prompting reasoning models guide — how CoC stacks with extended thinking.
AI prompts for data analysis — task shapes squarely in CoC territory.
Prompt engineering for developers — patterns for building runtimes the model can call into.
Advanced prompt engineering techniques — where CoC sits alongside other 2026-era techniques.

Chain-of-Code Prompting: A Walkthrough for Mixed Reasoning Tasks

Where CoT and PoT leave gaps

The pattern

Worked example

The starting problem

The CoC pseudocode

The execution trace

The assembled answer

Scoring with the SurePrompts Quality Rubric

How the "simulated" parts work

When to pick CoC vs PoT vs CoT

Failure modes

Our position

Ready to write better prompts?

Related Articles

Program-of-Thoughts Prompting: A Worked Example for Numerical Reasoning

Chain-of-Thought Prompting: The Secret to Complex Problem Solving

Tool Use Prompting Patterns: Getting Reliable Tool Calls (2026)

Chain-of-Code Prompting: A Walkthrough for Mixed Reasoning Tasks

Where CoT and PoT leave gaps

The pattern

Worked example

The starting problem

The CoC pseudocode

The execution trace

The assembled answer

Scoring with the SurePrompts Quality Rubric

How the "simulated" parts work

When to pick CoC vs PoT vs CoT

Failure modes

Our position

Related reading

Ready to write better prompts?

Related Articles

Program-of-Thoughts Prompting: A Worked Example for Numerical Reasoning

Chain-of-Thought Prompting: The Secret to Complex Problem Solving

Tool Use Prompting Patterns: Getting Reliable Tool Calls (2026)