Reflexion Prompting Guide: Verbal Self-Reflection After Failures (2026)

Q: What is reflexion prompting?

Reflexion is a retry pattern with three moves per iteration. The agent makes an attempt, an external signal evaluates it (a test suite runs, a schema validates, a tool returns an error, a verifier emits pass/fail), and if the attempt failed the agent writes a short, first-person reflection naming the mistake and what to do differently. That reflection text is then carried into the next attempt's prompt as a memory block, so each retry starts from a richer prompt than the last instead of a blank one. The memory is verbal — plain text the model reads back — and it accumulates as a growing list of reflections, episodic memory in the loosest sense. The pattern was introduced by Shinn et al. in 2023, but it is simple enough to implement from scratch without any framework.

Q: What is the difference between reflexion and self-refine?

The two look similar but are not the same. Self-refine runs a critique-revise pass inside a single attempt using no outside information — the model critiques its own output and revises, with no memory carried across attempts. Reflexion runs multiple full attempts, each one informed by a verbal summary of what went wrong in prior attempts, and crucially that feedback comes from an external pass/fail signal rather than the model's own judgment. Put simply: self-refine polishes an output, while reflexion tries again, smarter. They can also compose — an agent can run self-refine inside each reflexion attempt, using the reflection as the critique rubric. That gives two layers of feedback, internal critique plus accumulated cross-attempt memory, at the cost of more tokens rather than more complexity.

Q: How many retries should a reflexion loop allow?

Cap retries hard, typically at k=3 or k=5, and pair the cap with explicit escalation. Without a maximum, a stuck task consumes tokens forever. Past k=5 diminishing returns set in fast and reflections start repeating themselves — the agent writes similar lessons in attempt five as in attempt three — so if the task has not passed by then, more retries rarely fix it. The three termination patterns, in order of preference, are: the task passes its verifier (success), the retry cap trips, or escalation on exhaustion. Escalation matters: rather than just failing, hand off to a human, switch to a larger model, or surface the full reflection history as a diagnostic. Avoid nudging k upward with a 'try harder' prompt — more attempts on an unsolvable task is just more cost.

Imtiaz Rayhan

Reflexion is a small idea with an outsized effect: after an attempt fails, the agent writes a short note explaining why it failed, and that note becomes context for the next attempt. Each retry starts from a richer prompt than the last, not a blank one. The memory is verbal — plain text the model reads back — and the signal comes from outside: a failing test, a rejected tool call, a verifier that says "no." When those ingredients are present, reflexion outperforms looping without memory. When they're not, it adds overhead without gain.

What Reflexion Is

Reflexion is a retry pattern with three moves per iteration:

Attempt. The agent produces an output or takes an action.
Evaluate. An external signal scores the attempt — a test suite runs, a schema validates, a tool returns an error, a human or verifier emits pass/fail.
Reflect. If the attempt failed, the agent writes a short, first-person reflection naming the mistake and what to do differently.

The reflection text is then prepended (or appended in a structured memory block) to the next attempt's prompt. Iteration repeats until the task passes or a retry cap trips. The "memory" is nothing more than a growing list of reflections — episodic memory in the loosest sense, persistent only across attempts of the same task.

It's related to chain-of-thought prompting in that both surface reasoning in prose, but the timing is different: CoT reasons before the answer, reflexion reasons about an answer that already failed. The reflexion paper (Shinn et al. 2023) introduced the term, but the pattern is simple enough to implement from scratch — you don't need a framework to run it.

Reflexion vs Self-Refine

The two patterns look similar at a glance and get conflated. They are not the same thing.

Property	Self-refine	Reflexion
Feedback source	Model's own critique	External pass/fail signal
Retries	Single session, critique-revise	Multiple full attempts
Memory	None across attempts	Reflection text carried forward
Best target	Checkable output in one shot	Tasks with retry-level verification
Failure mode	Polishes errors it can't see	Wasted retries without clear signal

Self-refine prompting runs a critique-revise pass inside one attempt, using no outside information. Reflexion runs multiple full attempts, each one informed by a verbal summary of what went wrong in prior attempts, grounded in an external verdict. Put simply: self-refine polishes an output; reflexion tries again, smarter.

Both can compose. An agent can run self-refine inside each reflexion attempt, using the reflection as the critique rubric for the internal revise step. That's two layers of feedback — internal critique plus accumulated cross-attempt memory — and the cost is more tokens, not more complexity.

When Reflexion Wins

Reflexion pays off where three things hold at once: retries are possible, failures are detectable from outside the model, and the error is the kind you can describe in a sentence. Tasks that fit:

Code against a test suite. The suite runs, a test fails with a traceback, the agent reflects ("I assumed the input was a list, but the failing test passes a generator"), and the next attempt handles generators. Test output grounds the reflection in something concrete.
Structured output with validation. A schema validator rejects JSON for a missing field. The reflection names the missing field, the next attempt includes it. Repeat until valid.
Tool-using agents with retry budgets. A tool call returns an error — wrong arguments, missing auth, rate-limit signal. The reflection captures the tool's complaint and the next attempt adjusts. This is a natural fit for ReAct prompting pipelines, which already produce trajectories ripe for reflection.
Puzzles and tasks with scoring. Anything with a deterministic scorer — game levels, algorithmic challenges, math with verified answers — gives reflexion the ground truth it needs.
Regression debugging. An agent tries to fix a bug, the test still fails, the trace points at a new place. Each reflection carries forward the diagnosis so the agent stops re-proposing the fix that already failed. See agent debugging prompts for the full retry-and-diagnose shape.

The common element: the environment, not the model, decides whether an attempt passed. Reflection converts the environment's verdict into something the next prompt can act on.

When Reflexion Loses

Reflexion stalls when there is no reliable external signal or when retries are expensive relative to the first pass:

Pure generation with no ground truth. Writing an essay, drafting a brand tagline, summarizing an open-ended document. Nothing says "failed," so there's nothing to reflect on. Reflexion here degenerates into vague self-critique, which is self-refine with extra overhead.
Single-shot tasks where retries aren't allowed. If the system has to return an answer immediately with no retry budget, reflexion adds cost without getting to use the memory.
Noisy verifiers. If the failure signal is flaky — intermittent test failures, environments with side effects — reflections start encoding the noise. "Last time the test failed because of network timeout" becomes a distractor for the next attempt instead of a lesson.
Very short tasks. A one-line transformation that succeeds in the first attempt ~95% of the time doesn't benefit from a reflection scaffold. The retry budget is wasted on the other 5%.
Tasks where the model lacks the knowledge, period. Reflection doesn't fix a knowledge gap. If the model can't solve the task because it doesn't know the domain, three reflections still won't. Use retrieval or a better model.

The tell for a failing reflexion loop: reflections repeat across attempts — the agent writes similar regret in iteration 2 as in iteration 1 — without the attempt actually changing. Memory is there but isn't driving behavior.

Anatomy of a Reflexion Prompt

A working reflexion loop has three prompt templates, each with a specific job:

Attempt prompt. Task statement, plus a memory block with prior reflections if any exist. The block is explicitly labeled so the model knows it's lessons from past failures, not part of the task itself.
Reflection prompt. Shown after a failed attempt. Supplies the task, the attempt, the external failure signal (traceback, validator error, verdict), and asks for a short first-person reflection — not a new attempt. Keep it to two or three sentences. One-paragraph reflections degrade into vague self-critique; one-sentence reflections carry enough signal without burning context.
Retry prompt. Structurally identical to the attempt prompt, but with the new reflection appended to the memory block. The model never sees its own prior output in the retry (just the reflection), which keeps context small and discourages copy-pasting broken code.

The reflection prompt is where the loop lives or dies. "Reflect on what went wrong" is too loose. Better: "In 2–3 sentences, name the specific mistake in the attempt above, and state one concrete change for the next attempt." Specificity in, specificity out.

Episodic Memory Design

What should carry across attempts? Three choices, with tradeoffs:

Reflection text only. Cheapest. The model only sees distilled lessons, not raw failures. Works when the reflection prompt produces high-signal text. Risk: important detail gets abstracted away.
Reflection + condensed trajectory. Reflection plus a one-line summary of what the attempt did ("tried list comprehension, failed on empty input"). More tokens, more grounding. Good for tool-using agents where the action mattered as much as the outcome.
Reflection + specific error evidence. Reflection plus the exact error message or failing-test output. Most grounded, most verbose. Right when the error itself is the key signal (tracebacks, validator messages).

For most tasks, reflection text only is enough. Upgrade to richer memory when you see the loop repeating mistakes — that means the reflection lost information the trajectory or error held.

Cap memory size. Past 3–5 reflections, older entries are usually superseded by newer ones. Either truncate (keep the last N) or summarize (ask the model to compress the list when it exceeds a threshold). Unbounded memory is a slow context-window leak.

Termination

Reflexion loops must have a stop condition. Three patterns, in order of preference:

Task passes verifier. The external signal says succeed — tests pass, schema validates, tool returns clean. Terminate with success. This is the reason the pattern exists.
Retry cap. A hard maximum (k=3 or k=5) on attempts regardless of outcome. Mandatory. Without it a broken loop retries forever.
Escalation. If the retry cap is hit without success, don't just fail — escalate. Hand off to a human, switch to a larger model, surface the full reflection history as a diagnostic. The accumulated reflections are often the most useful thing the agent produced, even in failure.

Avoid combining reflexion with a "try harder" prompt that nudges k upward when things aren't working. More attempts with the same model on an unsolvable task is just more cost. Fail fast, escalate cleanly.

A Reflexion Loop — Hypothetical Example

An illustrative three-prompt sequence for a code-against-tests task. Hypothetical, meant to show the shape — the signals, tests, and outputs are invented.

code

# Attempt 1 (no prior reflections)

Task: Implement a function `flatten(seq)` that yields every non-iterable
leaf from a nested iterable. Pass the provided test suite.

Memory from prior attempts: (none yet — this is the first attempt.)

Output: your implementation as a single Python function.

---

# External signal after Attempt 1

Test run result:
  FAIL test_flatten_strings
    AssertionError: expected ['a','b','c'], got ['a','b','c','d','e','f']
    input: ['abc', 'def']

(Strings are iterable but should be treated as leaves.)

---

# Reflection prompt

Below is your previous attempt and the test failure. In 2–3 sentences,
name the specific mistake and state one concrete change for the next
attempt. Do not write code here.

Previous attempt: <code from Attempt 1>
Failure: <test output above>

Reflection:

---

# Attempt 2 (reflection in memory block)

Task: <same as before>

Memory from prior attempts:
  - Attempt 1 treated strings as iterables and flattened their characters.
    Next attempt must treat `str` (and `bytes`) as leaves, not sequences to
    descend into.

Output: your implementation as a single Python function.

The shape repeats. The task stays constant, the memory block grows, the model's context each turn is task + distilled lessons — never the full mess of prior code. Wrap this in a harness that runs the test suite, calls the reflection step only on failure, and caps at k=3.

Common Anti-Patterns

Reflecting without a real failure signal. "Reflect on your output and try again" with no ground truth turns reflexion into self-refine with extra steps. Fix: require an external verdict; if there isn't one, use self-refine.
Over-long reflections. Paragraph-length reflections drift into generic self-criticism and crowd out context. Fix: cap reflections at 2–3 sentences; demand specificity.
Reflections that don't drive behavior. The agent writes a reflection, then produces an attempt structurally identical to the previous one. Fix: make the reflection prompt require one concrete change, and verify the retry prompt actually surfaces the reflection prominently.
Unbounded retry budgets. No cap means a stuck task consumes tokens forever. Fix: hard cap at k=3 or k=5 with explicit escalation on exhaustion.
Including the failed output verbatim in the retry. The model copies the broken code instead of rewriting. Fix: retry prompt sees the reflection, not the previous output.
Unbounded memory. After many retries the memory block dominates the prompt. Fix: keep the last N reflections or summarize when the block exceeds a token threshold.

FAQ

Does reflexion need an external verifier?

Effectively yes. Without a real signal that an attempt failed, the reflection step has nothing to ground on and you're doing self-refine under a different name. The verifier can be anything that produces pass/fail — tests, schema validators, tool errors, human review — but it has to exist. If you can't verify, don't reflect.

How long should a reflection be?

Two or three sentences is the sweet spot in practice. One sentence often loses nuance; a paragraph drifts into vague self-criticism and starts to overwhelm the memory block. Make the reflection prompt enforce the length explicitly.

Can I use a different model for the reflection step?

Yes, and sometimes it helps. A stronger model reflecting on a weaker model's failures can produce more actionable lessons than the weaker model reflecting on itself. The cost-benefit depends on task value — for cheap tasks, stick with one model; for expensive failures, a stronger critic is worth it.

How many retries is too many?

Past k=5, diminishing returns set in fast, and reflections start repeating themselves — the agent writes similar lessons in attempt 5 as in attempt 3. If the task hasn't passed by then, more retries rarely fix it. Escalate: different model, human review, or declare failure with the accumulated reflection history attached.

Does reflexion compose with self-refine?

Yes. A reflexion loop can run self-refine inside each attempt — the reflection becomes a strong seed for the critique step. The extra cost is another model call per attempt; the win is two feedback channels (internal critique plus cross-attempt memory) working together.

Wrap-Up

Reflexion is the pattern to reach for when you have retries and an external signal. The memory is verbal and lightweight, the failures teach the loop directly, and termination is natural because the verifier either passes or the retry budget runs out. Keep reflections short, cap retries, don't inflict the pattern on tasks without ground truth. For the larger pattern landscape see the complete guide to prompting AI coding agents; for the simpler inner loop see self-refine prompting; for tool-using agents where reflexion slots in naturally see ReAct prompting; for the retry-and-diagnose shape applied to bug fixing see agent debugging prompts.

Reflexion Prompting Guide: Verbal Self-Reflection After Failures (2026)

What Reflexion Is

Reflexion vs Self-Refine

When Reflexion Wins

When Reflexion Loses

Anatomy of a Reflexion Prompt

Episodic Memory Design

Termination

A Reflexion Loop — Hypothetical Example

Common Anti-Patterns

FAQ

Does reflexion need an external verifier?

How long should a reflection be?

Can I use a different model for the reflection step?

How many retries is too many?

Does reflexion compose with self-refine?

Wrap-Up

Ready to write better prompts?

Related Resources

Prompt Refinement Template

Prompt Chain Builder Template

System Prompt Writer Template

Prompt Engineering Framework Template

Related Articles

The Complete Guide to Prompting AI Coding Agents (2026)

Self-Refine Prompting: Critique and Revise in One Loop (2026)

ReAct Prompting Guide: Reasoning Plus Acting for AI Agents (2026)