Skip to main content
Back to Blog
reflexionprompt engineeringagent memoryself-reflectionagentic AI

Reflexion Prompting Guide: Verbal Self-Reflection After Failures (2026)

How reflexion prompting works — the agent writes a reflection after each failed attempt, accumulating episodic memory that guides later retries.

SurePrompts Team
April 20, 2026
12 min read

TL;DR

Reflexion adds a memory layer to self-refine: after a failed attempt, the agent writes a short reflection on why it failed, and that text becomes context for the next try. It works on tasks with clear failure signals.

Reflexion is a small idea with an outsized effect: after an attempt fails, the agent writes a short note explaining why it failed, and that note becomes context for the next attempt. Each retry starts from a richer prompt than the last, not a blank one. The memory is verbal — plain text the model reads back — and the signal comes from outside: a failing test, a rejected tool call, a verifier that says "no." When those ingredients are present, reflexion outperforms looping without memory. When they're not, it adds overhead without gain.

What Reflexion Is

Reflexion is a retry pattern with three moves per iteration:

  • Attempt. The agent produces an output or takes an action.
  • Evaluate. An external signal scores the attempt — a test suite runs, a schema validates, a tool returns an error, a human or verifier emits pass/fail.
  • Reflect. If the attempt failed, the agent writes a short, first-person reflection naming the mistake and what to do differently.

The reflection text is then prepended (or appended in a structured memory block) to the next attempt's prompt. Iteration repeats until the task passes or a retry cap trips. The "memory" is nothing more than a growing list of reflections — episodic memory in the loosest sense, persistent only across attempts of the same task.

It's related to chain-of-thought prompting in that both surface reasoning in prose, but the timing is different: CoT reasons before the answer, reflexion reasons about an answer that already failed. The reflexion paper (Shinn et al. 2023) introduced the term, but the pattern is simple enough to implement from scratch — you don't need a framework to run it.

Reflexion vs Self-Refine

The two patterns look similar at a glance and get conflated. They are not the same thing.

PropertySelf-refineReflexion
Feedback sourceModel's own critiqueExternal pass/fail signal
RetriesSingle session, critique-reviseMultiple full attempts
MemoryNone across attemptsReflection text carried forward
Best targetCheckable output in one shotTasks with retry-level verification
Failure modePolishes errors it can't seeWasted retries without clear signal

Self-refine prompting runs a critique-revise pass inside one attempt, using no outside information. Reflexion runs multiple full attempts, each one informed by a verbal summary of what went wrong in prior attempts, grounded in an external verdict. Put simply: self-refine polishes an output; reflexion tries again, smarter.

Both can compose. An agent can run self-refine inside each reflexion attempt, using the reflection as the critique rubric for the internal revise step. That's two layers of feedback — internal critique plus accumulated cross-attempt memory — and the cost is more tokens, not more complexity.

When Reflexion Wins

Reflexion pays off where three things hold at once: retries are possible, failures are detectable from outside the model, and the error is the kind you can describe in a sentence. Tasks that fit:

  • Code against a test suite. The suite runs, a test fails with a traceback, the agent reflects ("I assumed the input was a list, but the failing test passes a generator"), and the next attempt handles generators. Test output grounds the reflection in something concrete.
  • Structured output with validation. A schema validator rejects JSON for a missing field. The reflection names the missing field, the next attempt includes it. Repeat until valid.
  • Tool-using agents with retry budgets. A tool call returns an error — wrong arguments, missing auth, rate-limit signal. The reflection captures the tool's complaint and the next attempt adjusts. This is a natural fit for ReAct prompting pipelines, which already produce trajectories ripe for reflection.
  • Puzzles and tasks with scoring. Anything with a deterministic scorer — game levels, algorithmic challenges, math with verified answers — gives reflexion the ground truth it needs.
  • Regression debugging. An agent tries to fix a bug, the test still fails, the trace points at a new place. Each reflection carries forward the diagnosis so the agent stops re-proposing the fix that already failed. See agent debugging prompts for the full retry-and-diagnose shape.

The common element: the environment, not the model, decides whether an attempt passed. Reflection converts the environment's verdict into something the next prompt can act on.

When Reflexion Loses

Reflexion stalls when there is no reliable external signal or when retries are expensive relative to the first pass:

  • Pure generation with no ground truth. Writing an essay, drafting a brand tagline, summarizing an open-ended document. Nothing says "failed," so there's nothing to reflect on. Reflexion here degenerates into vague self-critique, which is self-refine with extra overhead.
  • Single-shot tasks where retries aren't allowed. If the system has to return an answer immediately with no retry budget, reflexion adds cost without getting to use the memory.
  • Noisy verifiers. If the failure signal is flaky — intermittent test failures, environments with side effects — reflections start encoding the noise. "Last time the test failed because of network timeout" becomes a distractor for the next attempt instead of a lesson.
  • Very short tasks. A one-line transformation that succeeds in the first attempt ~95% of the time doesn't benefit from a reflection scaffold. The retry budget is wasted on the other 5%.
  • Tasks where the model lacks the knowledge, period. Reflection doesn't fix a knowledge gap. If the model can't solve the task because it doesn't know the domain, three reflections still won't. Use retrieval or a better model.

The tell for a failing reflexion loop: reflections repeat across attempts — the agent writes similar regret in iteration 2 as in iteration 1 — without the attempt actually changing. Memory is there but isn't driving behavior.

Anatomy of a Reflexion Prompt

A working reflexion loop has three prompt templates, each with a specific job:

  • Attempt prompt. Task statement, plus a memory block with prior reflections if any exist. The block is explicitly labeled so the model knows it's lessons from past failures, not part of the task itself.
  • Reflection prompt. Shown after a failed attempt. Supplies the task, the attempt, the external failure signal (traceback, validator error, verdict), and asks for a short first-person reflection — not a new attempt. Keep it to two or three sentences. One-paragraph reflections degrade into vague self-critique; one-sentence reflections carry enough signal without burning context.
  • Retry prompt. Structurally identical to the attempt prompt, but with the new reflection appended to the memory block. The model never sees its own prior output in the retry (just the reflection), which keeps context small and discourages copy-pasting broken code.

The reflection prompt is where the loop lives or dies. "Reflect on what went wrong" is too loose. Better: "In 2–3 sentences, name the specific mistake in the attempt above, and state one concrete change for the next attempt." Specificity in, specificity out.

Episodic Memory Design

What should carry across attempts? Three choices, with tradeoffs:

  • Reflection text only. Cheapest. The model only sees distilled lessons, not raw failures. Works when the reflection prompt produces high-signal text. Risk: important detail gets abstracted away.
  • Reflection + condensed trajectory. Reflection plus a one-line summary of what the attempt did ("tried list comprehension, failed on empty input"). More tokens, more grounding. Good for tool-using agents where the action mattered as much as the outcome.
  • Reflection + specific error evidence. Reflection plus the exact error message or failing-test output. Most grounded, most verbose. Right when the error itself is the key signal (tracebacks, validator messages).

For most tasks, reflection text only is enough. Upgrade to richer memory when you see the loop repeating mistakes — that means the reflection lost information the trajectory or error held.

Cap memory size. Past 3–5 reflections, older entries are usually superseded by newer ones. Either truncate (keep the last N) or summarize (ask the model to compress the list when it exceeds a threshold). Unbounded memory is a slow context-window leak.

Termination

Reflexion loops must have a stop condition. Three patterns, in order of preference:

  • Task passes verifier. The external signal says succeed — tests pass, schema validates, tool returns clean. Terminate with success. This is the reason the pattern exists.
  • Retry cap. A hard maximum (k=3 or k=5) on attempts regardless of outcome. Mandatory. Without it a broken loop retries forever.
  • Escalation. If the retry cap is hit without success, don't just fail — escalate. Hand off to a human, switch to a larger model, surface the full reflection history as a diagnostic. The accumulated reflections are often the most useful thing the agent produced, even in failure.

Avoid combining reflexion with a "try harder" prompt that nudges k upward when things aren't working. More attempts with the same model on an unsolvable task is just more cost. Fail fast, escalate cleanly.

A Reflexion Loop — Hypothetical Example

An illustrative three-prompt sequence for a code-against-tests task. Hypothetical, meant to show the shape — the signals, tests, and outputs are invented.

code
# Attempt 1 (no prior reflections)

Task: Implement a function `flatten(seq)` that yields every non-iterable
leaf from a nested iterable. Pass the provided test suite.

Memory from prior attempts: (none yet — this is the first attempt.)

Output: your implementation as a single Python function.

---

# External signal after Attempt 1

Test run result:
  FAIL test_flatten_strings
    AssertionError: expected ['a','b','c'], got ['a','b','c','d','e','f']
    input: ['abc', 'def']

(Strings are iterable but should be treated as leaves.)

---

# Reflection prompt

Below is your previous attempt and the test failure. In 2–3 sentences,
name the specific mistake and state one concrete change for the next
attempt. Do not write code here.

Previous attempt: <code from Attempt 1>
Failure: <test output above>

Reflection:

---

# Attempt 2 (reflection in memory block)

Task: <same as before>

Memory from prior attempts:
  - Attempt 1 treated strings as iterables and flattened their characters.
    Next attempt must treat `str` (and `bytes`) as leaves, not sequences to
    descend into.

Output: your implementation as a single Python function.

The shape repeats. The task stays constant, the memory block grows, the model's context each turn is task + distilled lessons — never the full mess of prior code. Wrap this in a harness that runs the test suite, calls the reflection step only on failure, and caps at k=3.

Common Anti-Patterns

  • Reflecting without a real failure signal. "Reflect on your output and try again" with no ground truth turns reflexion into self-refine with extra steps. Fix: require an external verdict; if there isn't one, use self-refine.
  • Over-long reflections. Paragraph-length reflections drift into generic self-criticism and crowd out context. Fix: cap reflections at 2–3 sentences; demand specificity.
  • Reflections that don't drive behavior. The agent writes a reflection, then produces an attempt structurally identical to the previous one. Fix: make the reflection prompt require one concrete change, and verify the retry prompt actually surfaces the reflection prominently.
  • Unbounded retry budgets. No cap means a stuck task consumes tokens forever. Fix: hard cap at k=3 or k=5 with explicit escalation on exhaustion.
  • Including the failed output verbatim in the retry. The model copies the broken code instead of rewriting. Fix: retry prompt sees the reflection, not the previous output.
  • Unbounded memory. After many retries the memory block dominates the prompt. Fix: keep the last N reflections or summarize when the block exceeds a token threshold.

FAQ

Does reflexion need an external verifier?

Effectively yes. Without a real signal that an attempt failed, the reflection step has nothing to ground on and you're doing self-refine under a different name. The verifier can be anything that produces pass/fail — tests, schema validators, tool errors, human review — but it has to exist. If you can't verify, don't reflect.

How long should a reflection be?

Two or three sentences is the sweet spot in practice. One sentence often loses nuance; a paragraph drifts into vague self-criticism and starts to overwhelm the memory block. Make the reflection prompt enforce the length explicitly.

Can I use a different model for the reflection step?

Yes, and sometimes it helps. A stronger model reflecting on a weaker model's failures can produce more actionable lessons than the weaker model reflecting on itself. The cost-benefit depends on task value — for cheap tasks, stick with one model; for expensive failures, a stronger critic is worth it.

How many retries is too many?

Past k=5, diminishing returns set in fast, and reflections start repeating themselves — the agent writes similar lessons in attempt 5 as in attempt 3. If the task hasn't passed by then, more retries rarely fix it. Escalate: different model, human review, or declare failure with the accumulated reflection history attached.

Does reflexion compose with self-refine?

Yes. A reflexion loop can run self-refine inside each attempt — the reflection becomes a strong seed for the critique step. The extra cost is another model call per attempt; the win is two feedback channels (internal critique plus cross-attempt memory) working together.

Wrap-Up

Reflexion is the pattern to reach for when you have retries and an external signal. The memory is verbal and lightweight, the failures teach the loop directly, and termination is natural because the verifier either passes or the retry budget runs out. Keep reflections short, cap retries, don't inflict the pattern on tasks without ground truth. For the larger pattern landscape see the complete guide to prompting AI coding agents; for the simpler inner loop see self-refine prompting; for tool-using agents where reflexion slots in naturally see ReAct prompting; for the retry-and-diagnose shape applied to bug fixing see agent debugging prompts.

Try it yourself

Build expert-level prompts from plain English with SurePrompts — 350+ templates with real-time preview.

Open Prompt Builder

Ready to write better prompts?

SurePrompts turns plain English into expert-level AI prompts. 350+ templates, real-time preview, works with any model.

Try AI Prompt Generator