What is Reflexion in prompt engineering?

Reflexion (from Shinn et al., 2023) is a technique where an LLM generates a response, evaluates it against a goal or rubric, produces a verbal reflection on what went wrong, and then generates an improved response using that reflection as context. It's a structured self-correction loop that can be applied for multiple iterations until the output meets quality criteria.

How is Reflexion different from just asking the model to improve its answer?

Simply asking 'can you do better?' gives the model no clear direction. Reflexion is structured: you ask the model to explicitly diagnose what's wrong with its current output before attempting to fix it. This forces specific identification of failure modes rather than unfocused revision. The reflection itself becomes part of the context that guides improvement.

Can Reflexion be applied to agentic tasks, not just text generation?

Yes — the original paper focused on agents taking actions in environments. An agent attempts a task, gets feedback (from test results, tool errors, or an evaluator), writes a reflection on what went wrong, and then re-attempts with that reflection in its context window. This is especially powerful for coding tasks where test results provide clear success/failure signals.

How many reflection iterations are usually needed?

Most tasks converge in 2–3 iterations. Running more than 5 iterations rarely improves quality and can cause the model to overthink. The clearer the evaluation criteria, the fewer iterations you need. For coding tasks with unit tests, 2–3 iterations usually achieve >90% pass rates. For open-ended writing tasks, set a maximum iteration count and pick the best output.

Reflexion: Teach AI to Learn from Its Own Mistakes

Reflexion is a self-correction technique that turns LLMs into their own editors. Instead of accepting the first response, the model evaluates its own output, identifies what went wrong, and tries again.

Why Self-Correction Matters

The first response to a complex prompt is rarely optimal. Humans write drafts, identify problems, and revise. LLMs traditionally don't — they output once and stop.

Reflexion gives models a structured revision loop:

Generate an initial response
Evaluate — what is wrong or missing?
Reflect — produce a verbal diagnosis of failures
Improve — generate a new response using the reflection as context
Repeat until criteria are met or max iterations reached

The Reflexion Loop

Task
  ↓
Generate response
  ↓
Evaluate against criteria
  ↓ [Fails]
Write reflection: "My response failed because..."
  ↓
Generate improved response (with reflection in context)
  ↓
Evaluate again → [Passes] → Done

Basic Reflexion Prompt Pattern

Turn 1: Initial Generation

[Task description]

Generate your best response to this task.

Turn 2: Evaluation + Reflection

Here was your previous response:
[Response from Turn 1]

Evaluate this response against these criteria:
- [Criterion 1]
- [Criterion 2]
- [Criterion 3]

For each criterion, state whether the response passes or fails and briefly explain why.
Then write a reflection: what specifically would you do differently in an improved version?

Format:
EVALUATION:
- Criterion 1: [Pass/Fail] — [reason]
- Criterion 2: [Pass/Fail] — [reason]

REFLECTION: [What you'd do differently]

Turn 3: Improved Generation

Based on your reflection, generate an improved version of your response:

REFLECTION (from previous step):
[Paste reflection]

Now generate an improved response that addresses all the identified issues.

Worked Example: Improving a Product Description

Initial response:

The XR500 headphones have great sound quality and are comfortable to wear.
Battery lasts 30 hours. Compatible with Bluetooth 5.0.

Evaluation:

EVALUATION:
- Compelling opening: FAIL — starts with the product name, not a hook
- Addresses pain points: FAIL — doesn't mention what problem it solves
- Specific features with benefits: PARTIAL — lists specs but not benefits
- Call to action: FAIL — no CTA

REFLECTION: I listed features without translating them into benefits. I should
open with a hook about the listening experience, tie each spec to a user benefit
(30 hours = no mid-trip charging), and close with a CTA.

Improved response:

Lose yourself in your music, not in cable management. The XR500 delivers
audiophile-grade sound in a form factor you'll forget you're wearing — all day,
every day. Bluetooth 5.0 means instant pairing and zero audio drop. Thirty hours
of battery means a full work week without reaching for a charger. Ready to hear
the difference? Order today with free 30-day returns.

Reflexion for Code

Reflexion is especially powerful for coding because test results provide unambiguous feedback:

def reflexion_coding_loop(task: str, tests: str, max_iterations: int = 3) -> str:
    """Run reflexion loop until tests pass or max iterations reached."""

    code = generate_initial_code(task)

    for iteration in range(max_iterations):
        test_results = run_tests(code, tests)

        if all_tests_pass(test_results):
            return code

        reflection = generate_reflection(
            task=task,
            code=code,
            test_results=test_results
        )

        code = generate_improved_code(
            task=task,
            previous_code=code,
            reflection=reflection
        )

    return code  # Best attempt after max iterations

The reflection for a failing test might look like:

My code failed test_edge_case_empty_input because I didn't handle the case where
the input list is empty — my code tries to access index 0 without checking length.
I also failed test_negative_numbers because I assumed all inputs were positive.
My improved version will add: (1) empty list check at the start, (2) abs() around
numeric operations.

Multi-Pass Reflexion

For complex tasks, you can run multiple full Reflexion cycles:

Iteration	Focus
1	Correctness — does it do what was asked?
2	Completeness — is anything missing?
3	Quality — style, tone, conciseness

Each iteration's reflection becomes context for the next generation.

When to Use Reflexion

High-value use cases:

Code generation (test results = clear feedback signal)
Structured document creation (follow a rubric)
Multi-part reasoning tasks (check each component)
Argument or essay writing (evaluate logic and evidence)

Not worth the cost:

Simple factual Q&A (one shot works fine)
Creative tasks with no clear quality criteria
High-volume, latency-sensitive applications

Reflexion vs. Similar Techniques

Technique	Mechanism	Feedback source
Reflexion	Self-evaluation → revise	Model critiques its own output
Self-consistency	Sample many paths → vote	Statistical aggregation
Chain of Thought	Show reasoning before answer	No revision loop
Constitutional AI	Rule-based self-critique	Predefined principles

Reflexion is unique in that the failure diagnosis becomes part of the context — not just "try again" but "try again knowing specifically what failed and why."

Key Takeaways

Have the model evaluate its output against explicit criteria before revising
The reflection (diagnosis) is as important as the revised output
2–3 iterations handles most tasks; stop early if criteria pass
Works best when you can define clear success/failure criteria
Especially powerful for code, structured documents, and agent tasks with environment feedback