Reflexion is a self-correction technique that turns LLMs into their own editors. Instead of accepting the first response, the model evaluates its own output, identifies what went wrong, and tries again.
Why Self-Correction Matters
The first response to a complex prompt is rarely optimal. Humans write drafts, identify problems, and revise. LLMs traditionally don't — they output once and stop.
Reflexion gives models a structured revision loop:
- Generate an initial response
- Evaluate — what is wrong or missing?
- Reflect — produce a verbal diagnosis of failures
- Improve — generate a new response using the reflection as context
- Repeat until criteria are met or max iterations reached
The Reflexion Loop
Task
↓
Generate response
↓
Evaluate against criteria
↓ [Fails]
Write reflection: "My response failed because..."
↓
Generate improved response (with reflection in context)
↓
Evaluate again → [Passes] → Done
Basic Reflexion Prompt Pattern
Turn 1: Initial Generation
[Task description]
Generate your best response to this task.
Turn 2: Evaluation + Reflection
Here was your previous response:
[Response from Turn 1]
Evaluate this response against these criteria:
- [Criterion 1]
- [Criterion 2]
- [Criterion 3]
For each criterion, state whether the response passes or fails and briefly explain why.
Then write a reflection: what specifically would you do differently in an improved version?
Format:
EVALUATION:
- Criterion 1: [Pass/Fail] — [reason]
- Criterion 2: [Pass/Fail] — [reason]
REFLECTION: [What you'd do differently]
Turn 3: Improved Generation
Based on your reflection, generate an improved version of your response:
REFLECTION (from previous step):
[Paste reflection]
Now generate an improved response that addresses all the identified issues.
Worked Example: Improving a Product Description
Initial response:
The XR500 headphones have great sound quality and are comfortable to wear.
Battery lasts 30 hours. Compatible with Bluetooth 5.0.
Evaluation:
EVALUATION:
- Compelling opening: FAIL — starts with the product name, not a hook
- Addresses pain points: FAIL — doesn't mention what problem it solves
- Specific features with benefits: PARTIAL — lists specs but not benefits
- Call to action: FAIL — no CTA
REFLECTION: I listed features without translating them into benefits. I should
open with a hook about the listening experience, tie each spec to a user benefit
(30 hours = no mid-trip charging), and close with a CTA.
Improved response:
Lose yourself in your music, not in cable management. The XR500 delivers
audiophile-grade sound in a form factor you'll forget you're wearing — all day,
every day. Bluetooth 5.0 means instant pairing and zero audio drop. Thirty hours
of battery means a full work week without reaching for a charger. Ready to hear
the difference? Order today with free 30-day returns.
Reflexion for Code
Reflexion is especially powerful for coding because test results provide unambiguous feedback:
def reflexion_coding_loop(task: str, tests: str, max_iterations: int = 3) -> str:
"""Run reflexion loop until tests pass or max iterations reached."""
code = generate_initial_code(task)
for iteration in range(max_iterations):
test_results = run_tests(code, tests)
if all_tests_pass(test_results):
return code
reflection = generate_reflection(
task=task,
code=code,
test_results=test_results
)
code = generate_improved_code(
task=task,
previous_code=code,
reflection=reflection
)
return code # Best attempt after max iterations
The reflection for a failing test might look like:
My code failed test_edge_case_empty_input because I didn't handle the case where
the input list is empty — my code tries to access index 0 without checking length.
I also failed test_negative_numbers because I assumed all inputs were positive.
My improved version will add: (1) empty list check at the start, (2) abs() around
numeric operations.
Multi-Pass Reflexion
For complex tasks, you can run multiple full Reflexion cycles:
| Iteration | Focus |
|---|---|
| 1 | Correctness — does it do what was asked? |
| 2 | Completeness — is anything missing? |
| 3 | Quality — style, tone, conciseness |
Each iteration's reflection becomes context for the next generation.
When to Use Reflexion
High-value use cases:
- Code generation (test results = clear feedback signal)
- Structured document creation (follow a rubric)
- Multi-part reasoning tasks (check each component)
- Argument or essay writing (evaluate logic and evidence)
Not worth the cost:
- Simple factual Q&A (one shot works fine)
- Creative tasks with no clear quality criteria
- High-volume, latency-sensitive applications
Reflexion vs. Similar Techniques
| Technique | Mechanism | Feedback source |
|---|---|---|
| Reflexion | Self-evaluation → revise | Model critiques its own output |
| Self-consistency | Sample many paths → vote | Statistical aggregation |
| Chain of Thought | Show reasoning before answer | No revision loop |
| Constitutional AI | Rule-based self-critique | Predefined principles |
Reflexion is unique in that the failure diagnosis becomes part of the context — not just "try again" but "try again knowing specifically what failed and why."
Key Takeaways
- Have the model evaluate its output against explicit criteria before revising
- The reflection (diagnosis) is as important as the revised output
- 2–3 iterations handles most tasks; stop early if criteria pass
- Works best when you can define clear success/failure criteria
- Especially powerful for code, structured documents, and agent tasks with environment feedback