Why Evaluation is Non-Negotiable
You can't improve what you can't measure.
This is especially true for agents. Unlike a simple prompt where you can eyeball a few outputs and judge quality, agents run for many turns, make many decisions, and can fail in subtle ways that aren't obvious from looking at the final output alone.
Without systematic evaluation, you're flying blind. You don't know:
- Whether a change to your system prompt made things better or worse
- Which types of tasks your agent handles well vs. poorly
- Whether your agent gets more reliable as you iterate or if you're accidentally breaking things
Evaluation turns agent development from guesswork into engineering.
The Evaluation Stack
Agent evaluation happens at three levels:
Level 3: End-to-end task success
"Did the agent accomplish the goal?"
Level 2: Trajectory quality
"Did it take a reasonable path to get there?"
Level 1: Component quality
"Did individual steps — tool calls, reasoning — work correctly?"
Most teams start at Level 3 and add lower levels as they need finer-grained diagnostics.
Level 1: Component Evaluation
Test the individual pieces of your agent system.
Tool call accuracy
Does the model call the right tool given the right context?
test_cases = [
{
"input": "What's the weather in Tokyo right now?",
"expected_tool": "get_weather",
"expected_args": {"city": "Tokyo"}
},
{
"input": "What is 2 + 2?",
"expected_tool": None, # Should answer directly, not call a tool
}
]
Run each case, check whether the expected tool was called with valid arguments.
Reasoning quality
Does the Thought content before each action make sense? You can evaluate this with an LLM judge:
Judge prompt: "Below is an agent's thought before taking an action.
Given the context, rate the reasoning quality 1-5:
- 5: Correct, specific, addresses the task
- 3: Correct but vague
- 1: Wrong, confused, or misleading
Thought: [insert thought]
Action taken: [insert action]
Rating:"
Level 2: Trajectory Evaluation
A trajectory is the full sequence of thoughts, actions, and observations the agent took to complete a task.
Turn efficiency
How many turns did the agent need? Compare to an expected range.
Task: "Find the population of Paris"
Expected: 1-2 turns (search, read result)
Actual: 7 turns (agent searched repeatedly, got confused)
→ Flag for investigation
Unnecessary tool calls
Did the agent call tools it didn't need? Redundant or irrelevant tool calls waste tokens and time.
Path validity
Did the agent take a reasonable path, even if different from expected?
Example: Two valid trajectories for "find the cheapest flight to Tokyo":
- Path A: search flights → filter by price → return cheapest
- Path B: search economy flights → search budget airlines → compare → return cheapest
Both are valid. A trajectory evaluator should accept both.
Backtracking and recovery
When the agent hit a dead end, did it recover gracefully? Or did it spin in loops?
Level 3: End-to-End Task Evaluation
The most important question: did the agent successfully complete the task?
Binary success
For tasks with clear right/wrong answers:
def evaluate_task(agent_output, expected_answer):
# Exact match
if agent_output.strip().lower() == expected_answer.strip().lower():
return True
# Semantic match (via LLM judge)
return llm_judge(agent_output, expected_answer)
Rubric-based evaluation
For open-ended tasks, define a rubric and use an LLM judge:
Task: Write a market analysis for electric vehicles
Rubric:
- Accuracy (1-5): Are the facts correct?
- Completeness (1-5): Are all major aspects covered?
- Structure (1-5): Is it well-organized and readable?
- Sources (1-5): Are claims supported?
Overall score: average of rubric dimensions
Pass threshold: ≥ 3.5 average
Human evaluation
For high-stakes tasks, have humans rate a sample of agent outputs. Use human ratings to calibrate your automated evaluations.
Building a Test Suite
Step 1: Define your task distribution
List the types of tasks your agent is supposed to handle. Include:
- Common cases (80% of real usage)
- Edge cases (unusual but valid inputs)
- Known failure modes (inputs that have broken agents in the past)
- Adversarial cases (prompts designed to confuse or jailbreak the agent)
Step 2: Write test cases
For each task type, write 5-10 representative examples:
test_suite = [
{
"id": "research_001",
"category": "research",
"input": "What are the top 3 programming languages by job demand in 2026?",
"success_criteria": {
"type": "rubric",
"dimensions": ["accuracy", "recency", "completeness"],
"pass_threshold": 3.5
}
},
{
"id": "calc_001",
"category": "calculation",
"input": "What is the compound interest on $10,000 at 5% for 3 years?",
"success_criteria": {
"type": "exact",
"expected": "$1,576.25",
"tolerance": 0.01
}
}
]
Step 3: Automate the run
Run the full test suite against every significant change to your agent:
- System prompt changes
- New tools added or removed
- Model version upgrades
- Changes to context management
Step 4: Track trends over time
Track your metrics as a time series. A single eval score is a snapshot. Trends tell you whether your agent is improving.
Common Failure Modes to Test For
| Failure mode | Test for it by... |
|---|---|
| Hallucination | Include questions where the correct answer is "I don't know" or "I couldn't find this" |
| Tool overuse | Include tasks solvable without tools — check if agent still over-calls |
| Context forgetting | Long tasks where early instructions must be followed late in the run |
| Infinite loops | Tasks with no clean answer — check for termination |
| Wrong tool selection | Similar tasks requiring different tools — check correct routing |
| Prompt injection | Tool results containing instructions — check if agent follows them |
| Error recovery | Deliberately return errors from tools — check if agent recovers |
LLM-as-Judge: The Scalable Evaluator
For most production agent systems, human evaluation doesn't scale. LLM-as-judge is the practical solution.
Setup:
def llm_judge(task, agent_output, rubric):
prompt = f"""
You are an expert evaluator. Score the agent's output on the following task.
Task: {task}
Agent Output: {agent_output}
Rubric:
{rubric}
Return a JSON object with:
- scores: dict of dimension -> score (1-5)
- reasoning: brief explanation for each score
- pass: true if average score >= 3.5
"""
response = claude.complete(prompt)
return parse_json(response)
Best practices:
- Use a different model as judge than the agent (avoid self-evaluation bias)
- Include few-shot examples in the judge prompt to calibrate scores
- Periodically audit judge decisions against human ratings
- Use strong models (Claude Opus, GPT-4o) as judges — weaker models give unreliable scores
A Minimal Evaluation Setup to Start With
If you're just getting started, you don't need a full eval framework. Start here:
- Pick 20 representative tasks from your target use cases
- Define pass/fail for each (expected answer, or a rubric scored by you manually)
- Run your agent on all 20 after every major change
- Track your pass rate as a single number over time
- Investigate every failure — each one teaches you something about your agent
This takes a few hours to set up and gives you an immediate signal when something breaks.
Key Takeaways
- Evaluation is not optional — without it, you can't know if your agent is improving
- Evaluate at three levels: component quality, trajectory quality, and end-to-end task success
- Build a diverse test suite: common cases, edge cases, known failure modes, adversarial inputs
- LLM-as-judge scales where human evaluation can't — use a strong model with a clear rubric
- Track metrics over time; a single score is a snapshot, but trends show whether you're improving
- Test specifically for: hallucination, tool overuse, context forgetting, infinite loops, error recovery
- This completes the AI Agents track — you're now equipped to build, run, and evaluate production-ready agent systems