Why is evaluating AI agents harder than evaluating regular software?

Regular software has deterministic outputs — the same input always produces the same output, and you can write precise assertions. Agents are non-deterministic: the same task may be solved via different tool calls in different runs, and 'correct' often means nuanced things like 'found the right answer via a reasonable path.' You need evaluation criteria that capture intent, not just exact output matches.

What metrics should I track for an AI agent?

The most important metrics are: task success rate (did it complete the goal?), tool call accuracy (did it call the right tools in the right order?), turn efficiency (how many steps did it take?), error recovery rate (when it failed, did it recover?), and cost per task (tokens used). Which matters most depends on your use case.

How do I build a test suite for an agent?

Start with a diverse set of representative tasks covering your intended use cases, plus known edge cases and failure modes. For each task, define what success looks like — either an expected output, a set of required tool calls, or a rubric an LLM judge can evaluate against. Run the suite on every significant change to your agent.

What is LLM-as-a-judge evaluation?

LLM-as-a-judge is using a separate, capable language model to evaluate your agent's outputs against a rubric. It's particularly useful for open-ended tasks where outputs can't be compared with string matching. For example, a judge model can evaluate whether a research summary is accurate, complete, and well-structured.

Evaluating AI Agents: How to Know If Your Agent Works

Why Evaluation is Non-Negotiable

You can't improve what you can't measure.

This is especially true for agents. Unlike a simple prompt where you can eyeball a few outputs and judge quality, agents run for many turns, make many decisions, and can fail in subtle ways that aren't obvious from looking at the final output alone.

Without systematic evaluation, you're flying blind. You don't know:

Whether a change to your system prompt made things better or worse
Which types of tasks your agent handles well vs. poorly
Whether your agent gets more reliable as you iterate or if you're accidentally breaking things

Evaluation turns agent development from guesswork into engineering.

The Evaluation Stack

Agent evaluation happens at three levels:

Level 3: End-to-end task success
          "Did the agent accomplish the goal?"

Level 2: Trajectory quality
          "Did it take a reasonable path to get there?"

Level 1: Component quality
          "Did individual steps — tool calls, reasoning — work correctly?"

Most teams start at Level 3 and add lower levels as they need finer-grained diagnostics.

Level 1: Component Evaluation

Test the individual pieces of your agent system.

Tool call accuracy

Does the model call the right tool given the right context?

test_cases = [
    {
        "input": "What's the weather in Tokyo right now?",
        "expected_tool": "get_weather",
        "expected_args": {"city": "Tokyo"}
    },
    {
        "input": "What is 2 + 2?",
        "expected_tool": None,  # Should answer directly, not call a tool
    }
]

Run each case, check whether the expected tool was called with valid arguments.

Reasoning quality

Does the Thought content before each action make sense? You can evaluate this with an LLM judge:

Judge prompt: "Below is an agent's thought before taking an action.
Given the context, rate the reasoning quality 1-5:
- 5: Correct, specific, addresses the task
- 3: Correct but vague
- 1: Wrong, confused, or misleading

Thought: [insert thought]
Action taken: [insert action]
Rating:"

Level 2: Trajectory Evaluation

A trajectory is the full sequence of thoughts, actions, and observations the agent took to complete a task.

Turn efficiency

How many turns did the agent need? Compare to an expected range.

Task: "Find the population of Paris"
Expected: 1-2 turns (search, read result)
Actual: 7 turns (agent searched repeatedly, got confused)
→ Flag for investigation

Unnecessary tool calls

Did the agent call tools it didn't need? Redundant or irrelevant tool calls waste tokens and time.

Path validity

Did the agent take a reasonable path, even if different from expected?

Example: Two valid trajectories for "find the cheapest flight to Tokyo":

Path A: search flights → filter by price → return cheapest
Path B: search economy flights → search budget airlines → compare → return cheapest

Both are valid. A trajectory evaluator should accept both.

Backtracking and recovery

When the agent hit a dead end, did it recover gracefully? Or did it spin in loops?

Level 3: End-to-End Task Evaluation

The most important question: did the agent successfully complete the task?

Binary success

For tasks with clear right/wrong answers:

def evaluate_task(agent_output, expected_answer):
    # Exact match
    if agent_output.strip().lower() == expected_answer.strip().lower():
        return True
    # Semantic match (via LLM judge)
    return llm_judge(agent_output, expected_answer)

Rubric-based evaluation

For open-ended tasks, define a rubric and use an LLM judge:

Task: Write a market analysis for electric vehicles
Rubric:
- Accuracy (1-5): Are the facts correct?
- Completeness (1-5): Are all major aspects covered?
- Structure (1-5): Is it well-organized and readable?
- Sources (1-5): Are claims supported?

Overall score: average of rubric dimensions
Pass threshold: ≥ 3.5 average

Human evaluation

For high-stakes tasks, have humans rate a sample of agent outputs. Use human ratings to calibrate your automated evaluations.

Building a Test Suite

Step 1: Define your task distribution

List the types of tasks your agent is supposed to handle. Include:

Common cases (80% of real usage)
Edge cases (unusual but valid inputs)
Known failure modes (inputs that have broken agents in the past)
Adversarial cases (prompts designed to confuse or jailbreak the agent)

Step 2: Write test cases

For each task type, write 5-10 representative examples:

test_suite = [
    {
        "id": "research_001",
        "category": "research",
        "input": "What are the top 3 programming languages by job demand in 2026?",
        "success_criteria": {
            "type": "rubric",
            "dimensions": ["accuracy", "recency", "completeness"],
            "pass_threshold": 3.5
        }
    },
    {
        "id": "calc_001",
        "category": "calculation",
        "input": "What is the compound interest on $10,000 at 5% for 3 years?",
        "success_criteria": {
            "type": "exact",
            "expected": "$1,576.25",
            "tolerance": 0.01
        }
    }
]

Step 3: Automate the run

Run the full test suite against every significant change to your agent:

System prompt changes
New tools added or removed
Model version upgrades
Changes to context management

Step 4: Track trends over time

Track your metrics as a time series. A single eval score is a snapshot. Trends tell you whether your agent is improving.

Common Failure Modes to Test For

Failure mode	Test for it by...
Hallucination	Include questions where the correct answer is "I don't know" or "I couldn't find this"
Tool overuse	Include tasks solvable without tools — check if agent still over-calls
Context forgetting	Long tasks where early instructions must be followed late in the run
Infinite loops	Tasks with no clean answer — check for termination
Wrong tool selection	Similar tasks requiring different tools — check correct routing
Prompt injection	Tool results containing instructions — check if agent follows them
Error recovery	Deliberately return errors from tools — check if agent recovers

LLM-as-Judge: The Scalable Evaluator

For most production agent systems, human evaluation doesn't scale. LLM-as-judge is the practical solution.

Setup:

def llm_judge(task, agent_output, rubric):
    prompt = f"""
You are an expert evaluator. Score the agent's output on the following task.

Task: {task}
Agent Output: {agent_output}

Rubric:
{rubric}

Return a JSON object with:
- scores: dict of dimension -> score (1-5)
- reasoning: brief explanation for each score
- pass: true if average score >= 3.5
"""
    response = claude.complete(prompt)
    return parse_json(response)

Best practices:

Use a different model as judge than the agent (avoid self-evaluation bias)
Include few-shot examples in the judge prompt to calibrate scores
Periodically audit judge decisions against human ratings
Use strong models (Claude Opus, GPT-4o) as judges — weaker models give unreliable scores

A Minimal Evaluation Setup to Start With

If you're just getting started, you don't need a full eval framework. Start here:

Pick 20 representative tasks from your target use cases
Define pass/fail for each (expected answer, or a rubric scored by you manually)
Run your agent on all 20 after every major change
Track your pass rate as a single number over time
Investigate every failure — each one teaches you something about your agent

This takes a few hours to set up and gives you an immediate signal when something breaks.

Key Takeaways

Evaluation is not optional — without it, you can't know if your agent is improving
Evaluate at three levels: component quality, trajectory quality, and end-to-end task success
Build a diverse test suite: common cases, edge cases, known failure modes, adversarial inputs
LLM-as-judge scales where human evaluation can't — use a strong model with a clear rubric
Track metrics over time; a single score is a snapshot, but trends show whether you're improving
Test specifically for: hallucination, tool overuse, context forgetting, infinite loops, error recovery
This completes the AI Agents track — you're now equipped to build, run, and evaluate production-ready agent systems