Most agent deployments are tested by vibes: "I tried it a few times and it seemed fine." That's how production incidents happen.
Agents behave non-deterministically. The same input can produce different outputs depending on model temperature, retrieval results, tool call timing, and dozens of other factors. Testing like you test deterministic code doesn't work. You can't write a unit test that asserts the agent's response is exactly "The refund will be processed in 3-5 business days" — because it won't be, and that's fine, as long as it's accurate. What you need is an evaluation framework built for probabilistic systems.
Here's how to build one.
Tests, evals, and monitoring — they're not the same thing
Most teams use these terms interchangeably. They shouldn't. Each plays a distinct role.
Tests are pass/fail assertions on deterministic behavior. Did the agent call the search_orders tool when the user asked about their order status? Did the output include a specific field? Did the agent refuse when the user tried to access data they don't own? These are binary and fast to run. Think of them as sanity checks that your plumbing works.
Evals are graded assessments of quality on a held-out dataset. How accurately did the agent complete its task across 100 representative queries? What fraction of its answers were faithful to the retrieved context? These require more setup but tell you whether your agent is actually good at its job.
Monitoring is online observation of production behavior. What are real users experiencing right now? Are escalation rates increasing? Is session abandonment spiking? Monitoring runs continuously in production and catches problems that your offline evals missed.
All three are necessary. Most teams only have vibes.
The 4 dimensions to evaluate
When you're building an eval framework from scratch, start with these four metrics. They cover the most common failure modes and give you a useful baseline before you go deeper.
Task completion rate
The most important metric. Did the agent successfully do what was asked?
For simple tasks, this is binary: the agent either booked the meeting or it didn't. For complex tasks, you need a graded scale — "fully complete," "partially complete," "failed." Define what each level means before you start collecting data, otherwise your scores will be inconsistent.
Target for production: greater than 90% task completion rate. If you're below 80%, the agent isn't ready to ship. If you're at 85-90%, you probably have specific failure modes worth investigating — run a breakdown by task type or query length to find them.
Hallucination and faithfulness rate
Did the agent claim things not supported by its context?
This matters most for RAG agents, where the agent is supposed to answer based on retrieved documents. A faithfulness failure is when the agent states something that isn't in the documents — it either fabricated the information or over-extrapolated from partial evidence.
For customer-facing agents, a 5% hallucination rate sounds acceptable until you realize it means 1 in 20 users gets confidently wrong information. Aim for below 2% on factual claims.
Tools: RAGAS (purpose-built for RAG evaluation), TruLens (good for open-ended evaluation), or LLM-as-judge with a faithfulness rubric (covered below).
Latency
Users abandon after 10-15 seconds. Know your P50 and P95 response times, not just the average.
P50 is your typical experience. P95 is what your worst 1-in-20 users sees. Both matter. An agent with a P50 of 3 seconds and a P95 of 45 seconds has a latency bug you need to fix — probably an occasional tool call timeout or a runaway iteration loop.
Measure per-step latency too: retrieval time, LLM inference time, tool execution time. When P95 spikes, you need to know which step is causing it.
Cost per session
Token cost × API pricing, summed across all LLM calls in a session. Know this number before you scale from 100 users to 10,000.
A $0.12 average session cost is fine for an enterprise product. It's catastrophic for a consumer app where you're charging $10/month. Run your cost numbers early — agent sessions get expensive fast with multi-turn conversations and multiple tool calls.
Building a golden evaluation set
This is the highest-leverage thing you can do for your agent. Everything else depends on having a good golden set.
A golden set is a collection of 50-200 representative input/output pairs where you know the correct answer. It's your held-out test suite. You run your agent against it, score the results, and track how scores change as you modify the agent.
How to build one:
- Once you have production traffic, 50% of your golden set should come from real user queries. Sample from different time periods and user cohorts to get distribution coverage.
- 50% should be hand-crafted to cover cases you know are important but might be underrepresented in real traffic: edge cases, things the agent should refuse or escalate, ambiguous queries, multi-step tasks.
- Every query in the golden set needs a reference answer: either the exact correct output, a rubric for scoring, or both.
What to include: happy-path queries (things the agent should handle well), adversarial queries (attempts to make it say something wrong), boundary cases (requests at the edge of what the agent is designed to do), and escalation triggers (cases where the agent should hand off to a human).
One rule: every production incident goes into the golden set. If the agent failed on something real, add it. Your golden set should encode your institutional memory of failure modes.
LLM-as-judge for subjective quality
For many eval dimensions — helpfulness, tone, completeness — there's no clean string match. You can't write a rule that distinguishes a good answer from a mediocre one at scale. Human review is the gold standard but doesn't scale.
LLM-as-judge fills the gap. You use a stronger model to grade your agent's outputs against a rubric.
A simple faithfulness rubric:
You are evaluating whether an AI assistant's response is faithful to the provided source documents.
Source documents: {retrieved_context}
User query: {query}
Assistant response: {response}
Score the response on faithfulness (1-5):
5 - Every claim is directly supported by the source documents
4 - Nearly all claims are supported; minor extrapolations only
3 - Most claims are supported; some unsupported assertions
2 - Several unsupported claims present
1 - Response contains fabricated information not in the source
Respond with: {"score": N, "reasoning": "..."}
When well-designed, LLM-as-judge rubrics correlate with human judgment at around 80-85%. That's good enough to catch regressions, identify failure modes, and track trends.
One important caveat: don't use the same model for generation and evaluation. A model grading its own outputs will be biased toward its own style and reasoning patterns. Use Claude to evaluate GPT-4o outputs, or vice versa, or use a dedicated evaluator model.
Tooling — an opinionated shortlist
LangSmith is the default choice if you're building with LangChain or LangGraph. Tracing, dataset management, and evals are all in one place. The trace explorer is genuinely useful for debugging agent failures. Worth the cost for teams already in the LangChain ecosystem.
Braintrust is provider-agnostic and has a better UI for manual review workflows. Good choice if you're rolling your own agent framework or using multiple model providers. The experiment tracking and LLM-as-judge integrations are polished.
promptfoo is an open-source CLI tool that fits well into CI/CD pipelines. Strong for red-teaming, regression testing, and automated eval runs on every pull request. Lower setup friction than hosted tools.
RAGAS is specifically designed for RAG agents. It measures faithfulness, answer relevance, context precision, and context recall — the four metrics that matter most for retrieval-augmented systems. Use it alongside one of the above tools, not as a replacement.
DIY: If you have a small team and want to start immediately, a spreadsheet of golden-set queries + a Python script that calls your agent and logs the outputs is enough to begin. Don't let tooling paralysis stop you from running evals. Start simple.
Regression testing — run evals before every deploy
Treat your golden set like a test suite and run it before shipping any change to your agent: prompt edits, retrieval parameter changes, model upgrades, tool modifications. All of them.
Acceptable regression threshold: less than 2% drop in task completion rate. This is conservative by software testing standards, but agents are sensitive — a prompt change that "obviously" improves one behavior can subtly degrade another.
Hard rule: any increase in hallucination rate means you don't ship. A 1% increase in task completion rate doesn't justify a 0.5% increase in hallucination rate. Accuracy is not a dial you trade off against other metrics.
Set up a CI check that runs your core golden set on every pull request. Even 20 queries covering your main use cases catches most regressions. Run the full set nightly or before major releases.
What to monitor in production
Once you've shipped, the eval framework becomes a monitoring framework. You're no longer running against a golden set — you're watching what real users are doing.
Four metrics worth instrumenting from day one:
Escalation rate: If your agent has a "hand off to human" path, watch how often it triggers. A rising escalation rate means the agent is encountering queries it can't handle — either new query types you didn't train for, or degrading performance on existing ones.
Session abandonment: Users who stop responding mid-conversation. Some abandonment is normal (they got their answer). Unusual spikes usually mean the agent said something confusing or unhelpful and the user gave up.
Tool call failure rate: Track how often tool calls return errors. Consistently failing tools indicate external API issues or malformed tool calls the agent is generating. This metric catches infrastructure problems before users report them.
User feedback signals: If you have thumbs up/down or any explicit feedback mechanism, track it. Even noisy feedback signals catch aggregate quality shifts that your offline evals didn't predict.
Putting it together
The minimum viable eval setup for a production agent:
- A golden set of 50-100 queries with reference answers
- A task completion scoring rubric (and someone to apply it, or an LLM-as-judge prompt)
- A script that runs your agent against the golden set and logs scores
- A pre-deploy check: if scores drop, investigate before shipping
That's it. Most teams that skip this step end up doing expensive incident post-mortems instead. Evals are cheaper.
For the conceptual foundation on what to evaluate in agents, see our evaluating agents lesson — it covers the framework that the tooling above is designed to implement.



