Your agent crushed it in testing. Every demo scenario worked. You shipped it. Three weeks later, a user mentions in a support ticket that the agent "just kept apologizing and never actually helped." You pull the logs. There aren't any. You have no idea what happened.
This is the production shock that hits almost every team building AI agents for the first time. The gap between "works in testing" and "observable in production" is wide, and falling into it is expensive.
The war story that changed how I instrument everything
We shipped a support agent for order lookups. It handled cancellations, status checks, refunds — all the things a support team deals with every day. In testing, it performed flawlessly. In production, users kept escalating to human agents.
Three weeks in, someone finally dug into the escalation patterns. The agent had a 40% tool-failure rate on order lookups. For three weeks.
What happened: a backend team had quietly deprecated an API endpoint and spun up a new one. The old endpoint started returning an HTML error page instead of JSON. The agent would receive the HTML, fail to parse it as order data, apologize to the user, and escalate. Every single time. The fix was changing one URL in a config file. Finding the problem took three weeks because nobody was tracing tool calls.
That experience is why I now instrument AI agents the same way I instrument payment systems — before they go live.
The 4 things you must instrument
Observability for agents isn't just "log the input and output." You need visibility into what happened between input and output, because that's where things break.
Token usage per step, not just per run. Total token count tells you cost. Token count per step tells you whether the agent is looping. A run that burns 40K tokens when your p95 baseline is 2K has a runaway loop. Build an alert at 3× your p95 baseline per run. This catches infinite loops before they drain your API budget.
Tool call success and failure rate. Every tool call needs to log: the tool name, its inputs, whether it succeeded or failed, how long it took, and the error message if it failed. A 5% failure rate on any tool is worth investigating. A 15% failure rate is an incident. The order lookup story above? That was a 40% failure rate with zero visibility.
Latency per step, not just end-to-end. P50, P95, and P99 for each step. If P99 is 10× P50, you have a tail latency problem — probably a flaky downstream service or a tool call that hangs under load. End-to-end latency hides which step is the bottleneck.
A hallucination proxy metric. This one isn't perfect, but it's useful. Build a simple check that flags responses containing URLs not in your knowledge base, prices that don't match your catalog, or names of people not in your system. Flag those for human review. You won't catch everything, but you'll catch the easy ones — and easy ones do happen at scale.
LangSmith tracing setup
LangSmith is the fastest way to get full trace visibility on agent runs. The @traceable decorator wraps any function and automatically logs inputs, outputs, and nested calls to LangSmith's dashboard.
Here's how to wire it up with the aicredits.in API:
import os
import requests
from langsmith import traceable
from openai import OpenAI
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = os.environ["LANGSMITH_API_KEY"]
client = OpenAI(
api_key=os.environ["AICREDITS_API_KEY"],
base_url="https://api.aicredits.in/v1"
)
@traceable(name="lookup-order", tags=["tool", "order-system"])
def lookup_order(order_id: str) -> dict:
response = requests.get(f"https://api.yourapp.com/orders/{order_id}")
response.raise_for_status()
return response.json()
@traceable(name="support-agent", run_type="chain")
def run_support_agent(user_message: str, user_id: str) -> str:
response = client.chat.completions.create(
model="anthropic/claude-sonnet-4-6",
messages=[
{"role": "system", "content": "You are a support agent. Use the lookup_order tool when given an order ID."},
{"role": "user", "content": user_message}
]
)
return response.choices[0].message.content
Every call to run_support_agent creates a full trace in LangSmith — the user message, the model response, and every nested lookup_order call with its inputs and outputs. You can filter by tag, search by user ID, and replay any failing run.
Indian developers: access Claude, GPT-4o, and Gemini through AICredits.in — INR billing, UPI top-up, no international card.
The tags=["tool", "order-system"] on the tool function means you can filter LangSmith traces to just order lookup failures. When the next broken endpoint ships, you'll find it in minutes instead of weeks.
Structured logging for teams not using LangSmith
If you're shipping to an environment where LangSmith isn't approved, or you want logs in your existing stack alongside application logs, structured JSON logging gets you most of the way there:
import json
import time
import uuid
from datetime import datetime
def log_agent_run(
user_id: str,
input_msg: str,
output_msg: str,
tool_calls: list,
tokens_used: int,
latency_ms: float,
error: str = None
):
log_entry = {
"run_id": str(uuid.uuid4()),
"timestamp": datetime.utcnow().isoformat(),
"user_id": user_id,
"input": input_msg[:500],
"output": output_msg[:500] if output_msg else None,
"tool_calls": [
{
"tool": t["name"],
"success": t["success"],
"latency_ms": t["latency_ms"]
}
for t in tool_calls
],
"tokens_used": tokens_used,
"latency_ms": latency_ms,
"error": error
}
print(json.dumps(log_entry))
Ship this to Datadog, CloudWatch, or Grafana Loki. The structured shape means you can immediately build dashboards on tool_calls[*].success and tokens_used without any parsing work. run_id lets you correlate a user complaint to a specific trace. The 500-char truncation on input/output is intentional — you don't want to log full conversation history in plain text if you're storing user PII.
Build two dashboards on day one: tool call success rate by tool name, and token usage distribution by run. Everything else comes later. These two will catch 80% of production problems.
Alert patterns that actually matter
I've seen teams build elaborate alerting systems that alert on everything and get ignored. Four rules, each actionable:
Same tool called 3+ times in one run. This is almost always an infinite loop. The agent got a bad response, tried again, got the same bad response, and is stuck. Kill the run. Return a graceful error to the user. Log the loop pattern — it tells you which tool is breaking.
Token usage above 80% of context window. The next turn will fail with a context overflow error. Warn, and start summarizing conversation history automatically. Don't let this be a user-facing crash.
Tool error rate above 5% in the last 15 minutes. Check downstream service health immediately. This is the order-lookup-broken-endpoint scenario. The agent didn't break — something it depends on broke.
P95 latency above 2× your weekly average. Don't ignore latency spikes. They compound. A slow tool call makes the whole agent feel broken, and users give up before the agent finishes.
Braintrust for evals alongside traces
LangSmith gives you traces. Braintrust gives you scores. The combination is what lets you ship improvements confidently.
The simplest useful scorer is binary: did the agent resolve the issue, or did it escalate? You can log this from your human agent handoff data. Pull your LangSmith traces, attach the resolution outcome, and run that as a Braintrust eval dataset.
from braintrust import Eval
def resolution_scorer(output, expected):
resolved = expected.get("resolved", False)
agent_escalated = "escalate" in output.lower() or "transfer" in output.lower()
return 1.0 if (resolved and not agent_escalated) or (not resolved and agent_escalated) else 0.0
Eval(
"support-agent-resolution",
data=lambda: load_traces_from_langsmith(last_n_days=7),
task=lambda input: run_support_agent(input["message"], input["user_id"]),
scores=[resolution_scorer]
)
Track the average weekly. If resolution rate drops after a prompt change, you'll know before your support team tells you something is wrong. Even a 5-point drop in a week is worth investigating.
This is where most teams spend too little time. The eval setup takes a day. The value compounds for months. See our post on AI agent evaluation frameworks for a more complete treatment of scoring strategies.
What to do when you find a problem
Finding a problem in your traces is only useful if you have a triage process. When a tool failure alert fires:
- Pull the last 50 traces where that tool failed. Check if it's one user or many — one user points to bad input, many users points to a service issue.
- Check the error messages. "Connection timeout" and "invalid JSON response" point to different root causes.
- Check when it started. A sudden spike at 14:32 correlates with a deploy or a downstream change.
- Reproduce with the exact inputs from a failing trace. LangSmith makes this easy — you can replay a trace directly.
- Fix, deploy, and watch the dashboard for 30 minutes.
The triage process sounds obvious. It isn't, when you're half-asleep dealing with a production incident at 2am. Write it down before you need it.
Instrumentation is not optional
The mindset shift that matters: AI agents are not deterministic software. The same input can produce different outputs. The agent can take different tool call paths. It can succeed in one context and fail in another. You cannot reason about production behavior from code review alone.
Observability isn't overhead — it's the feedback loop that makes improvement possible. Without traces, every production problem is a mystery. With them, most problems are obvious within minutes.
Start with the four instrumentation points. Add LangSmith tracing if you can. Build the two dashboards. Set the four alerts. That's a day of work that will save you weeks of debugging.
For a deeper look at how to evaluate what you're seeing in those traces, the evaluating agents lesson covers scoring strategies and eval dataset construction from first principles.



