Debugging a regular function is straightforward: wrong input produces wrong output, you trace through the code, you fix the function. The relationship is deterministic and the failure is usually local.
Debugging an agent is fundamentally different. The same input can produce different outputs on different runs. A failure that surfaces at step 7 of a reasoning chain was often caused by a mistake at step 3. The tool call was made with perfectly valid parameters — but the external API returned unexpected data, and the agent didn't know how to handle it. Without observability, you're not debugging. You're guessing.
This lesson covers what observability means for agents, how to trace execution, the five most common failure modes, and how to set up the monitoring you need before you launch.
Why agents are hard to debug
Several properties of agentic systems make traditional debugging approaches insufficient:
Non-determinism: Unlike a pure function, an LLM call with the same input can produce different outputs. Temperature settings, model updates, and sampling randomness all contribute. A bug you reproduce once might not reproduce on the next run.
Multi-step failure propagation: In a 7-step reasoning chain, the error introduced at step 3 may not produce an obviously wrong result until step 6 or 7. By then, the output looks plausible enough that you might not notice — unless you trace every step.
Opaque reasoning: Unless you're using an extended-thinking model that exposes its reasoning, you don't see why the agent made a decision. You see the decision, not the chain of reasoning that led to it.
External dependencies: Your agent's behavior depends on tool results, API responses, and retrieved content that you don't fully control. A tool that usually returns structured JSON occasionally returns an error string. A retrieval that usually surfaces the right document sometimes misses it. These external failures propagate into agent behavior in ways that are hard to anticipate.
Context sensitivity: The agent's behavior depends on everything in its context window. A subtle difference in a retrieved document, an extra message in the conversation history, a slightly different system prompt — any of these can change how the agent behaves. Reproducing exact failure conditions requires knowing the exact context state at the time of failure.
What observability means for agents
Observability for a traditional service means logs, metrics, and traces. For agents, it means specifically:
- A complete trace of every decision the agent made, in order
- Every tool call: which tool, what parameters, what was returned
- The exact prompt and context window contents at each decision point
- Latency and token count for each step (for cost and performance analysis)
- The final output and how it was constructed from the intermediate steps
Without this, you have a black box. With it, you can reconstruct exactly what happened on any given run and understand precisely where and why something went wrong.
Tracing agent execution
A good trace captures the agent's full reasoning and action sequence. Here's what each step of a well-instrumented trace looks like:
Step 1 — Initial context: The full system prompt, conversation history, and any pre-loaded context passed into the agent. This is the starting state.
Step 2 — First reasoning step: What did the model decide to do? If it chose to call a tool, which tool and why? If it chose to respond directly, what triggered that decision?
Step 3 — Tool call: The tool name and the exact parameters passed. Not just "the agent searched" — but search_knowledge_base(query="refund policy for annual subscriptions").
Step 4 — Tool result: Exactly what the tool returned. The raw response, not a paraphrase. This is crucial — a vague summary of tool output hides the failures you need to see.
Step 5 — Post-tool reasoning: Did the tool result change the agent's plan? Did it call another tool, or proceed to respond? What changed in the context after the tool result was appended?
Step N — Final response: How was the final answer constructed? Which pieces of context were most relevant? Was it grounded in retrieved content or generated from the model's training?
A good tracing tool displays all of this as a timeline, with each step expandable to show full detail. When something goes wrong, you can scan the timeline, find where things diverged from the expected path, and understand exactly what happened.
The 5 most common agent failures
After instrumenting your agent, you'll be able to identify these failure modes in your traces:
Failure 1: Wrong tool selection
The agent calls the wrong tool for the job. It uses a general web search when it should have used the internal knowledge base. It calls a read tool when it should have called a write tool. It skips a required verification step and goes straight to an action.
What it looks like in a trace: The tool call name in step 3 doesn't match the intent expressed in step 2. The agent's reasoning said "I need to look up the user's order history" but then called search_web instead of get_order_history.
Fix: Improve your tool descriptions. Be explicit about when each tool should and should not be used. "Use this tool for internal customer order lookups. Do not use this tool for general product information — use search_knowledge_base for that." The more precisely you describe the tool's purpose and scope, the better the model is at choosing correctly.
Failure 2: Malformed tool parameters
The agent calls the right tool but passes parameters in the wrong format. The order ID is passed as "order #1234" when the schema expects "1234". A date is formatted as "March 3rd" instead of "2026-03-03". A boolean field gets passed as the string "true" instead of true.
What it looks like in a trace: The tool call parameters don't match the schema. The tool returns a validation error or unexpected result, and the agent either retries with the same malformed input or produces an answer based on a failed tool call.
Fix: Add explicit format examples in your tool descriptions. Don't just say id: string — say id: string — the numeric order ID only, e.g. "1234" not "order 1234" or "#1234". The model follows examples far more reliably than it follows abstract format descriptions.
Failure 3: Hallucination despite retrieval
This is one of the most insidious failures because the agent did everything right — it retrieved relevant documents — but then answered with something not actually contained in those documents.
What it looks like in a trace: The retrieval step returned relevant content. You can see it in the context at the time of the final response. But the answer includes claims that aren't present in the retrieved text.
Fix: Strengthen your grounding instructions in the system prompt. Be explicit: "Answer only using information from the provided context. If the context does not contain enough information to answer the question, say so — do not infer or speculate beyond what is provided." Also consider adding output validation that checks whether the key claims in the response can be found in the retrieved context.
Failure 4: Retrieval miss
The agent searched for relevant information but the search didn't return the right content. It then either said "I don't know" (better) or hallucinated to fill the gap (worse).
What it looks like in a trace: The retrieval step ran, but the returned chunks have low relevance scores or are clearly about a different topic than the query. The agent's answer is either a refusal or a fabrication.
Fix: This is a retrieval quality problem, not an agent reasoning problem. Common causes and fixes: your chunks are too large (break them down further), your embedding model isn't matching the query intent (try query expansion or a different model), or the information simply isn't in your knowledge base (add it). Examining the actual retrieved content in traces is the fastest way to diagnose which of these is the issue.
Failure 5: Infinite loops and over-tool-calling
The agent calls the same tool repeatedly with slightly different queries, never converging on an answer. Or it calls 8 tools sequentially to answer a question that needed 1.
What it looks like in a trace: The same tool appears 3, 4, 5+ times in the timeline. The queries are variations on the same theme. The agent never commits to a final answer.
Fix: Add max_iterations limits to your agent loop — if it hasn't produced an answer after N steps, stop and return what you have. Add explicit system prompt instructions: "If you have searched twice and haven't found the information you need, tell the user you couldn't find it rather than searching again." For over-tool-calling, review whether your tool selection instructions are too loose, allowing the agent to call multiple tools when one would suffice.
Observability tools
Several tools are purpose-built for agent observability:
LangSmith: The best option if you're using LangChain or LangGraph. Provides real-time traces, a dataset management system for building eval sets, and integration with automated evaluation. The free tier is generous enough for development.
Langfuse: An open-source alternative to LangSmith. Self-hostable if you have data residency requirements. Good SDK support for multiple frameworks, not just LangChain.
Weights & Biases Weave: A strong option if your team already uses W&B for ML experimentation. Integrates agent traces with model training and evaluation workflows.
Arize Phoenix: Strong evaluation capabilities and production monitoring features. Good for teams who need to monitor agent quality at scale in production.
DIY minimal approach: If you don't want to set up a dedicated tool, the minimum viable logging approach is: wrap every tool call in a decorator that records the timestamp, tool name, input parameters, output result, and latency. Write these records to a file or database. This gives you enough to debug the majority of issues, even without a polished UI.
Building your own minimal observability
If you're not ready for a full observability platform, here's a minimal setup that covers the most important ground:
Log every tool call: Before and after each tool execution, write a record with the tool name, parameters, result, and how long it took. This alone catches failure modes 1, 2, 4, and 5 from the list above.
Log session state at the start: When a session begins, write out the full system prompt and initial context. This gives you the starting conditions you need to reproduce failures.
Log agent decision points: When the agent decides to call a tool or produce a final answer, log the reasoning if it's available. Even just logging "agent chose to call X" is helpful for tracing the decision path.
Add unique session IDs: Every conversation should have a unique ID in your logs. When a user reports a problem, you can look up exactly that session.
This approach is not as powerful as a dedicated tool, but it's dramatically better than nothing — and you can implement it in an afternoon.
Offline evals vs online monitoring
Observability has two distinct modes, and you need both:
Offline evaluation runs before you deploy. You maintain a golden test set — a collection of inputs with known correct outputs — and run your agent against it before every change. This catches regressions: cases where a change you made broke something that was previously working. Run evals on every significant change to your agent, system prompt, tools, or retrieval configuration.
Online monitoring runs in production. You sample a percentage of real conversations (typically 5-10%) and review them — either manually or by running an LLM-as-judge evaluation. This catches distribution shift: cases where real-world inputs differ from your test set in ways you didn't anticipate.
Offline evals catch known failure modes before they reach users. Online monitoring catches unknown failure modes that only surface with real traffic. Both are necessary, and neither replaces the other.
Key takeaway
You can't fix what you can't see. Instrument your agent before launching, not after your first production incident. The minimum viable observability is: log every tool call and its result. Everything else is built on top of that foundation.
The habit to build: every time you make a significant change to your agent, look at 5-10 traces from before and after the change. Compare them. This practice will surface regressions faster than any automated check, and it builds your intuition for how your specific agent behaves.
Next steps: For frameworks to systematically evaluate agent quality, see the Evaluating Agents lesson. For the full production checklist, see the Production-Ready Agents lesson.