I've shipped six production AI agents. Every item on this list comes from something that went wrong.
The gap between "it works in my notebook" and "it works for real users at 2am when I'm asleep" is not about model quality. It's about all the things you didn't build around the model. This checklist covers them.
Twenty items, five categories. Each one takes 30 minutes to an hour to implement. Skip one and you'll spend a weekend on-call debugging it instead.
Reliability
1. Retry logic on every LLM call
LLM APIs return 429s, 500s, and occasional timeouts. Without retries, a single transient error fails your user's request.
import anthropic
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=2, max=10))
def call_claude(messages: list, system: str) -> str:
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1000,
system=system,
messages=messages,
)
return response.content[0].text
What happens if you skip it: one API blip, one angry user, one lost session. At scale: 1–3% of requests fail permanently on the first error.
2. Timeout handling
A hung LLM call will hang your entire agent. Set timeouts at every layer.
import httpx
# anthropic SDK accepts an httpx client with timeout config
http_client = httpx.Client(timeout=httpx.Timeout(30.0, connect=5.0))
client = anthropic.Anthropic(http_client=http_client)
30 seconds is a reasonable max for a single call. If your agent regularly needs more, the task is probably too big for one call.
What happens if you skip it: one slow response from the API, and your user's request hangs indefinitely. In serverless environments, this eats your function timeout budget.
3. Graceful degradation
When Claude is down (it happens, rarely but it does), what does your user see? "Something went wrong" is better than a spinner that never resolves. A cached response from the last successful run is even better.
def get_agent_response(query: str) -> str:
try:
return call_claude_with_retries(query)
except Exception:
# Log the error, return a fallback
logger.error("LLM call failed after retries", exc_info=True)
return "I'm having trouble processing that right now. Please try again in a moment."
What happens if you skip it: silent failures, blank UI states, confused users who don't know if the agent is thinking or broken.
4. Circuit breaker
If Claude is returning errors, stop hammering it. A circuit breaker pauses calls for a period after repeated failures, then lets a test request through.
from circuitbreaker import circuit
@circuit(failure_threshold=5, recovery_timeout=60)
def call_claude(messages):
...
What happens if you skip it: when there's a partial outage, your agent keeps making failing calls, burning retries, consuming your rate limit budget, and delaying recovery.
Safety
5. max_iterations cap
Every agent that can use tools needs a hard cap on how many tool calls it makes per session. Without it, a confused agent loops forever.
MAX_ITERATIONS = 15
iterations = 0
while True:
iterations += 1
if iterations > MAX_ITERATIONS:
return {"error": "Agent reached iteration limit. Please rephrase your request."}
response = call_claude(messages)
if response.stop_reason != "tool_use":
break
# handle tool call...
What happens if you skip it: a bad prompt or unexpected tool result can send your agent into a loop that burns tokens and money until the process is killed.
6. Input sanitization
Don't pass raw user input directly into your agent's system prompt or tool calls. Validate and strip at the boundary.
import re
def sanitize_input(text: str, max_length: int = 2000) -> str:
# Truncate
text = text[:max_length]
# Strip null bytes and control characters (keep newlines/tabs)
text = re.sub(r'[\x00-\x08\x0b\x0c\x0e-\x1f\x7f]', '', text)
return text.strip()
This doesn't prevent prompt injection entirely — see the prompt injection defense post for a fuller treatment — but it handles the obvious cases.
What happens if you skip it: users can inject content that overrides your system prompt, causes unexpected behavior, or extracts information they shouldn't have access to.
7. Output validation
Validate the agent's output before acting on it. If your agent is supposed to return JSON, parse it and reject malformed responses.
from pydantic import BaseModel, ValidationError
class AgentOutput(BaseModel):
action: str
parameters: dict
def parse_agent_output(raw: str) -> AgentOutput | None:
try:
return AgentOutput.model_validate_json(raw)
except (ValidationError, ValueError):
logger.warning("Agent returned invalid output", extra={"raw": raw[:500]})
return None
What happens if you skip it: the agent returns "I'm not sure how to help with that" in a field your code tries to parse as JSON. Or it returns valid JSON with an action field you didn't expect. Both can crash downstream code.
8. Tool allowlist per role
Not every user should have access to every tool. An admin tool that deletes records should not be available to a user-facing agent.
USER_TOOLS = ["search_knowledge_base", "get_order_status", "create_support_ticket"]
ADMIN_TOOLS = USER_TOOLS + ["update_order", "refund_payment", "delete_record"]
def get_tools_for_role(role: str) -> list:
return ADMIN_TOOLS if role == "admin" else USER_TOOLS
What happens if you skip it: a user who types the right prompt can trigger admin operations. This has happened in production agents.
Observability
9. Log every tool call
Every tool call should emit a structured log: what was called, what was passed, what was returned, how long it took.
import time, logging
def logged_tool_call(tool_name: str, inputs: dict, fn):
start = time.time()
try:
result = fn(inputs)
logger.info("tool_call", extra={
"tool": tool_name,
"inputs": inputs,
"latency_ms": int((time.time() - start) * 1000),
"success": True,
})
return result
except Exception as e:
logger.error("tool_call_failed", extra={
"tool": tool_name,
"inputs": inputs,
"error": str(e),
"latency_ms": int((time.time() - start) * 1000),
})
raise
What happens if you skip it: a user reports the agent "doing something weird" and you have no idea what it called or what it got back.
10. Distributed tracing
Logs tell you what happened. Traces tell you why — the full chain of LLM calls and tool calls that produced an output. LangSmith, Braintrust, and OpenTelemetry are all good options.
See the agent observability guide for setup details. Pick one and integrate it before you go live — retrofitting observability after a production incident is painful.
What happens if you skip it: you get a bug report for a complex multi-step agent failure and have no way to replay what happened.
11. Error rate and latency alerts
Set two alerts on day one: error rate > 5%, and p95 latency > 10 seconds. Both indicate something is wrong that needs human attention.
What happens if you skip it: the error rate climbs for 48 hours before someone notices. By then you have a week of bad user experiences to explain.
12. Review 50 production conversations before calling it stable
This is the one developers skip most. Before you declare an agent production-ready, manually read 50 real conversations. You'll find edge cases your evals didn't cover. Always.
What happens if you skip it: you ship confident in your eval suite, then discover that 8% of users ask a question phrased in a way your evals never tested and the agent handles it badly.
Cost
13. Estimated cost per run, documented
Know what a single agent run costs before you ship. If you don't know, you can't budget, you can't alert, and you can't explain the bill to your CFO.
# Rough estimate: input tokens × rate + output tokens × rate
# Claude Sonnet 4.6: $3/M input, $15/M output
def estimate_cost(input_tokens: int, output_tokens: int) -> float:
return (input_tokens / 1_000_000 * 3) + (output_tokens / 1_000_000 * 15)
Log this on every run. Set a budget per run and flag when individual runs exceed it by 2×.
14. Daily spend alert
Set a spend alert at 1.5× your expected daily budget. Unexpected cost spikes are almost always a bug — an agent looping, a prompt that generates 10× the expected tokens, a retry storm.
What happens if you skip it: you wake up to a $400 API bill from an agent that ran in a loop overnight.
15. Context window limit
Conversations that grow without bound eventually hit the context window — and then either fail or generate very expensive calls. Summarize or truncate history after a fixed number of turns.
def trim_conversation(messages: list, max_turns: int = 20) -> list:
if len(messages) <= max_turns * 2:
return messages
# Keep system context; summarize older turns
return messages[-(max_turns * 2):]
16. Smaller model for cheap steps
Not every step in your agent needs Sonnet. Classification, routing, and extraction can run on Haiku at 10× lower cost.
def classify_intent(message: str) -> str:
# Haiku for cheap classification
response = anthropic.Anthropic().messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=20,
messages=[{"role": "user", "content": f"Classify as support/sales/other: {message}"}],
)
return response.content[0].text.strip().lower()
User experience
17. Loading state — always
Never show a blank screen while the agent thinks. Even a simple "Thinking..." indicator is better than silence.
What happens if you skip it: users click the submit button 3 more times, creating 3 parallel agent runs, and you have a debugging nightmare.
18. Errors that make sense to users
anthropic.APIStatusError: 529 Overloaded means nothing to a user. Map it to something useful:
FRIENDLY_ERRORS = {
"529": "We're experiencing high demand right now. Please try again in a moment.",
"timeout": "That request took too long. Try breaking it into a smaller question.",
"iteration_limit": "I couldn't complete that in one go. Try a more specific request.",
}
19. Clear escalation path
Users need to know when they've hit the limits of the agent and what to do next. "Contact support at support@yourco.com" is fine. Silence is not.
Every agent response that ends in failure or uncertainty should offer a next step.
20. Document what the agent can and can't do
Put it somewhere users will actually see it — the first message, a sidebar, an onboarding modal. Be specific about the scope.
"I can help with order status, returns, and product questions. For billing disputes and account security, contact our support team directly."
What happens if you skip it: users ask the agent to do things it can't do, get frustrated when it fails, and blame the product instead of the scope mismatch.
The full list is also in the agent evaluation post as part of a broader framework for measuring agent quality. If you want a script that auto-checks some of these programmatically — logging config, environment variable presence, context window size — that makes a good addition to your CI pipeline.
Ship with all 20 checked. You'll thank yourself when the first 2am alert fires and it's not actually a disaster.



