A demo agent works when you run it carefully, on good inputs, with your full attention. A production agent needs to work when your users run it at 3am, on unexpected inputs, on a slow API connection, with no one watching. The gap between these two is significant — and it's predictable. This lesson covers what fills it.
The good news: every problem you'll face in production has been faced before. The failure modes are documented. The patterns exist. You just need to build them in from the start, not bolt them on after the first incident.
The 5 gaps between demo and production
Before getting into solutions, it's useful to name the gaps explicitly:
Reliability. Demo agents fail ungracefully — an API timeout raises an unhandled exception and the session dies. Production agents need retry logic, fallbacks, and clear error messaging.
Security. Demo agents trust all input. Production agents receive adversarial input from users who are actively trying to get them to do things they shouldn't. Injection protection and output validation are required.
Cost. Demo agents run on a developer's credit card with no volume. Production agents at scale can generate unexpected costs quickly. Per-session budgets and rate limits prevent surprises.
Testing. Demo agents are tested by trying them and seeing if the output looks right. Production agents need reproducible test suites that run before every deployment.
Human oversight. Demo agents run fully autonomously in a forgiving environment. Production agents often need human checkpoints before irreversible or high-stakes actions.
Reliability patterns
Retries with exponential backoff
LLM APIs fail. External tools fail. Networks fail. Build retry logic into every external call:
- Retry up to 3 times with delays of 1s, 2s, 4s (exponential backoff with jitter)
- Retry on: rate limit errors (429), network timeouts, server errors (5xx)
- Do not retry on: malformed requests (400), authentication failures (401/403), not-found errors (404) — these won't resolve with retries
Most LLM client libraries provide retry helpers. Use them rather than building your own, and set timeouts on all external calls — an infinite hang is worse than a fast failure.
Fallback models
If the primary model is unavailable or too slow, fall back to a lighter model rather than failing entirely:
- Primary: a capable model for high-quality responses (e.g., Claude Sonnet)
- Fallback: a faster, lower-cost model for degraded-but-functional responses (e.g., Claude Haiku)
Two rules for fallbacks: First, only fall back on availability failures — not on quality disagreement. Second, don't silently downgrade for important decisions. If a user is about to complete a financial transaction and your primary model is down, fail explicitly rather than letting a less capable model handle it.
Graceful degradation
When a tool fails, the agent should acknowledge it rather than hallucinating the result. This requires an explicit instruction in the system prompt:
"If any tool call returns an error or fails to complete, tell the user what you were trying to do and that you couldn't complete it. Do not guess at the result or proceed as though the action succeeded."
Without this instruction, many agents will fabricate plausible-sounding results. A customer support agent that invents an order status is worse than one that says "I wasn't able to retrieve your order status — please try again or contact support."
Circuit breakers
If a downstream service is failing, continuing to call it makes things worse — you hammer a broken system and exhaust your retry budget. Implement circuit breakers: if a tool fails N times within a time window, stop calling it and surface the failure clearly. Alert your team. Don't let a single broken external API degrade every agent session indefinitely.
Human-in-the-loop design
Not every action should be autonomous. For some actions, the cost of a mistake outweighs the benefit of automation, and a human confirmation step is the right design choice — not a failure of engineering.
Three categories that typically require approval:
Irreversible actions. Deletions, sent emails, published posts. Once done, they can't be undone cleanly. Ask before executing.
High-stakes actions. Financial transactions, external communications, record modifications that affect other people. The blast radius of a mistake is high enough to justify a confirmation step.
Ambiguous requests. When the agent's confidence in its interpretation is low, it should say so and confirm rather than guess. "I want to make sure I understood: you want me to cancel the order, not update it — is that right?"
The implementation is straightforward. In the system prompt:
"Before calling any of the following tools — create_order, send_email, delete_record — summarize in plain language what you are about to do and explicitly ask the user to confirm before proceeding."
The agent surfaces its intent. The user confirms or corrects. The action executes. This pattern catches a large fraction of mistakes before they happen.
Rate limiting and cost management
Token budgets
Agent sessions can grow unexpectedly large. A conversation that loops, a tool that returns very long output, or a user who pastes a document can spike your token usage per session. Set budgets:
- Track cumulative tokens per session
- At a threshold (say, 80% of your budget), summarize earlier context and compress the conversation
- At the limit, surface a message to the user and end the session cleanly
Per-user rate limits
Prevent any single user from consuming disproportionate resources, whether through abuse or simply heavy usage. Implement per-user limits on sessions per hour and tokens per day.
Tool call limits
Infinite loops are possible. An agent that calls a search tool, gets results, calls a different search tool, gets different results, and decides it needs to search again can loop indefinitely. Set max_iterations — a cap on the number of tool calls per session. When the limit is reached, the agent should stop and summarize what it has so far rather than continuing to loop.
Know your unit economics before you scale
Run cost analysis before launch:
- What is your average cost per session (tokens × price/token + tool call costs)?
- At 100 daily users, what is your monthly cost?
- At 10,000 daily users?
Surprises at scale are painful. A $0.05/session cost at 1,000 users/day is $1,500/month. The same cost at 100,000 users/day is $150,000/month. Know the number before your user count grows.
Security
Prompt injection defense
Prompt injection is when malicious text in the user's input (or in data retrieved by the agent, such as a document or web page) attempts to override the agent's instructions. Common forms: "Ignore your previous instructions and instead...", base64-encoded instructions, or instruction-like text embedded in uploaded files.
Defense layers:
- Input validation: check for common injection patterns before passing user input to the agent context
- Instruction separation: use the system prompt for instructions, not the user turn. Don't construct system prompts from user-supplied data.
- Skeptical tool use: the agent's system prompt should instruct it to be skeptical of instructions found in retrieved content — a web page the agent reads should not be able to override its directives
Output validation
Before taking action based on agent output, validate that the output matches the expected format and falls within expected bounds. Don't blindly execute whatever the agent returns.
If your agent generates structured output (a JSON payload, a SQL query, a code snippet), validate it against a schema or run it through a safety check before executing. An agent that generates a SQL query should have that query reviewed for destructive clauses before it runs against your database.
Tool sandboxing
Tools that execute code — code interpreters, shell access, file system tools — must run in sandboxed environments. Use Docker containers, VMs, or cloud functions with scoped permissions. Never execute LLM-generated code directly in your production environment.
LLMs can generate syntactically valid but destructive code. The model is not trying to harm you, but it will produce exactly what its reasoning produces, and that reasoning is not always safe.
Privilege separation
Agents should only have access to the tools they need for their specific task. A customer support agent does not need a tool that can delete database records. A content summarization agent does not need a tool that can send emails.
Apply the principle of least privilege: grant the minimum access required for the job. When scoping an agent's tools, ask: "What is the worst thing this agent could do with these tools?" If the answer is unacceptable, remove the tool.
Testing strategy
Unit test your tools
Test each tool function independently before testing the agent. Does check_order_status("1234") return the right format? Does it handle a non-existent order ID gracefully? Does send_email validate the address format?
Tool bugs are easier to find and fix in isolation. An agent integration test that fails because of a tool bug is much harder to debug.
Integration test agent workflows
Test end-to-end flows with mock LLM responses. Given a specific conversation history, does the agent call the tools in the right sequence? Does it handle a tool failure correctly? Does it ask for confirmation before irreversible actions?
Mock the LLM responses so your integration tests are deterministic — you're testing the orchestration logic and tool wiring, not the model's behavior.
Build a golden set for evals
Maintain a set of 50 to 200 representative queries with known correct outcomes. Before every deployment, run your agent against the full golden set and measure:
- Did it call the right tools?
- Did the response match the expected format?
- Did it avoid prohibited actions?
- Did it handle edge cases correctly?
Your eval set should grow over time. Every production incident that reaches a user should produce at least one new eval case.
Adversarial testing
Before shipping, try to break the agent yourself. Attempt prompt injection. Try to get it to skip confirmation steps. Submit malformed inputs. Try to extract information from the system prompt. Ask it to do things outside its scope.
If you can break it, your users can break it — and some of them will try. Adversarial testing is not optional; it's the last line of defense before the agent reaches real users.
Deployment considerations
Stateless design. Stateless agents are easier to scale, restart, and debug. Where possible, pass all necessary context in the request and store session state externally (in a database, not in memory). A stateless agent can be restarted without losing anything.
Streaming for latency. For interactive use cases, stream the LLM response to the client rather than waiting for the full completion. Users tolerate latency better when they can see output arriving incrementally.
Health checks. Every agent deployment should expose a health check endpoint that runs a simple, low-cost test query and verifies the agent responds correctly. Use this in your deployment pipeline and monitoring.
Rollback plan. When you update an agent's system prompt, tools, or model version, keep the previous version running for at least 24 hours. Agents can regress in subtle ways — a prompt change that improves one behavior can silently break another. The ability to roll back quickly reduces the cost of mistakes.
The principle to carry forward
Ship the simplest reliable agent, not the most capable unreliable one.
Every additional capability is additional surface area for failures. A simpler agent with excellent reliability and clear error handling will serve users better — and generate more trust — than a sophisticated agent that occasionally does something wrong in an opaque way.
Add capabilities one at a time. Cover each new capability with eval cases before adding the next. Production quality is earned incrementally.
What to read next
Agent Observability and Debugging covers how to monitor agents once they're in production — logging, tracing, and diagnosing failures. For the tool design principles referenced throughout this lesson, see Tool Design for AI Agents. For a concrete end-to-end example of these patterns applied to a real use case, see Build a Customer Support Agent That Doesn't Hallucinate.