An AI agent that generates bad text causes a problem you can fix by editing text. An AI agent that sends an email, deletes a file, or submits a form causes a problem that may not be fixable at all. This is why agent safety is fundamentally different from prompt safety — and why it demands more deliberate design.
Most of the failure modes discussed in prompt injection and jailbreaking are about getting models to say things they shouldn't. Agent failures are about getting models to do things they shouldn't. The stakes are higher and the blast radius is wider. A well-designed agent fails safely. A poorly designed one can cause irreversible damage in seconds.
Why agent safety is different
When an agent has tools — web search, file system access, database writes, API calls, email sending — every action it takes has real-world consequences. The agent isn't just generating text for a human to evaluate; it's changing state in systems that other people depend on.
Three properties make agent failures more dangerous than text generation failures:
Automation amplifies mistakes. An agent can take 50 wrong actions in the time it takes a human to notice the first one. Errors compound.
Actions may be irreversible. You can't unsend an email. You can't easily undo a database deletion without a backup. You can't un-notify 10,000 users about a false emergency.
Agents are exploitable. An agent browsing the web or reading emails can be fed adversarial content designed to hijack its actions — this is prompt injection at the agent level, and it's more dangerous when the agent has write permissions.
The 5 principles of responsible agent design
1. Minimal footprint
Request only the permissions the agent actually needs for its task. If an agent's job is to read customer records and generate summaries, it shouldn't have write access to the customer database. If it needs to send one type of email, it shouldn't have access to the full contact list.
This principle applies to:
- API scopes and permissions: Request read-only tokens when writes aren't needed
- File system access: Scope to specific directories, not the whole system
- Database access: Grant SELECT on specific tables, not full admin
- Tool availability: Don't give an agent tools it doesn't need for the current task
Minimal footprint limits the blast radius of a failure. An agent that can only read can't corrupt your data. An agent scoped to /tmp/agent-workspace/ can't accidentally delete production files.
In practice, this means parameterizing tool access per-task rather than giving every agent instance maximum permissions:
# Instead of: agent.tools = all_tools
# Do:
tools_for_task = {
"read_customer": customer_read_tool,
"search_knowledge_base": kb_search_tool,
# NOT: write_customer, delete_customer, send_email
}
agent = create_agent(tools=tools_for_task)
2. Human-in-the-loop checkpoints
Not every action needs human approval, but some do. The key is categorizing actions by consequence and requiring confirmation for high-consequence, low-reversibility actions.
Categories to think about:
| Action type | Example | Checkpoint needed? |
|---|---|---|
| Read-only lookup | Search knowledge base | No |
| Low-stakes write | Add a draft | Maybe |
| External communication | Send email | Yes |
| Irreversible deletion | Delete record | Yes, with explicit confirmation |
| Financial action | Process payment | Yes, with authorization |
| Bulk operation | Update 1000 records | Yes, with preview |
Implement checkpoints as explicit pause points in your agent loop, not just as instructions in the system prompt. Instructions can be overridden by sufficiently clever prompt injection. Code can't.
def execute_action(action, context):
if action.type in REQUIRES_CONFIRMATION:
confirmed = await request_human_approval(
action=action,
context=context,
timeout_seconds=300 # Auto-deny if no response
)
if not confirmed:
return ActionResult(status="denied", reason="human_declined")
return action.execute()
3. Graceful degradation
When an agent encounters something it doesn't know how to handle, it should fail clearly and safely — not silently, not catastrophically.
Silent failure is the worst outcome. The agent skips a step, produces incomplete output, and the user doesn't know anything went wrong until much later. Build explicit failure states into your agent.
Catastrophic failure means the agent encounters an error and takes a drastic compensating action — retrying indefinitely, escalating permissions, or taking an irreversible action to "resolve" the ambiguity.
Graceful degradation means the agent:
- Recognizes it's stuck or uncertain
- Stops rather than guessing
- Reports what it was trying to do, what went wrong, and what information it needs to continue
- Hands off to a human if needed
In system prompt terms:
When you encounter a situation where you're uncertain how to proceed:
- Do not attempt to resolve ambiguity by taking an irreversible action
- Do not retry a failed action more than twice
- Stop and output: "I need clarification: [specific question]"
- Do not proceed until you receive an answer
When you encounter an error that prevents completing the task:
- Report the error in full
- Describe what you were trying to accomplish
- List what you have and haven't completed so far
- Ask whether to retry, skip, or abort the task
4. Audit trails
Log everything an agent does. Not just what it was asked to do — every action it actually took, every tool it called, every API request it made, every decision point.
A minimal audit log entry:
@dataclass
class AgentActionLog:
timestamp: datetime
session_id: str
agent_id: str
action_type: str # "tool_call", "decision", "output"
action_details: dict # Tool name, parameters, etc.
result: str # Success, failure, output
triggered_by: str # What caused this action
human_approved: bool # Was this action confirmed?
Audit trails serve three purposes:
- Debugging: When something goes wrong, you can reconstruct exactly what happened
- Accountability: You can demonstrate to users or regulators what actions were taken and why
- Improvement: Patterns in audit logs show you where agents are making mistakes or taking unexpected paths
Don't treat audit logging as optional infrastructure. Build it in from day one. Adding it later is painful, and you'll need it exactly when something goes wrong — which is precisely when you don't want to be debugging your logging system.
5. Scope limitation
Scope limiting means constraining what an agent can do in two ways: through the prompt (behavioral constraints) and through the code (enforced constraints). You need both.
Prompt-level scope:
You are a customer support agent for Acme Corp. Your role is limited to:
- Answering questions about Acme products and services
- Checking order status for the customer you're speaking with
- Escalating complex issues to the human support team
You are not authorized to:
- Access or discuss other customers' information
- Make changes to product pricing
- Issue refunds above $50 without human approval
- Take any action not directly related to resolving this customer's support request
Code-level scope (enforces what the prompt describes):
def get_order_status(customer_id: str, order_id: str) -> dict:
# Enforce that agents can only access orders belonging to
# the authenticated customer — not any arbitrary order_id
if not order_belongs_to_customer(order_id, customer_id):
raise PermissionError("Cannot access orders belonging to other customers")
return fetch_order(order_id)
The prompt tells the agent what to do. The code enforces what it can do. A prompt-only constraint is a guideline. A code constraint is a guarantee.
The reversibility principle
When an agent has multiple paths to accomplish a task, prefer the more reversible one — even if it's slightly slower or less efficient.
Examples:
- Move files to trash instead of permanent delete
- Create a draft email instead of sending immediately
- Stage a database change for review instead of committing directly
- Create a new version of a document instead of overwriting
Build reversibility into your tool design:
# Less reversible
def send_email(to: str, subject: str, body: str) -> None:
email_client.send(to=to, subject=subject, body=body)
# More reversible
def create_email_draft(to: str, subject: str, body: str) -> str:
"""Creates a draft and returns the draft ID. Does not send."""
draft_id = email_client.create_draft(to=to, subject=subject, body=body)
return draft_id
def send_draft(draft_id: str, confirmed_by: str) -> None:
"""Sends a previously created draft. Requires confirmation."""
email_client.send_draft(draft_id, authorized_by=confirmed_by)
Error recovery patterns
When an agent gets stuck, it needs a recovery path. Three common patterns:
Checkpoint and resume: Save agent state periodically so that if a task fails mid-way, you can restart from the last successful checkpoint rather than from scratch.
Escalation path: Define what happens when the agent can't make progress. This might be: notify a human, create a support ticket, or simply log the failure and exit cleanly. The important thing is that "keep trying forever with the same broken approach" is not on the list.
Compensation actions: For multi-step workflows, define compensation actions that undo completed steps if a later step fails. If step 3 of a 5-step workflow fails, steps 1 and 2 should be rolled back if they can be. This is essentially the saga pattern from distributed systems, applied to agent workflows.
A system prompt template
Here's a starting template that encodes the principles above. Adapt it for your specific agent:
You are [agent name], an AI assistant for [company/product].
## Scope
Your authorized tasks are:
- [Task 1]
- [Task 2]
You are not authorized to perform any task not listed above.
## Decision rules
- Prefer reversible actions over irreversible ones when both achieve the goal
- When uncertain, stop and ask rather than guess
- Do not retry a failed action more than twice
- For any irreversible action, confirm with the user before proceeding
## Failure behavior
If you cannot complete a task:
1. Stop
2. Report: what you were trying to do, what went wrong, what you've completed so far
3. Ask whether to retry, skip, or abort
## Prohibited actions
- Do not access data belonging to users other than the one you're assisting
- Do not take bulk actions (affecting more than 10 records) without explicit confirmation
- Do not send external communications without explicit approval
Responsible agent design isn't about being timid or making agents less useful. It's about building agents that can be trusted with real tasks in real systems — because users know exactly what they'll do, what they won't do, and how they'll behave when something goes wrong. That trust is what makes agents actually deployable.
For more on the agent architecture these principles apply to, see the multi-agent systems lesson and evaluating agents.