Why is AI safety important for prompt engineers?

Prompt engineers design the instructions that govern AI behavior. Understanding AI safety means understanding the limits, failure modes, and attack vectors of the systems you build. A prompt engineer who ignores safety risks ships products that can be misused, produce harmful outputs, or expose sensitive information — directly because of how the prompts were designed.

Do I need a security background to learn AI safety?

No. The Risks & Safety track is designed for developers, product builders, and practitioners — not security specialists. The concepts are practical and applied. Understanding prompt injection, jailbreaking, and hallucinations requires understanding how LLMs work, which you've already built through other lessons.

What's the most important AI safety concept to learn first?

Prompt injection. It's the most common and impactful security vulnerability in LLM applications. If you build any system where an LLM processes untrusted external data — user inputs, web content, documents — prompt injection is your biggest risk. Understanding it first gives you the right defensive mindset for everything else.

How is LLM safety different from traditional software security?

Traditional software security is largely deterministic — the same input reliably produces the same output, making vulnerabilities predictable and patchable. LLM safety is probabilistic — the same prompt may produce different outputs, defenses that work 99% of the time may fail 1%, and there's no formal verification that a safety measure holds. This requires layered defenses, monitoring, and ongoing vigilance rather than one-time fixes.

Responsible AI Agent Design

An AI agent that generates bad text causes a problem you can fix by editing text. An AI agent that sends an email, deletes a file, or submits a form causes a problem that may not be fixable at all. This is why agent safety is fundamentally different from prompt safety — and why it demands more deliberate design.

Most of the failure modes discussed in prompt injection and jailbreaking are about getting models to say things they shouldn't. Agent failures are about getting models to do things they shouldn't. The stakes are higher and the blast radius is wider. A well-designed agent fails safely. A poorly designed one can cause irreversible damage in seconds.

Why agent safety is different

When an agent has tools — web search, file system access, database writes, API calls, email sending — every action it takes has real-world consequences. The agent isn't just generating text for a human to evaluate; it's changing state in systems that other people depend on.

Three properties make agent failures more dangerous than text generation failures:

Automation amplifies mistakes. An agent can take 50 wrong actions in the time it takes a human to notice the first one. Errors compound.

Actions may be irreversible. You can't unsend an email. You can't easily undo a database deletion without a backup. You can't un-notify 10,000 users about a false emergency.

Agents are exploitable. An agent browsing the web or reading emails can be fed adversarial content designed to hijack its actions — this is prompt injection at the agent level, and it's more dangerous when the agent has write permissions.

The 5 principles of responsible agent design

1. Minimal footprint

Request only the permissions the agent actually needs for its task. If an agent's job is to read customer records and generate summaries, it shouldn't have write access to the customer database. If it needs to send one type of email, it shouldn't have access to the full contact list.

This principle applies to:

API scopes and permissions: Request read-only tokens when writes aren't needed
File system access: Scope to specific directories, not the whole system
Database access: Grant SELECT on specific tables, not full admin
Tool availability: Don't give an agent tools it doesn't need for the current task

Minimal footprint limits the blast radius of a failure. An agent that can only read can't corrupt your data. An agent scoped to /tmp/agent-workspace/ can't accidentally delete production files.

In practice, this means parameterizing tool access per-task rather than giving every agent instance maximum permissions:

# Instead of: agent.tools = all_tools
# Do:
tools_for_task = {
    "read_customer": customer_read_tool,
    "search_knowledge_base": kb_search_tool,
    # NOT: write_customer, delete_customer, send_email
}
agent = create_agent(tools=tools_for_task)

2. Human-in-the-loop checkpoints

Not every action needs human approval, but some do. The key is categorizing actions by consequence and requiring confirmation for high-consequence, low-reversibility actions.

Categories to think about:

Action type	Example	Checkpoint needed?
Read-only lookup	Search knowledge base	No
Low-stakes write	Add a draft	Maybe
External communication	Send email	Yes
Irreversible deletion	Delete record	Yes, with explicit confirmation
Financial action	Process payment	Yes, with authorization
Bulk operation	Update 1000 records	Yes, with preview

Implement checkpoints as explicit pause points in your agent loop, not just as instructions in the system prompt. Instructions can be overridden by sufficiently clever prompt injection. Code can't.

def execute_action(action, context):
    if action.type in REQUIRES_CONFIRMATION:
        confirmed = await request_human_approval(
            action=action,
            context=context,
            timeout_seconds=300  # Auto-deny if no response
        )
        if not confirmed:
            return ActionResult(status="denied", reason="human_declined")
    return action.execute()

3. Graceful degradation

When an agent encounters something it doesn't know how to handle, it should fail clearly and safely — not silently, not catastrophically.

Silent failure is the worst outcome. The agent skips a step, produces incomplete output, and the user doesn't know anything went wrong until much later. Build explicit failure states into your agent.

Catastrophic failure means the agent encounters an error and takes a drastic compensating action — retrying indefinitely, escalating permissions, or taking an irreversible action to "resolve" the ambiguity.

Graceful degradation means the agent:

Recognizes it's stuck or uncertain
Stops rather than guessing
Reports what it was trying to do, what went wrong, and what information it needs to continue
Hands off to a human if needed

In system prompt terms:

When you encounter a situation where you're uncertain how to proceed:
- Do not attempt to resolve ambiguity by taking an irreversible action
- Do not retry a failed action more than twice
- Stop and output: "I need clarification: [specific question]"
- Do not proceed until you receive an answer

When you encounter an error that prevents completing the task:
- Report the error in full
- Describe what you were trying to accomplish
- List what you have and haven't completed so far
- Ask whether to retry, skip, or abort the task

4. Audit trails

Log everything an agent does. Not just what it was asked to do — every action it actually took, every tool it called, every API request it made, every decision point.

A minimal audit log entry:

@dataclass
class AgentActionLog:
    timestamp: datetime
    session_id: str
    agent_id: str
    action_type: str          # "tool_call", "decision", "output"
    action_details: dict      # Tool name, parameters, etc.
    result: str               # Success, failure, output
    triggered_by: str         # What caused this action
    human_approved: bool      # Was this action confirmed?

Audit trails serve three purposes:

Debugging: When something goes wrong, you can reconstruct exactly what happened
Accountability: You can demonstrate to users or regulators what actions were taken and why
Improvement: Patterns in audit logs show you where agents are making mistakes or taking unexpected paths

Don't treat audit logging as optional infrastructure. Build it in from day one. Adding it later is painful, and you'll need it exactly when something goes wrong — which is precisely when you don't want to be debugging your logging system.

5. Scope limitation

Scope limiting means constraining what an agent can do in two ways: through the prompt (behavioral constraints) and through the code (enforced constraints). You need both.

Prompt-level scope:

You are a customer support agent for Acme Corp. Your role is limited to:
- Answering questions about Acme products and services
- Checking order status for the customer you're speaking with
- Escalating complex issues to the human support team

You are not authorized to:
- Access or discuss other customers' information
- Make changes to product pricing
- Issue refunds above $50 without human approval
- Take any action not directly related to resolving this customer's support request

Code-level scope (enforces what the prompt describes):

def get_order_status(customer_id: str, order_id: str) -> dict:
    # Enforce that agents can only access orders belonging to
    # the authenticated customer — not any arbitrary order_id
    if not order_belongs_to_customer(order_id, customer_id):
        raise PermissionError("Cannot access orders belonging to other customers")
    return fetch_order(order_id)

The prompt tells the agent what to do. The code enforces what it can do. A prompt-only constraint is a guideline. A code constraint is a guarantee.

The reversibility principle

When an agent has multiple paths to accomplish a task, prefer the more reversible one — even if it's slightly slower or less efficient.

Examples:

Move files to trash instead of permanent delete
Create a draft email instead of sending immediately
Stage a database change for review instead of committing directly
Create a new version of a document instead of overwriting

Build reversibility into your tool design:

# Less reversible
def send_email(to: str, subject: str, body: str) -> None:
    email_client.send(to=to, subject=subject, body=body)

# More reversible
def create_email_draft(to: str, subject: str, body: str) -> str:
    """Creates a draft and returns the draft ID. Does not send."""
    draft_id = email_client.create_draft(to=to, subject=subject, body=body)
    return draft_id

def send_draft(draft_id: str, confirmed_by: str) -> None:
    """Sends a previously created draft. Requires confirmation."""
    email_client.send_draft(draft_id, authorized_by=confirmed_by)

Error recovery patterns

When an agent gets stuck, it needs a recovery path. Three common patterns:

Checkpoint and resume: Save agent state periodically so that if a task fails mid-way, you can restart from the last successful checkpoint rather than from scratch.

Escalation path: Define what happens when the agent can't make progress. This might be: notify a human, create a support ticket, or simply log the failure and exit cleanly. The important thing is that "keep trying forever with the same broken approach" is not on the list.

Compensation actions: For multi-step workflows, define compensation actions that undo completed steps if a later step fails. If step 3 of a 5-step workflow fails, steps 1 and 2 should be rolled back if they can be. This is essentially the saga pattern from distributed systems, applied to agent workflows.

A system prompt template

Here's a starting template that encodes the principles above. Adapt it for your specific agent:

You are [agent name], an AI assistant for [company/product].

## Scope
Your authorized tasks are:
- [Task 1]
- [Task 2]

You are not authorized to perform any task not listed above.

## Decision rules
- Prefer reversible actions over irreversible ones when both achieve the goal
- When uncertain, stop and ask rather than guess
- Do not retry a failed action more than twice
- For any irreversible action, confirm with the user before proceeding

## Failure behavior
If you cannot complete a task:
1. Stop
2. Report: what you were trying to do, what went wrong, what you've completed so far
3. Ask whether to retry, skip, or abort

## Prohibited actions
- Do not access data belonging to users other than the one you're assisting
- Do not take bulk actions (affecting more than 10 records) without explicit confirmation
- Do not send external communications without explicit approval

Responsible agent design isn't about being timid or making agents less useful. It's about building agents that can be trusted with real tasks in real systems — because users know exactly what they'll do, what they won't do, and how they'll behave when something goes wrong. That trust is what makes agents actually deployable.

For more on the agent architecture these principles apply to, see the multi-agent systems lesson and evaluating agents.

Why agent safety is different

Three properties make agent failures more dangerous than text generation failures:

Automation amplifies mistakes. An agent can take 50 wrong actions in the time it takes a human to notice the first one. Errors compound.

Actions may be irreversible. You can't unsend an email. You can't easily undo a database deletion without a backup. You can't un-notify 10,000 users about a false emergency.

The 5 principles of responsible agent design

1. Minimal footprint

This principle applies to:

API scopes and permissions: Request read-only tokens when writes aren't needed
File system access: Scope to specific directories, not the whole system
Database access: Grant SELECT on specific tables, not full admin
Tool availability: Don't give an agent tools it doesn't need for the current task

Minimal footprint limits the blast radius of a failure. An agent that can only read can't corrupt your data. An agent scoped to /tmp/agent-workspace/ can't accidentally delete production files.

In practice, this means parameterizing tool access per-task rather than giving every agent instance maximum permissions:

# Instead of: agent.tools = all_tools
# Do:
tools_for_task = {
    "read_customer": customer_read_tool,
    "search_knowledge_base": kb_search_tool,
    # NOT: write_customer, delete_customer, send_email
}
agent = create_agent(tools=tools_for_task)

2. Human-in-the-loop checkpoints

Not every action needs human approval, but some do. The key is categorizing actions by consequence and requiring confirmation for high-consequence, low-reversibility actions.

Categories to think about:

Action type	Example	Checkpoint needed?
Read-only lookup	Search knowledge base	No
Low-stakes write	Add a draft	Maybe
External communication	Send email	Yes
Irreversible deletion	Delete record	Yes, with explicit confirmation
Financial action	Process payment	Yes, with authorization
Bulk operation	Update 1000 records	Yes, with preview

Implement checkpoints as explicit pause points in your agent loop, not just as instructions in the system prompt. Instructions can be overridden by sufficiently clever prompt injection. Code can't.

def execute_action(action, context):
    if action.type in REQUIRES_CONFIRMATION:
        confirmed = await request_human_approval(
            action=action,
            context=context,
            timeout_seconds=300  # Auto-deny if no response
        )
        if not confirmed:
            return ActionResult(status="denied", reason="human_declined")
    return action.execute()

3. Graceful degradation

When an agent encounters something it doesn't know how to handle, it should fail clearly and safely — not silently, not catastrophically.

Graceful degradation means the agent:

Recognizes it's stuck or uncertain
Stops rather than guessing
Reports what it was trying to do, what went wrong, and what information it needs to continue
Hands off to a human if needed

In system prompt terms:

When you encounter a situation where you're uncertain how to proceed:
- Do not attempt to resolve ambiguity by taking an irreversible action
- Do not retry a failed action more than twice
- Stop and output: "I need clarification: [specific question]"
- Do not proceed until you receive an answer

When you encounter an error that prevents completing the task:
- Report the error in full
- Describe what you were trying to accomplish
- List what you have and haven't completed so far
- Ask whether to retry, skip, or abort the task

4. Audit trails

Log everything an agent does. Not just what it was asked to do — every action it actually took, every tool it called, every API request it made, every decision point.

A minimal audit log entry:

@dataclass
class AgentActionLog:
    timestamp: datetime
    session_id: str
    agent_id: str
    action_type: str          # "tool_call", "decision", "output"
    action_details: dict      # Tool name, parameters, etc.
    result: str               # Success, failure, output
    triggered_by: str         # What caused this action
    human_approved: bool      # Was this action confirmed?

Audit trails serve three purposes:

Debugging: When something goes wrong, you can reconstruct exactly what happened
Accountability: You can demonstrate to users or regulators what actions were taken and why
Improvement: Patterns in audit logs show you where agents are making mistakes or taking unexpected paths

5. Scope limitation

Scope limiting means constraining what an agent can do in two ways: through the prompt (behavioral constraints) and through the code (enforced constraints). You need both.

Prompt-level scope:

You are a customer support agent for Acme Corp. Your role is limited to:
- Answering questions about Acme products and services
- Checking order status for the customer you're speaking with
- Escalating complex issues to the human support team

You are not authorized to:
- Access or discuss other customers' information
- Make changes to product pricing
- Issue refunds above $50 without human approval
- Take any action not directly related to resolving this customer's support request

Code-level scope (enforces what the prompt describes):

def get_order_status(customer_id: str, order_id: str) -> dict:
    # Enforce that agents can only access orders belonging to
    # the authenticated customer — not any arbitrary order_id
    if not order_belongs_to_customer(order_id, customer_id):
        raise PermissionError("Cannot access orders belonging to other customers")
    return fetch_order(order_id)

The prompt tells the agent what to do. The code enforces what it can do. A prompt-only constraint is a guideline. A code constraint is a guarantee.

The reversibility principle

When an agent has multiple paths to accomplish a task, prefer the more reversible one — even if it's slightly slower or less efficient.

Examples:

Move files to trash instead of permanent delete
Create a draft email instead of sending immediately
Stage a database change for review instead of committing directly
Create a new version of a document instead of overwriting

Build reversibility into your tool design:

# Less reversible
def send_email(to: str, subject: str, body: str) -> None:
    email_client.send(to=to, subject=subject, body=body)

# More reversible
def create_email_draft(to: str, subject: str, body: str) -> str:
    """Creates a draft and returns the draft ID. Does not send."""
    draft_id = email_client.create_draft(to=to, subject=subject, body=body)
    return draft_id

def send_draft(draft_id: str, confirmed_by: str) -> None:
    """Sends a previously created draft. Requires confirmation."""
    email_client.send_draft(draft_id, authorized_by=confirmed_by)

Error recovery patterns

When an agent gets stuck, it needs a recovery path. Three common patterns:

Checkpoint and resume: Save agent state periodically so that if a task fails mid-way, you can restart from the last successful checkpoint rather than from scratch.

A system prompt template

Here's a starting template that encodes the principles above. Adapt it for your specific agent:

You are [agent name], an AI assistant for [company/product].

## Scope
Your authorized tasks are:
- [Task 1]
- [Task 2]

You are not authorized to perform any task not listed above.

## Decision rules
- Prefer reversible actions over irreversible ones when both achieve the goal
- When uncertain, stop and ask rather than guess
- Do not retry a failed action more than twice
- For any irreversible action, confirm with the user before proceeding

## Failure behavior
If you cannot complete a task:
1. Stop
2. Report: what you were trying to do, what went wrong, what you've completed so far
3. Ask whether to retry, skip, or abort

## Prohibited actions
- Do not access data belonging to users other than the one you're assisting
- Do not take bulk actions (affecting more than 10 records) without explicit confirmation
- Do not send external communications without explicit approval

For more on the agent architecture these principles apply to, see the multi-agent systems lesson and evaluating agents.