What is prompt engineering?

Prompt engineering is the practice of crafting inputs to AI language models to produce accurate, useful, and reliable outputs. It involves choosing the right words, structure, context, and format to guide the AI toward the response you actually need — rather than a generic or off-target one.

Which AI models benefit most from better prompting?

All major large language models — including ChatGPT (GPT-4o), Claude, and Gemini — respond significantly to prompt quality. The same task can produce dramatically different results depending on how you structure your request. Better prompting improves output across every major model.

Do I need technical skills to do prompt engineering?

No. Prompt engineering is done in natural language — you write text instructions, not code. Basic prompting needs no technical background at all. Advanced techniques like prompt chaining or agentic workflows can benefit from light scripting knowledge, but the core skill is clear written communication.

Where can I learn more about prompt engineering?

MasterPrompting.net offers a structured curriculum from beginner to advanced, covering every major technique from basic clarity and context to chain-of-thought, meta-prompting, and agentic workflows. Start with the Beginner track to build a solid foundation.

How to cut AI agent costs 60% — optimization techniques for multi-agent systems

The real cost of a multi-agent system isn't the model fee per call. It's the fact that every tool invocation is an LLM call. A research agent making 8 tool calls costs 8× what a single call costs — and that's before you account for the context window growing with each step.

Here's what that looks like in production: a 10-step research agent using Sonnet 4.6 costs about $0.05 per run. At 1,000 runs/day that's $1,500/month. I applied the seven techniques below and brought the same workload to $580/month — a 61% reduction — without changing the agent's behavior from the user's perspective.

The cost baseline

Before optimizing, understand exactly what you're paying for:

import anthropic

def track_cost(response: anthropic.types.Message) -> float:
    # Claude Sonnet 4.6 pricing
    INPUT_COST_PER_M = 3.0   # $3.00 per million input tokens
    OUTPUT_COST_PER_M = 15.0  # $15.00 per million output tokens
    
    input_tokens = response.usage.input_tokens
    output_tokens = response.usage.output_tokens
    
    return (input_tokens / 1_000_000 * INPUT_COST_PER_M) + \
           (output_tokens / 1_000_000 * OUTPUT_COST_PER_M)

Log this on every call. After a week, aggregate by agent step and you'll immediately see which steps are expensive. Usually it's one or two — the synthesis step and the tool calls with large results.

Technique 1: Route tasks to the cheapest capable model

Claude Haiku 4.5 costs 10× less than Sonnet 4.6 and handles classification, routing, extraction, and simple transformation tasks perfectly. Sonnet is only necessary for generation, complex reasoning, and synthesis.

from enum import Enum

class TaskType(Enum):
    CLASSIFY = "classify"
    EXTRACT = "extract"
    ROUTE = "route"
    GENERATE = "generate"
    SYNTHESIZE = "synthesize"
    REASON = "reason"

CHEAP_TASKS = {TaskType.CLASSIFY, TaskType.EXTRACT, TaskType.ROUTE}
EXPENSIVE_TASKS = {TaskType.GENERATE, TaskType.SYNTHESIZE, TaskType.REASON}

def get_model(task_type: TaskType) -> str:
    if task_type in CHEAP_TASKS:
        return "claude-haiku-4-5-20251001"  # 10× cheaper
    return "claude-sonnet-4-6"

# In your agent:
def classify_intent(message: str) -> str:
    response = client.messages.create(
        model=get_model(TaskType.CLASSIFY),  # Haiku
        max_tokens=20,
        messages=[{"role": "user", "content": f"Classify as support/sales/other: {message}"}],
    )
    return response.content[0].text.strip()

def generate_reply(context: dict) -> str:
    response = client.messages.create(
        model=get_model(TaskType.GENERATE),  # Sonnet
        max_tokens=500,
        messages=[{"role": "user", "content": f"Write a reply to: {context}"}],
    )
    return response.content[0].text

Savings: 40–60% on classification-heavy pipelines. The model routing post covers automatic routing with LiteLLM.

Technique 2: Cache tool results

If your agent calls get_weather("Mumbai") 100 times today, that's 100 API calls and 100 LLM calls to process the result. Cache the tool output for a sensible TTL and pay for it once.

import redis
import json
import hashlib
import functools
from typing import Callable, Any

r = redis.Redis(host="localhost", port=6379, decode_responses=True)

def cached_tool(ttl_seconds: int = 300):
    """Decorator: cache tool call results in Redis by input hash."""
    def decorator(fn: Callable) -> Callable:
        @functools.wraps(fn)
        def wrapper(*args, **kwargs):
            # Create a cache key from function name + arguments
            key_data = json.dumps({"fn": fn.__name__, "args": args, "kwargs": kwargs}, sort_keys=True)
            cache_key = f"tool:{hashlib.md5(key_data.encode()).hexdigest()}"
            
            cached = r.get(cache_key)
            if cached:
                return json.loads(cached)
            
            result = fn(*args, **kwargs)
            r.setex(cache_key, ttl_seconds, json.dumps(result))
            return result
        return wrapper
    return decorator

# Wrap your tool functions:
@cached_tool(ttl_seconds=1800)  # cache 30 minutes
def get_weather(city: str) -> dict:
    # ... actual API call
    return {"temp": 32, "condition": "sunny"}

@cached_tool(ttl_seconds=86400)  # cache 24 hours
def get_company_info(domain: str) -> dict:
    # ... lookup
    return {"name": "...", "industry": "..."}

For tools that make external HTTP calls, this also reduces latency significantly — cached responses return in under 1ms vs 200–500ms for live calls.

Savings: varies by tool call frequency. In a pipeline processing similar inputs (e.g., classifying 1,000 support tickets), you might see 70%+ cache hits.

Technique 3: Prompt caching for shared system prompts

If all your agent calls share a system prompt longer than 1,024 tokens, Claude's prompt caching saves up to 90% on those input tokens. The cache prefix is charged at 0.1× the normal input rate on cache hits.

# Enable caching by adding cache_control to the system message
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1000,
    system=[
        {
            "type": "text",
            "text": LONG_SYSTEM_PROMPT,  # your 2,000-token system prompt
            "cache_control": {"type": "ephemeral"},  # cache this prefix
        }
    ],
    messages=messages,
)

The first call pays full price. Subsequent calls within 5 minutes pay 0.1×. For a pipeline that makes 500 calls per hour all sharing the same system prompt, this cuts input costs by ~85% on those tokens.

See the Batch API post for how this interacts with batch pricing.

Technique 4: Truncate tool results before passing to the model

This is the mistake I see most in production agents: the tool makes an API call, returns a 4,000-token JSON response, and the agent passes the whole thing to the next LLM call. Now your context window is full of irrelevant nested objects.

Extract only what the model needs inside the tool function:

# Bad: passes entire API response to model
def get_order(order_id: str) -> dict:
    return requests.get(f"/api/orders/{order_id}").json()  # 3KB of nested JSON

# Good: extract relevant fields
def get_order(order_id: str) -> dict:
    data = requests.get(f"/api/orders/{order_id}").json()
    return {
        "order_id": data["id"],
        "status": data["status"],
        "items": [{"sku": i["sku"], "qty": i["quantity"]} for i in data["line_items"]],
        "total_inr": int(float(data["total"]) * 100),  # paise
        "estimated_delivery": data.get("shipping", {}).get("estimated_date"),
    }

Also cap text-based results:

def fetch_webpage(url: str) -> str:
    content = firecrawl.scrape(url)
    return content[:3000]  # 3K chars is enough context for most tasks

Savings: 30–50% on input tokens for tool-heavy agents.

Technique 5: Summarize conversation history

Multi-turn agents accumulate context. By turn 15, you might be paying for 8,000 tokens of history on every call even though the last 3 turns contain everything relevant.

def get_effective_history(messages: list, max_turns: int = 8) -> list:
    if len(messages) <= max_turns * 2:
        return messages
    
    # Summarize older turns
    old_messages = messages[:-max_turns * 2]
    recent_messages = messages[-max_turns * 2:]
    
    summary_response = client.messages.create(
        model="claude-haiku-4-5-20251001",  # Cheap model for summarization
        max_tokens=200,
        messages=[{
            "role": "user",
            "content": f"Summarize the key context from this conversation in 3 sentences: {json.dumps(old_messages)}",
        }],
    )
    summary = summary_response.content[0].text
    
    return [
        {"role": "user", "content": f"[Earlier conversation summary: {summary}]"},
        {"role": "assistant", "content": "Understood. I'll keep that context in mind."},
        *recent_messages,
    ]

Use this in your agent loop before each LLM call:

messages = get_effective_history(messages, max_turns=8)
response = client.messages.create(..., messages=messages)

Savings: 40–60% on long conversations. The reduction grows as conversations get longer.

Technique 6: Exit early when confidence is high

Don't run all 10 steps of your agent if the answer is clear after step 3. Add a confidence signal to your intermediate tool results and stop when it's high enough.

CONFIDENCE_THRESHOLD = 0.85

def run_agent_with_early_exit(question: str, max_steps: int = 10) -> dict:
    messages = [{"role": "user", "content": question}]
    steps = 0
    
    while steps < max_steps:
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=500,
            system=SYSTEM_PROMPT,
            tools=tools,
            messages=messages,
        )
        steps += 1
        
        if response.stop_reason == "end_turn":
            return {"answer": response.content[0].text, "steps": steps}
        
        # Check if the model is expressing high confidence
        # (you can also ask the model to return a confidence score explicitly)
        tool_results = process_tool_calls(response.content)
        
        # If the latest result has high confidence, ask Claude to conclude
        if any(r.get("confidence", 0) > CONFIDENCE_THRESHOLD for r in tool_results):
            messages.append({"role": "assistant", "content": response.content})
            messages.append({
                "role": "user",
                "content": [
                    *[{"type": "tool_result", **r} for r in tool_results],
                    {"type": "text", "text": "Based on the findings so far, you have enough information to give a complete answer. Please respond now."}
                ]
            })
            final = client.messages.create(
                model="claude-sonnet-4-6", max_tokens=500, 
                system=SYSTEM_PROMPT, tools=tools, messages=messages,
            )
            return {"answer": final.content[0].text, "steps": steps + 1}
        
        messages.append({"role": "assistant", "content": response.content})
        messages.append({"role": "user", "content": [{"type": "tool_result", **r} for r in tool_results]})
    
    return {"answer": "Reached step limit", "steps": steps}

Savings: 20–40% on average, depending on your workload distribution.

Technique 7: Batch non-real-time workloads

The Anthropic Batch API gives 50% off all input and output tokens for asynchronous workloads. If your agent is processing documents, generating reports, or running analysis that doesn't need immediate results, batch it.

import anthropic

client = anthropic.Anthropic()

# Prepare a batch of 500 documents to process
batch_requests = [
    {
        "custom_id": f"doc-{i}",
        "params": {
            "model": "claude-sonnet-4-6",
            "max_tokens": 500,
            "messages": [{"role": "user", "content": f"Extract key facts from: {doc_text}"}],
        }
    }
    for i, doc_text in enumerate(documents)
]

# Submit batch (returns in minutes to hours)
batch = client.messages.batches.create(requests=batch_requests)
print(f"Batch submitted: {batch.id}")

# Poll for completion and process results
import time
while True:
    status = client.messages.batches.retrieve(batch.id)
    if status.processing_status == "ended":
        break
    time.sleep(60)

for result in client.messages.batches.results(batch.id):
    if result.result.type == "succeeded":
        print(f"{result.custom_id}: {result.result.message.content[0].text}")

Savings: 50% on all token costs for batched workloads.

Before and after: the cost table

Step	Before	After	Technique applied
Intent classification	Sonnet, no cache	Haiku	Model routing
Weather API (50×/day)	50 LLM calls	1 LLM call + 49 cache hits	Tool caching
System prompt (2K tokens × 500 calls)	Full price	90% cache hits	Prompt caching
Search results (5K tokens each)	Full context	Truncated to 500 tokens	Result truncation
Conversation history (15 turns)	Full history	Summarized to 3 sentences	History summarization
Research agent (10 steps avg)	Always 10 steps	4.2 steps avg	Early exit
Nightly document processing	Real-time	Batch API	Batch processing

Total estimated reduction: 58–65% depending on workload.

The model routing guide covers automatic routing with LiteLLM if you want to make model selection dynamic rather than hardcoded by task type.

The cost baseline

Before optimizing, understand exactly what you're paying for:

import anthropic

def track_cost(response: anthropic.types.Message) -> float:
    # Claude Sonnet 4.6 pricing
    INPUT_COST_PER_M = 3.0   # $3.00 per million input tokens
    OUTPUT_COST_PER_M = 15.0  # $15.00 per million output tokens
    
    input_tokens = response.usage.input_tokens
    output_tokens = response.usage.output_tokens
    
    return (input_tokens / 1_000_000 * INPUT_COST_PER_M) + \
           (output_tokens / 1_000_000 * OUTPUT_COST_PER_M)

Technique 1: Route tasks to the cheapest capable model

from enum import Enum

class TaskType(Enum):
    CLASSIFY = "classify"
    EXTRACT = "extract"
    ROUTE = "route"
    GENERATE = "generate"
    SYNTHESIZE = "synthesize"
    REASON = "reason"

CHEAP_TASKS = {TaskType.CLASSIFY, TaskType.EXTRACT, TaskType.ROUTE}
EXPENSIVE_TASKS = {TaskType.GENERATE, TaskType.SYNTHESIZE, TaskType.REASON}

def get_model(task_type: TaskType) -> str:
    if task_type in CHEAP_TASKS:
        return "claude-haiku-4-5-20251001"  # 10× cheaper
    return "claude-sonnet-4-6"

# In your agent:
def classify_intent(message: str) -> str:
    response = client.messages.create(
        model=get_model(TaskType.CLASSIFY),  # Haiku
        max_tokens=20,
        messages=[{"role": "user", "content": f"Classify as support/sales/other: {message}"}],
    )
    return response.content[0].text.strip()

def generate_reply(context: dict) -> str:
    response = client.messages.create(
        model=get_model(TaskType.GENERATE),  # Sonnet
        max_tokens=500,
        messages=[{"role": "user", "content": f"Write a reply to: {context}"}],
    )
    return response.content[0].text

Savings: 40–60% on classification-heavy pipelines. The model routing post covers automatic routing with LiteLLM.

Technique 2: Cache tool results

If your agent calls get_weather("Mumbai") 100 times today, that's 100 API calls and 100 LLM calls to process the result. Cache the tool output for a sensible TTL and pay for it once.

import redis
import json
import hashlib
import functools
from typing import Callable, Any

r = redis.Redis(host="localhost", port=6379, decode_responses=True)

def cached_tool(ttl_seconds: int = 300):
    """Decorator: cache tool call results in Redis by input hash."""
    def decorator(fn: Callable) -> Callable:
        @functools.wraps(fn)
        def wrapper(*args, **kwargs):
            # Create a cache key from function name + arguments
            key_data = json.dumps({"fn": fn.__name__, "args": args, "kwargs": kwargs}, sort_keys=True)
            cache_key = f"tool:{hashlib.md5(key_data.encode()).hexdigest()}"
            
            cached = r.get(cache_key)
            if cached:
                return json.loads(cached)
            
            result = fn(*args, **kwargs)
            r.setex(cache_key, ttl_seconds, json.dumps(result))
            return result
        return wrapper
    return decorator

# Wrap your tool functions:
@cached_tool(ttl_seconds=1800)  # cache 30 minutes
def get_weather(city: str) -> dict:
    # ... actual API call
    return {"temp": 32, "condition": "sunny"}

@cached_tool(ttl_seconds=86400)  # cache 24 hours
def get_company_info(domain: str) -> dict:
    # ... lookup
    return {"name": "...", "industry": "..."}

For tools that make external HTTP calls, this also reduces latency significantly — cached responses return in under 1ms vs 200–500ms for live calls.

Savings: varies by tool call frequency. In a pipeline processing similar inputs (e.g., classifying 1,000 support tickets), you might see 70%+ cache hits.

Technique 3: Prompt caching for shared system prompts

# Enable caching by adding cache_control to the system message
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1000,
    system=[
        {
            "type": "text",
            "text": LONG_SYSTEM_PROMPT,  # your 2,000-token system prompt
            "cache_control": {"type": "ephemeral"},  # cache this prefix
        }
    ],
    messages=messages,
)

See the Batch API post for how this interacts with batch pricing.

Technique 4: Truncate tool results before passing to the model

Extract only what the model needs inside the tool function:

# Bad: passes entire API response to model
def get_order(order_id: str) -> dict:
    return requests.get(f"/api/orders/{order_id}").json()  # 3KB of nested JSON

# Good: extract relevant fields
def get_order(order_id: str) -> dict:
    data = requests.get(f"/api/orders/{order_id}").json()
    return {
        "order_id": data["id"],
        "status": data["status"],
        "items": [{"sku": i["sku"], "qty": i["quantity"]} for i in data["line_items"]],
        "total_inr": int(float(data["total"]) * 100),  # paise
        "estimated_delivery": data.get("shipping", {}).get("estimated_date"),
    }

Also cap text-based results:

def fetch_webpage(url: str) -> str:
    content = firecrawl.scrape(url)
    return content[:3000]  # 3K chars is enough context for most tasks

Savings: 30–50% on input tokens for tool-heavy agents.

Technique 5: Summarize conversation history

Multi-turn agents accumulate context. By turn 15, you might be paying for 8,000 tokens of history on every call even though the last 3 turns contain everything relevant.

def get_effective_history(messages: list, max_turns: int = 8) -> list:
    if len(messages) <= max_turns * 2:
        return messages
    
    # Summarize older turns
    old_messages = messages[:-max_turns * 2]
    recent_messages = messages[-max_turns * 2:]
    
    summary_response = client.messages.create(
        model="claude-haiku-4-5-20251001",  # Cheap model for summarization
        max_tokens=200,
        messages=[{
            "role": "user",
            "content": f"Summarize the key context from this conversation in 3 sentences: {json.dumps(old_messages)}",
        }],
    )
    summary = summary_response.content[0].text
    
    return [
        {"role": "user", "content": f"[Earlier conversation summary: {summary}]"},
        {"role": "assistant", "content": "Understood. I'll keep that context in mind."},
        *recent_messages,
    ]

Use this in your agent loop before each LLM call:

messages = get_effective_history(messages, max_turns=8)
response = client.messages.create(..., messages=messages)

Savings: 40–60% on long conversations. The reduction grows as conversations get longer.

Technique 6: Exit early when confidence is high

Don't run all 10 steps of your agent if the answer is clear after step 3. Add a confidence signal to your intermediate tool results and stop when it's high enough.

CONFIDENCE_THRESHOLD = 0.85

def run_agent_with_early_exit(question: str, max_steps: int = 10) -> dict:
    messages = [{"role": "user", "content": question}]
    steps = 0
    
    while steps < max_steps:
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=500,
            system=SYSTEM_PROMPT,
            tools=tools,
            messages=messages,
        )
        steps += 1
        
        if response.stop_reason == "end_turn":
            return {"answer": response.content[0].text, "steps": steps}
        
        # Check if the model is expressing high confidence
        # (you can also ask the model to return a confidence score explicitly)
        tool_results = process_tool_calls(response.content)
        
        # If the latest result has high confidence, ask Claude to conclude
        if any(r.get("confidence", 0) > CONFIDENCE_THRESHOLD for r in tool_results):
            messages.append({"role": "assistant", "content": response.content})
            messages.append({
                "role": "user",
                "content": [
                    *[{"type": "tool_result", **r} for r in tool_results],
                    {"type": "text", "text": "Based on the findings so far, you have enough information to give a complete answer. Please respond now."}
                ]
            })
            final = client.messages.create(
                model="claude-sonnet-4-6", max_tokens=500, 
                system=SYSTEM_PROMPT, tools=tools, messages=messages,
            )
            return {"answer": final.content[0].text, "steps": steps + 1}
        
        messages.append({"role": "assistant", "content": response.content})
        messages.append({"role": "user", "content": [{"type": "tool_result", **r} for r in tool_results]})
    
    return {"answer": "Reached step limit", "steps": steps}

Savings: 20–40% on average, depending on your workload distribution.

Technique 7: Batch non-real-time workloads

import anthropic

client = anthropic.Anthropic()

# Prepare a batch of 500 documents to process
batch_requests = [
    {
        "custom_id": f"doc-{i}",
        "params": {
            "model": "claude-sonnet-4-6",
            "max_tokens": 500,
            "messages": [{"role": "user", "content": f"Extract key facts from: {doc_text}"}],
        }
    }
    for i, doc_text in enumerate(documents)
]

# Submit batch (returns in minutes to hours)
batch = client.messages.batches.create(requests=batch_requests)
print(f"Batch submitted: {batch.id}")

# Poll for completion and process results
import time
while True:
    status = client.messages.batches.retrieve(batch.id)
    if status.processing_status == "ended":
        break
    time.sleep(60)

for result in client.messages.batches.results(batch.id):
    if result.result.type == "succeeded":
        print(f"{result.custom_id}: {result.result.message.content[0].text}")

Savings: 50% on all token costs for batched workloads.

Before and after: the cost table

Step	Before	After	Technique applied
Intent classification	Sonnet, no cache	Haiku	Model routing
Weather API (50×/day)	50 LLM calls	1 LLM call + 49 cache hits	Tool caching
System prompt (2K tokens × 500 calls)	Full price	90% cache hits	Prompt caching
Search results (5K tokens each)	Full context	Truncated to 500 tokens	Result truncation
Conversation history (15 turns)	Full history	Summarized to 3 sentences	History summarization
Research agent (10 steps avg)	Always 10 steps	4.2 steps avg	Early exit
Nightly document processing	Real-time	Batch API	Batch processing

Total estimated reduction: 58–65% depending on workload.

The model routing guide covers automatic routing with LiteLLM if you want to make model selection dynamic rather than hardcoded by task type.

How to cut AI agent costs 60% — optimization techniques for multi-agent systems

The cost baseline

Technique 1: Route tasks to the cheapest capable model

Technique 2: Cache tool results

Technique 3: Prompt caching for shared system prompts

Technique 4: Truncate tool results before passing to the model

Technique 5: Summarize conversation history

Technique 6: Exit early when confidence is high

Technique 7: Batch non-real-time workloads

Before and after: the cost table

Related articles

A/B Testing Prompts in Production — A Statistical Guide

Async Python for LLM Apps — Patterns That Actually Work in Production

FastAPI + Claude API — Production Patterns for AI Backends

How to cut AI agent costs 60% — optimization techniques for multi-agent systems

The cost baseline

Technique 1: Route tasks to the cheapest capable model

Technique 2: Cache tool results

Technique 3: Prompt caching for shared system prompts

Technique 4: Truncate tool results before passing to the model

Technique 5: Summarize conversation history

Technique 6: Exit early when confidence is high

Technique 7: Batch non-real-time workloads

Before and after: the cost table

Related articles

A/B Testing Prompts in Production — A Statistical Guide

Async Python for LLM Apps — Patterns That Actually Work in Production

FastAPI + Claude API — Production Patterns for AI Backends