The real cost of a multi-agent system isn't the model fee per call. It's the fact that every tool invocation is an LLM call. A research agent making 8 tool calls costs 8× what a single call costs — and that's before you account for the context window growing with each step.
Here's what that looks like in production: a 10-step research agent using Sonnet 4.6 costs about $0.05 per run. At 1,000 runs/day that's $1,500/month. I applied the seven techniques below and brought the same workload to $580/month — a 61% reduction — without changing the agent's behavior from the user's perspective.
The cost baseline
Before optimizing, understand exactly what you're paying for:
import anthropic
def track_cost(response: anthropic.types.Message) -> float:
# Claude Sonnet 4.6 pricing
INPUT_COST_PER_M = 3.0 # $3.00 per million input tokens
OUTPUT_COST_PER_M = 15.0 # $15.00 per million output tokens
input_tokens = response.usage.input_tokens
output_tokens = response.usage.output_tokens
return (input_tokens / 1_000_000 * INPUT_COST_PER_M) + \
(output_tokens / 1_000_000 * OUTPUT_COST_PER_M)
Log this on every call. After a week, aggregate by agent step and you'll immediately see which steps are expensive. Usually it's one or two — the synthesis step and the tool calls with large results.
Technique 1: Route tasks to the cheapest capable model
Claude Haiku 4.5 costs 10× less than Sonnet 4.6 and handles classification, routing, extraction, and simple transformation tasks perfectly. Sonnet is only necessary for generation, complex reasoning, and synthesis.
from enum import Enum
class TaskType(Enum):
CLASSIFY = "classify"
EXTRACT = "extract"
ROUTE = "route"
GENERATE = "generate"
SYNTHESIZE = "synthesize"
REASON = "reason"
CHEAP_TASKS = {TaskType.CLASSIFY, TaskType.EXTRACT, TaskType.ROUTE}
EXPENSIVE_TASKS = {TaskType.GENERATE, TaskType.SYNTHESIZE, TaskType.REASON}
def get_model(task_type: TaskType) -> str:
if task_type in CHEAP_TASKS:
return "claude-haiku-4-5-20251001" # 10× cheaper
return "claude-sonnet-4-6"
# In your agent:
def classify_intent(message: str) -> str:
response = client.messages.create(
model=get_model(TaskType.CLASSIFY), # Haiku
max_tokens=20,
messages=[{"role": "user", "content": f"Classify as support/sales/other: {message}"}],
)
return response.content[0].text.strip()
def generate_reply(context: dict) -> str:
response = client.messages.create(
model=get_model(TaskType.GENERATE), # Sonnet
max_tokens=500,
messages=[{"role": "user", "content": f"Write a reply to: {context}"}],
)
return response.content[0].text
Savings: 40–60% on classification-heavy pipelines. The model routing post covers automatic routing with LiteLLM.
Technique 2: Cache tool results
If your agent calls get_weather("Mumbai") 100 times today, that's 100 API calls and 100 LLM calls to process the result. Cache the tool output for a sensible TTL and pay for it once.
import redis
import json
import hashlib
import functools
from typing import Callable, Any
r = redis.Redis(host="localhost", port=6379, decode_responses=True)
def cached_tool(ttl_seconds: int = 300):
"""Decorator: cache tool call results in Redis by input hash."""
def decorator(fn: Callable) -> Callable:
@functools.wraps(fn)
def wrapper(*args, **kwargs):
# Create a cache key from function name + arguments
key_data = json.dumps({"fn": fn.__name__, "args": args, "kwargs": kwargs}, sort_keys=True)
cache_key = f"tool:{hashlib.md5(key_data.encode()).hexdigest()}"
cached = r.get(cache_key)
if cached:
return json.loads(cached)
result = fn(*args, **kwargs)
r.setex(cache_key, ttl_seconds, json.dumps(result))
return result
return wrapper
return decorator
# Wrap your tool functions:
@cached_tool(ttl_seconds=1800) # cache 30 minutes
def get_weather(city: str) -> dict:
# ... actual API call
return {"temp": 32, "condition": "sunny"}
@cached_tool(ttl_seconds=86400) # cache 24 hours
def get_company_info(domain: str) -> dict:
# ... lookup
return {"name": "...", "industry": "..."}
For tools that make external HTTP calls, this also reduces latency significantly — cached responses return in under 1ms vs 200–500ms for live calls.
Savings: varies by tool call frequency. In a pipeline processing similar inputs (e.g., classifying 1,000 support tickets), you might see 70%+ cache hits.
Technique 3: Prompt caching for shared system prompts
If all your agent calls share a system prompt longer than 1,024 tokens, Claude's prompt caching saves up to 90% on those input tokens. The cache prefix is charged at 0.1× the normal input rate on cache hits.
# Enable caching by adding cache_control to the system message
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1000,
system=[
{
"type": "text",
"text": LONG_SYSTEM_PROMPT, # your 2,000-token system prompt
"cache_control": {"type": "ephemeral"}, # cache this prefix
}
],
messages=messages,
)
The first call pays full price. Subsequent calls within 5 minutes pay 0.1×. For a pipeline that makes 500 calls per hour all sharing the same system prompt, this cuts input costs by ~85% on those tokens.
See the Batch API post for how this interacts with batch pricing.
Technique 4: Truncate tool results before passing to the model
This is the mistake I see most in production agents: the tool makes an API call, returns a 4,000-token JSON response, and the agent passes the whole thing to the next LLM call. Now your context window is full of irrelevant nested objects.
Extract only what the model needs inside the tool function:
# Bad: passes entire API response to model
def get_order(order_id: str) -> dict:
return requests.get(f"/api/orders/{order_id}").json() # 3KB of nested JSON
# Good: extract relevant fields
def get_order(order_id: str) -> dict:
data = requests.get(f"/api/orders/{order_id}").json()
return {
"order_id": data["id"],
"status": data["status"],
"items": [{"sku": i["sku"], "qty": i["quantity"]} for i in data["line_items"]],
"total_inr": int(float(data["total"]) * 100), # paise
"estimated_delivery": data.get("shipping", {}).get("estimated_date"),
}
Also cap text-based results:
def fetch_webpage(url: str) -> str:
content = firecrawl.scrape(url)
return content[:3000] # 3K chars is enough context for most tasks
Savings: 30–50% on input tokens for tool-heavy agents.
Technique 5: Summarize conversation history
Multi-turn agents accumulate context. By turn 15, you might be paying for 8,000 tokens of history on every call even though the last 3 turns contain everything relevant.
def get_effective_history(messages: list, max_turns: int = 8) -> list:
if len(messages) <= max_turns * 2:
return messages
# Summarize older turns
old_messages = messages[:-max_turns * 2]
recent_messages = messages[-max_turns * 2:]
summary_response = client.messages.create(
model="claude-haiku-4-5-20251001", # Cheap model for summarization
max_tokens=200,
messages=[{
"role": "user",
"content": f"Summarize the key context from this conversation in 3 sentences: {json.dumps(old_messages)}",
}],
)
summary = summary_response.content[0].text
return [
{"role": "user", "content": f"[Earlier conversation summary: {summary}]"},
{"role": "assistant", "content": "Understood. I'll keep that context in mind."},
*recent_messages,
]
Use this in your agent loop before each LLM call:
messages = get_effective_history(messages, max_turns=8)
response = client.messages.create(..., messages=messages)
Savings: 40–60% on long conversations. The reduction grows as conversations get longer.
Technique 6: Exit early when confidence is high
Don't run all 10 steps of your agent if the answer is clear after step 3. Add a confidence signal to your intermediate tool results and stop when it's high enough.
CONFIDENCE_THRESHOLD = 0.85
def run_agent_with_early_exit(question: str, max_steps: int = 10) -> dict:
messages = [{"role": "user", "content": question}]
steps = 0
while steps < max_steps:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=500,
system=SYSTEM_PROMPT,
tools=tools,
messages=messages,
)
steps += 1
if response.stop_reason == "end_turn":
return {"answer": response.content[0].text, "steps": steps}
# Check if the model is expressing high confidence
# (you can also ask the model to return a confidence score explicitly)
tool_results = process_tool_calls(response.content)
# If the latest result has high confidence, ask Claude to conclude
if any(r.get("confidence", 0) > CONFIDENCE_THRESHOLD for r in tool_results):
messages.append({"role": "assistant", "content": response.content})
messages.append({
"role": "user",
"content": [
*[{"type": "tool_result", **r} for r in tool_results],
{"type": "text", "text": "Based on the findings so far, you have enough information to give a complete answer. Please respond now."}
]
})
final = client.messages.create(
model="claude-sonnet-4-6", max_tokens=500,
system=SYSTEM_PROMPT, tools=tools, messages=messages,
)
return {"answer": final.content[0].text, "steps": steps + 1}
messages.append({"role": "assistant", "content": response.content})
messages.append({"role": "user", "content": [{"type": "tool_result", **r} for r in tool_results]})
return {"answer": "Reached step limit", "steps": steps}
Savings: 20–40% on average, depending on your workload distribution.
Technique 7: Batch non-real-time workloads
The Anthropic Batch API gives 50% off all input and output tokens for asynchronous workloads. If your agent is processing documents, generating reports, or running analysis that doesn't need immediate results, batch it.
import anthropic
client = anthropic.Anthropic()
# Prepare a batch of 500 documents to process
batch_requests = [
{
"custom_id": f"doc-{i}",
"params": {
"model": "claude-sonnet-4-6",
"max_tokens": 500,
"messages": [{"role": "user", "content": f"Extract key facts from: {doc_text}"}],
}
}
for i, doc_text in enumerate(documents)
]
# Submit batch (returns in minutes to hours)
batch = client.messages.batches.create(requests=batch_requests)
print(f"Batch submitted: {batch.id}")
# Poll for completion and process results
import time
while True:
status = client.messages.batches.retrieve(batch.id)
if status.processing_status == "ended":
break
time.sleep(60)
for result in client.messages.batches.results(batch.id):
if result.result.type == "succeeded":
print(f"{result.custom_id}: {result.result.message.content[0].text}")
Savings: 50% on all token costs for batched workloads.
Before and after: the cost table
| Step | Before | After | Technique applied |
|---|---|---|---|
| Intent classification | Sonnet, no cache | Haiku | Model routing |
| Weather API (50×/day) | 50 LLM calls | 1 LLM call + 49 cache hits | Tool caching |
| System prompt (2K tokens × 500 calls) | Full price | 90% cache hits | Prompt caching |
| Search results (5K tokens each) | Full context | Truncated to 500 tokens | Result truncation |
| Conversation history (15 turns) | Full history | Summarized to 3 sentences | History summarization |
| Research agent (10 steps avg) | Always 10 steps | 4.2 steps avg | Early exit |
| Nightly document processing | Real-time | Batch API | Batch processing |
Total estimated reduction: 58–65% depending on workload.
The model routing guide covers automatic routing with LiteLLM if you want to make model selection dynamic rather than hardcoded by task type.



