Every API call that repeats the same system prompt is paying full price to reprocess tokens Claude has already seen. For a simple chatbot that sends a 2,000-token system prompt with every message, that's 60 million wasted tokens per month at 1,000 queries/day — before you even count the user messages.
Prompt caching fixes this: cached tokens cost 10% of the standard input price. For a customer support bot with a 2,000-token system prompt handling 1,000 daily queries, that's the difference between roughly ₹22,000/month and ₹5,000/month. One line of code.
How prompt caching works
Three phases:
Cache write (first request): Claude processes the cacheable block at full input price plus a 25% write surcharge. The cache is stored for 5 minutes from the last access.
Cache read (subsequent requests): Any request that sends the same cached block gets a cache hit. Cost drops to 10% of standard input price. The 5-minute timer resets on every hit, so an active conversation keeps its cache alive indefinitely.
Cache miss (after 5 minutes of inactivity): The cache expires. Next request pays full price + write surcharge again to recreate it.
Here's the full cost breakdown across models:
| Model | Standard input | Cache write | Cache read |
|---|---|---|---|
| Claude Sonnet 4.6 | $3.00/MTok | $3.75/MTok | $0.30/MTok |
| Claude Opus 4.6 | $5.00/MTok | $6.25/MTok | $0.50/MTok |
INR equivalents via AICredits.in (~10% markup on USD):
| Model | Standard input | Cache write | Cache read |
|---|---|---|---|
| Sonnet 4.6 | ₹331/MTok | ₹413/MTok | ₹33/MTok |
| Opus 4.6 | ₹552/MTok | ₹690/MTok | ₹55/MTok |
The math favors caching heavily once your system prompt exceeds ~1,000 tokens and you're making more than ~10 requests per session.
What to cache (and what not to)
Cache these:
- System prompts — almost always worth caching if they're consistent across requests. If your system prompt is 2,000+ tokens of instructions, this is the highest-leverage thing you can do.
- Long documents sent with every request — knowledge bases, product catalogues, reference documents. If you're sending a 50,000-word document to Claude for every analysis request, you're paying full price every time.
- Tool definitions — each tool spec is 50-100 tokens. If you have 20 tools, that's 1,000-2,000 tokens per request that can be cached.
- Few-shot examples — if you have a long example block to guide output format, cache it.
Don't cache these:
- The user message — changes every request, can't be cached.
- Short system prompts — below ~1,000 tokens, the write surcharge outweighs the read savings unless you have very high request volume.
- Content that changes frequently — if your "current context" block updates every 30 minutes, the cache will miss more than it hits.
Automatic caching (recommended for most use cases)
The simplest approach: pass cache_control={"type": "auto"} at the message level. Claude places the cache point automatically at the end of the last cacheable block and moves it forward as the conversation grows.
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
cache_control={"type": "auto"},
system="""You are a customer support agent for Acme Corp.
Company overview:
[2,000+ tokens of product documentation, policies, procedures...]
""",
messages=[
{"role": "user", "content": user_message}
]
)
print(response.content[0].text)
print(f"Cache read tokens: {response.usage.cache_read_input_tokens}")
print(f"Cache write tokens: {response.usage.cache_creation_input_tokens}")
That's it. The first request writes the cache. Every subsequent request within 5 minutes (extended on each hit) reads it for ₹33/MTok instead of ₹331/MTok.
Explicit cache breakpoints (for fine-grained control)
For more complex use cases — for example, you want to cache the system prompt but not a dynamically fetched document — use explicit cache breakpoints with cache_control: {"type": "ephemeral"} on specific content blocks.
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[
{
"type": "text",
"text": """You are a customer support agent for Acme Corp.
[Your 2,000-token static instructions here...]
Always respond in the user's language. Be concise. Escalate billing issues to billing@acme.com.""",
"cache_control": {"type": "ephemeral"} # Cache this block
},
{
"type": "text",
"text": f"""Current product catalogue (updated daily):
{product_catalogue}"""
# No cache_control — this block is dynamic or you control caching above
}
],
messages=[
{"role": "user", "content": user_message}
]
)
You can add up to 4 explicit cache breakpoints. Everything up to and including a breakpoint is cached together.
Real cost calculation — customer support bot in India
Let's price this out for a realistic scenario.
Setup: 1,000 customer queries/day. 2,000-token system prompt. 300-token average user message. 500-token average response. Claude Sonnet 4.6 via AICredits.in.
Without caching:
Every request processes: 2,000 (system) + 300 (user) = 2,300 input tokens.
Monthly input tokens: 2,300 × 1,000 × 30 = 69,000,000 tokens = 69 MTok
Cost: 69 × ₹331 = ₹22,839/month
With caching:
Cache write: One 2,000-token write per session (simplified). At ₹413/MTok = effectively negligible.
Cache reads: 2,000 cached tokens per request at ₹33/MTok + 300 uncached tokens at ₹331/MTok.
Per 1,000 requests:
- Cached: 2,000,000 tokens × ₹33/MTok = ₹66
- Uncached: 300,000 tokens × ₹331/MTok = ₹99.30
- Total: ₹165.30/day
Monthly: ₹165.30 × 30 = ₹4,959/month
Savings: ₹17,880/month (~78%)
That's assuming you do nothing else. If your system prompt is longer (e.g., 10,000 tokens), or you're also caching tool definitions, the savings compound further.
Monitoring cache hit rates
You should always track whether your caching is actually working. The response usage object gives you everything you need:
def make_cached_request(client, system_prompt, user_message):
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
cache_control={"type": "auto"},
system=system_prompt,
messages=[{"role": "user", "content": user_message}]
)
usage = response.usage
cache_hits = usage.cache_read_input_tokens
cache_writes = usage.cache_creation_input_tokens
uncached = usage.input_tokens # Tokens processed without cache
total_input = cache_hits + cache_writes + uncached
hit_rate = cache_hits / total_input if total_input > 0 else 0
print(f"Cache hit rate: {hit_rate:.1%}")
print(f" Hit: {cache_hits} tokens (₹{cache_hits * 33 / 1_000_000:.4f})")
print(f" Write: {cache_writes} tokens (₹{cache_writes * 413 / 1_000_000:.4f})")
print(f" Uncached: {uncached} tokens (₹{uncached * 331 / 1_000_000:.4f})")
return response
You want cache hit rate above 80% for meaningful savings. If you're seeing a low hit rate:
- Check that you're not regenerating the system prompt with dynamic content on every call
- Verify requests are spaced less than 5 minutes apart (for persistent sessions)
- Confirm you're sending the exact same text — even a single character difference breaks the cache
Combining with effort parameter for maximum savings
Prompt caching and the effort parameter stack independently. For high-volume applications where you want to minimise costs without sacrificing quality:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=2048,
effort="medium", # Reduces thinking token usage ~40-60%
thinking={"type": "adaptive"}, # Scales thinking to task complexity
cache_control={"type": "auto"}, # Caches system prompt at 10% of input price
system=system_prompt,
messages=conversation_history
)
The combined effect: ~78% savings from prompt caching on repeated system tokens, plus ~40-60% reduction in thinking token usage from effort="medium". For a customer support bot where most queries don't need deep reasoning, this combination brings effective cost well below 20% of the naive default configuration.
💡 Track your ₹ savings per API key at AICredits.in — the dashboard shows per-request token breakdown including cache hits vs writes vs uncached.
Practical checklist
Before you ship any production Claude integration:
- Is your system prompt > 1,000 tokens? Add
cache_control={"type": "auto"}immediately. - Are you sending the same document with every request? Put it in the system prompt under a cache breakpoint.
- Do you have many tool definitions? Cache them in a separate block.
- Are requests in the same session within 5 minutes of each other? If not, cache writes are mostly wasted — consider restructuring the session model.
- Are you logging
cache_read_input_tokens? If you're not monitoring cache hits, you don't know if it's working.
Next steps
- Claude 4.6 effort parameter and cost optimization — reduce thinking token costs
- Claude Opus 4.6 prompting guide — when to use Opus vs Sonnet
- AICredits.in review — UPI-based API access in India
- Prompt compression guide — shrink system prompts without losing effectiveness
Try it now with AICredits.in
Access Claude, GPT-4o, Gemini, and 300+ models with UPI payment in ₹. No international card needed. Create free account →



