Most teams running AI in production are leaving serious money on the table. If your application sends the same system prompt, tool definitions, or retrieved documents with every request — and you haven't set up prompt caching — you're paying full price for tokens your provider has already processed. In my experience, the savings from proper caching land between 60% and 80% of total token costs for typical production workloads. This post covers exactly how to implement it on both Anthropic and OpenAI APIs, with real pricing numbers and the ordering patterns that actually move your cache hit rate.
What prompt caching actually is
When you call an AI API, the provider processes your entire input from scratch — tokenizing it, running it through attention layers, building the key-value (KV) cache internally. Prompt caching lets you reuse that computed KV cache across requests. Instead of reprocessing a 20,000-token system prompt on every call, the provider reads from its stored computation.
The key mechanic: caching only works on a prefix of your prompt. The provider checksums the beginning of your input. If a subsequent request starts with an identical prefix of sufficient length, it's a cache hit. Anything after the cached prefix still gets processed normally. This is why prompt structure matters — position your stable content at the top, variable content at the bottom.
This is distinct from context caching explained at the infrastructure level, though the end result is similar: you pay less to reuse context.
Anthropic prompt caching
Anthropic's implementation requires explicit opt-in. You mark cache breakpoints in your request using cache_control blocks, and the API stores everything up to that breakpoint.
Pricing
| Token type | Cost (relative to base) |
|---|---|
| Cache write (first request) | 1.25× base input price |
| Cache read (subsequent hits) | 0.10× base input price |
| Normal input (uncached) | 1.00× base input price |
The write penalty is real — first-request cost goes up 25%. But from the second request onward, you're paying 10 cents on the dollar for those tokens. The breakeven is almost always the second request.
Minimum prefix lengths
- Claude Sonnet, Opus: 1,024 tokens minimum cacheable prefix
- Claude Haiku: 2,048 tokens minimum cacheable prefix
If your prefix is shorter than the minimum, the cache write is silently skipped. No error, no warning — just no cache hit. This trips up a lot of developers who wonder why their short system prompts aren't getting cached.
TTL
Cache entries expire after 5 minutes of inactivity. Each cache hit resets the TTL. For bursty workloads, this is fine. For low-traffic apps with long gaps between requests, you'll get cold cache misses more often.
Python implementation
import anthropic
client = anthropic.Anthropic()
# Long static document or system instructions
SYSTEM_PROMPT = """You are an expert financial analyst assistant...
[... 2000+ tokens of instructions, examples, and context ...]
"""
REFERENCE_DOCUMENT = """[Your 15,000-token document here]"""
response = client.messages.create(
model="claude-opus-4-5",
max_tokens=1024,
system=[
{
"type": "text",
"text": SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"} # Cache breakpoint after system prompt
},
{
"type": "text",
"text": REFERENCE_DOCUMENT,
"cache_control": {"type": "ephemeral"} # Second cache breakpoint after document
}
],
messages=[
{
"role": "user",
"content": "What is the revenue trend for Q3?" # Dynamic — NOT cached
}
]
)
# Check cache performance
usage = response.usage
print(f"Input tokens: {usage.input_tokens}")
print(f"Cache creation tokens: {usage.cache_creation_input_tokens}")
print(f"Cache read tokens: {usage.cache_read_input_tokens}")
You can place multiple cache_control breakpoints in a single request. Anthropic caches the prefix up to each marked block. Use this when you have tiered static content — fixed system instructions first, then per-session context, then per-request data — each with its own breakpoint.
What the response tells you
The usage object returns three token counts:
cache_creation_input_tokens— tokens written to cache (charged at 1.25×)cache_read_input_tokens— tokens read from cache (charged at 0.10×)input_tokens— tokens processed normally (charged at 1.00×)
On a warm cache hit, you'll see cache_read_input_tokens equal to the length of your cached prefix and cache_creation_input_tokens equal to zero.
OpenAI prompt caching
OpenAI's approach is the opposite of Anthropic's: fully automatic. No configuration required. The API silently caches prompt prefixes and applies a discount when it gets a hit.
Pricing
| Token type | Cost (relative to base) |
|---|---|
| Cache hit (prefix) | 0.50× base input price |
| Cache miss / normal | 1.00× base input price |
The discount is 50%, not 90% like Anthropic. But you also don't pay a write penalty. First request is normal price, subsequent hits are half price.
Minimum prefix length and TTL
- Minimum cacheable prefix: 1,024 tokens
- TTL: approximately 5 minutes
Structuring prompts for automatic caching
Since OpenAI caches automatically based on prefix matching, your job is purely structural: keep identical content at the beginning of every request.
from openai import OpenAI
client = OpenAI()
# This prefix must be IDENTICAL across requests for caching to work
STATIC_SYSTEM = """You are a code review assistant specializing in Python.
You follow PEP 8 style guidelines and focus on:
1. Security vulnerabilities
2. Performance bottlenecks
3. Code maintainability
[... rest of your static instructions ...]"""
STATIC_EXAMPLES = """
## Example review
Code:
\`\`\`python
def get_user(id):
return db.query(f"SELECT * FROM users WHERE id={id}")
\`\`\`
Review:
- SQL injection vulnerability: use parameterized queries
- Missing type hints
- No error handling for missing user
[... more examples ...]"""
def review_code(code_snippet: str) -> str:
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
# Static prefix — same on every request
"content": STATIC_SYSTEM + "\n\n" + STATIC_EXAMPLES
},
{
"role": "user",
# Dynamic content always comes last
"content": f"Please review this code:\n\n\`\`\`python\n{code_snippet}\n\`\`\`"
}
]
)
return response.choices[0].message.content
# Check cache usage
def review_code_with_stats(code_snippet: str):
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": STATIC_SYSTEM + "\n\n" + STATIC_EXAMPLES},
{"role": "user", "content": f"Review:\n\n\`\`\`python\n{code_snippet}\n\`\`\`"}
]
)
usage = response.usage
cached = usage.prompt_tokens_details.cached_tokens if usage.prompt_tokens_details else 0
print(f"Total prompt tokens: {usage.prompt_tokens}")
print(f"Cached tokens: {cached}")
print(f"Cache hit rate: {cached / usage.prompt_tokens:.1%}")
return response.choices[0].message.content
Pricing comparison at a glance
| Provider | Cache write cost | Cache read cost | Min prefix | Config required |
|---|---|---|---|---|
| Anthropic | 1.25× base | 0.10× base | 1,024 tokens (Sonnet/Opus) | Yes (cache_control) |
| OpenAI | 1.00× base | 0.50× base | 1,024 tokens | No (automatic) |
Anthropic gives a bigger discount on reads (90% off vs. 50% off) but charges more to write. For very high-traffic applications where the same prefix fires thousands of times per hour, Anthropic's model wins. For moderate traffic, OpenAI's zero-config approach is often the better starting point.
The ordering rule that determines everything
Both implementations share one absolute requirement: the cached portion must be a stable prefix. The provider checksums the beginning of your input. If anything in that prefix changes between requests, it's a cache miss and you pay full price.
This means your prompt structure should follow a strict hierarchy from most-stable to least-stable:
- System instructions — never change between requests
- Reference documents or retrieved context — change per session, not per request
- Tool/function definitions — change only when you update your app
- Few-shot examples — static or session-scoped
- User message — changes every request
Most developers structure it backwards. They write code that prepends user context and appends boilerplate, then wonder why cache hit rates are near zero. The fix is always the same: invert the structure.
For a deeper look at how prefix ordering affects performance beyond caching, see context engineering.
High-impact use cases
RAG pipelines — this is the single biggest win. If you retrieve a 20K-token document and send it with every query in a session, you're paying full price for those tokens on every request. Cache the document after the first retrieval. Every follow-up question in that session hits the cache.
Agents with long tool definitions — complex agents often carry 3,000–8,000 tokens of tool definitions on every call in a loop. These definitions don't change mid-run. Cache them. On a 10-step agent loop, you pay write cost once and read cost nine times. For patterns on structuring this, see AI agent design patterns.
Document analysis pipelines — uploading a contract or report for analysis typically involves one large document and many smaller questions. Same document, different queries — exactly what caching is designed for.
Few-shot prompt libraries — if you maintain a library of 15–20 examples to improve output quality, those examples are a natural caching candidate. Put them in the system prompt before the user message, and they're paid for once per TTL window.
For reducing what you cache in the first place, the prompt compression guide covers techniques that often combine well with caching.
Real cost math
Here's a concrete example. A RAG pipeline that:
- Retrieves a 20,000-token document per session
- Receives 1,000 user queries per day
- Uses Claude Sonnet 4.6 (input price: $3.00 per million tokens)
- Average session has 5 queries against the same document
Without caching:
- 1,000 queries × 20,000 tokens = 20,000,000 input tokens/day
- Cost: 20M × $3.00/1M = $60/day
With Anthropic prompt caching:
- 200 cache writes (one per session start): 200 × 20,000 = 4M tokens at 1.25× = $15.00
- 800 cache reads (4 follow-ups per session): 800 × 20,000 = 16M tokens at 0.10× = $4.80
- Total: $19.80/day
- Savings: 67%
With OpenAI GPT-4o (input price: $2.50/1M):
- 200 cache writes (full price): 4M × $2.50/1M = $10.00
- 800 cache reads (50% off): 16M × $1.25/1M = $20.00
- Total: $30.00/day
- Savings: 40% (vs. $50/day uncached)
For workloads with even higher query multipliers per session, Anthropic's 90% read discount compounds dramatically.
Monitoring cache performance in production
Don't guess — measure. Both APIs return cache metrics on every response.
Anthropic — check response.usage:
cache_hit_rate = (
usage.cache_read_input_tokens /
(usage.cache_read_input_tokens + usage.cache_creation_input_tokens + usage.input_tokens)
)
OpenAI — check response.usage.prompt_tokens_details:
details = response.usage.prompt_tokens_details
cache_hit_rate = details.cached_tokens / response.usage.prompt_tokens
Log these metrics per request and aggregate them. A healthy RAG pipeline should see 70–90% cache hit rates once the TTL window is warm. If you're seeing below 50%, the most common causes are:
- Dynamic content bleeding into your prefix (timestamps, request IDs, random seeds)
- Prefix length below the minimum threshold
- TTL expiring between requests for low-traffic routes
- Inconsistent whitespace or formatting in your static content
The fastest debug step: print the first 200 characters of your prompt on two consecutive requests and diff them. Any difference before the dynamic section explains your cache miss.
Common mistakes
Randomizing content before the cache breakpoint. Shuffling few-shot examples for "variety" kills caching. Pick a fixed order and keep it.
Including timestamps or session IDs in the system prompt. I've seen prompts like "Current date: {datetime.now()}" in the static section. That's a cache miss on every single request.
Setting cache_control on the user message. Anthropic's cache breakpoint marks the end of the cacheable prefix. If you put it on the user message, you're caching the dynamic part and potentially thrashing the cache.
Forgetting the minimum token threshold. If your system prompt is 800 tokens, neither provider caches it. Consolidate your instructions and examples until you're comfortably above 1,024 tokens, or accept that you won't get caching benefits on that route.
Prompt caching isn't a premature optimization — for any app making repeated API calls with shared context, it's table stakes. The configuration cost is low (zero for OpenAI, a few extra fields for Anthropic), and the savings kick in immediately on the second request.



