If your AI application sends the same large document, knowledge base, or system instructions with every API call, you're paying to process that content repeatedly. Context caching lets you pay once and reuse.
It's not a new concept — databases have done this forever. But applied to LLM APIs, it can cut input token costs by 70–90% for the right use cases.
How Context Caching Works
Without caching, every API call sends the full context:
[50,000 token document] + [200 token user question] → Response
Cost: 50,200 tokens × input price
With caching, the document is stored server-side:
First call: [50,000 token document cached] + [200 token user question] → Response
Cost: 50,200 tokens × full price + small caching fee
Subsequent: [cache reference] + [200 token user question] → Response
Cost: 200 tokens × full price + 50,000 tokens × cache price (~10%)
After the first call, each subsequent call costs ~90% less for the cached portion.
Anthropic Claude: Prompt Caching
Claude's prompt caching uses cache_control headers to mark content for caching.
import anthropic
client = anthropic.Anthropic()
# Cache a large document for reuse across many queries
with open("product_manual.txt", "r") as f:
product_manual = f.read()
def query_manual(question: str) -> str:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are a helpful product support agent. Answer questions based on the product manual provided.",
},
{
"type": "text",
"text": product_manual,
"cache_control": {"type": "ephemeral"} # Cache this content
}
],
messages=[{"role": "user", "content": question}]
)
return response.content[0].text
# First call: full processing cost
answer1 = query_manual("How do I reset the device to factory settings?")
# Subsequent calls: ~90% cost reduction on the cached portion
answer2 = query_manual("What's the warranty period?")
answer3 = query_manual("How do I connect to Wi-Fi?")
Key details:
- Cache lifetime: 5 minutes (ephemeral). Extended caching options may be available
- Minimum cacheable content: ~1,024 tokens (smaller content isn't worth caching)
- Cache pricing: approximately 10% of base input price for reads, small write fee
- The cache is per-account and per-API-version — not shared between users
What to cache with Claude:
- Long system prompts with extensive instructions
- Reference documents (manuals, knowledge bases, policies)
- Code files for code review/analysis workflows
- Large templates that appear in every call
Google Gemini: Context Caching
Gemini's caching is handled via the Caches API and has an explicit TTL (time-to-live):
import google.generativeai as genai
import datetime
genai.configure(api_key="your-api-key")
# Read a large document
with open("research_papers.txt", "r") as f:
research_content = f.read()
# Create a cache with a 2-hour TTL
cache = genai.caching.CachedContent.create(
model="gemini-2.0-flash",
display_name="research-papers-cache",
system_instruction="You are a research analyst. Answer questions based on the provided research papers.",
contents=[research_content],
ttl=datetime.timedelta(hours=2)
)
# Use the cached content for queries
model = genai.GenerativeModel.from_cached_content(cached_content=cache)
# All subsequent calls reuse the cached content
response1 = model.generate_content("What are the key findings on working memory?")
response2 = model.generate_content("How do the 2024 studies differ from 2023?")
response3 = model.generate_content("Summarize the methodology section of paper 3")
# Clean up when done
cache.delete()
Key details:
- Explicit TTL management — you set and control expiration
- Minimum size: 32,768 tokens (smaller content is not cacheable in Gemini)
- Cache pricing: roughly 25% of base input price for reads
- Can cache system instructions, documents, and conversation history
Gemini caching is particularly useful for:
- Workflows querying the same large document many times (1M token context + many questions)
- Multi-user applications where all users query the same knowledge base
- Batch processing pipelines analyzing the same base documents
OpenAI: Automatic Prompt Caching
OpenAI handles caching automatically for requests with long, repeated prefixes:
from openai import OpenAI
client = OpenAI()
system_prompt = """[Your long system prompt here - at least a few thousand tokens for caching to kick in]"""
def query_with_cached_system(question: str) -> str:
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": question}
]
)
return response.choices[0].message.content
OpenAI caching is automatic: No special API calls needed. If you send requests with the same content prefix (starting from the beginning of the prompt), OpenAI automatically caches and discounts it.
How to optimize for it:
- Keep static content (system prompts, documents) at the beginning of messages
- Keep dynamic content (user queries) at the end
- Don't modify the cached prefix between calls — even small changes break the cache
Pricing: Cache hits are approximately 50% of the base input price (less aggressive discount than Claude's 10% rate but requires zero configuration).
When Context Caching Is Worth Using
| Scenario | Cache Value |
|---|---|
| Same 50K+ token document, many queries | High — major cost reduction |
| Large system prompt (5K+ tokens), many calls | High — applies every call |
| Customer support bot with long policy docs | High |
| Each call has different context | None |
| Short prompts (<1K tokens) | None — minimum size requirements |
| One-off or infrequent calls | Low — setup cost vs. savings |
The math: If you have 50,000 tokens of fixed context and send 100 calls per day at Claude Sonnet 4.6 pricing (~$3/million input tokens):
- Without caching: 50,000 × 100 = 5M tokens/day × $3 = $15/day
- With caching: 50,000 × 1 (write) + 50,000 × 99 × 0.1 (reads) = 545K effective tokens/day × $3 ≈ $1.64/day
At scale, that's an 89% cost reduction on input tokens.
Implementation Checklist
Before implementing caching:
- Identify what content is repeated across calls (system prompts, reference docs)
- Verify the repeated content exceeds minimum size thresholds
- Estimate call volume to quantify savings
- Choose whether to use automatic caching (OpenAI) or explicit (Claude, Gemini)
For explicit caching:
- Structure prompts with static content first, dynamic content last
- Set appropriate TTL for your use case
- Handle cache expiration gracefully (fall back to full prompt)
- Monitor cache hit rates to verify savings are materializing
The implementation overhead is low. For applications with large repeated contexts, it's one of the highest-ROI optimizations you can make.


