What is context caching in AI APIs?

Context caching is a feature that lets you store a large portion of your prompt (documents, instructions, knowledge base) on the API provider's servers, then reuse it across multiple API calls without paying to process it again. You pay full price for the first call and a heavily discounted 'cache read' price for subsequent calls. This is particularly valuable for use cases where the same large document or context is used with many different queries.

How much can context caching save?

Savings depend on your pattern. If you're sending 50,000 tokens of context with every API call, and 90% of those tokens are the same document, caching can reduce input token costs by 70-90%. Anthropic charges roughly 10% of the base input price for cache reads. Google charges roughly 25% for Gemini cache reads. The savings multiply with call volume — at 1,000 calls per day to the same document, caching pays for itself immediately.

Context Caching Explained: Cut AI Costs by Up to 90% on Repeated Context

If your AI application sends the same large document, knowledge base, or system instructions with every API call, you're paying to process that content repeatedly. Context caching lets you pay once and reuse.

It's not a new concept — databases have done this forever. But applied to LLM APIs, it can cut input token costs by 70–90% for the right use cases.

How Context Caching Works

Without caching, every API call sends the full context:

[50,000 token document] + [200 token user question] → Response
Cost: 50,200 tokens × input price

With caching, the document is stored server-side:

First call:  [50,000 token document cached] + [200 token user question] → Response
             Cost: 50,200 tokens × full price + small caching fee

Subsequent:  [cache reference] + [200 token user question] → Response
             Cost: 200 tokens × full price + 50,000 tokens × cache price (~10%)

After the first call, each subsequent call costs ~90% less for the cached portion.

Anthropic Claude: Prompt Caching

Claude's prompt caching uses cache_control headers to mark content for caching.

import anthropic

client = anthropic.Anthropic()

# Cache a large document for reuse across many queries
with open("product_manual.txt", "r") as f:
    product_manual = f.read()

def query_manual(question: str) -> str:
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system=[
            {
                "type": "text",
                "text": "You are a helpful product support agent. Answer questions based on the product manual provided.",
            },
            {
                "type": "text",
                "text": product_manual,
                "cache_control": {"type": "ephemeral"}  # Cache this content
            }
        ],
        messages=[{"role": "user", "content": question}]
    )
    return response.content[0].text

# First call: full processing cost
answer1 = query_manual("How do I reset the device to factory settings?")

# Subsequent calls: ~90% cost reduction on the cached portion
answer2 = query_manual("What's the warranty period?")
answer3 = query_manual("How do I connect to Wi-Fi?")

Key details:

Cache lifetime: 5 minutes (ephemeral). Extended caching options may be available
Minimum cacheable content: ~1,024 tokens (smaller content isn't worth caching)
Cache pricing: approximately 10% of base input price for reads, small write fee
The cache is per-account and per-API-version — not shared between users

What to cache with Claude:

Long system prompts with extensive instructions
Reference documents (manuals, knowledge bases, policies)
Code files for code review/analysis workflows
Large templates that appear in every call

Google Gemini: Context Caching

Gemini's caching is handled via the Caches API and has an explicit TTL (time-to-live):

import google.generativeai as genai
import datetime

genai.configure(api_key="your-api-key")

# Read a large document
with open("research_papers.txt", "r") as f:
    research_content = f.read()

# Create a cache with a 2-hour TTL
cache = genai.caching.CachedContent.create(
    model="gemini-2.0-flash",
    display_name="research-papers-cache",
    system_instruction="You are a research analyst. Answer questions based on the provided research papers.",
    contents=[research_content],
    ttl=datetime.timedelta(hours=2)
)

# Use the cached content for queries
model = genai.GenerativeModel.from_cached_content(cached_content=cache)

# All subsequent calls reuse the cached content
response1 = model.generate_content("What are the key findings on working memory?")
response2 = model.generate_content("How do the 2024 studies differ from 2023?")
response3 = model.generate_content("Summarize the methodology section of paper 3")

# Clean up when done
cache.delete()

Key details:

Explicit TTL management — you set and control expiration
Minimum size: 32,768 tokens (smaller content is not cacheable in Gemini)
Cache pricing: roughly 25% of base input price for reads
Can cache system instructions, documents, and conversation history

Gemini caching is particularly useful for:

Workflows querying the same large document many times (1M token context + many questions)
Multi-user applications where all users query the same knowledge base
Batch processing pipelines analyzing the same base documents

OpenAI: Automatic Prompt Caching

OpenAI handles caching automatically for requests with long, repeated prefixes:

from openai import OpenAI

client = OpenAI()

system_prompt = """[Your long system prompt here - at least a few thousand tokens for caching to kick in]"""

def query_with_cached_system(question: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": question}
        ]
    )
    return response.choices[0].message.content

OpenAI caching is automatic: No special API calls needed. If you send requests with the same content prefix (starting from the beginning of the prompt), OpenAI automatically caches and discounts it.

How to optimize for it:

Keep static content (system prompts, documents) at the beginning of messages
Keep dynamic content (user queries) at the end
Don't modify the cached prefix between calls — even small changes break the cache

Pricing: Cache hits are approximately 50% of the base input price (less aggressive discount than Claude's 10% rate but requires zero configuration).

When Context Caching Is Worth Using

Scenario	Cache Value
Same 50K+ token document, many queries	High — major cost reduction
Large system prompt (5K+ tokens), many calls	High — applies every call
Customer support bot with long policy docs	High
Each call has different context	None
Short prompts (<1K tokens)	None — minimum size requirements
One-off or infrequent calls	Low — setup cost vs. savings

The math: If you have 50,000 tokens of fixed context and send 100 calls per day at Claude Sonnet 4.6 pricing (~$3/million input tokens):

Without caching: 50,000 × 100 = 5M tokens/day × $3 = $15/day
With caching: 50,000 × 1 (write) + 50,000 × 99 × 0.1 (reads) = 545K effective tokens/day × $3 ≈ $1.64/day

At scale, that's an 89% cost reduction on input tokens.

Implementation Checklist

Before implementing caching:

Identify what content is repeated across calls (system prompts, reference docs)
Verify the repeated content exceeds minimum size thresholds
Estimate call volume to quantify savings
Choose whether to use automatic caching (OpenAI) or explicit (Claude, Gemini)

For explicit caching:

Structure prompts with static content first, dynamic content last
Set appropriate TTL for your use case
Handle cache expiration gracefully (fall back to full prompt)
Monitor cache hit rates to verify savings are materializing

The implementation overhead is low. For applications with large repeated contexts, it's one of the highest-ROI optimizations you can make.

It's not a new concept — databases have done this forever. But applied to LLM APIs, it can cut input token costs by 70–90% for the right use cases.

How Context Caching Works

Without caching, every API call sends the full context:

[50,000 token document] + [200 token user question] → Response
Cost: 50,200 tokens × input price

With caching, the document is stored server-side:

First call:  [50,000 token document cached] + [200 token user question] → Response
             Cost: 50,200 tokens × full price + small caching fee

Subsequent:  [cache reference] + [200 token user question] → Response
             Cost: 200 tokens × full price + 50,000 tokens × cache price (~10%)

After the first call, each subsequent call costs ~90% less for the cached portion.

Anthropic Claude: Prompt Caching

Claude's prompt caching uses cache_control headers to mark content for caching.

import anthropic

client = anthropic.Anthropic()

# Cache a large document for reuse across many queries
with open("product_manual.txt", "r") as f:
    product_manual = f.read()

def query_manual(question: str) -> str:
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system=[
            {
                "type": "text",
                "text": "You are a helpful product support agent. Answer questions based on the product manual provided.",
            },
            {
                "type": "text",
                "text": product_manual,
                "cache_control": {"type": "ephemeral"}  # Cache this content
            }
        ],
        messages=[{"role": "user", "content": question}]
    )
    return response.content[0].text

# First call: full processing cost
answer1 = query_manual("How do I reset the device to factory settings?")

# Subsequent calls: ~90% cost reduction on the cached portion
answer2 = query_manual("What's the warranty period?")
answer3 = query_manual("How do I connect to Wi-Fi?")

Key details:

Cache lifetime: 5 minutes (ephemeral). Extended caching options may be available
Minimum cacheable content: ~1,024 tokens (smaller content isn't worth caching)
Cache pricing: approximately 10% of base input price for reads, small write fee
The cache is per-account and per-API-version — not shared between users

What to cache with Claude:

Long system prompts with extensive instructions
Reference documents (manuals, knowledge bases, policies)
Code files for code review/analysis workflows
Large templates that appear in every call

Google Gemini: Context Caching

Gemini's caching is handled via the Caches API and has an explicit TTL (time-to-live):

import google.generativeai as genai
import datetime

genai.configure(api_key="your-api-key")

# Read a large document
with open("research_papers.txt", "r") as f:
    research_content = f.read()

# Create a cache with a 2-hour TTL
cache = genai.caching.CachedContent.create(
    model="gemini-2.0-flash",
    display_name="research-papers-cache",
    system_instruction="You are a research analyst. Answer questions based on the provided research papers.",
    contents=[research_content],
    ttl=datetime.timedelta(hours=2)
)

# Use the cached content for queries
model = genai.GenerativeModel.from_cached_content(cached_content=cache)

# All subsequent calls reuse the cached content
response1 = model.generate_content("What are the key findings on working memory?")
response2 = model.generate_content("How do the 2024 studies differ from 2023?")
response3 = model.generate_content("Summarize the methodology section of paper 3")

# Clean up when done
cache.delete()

Key details:

Explicit TTL management — you set and control expiration
Minimum size: 32,768 tokens (smaller content is not cacheable in Gemini)
Cache pricing: roughly 25% of base input price for reads
Can cache system instructions, documents, and conversation history

Gemini caching is particularly useful for:

Workflows querying the same large document many times (1M token context + many questions)
Multi-user applications where all users query the same knowledge base
Batch processing pipelines analyzing the same base documents

OpenAI: Automatic Prompt Caching

OpenAI handles caching automatically for requests with long, repeated prefixes:

from openai import OpenAI

client = OpenAI()

system_prompt = """[Your long system prompt here - at least a few thousand tokens for caching to kick in]"""

def query_with_cached_system(question: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": question}
        ]
    )
    return response.choices[0].message.content

How to optimize for it:

Keep static content (system prompts, documents) at the beginning of messages
Keep dynamic content (user queries) at the end
Don't modify the cached prefix between calls — even small changes break the cache

Pricing: Cache hits are approximately 50% of the base input price (less aggressive discount than Claude's 10% rate but requires zero configuration).

When Context Caching Is Worth Using

Scenario	Cache Value
Same 50K+ token document, many queries	High — major cost reduction
Large system prompt (5K+ tokens), many calls	High — applies every call
Customer support bot with long policy docs	High
Each call has different context	None
Short prompts (<1K tokens)	None — minimum size requirements
One-off or infrequent calls	Low — setup cost vs. savings

The math: If you have 50,000 tokens of fixed context and send 100 calls per day at Claude Sonnet 4.6 pricing (~$3/million input tokens):

Without caching: 50,000 × 100 = 5M tokens/day × $3 = $15/day
With caching: 50,000 × 1 (write) + 50,000 × 99 × 0.1 (reads) = 545K effective tokens/day × $3 ≈ $1.64/day

At scale, that's an 89% cost reduction on input tokens.

Implementation Checklist

Before implementing caching:

Identify what content is repeated across calls (system prompts, reference docs)
Verify the repeated content exceeds minimum size thresholds
Estimate call volume to quantify savings
Choose whether to use automatic caching (OpenAI) or explicit (Claude, Gemini)

For explicit caching:

Structure prompts with static content first, dynamic content last
Set appropriate TTL for your use case
Handle cache expiration gracefully (fall back to full prompt)
Monitor cache hit rates to verify savings are materializing

The implementation overhead is low. For applications with large repeated contexts, it's one of the highest-ROI optimizations you can make.

Context Caching Explained: Cut AI Costs by Up to 90% on Repeated Context

How Context Caching Works

Anthropic Claude: Prompt Caching

Google Gemini: Context Caching

OpenAI: Automatic Prompt Caching

When Context Caching Is Worth Using

Implementation Checklist

Related articles

Best Claude System Prompts for 2026: Templates That Actually Work

Claude vs GPT-4o: How Prompting Differences Actually Affect Results

What is the Model Context Protocol (MCP)? A Plain-English Guide

Context Caching Explained: Cut AI Costs by Up to 90% on Repeated Context

How Context Caching Works

Anthropic Claude: Prompt Caching

Google Gemini: Context Caching

OpenAI: Automatic Prompt Caching

When Context Caching Is Worth Using

Implementation Checklist

Related articles

Best Claude System Prompts for 2026: Templates That Actually Work

Claude vs GPT-4o: How Prompting Differences Actually Affect Results

What is the Model Context Protocol (MCP)? A Plain-English Guide