What is prompt engineering?

Prompt engineering is the practice of crafting inputs to AI language models to produce accurate, useful, and reliable outputs. It involves choosing the right words, structure, context, and format to guide the AI toward the response you actually need — rather than a generic or off-target one.

Which AI models benefit most from better prompting?

All major large language models — including ChatGPT (GPT-4o), Claude, and Gemini — respond significantly to prompt quality. The same task can produce dramatically different results depending on how you structure your request. Better prompting improves output across every major model.

Do I need technical skills to do prompt engineering?

No. Prompt engineering is done in natural language — you write text instructions, not code. Basic prompting needs no technical background at all. Advanced techniques like prompt chaining or agentic workflows can benefit from light scripting knowledge, but the core skill is clear written communication.

Where can I learn more about prompt engineering?

MasterPrompting.net offers a structured curriculum from beginner to advanced, covering every major technique from basic clarity and context to chain-of-thought, meta-prompting, and agentic workflows. Start with the Beginner track to build a solid foundation.

Prompt Caching: How to Cut AI API Costs by 80% (Anthropic + OpenAI)

Most teams running AI in production are leaving serious money on the table. If your application sends the same system prompt, tool definitions, or retrieved documents with every request — and you haven't set up prompt caching — you're paying full price for tokens your provider has already processed. In my experience, the savings from proper caching land between 60% and 80% of total token costs for typical production workloads. This post covers exactly how to implement it on both Anthropic and OpenAI APIs, with real pricing numbers and the ordering patterns that actually move your cache hit rate.

What prompt caching actually is

When you call an AI API, the provider processes your entire input from scratch — tokenizing it, running it through attention layers, building the key-value (KV) cache internally. Prompt caching lets you reuse that computed KV cache across requests. Instead of reprocessing a 20,000-token system prompt on every call, the provider reads from its stored computation.

The key mechanic: caching only works on a prefix of your prompt. The provider checksums the beginning of your input. If a subsequent request starts with an identical prefix of sufficient length, it's a cache hit. Anything after the cached prefix still gets processed normally. This is why prompt structure matters — position your stable content at the top, variable content at the bottom.

This is distinct from context caching explained at the infrastructure level, though the end result is similar: you pay less to reuse context.

Anthropic prompt caching

Anthropic's implementation requires explicit opt-in. You mark cache breakpoints in your request using cache_control blocks, and the API stores everything up to that breakpoint.

Pricing

Token type	Cost (relative to base)
Cache write (first request)	1.25× base input price
Cache read (subsequent hits)	0.10× base input price
Normal input (uncached)	1.00× base input price

The write penalty is real — first-request cost goes up 25%. But from the second request onward, you're paying 10 cents on the dollar for those tokens. The breakeven is almost always the second request.

Minimum prefix lengths

Claude Sonnet, Opus: 1,024 tokens minimum cacheable prefix
Claude Haiku: 2,048 tokens minimum cacheable prefix

If your prefix is shorter than the minimum, the cache write is silently skipped. No error, no warning — just no cache hit. This trips up a lot of developers who wonder why their short system prompts aren't getting cached.

TTL

Cache entries expire after 5 minutes of inactivity. Each cache hit resets the TTL. For bursty workloads, this is fine. For low-traffic apps with long gaps between requests, you'll get cold cache misses more often.

Python implementation

import anthropic

client = anthropic.Anthropic()

# Long static document or system instructions
SYSTEM_PROMPT = """You are an expert financial analyst assistant...
[... 2000+ tokens of instructions, examples, and context ...]
"""

REFERENCE_DOCUMENT = """[Your 15,000-token document here]"""

response = client.messages.create(
    model="claude-opus-4-5",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": SYSTEM_PROMPT,
            "cache_control": {"type": "ephemeral"}  # Cache breakpoint after system prompt
        },
        {
            "type": "text",
            "text": REFERENCE_DOCUMENT,
            "cache_control": {"type": "ephemeral"}  # Second cache breakpoint after document
        }
    ],
    messages=[
        {
            "role": "user",
            "content": "What is the revenue trend for Q3?"  # Dynamic — NOT cached
        }
    ]
)

# Check cache performance
usage = response.usage
print(f"Input tokens: {usage.input_tokens}")
print(f"Cache creation tokens: {usage.cache_creation_input_tokens}")
print(f"Cache read tokens: {usage.cache_read_input_tokens}")

You can place multiple cache_control breakpoints in a single request. Anthropic caches the prefix up to each marked block. Use this when you have tiered static content — fixed system instructions first, then per-session context, then per-request data — each with its own breakpoint.

What the response tells you

The usage object returns three token counts:

cache_creation_input_tokens — tokens written to cache (charged at 1.25×)
cache_read_input_tokens — tokens read from cache (charged at 0.10×)
input_tokens — tokens processed normally (charged at 1.00×)

On a warm cache hit, you'll see cache_read_input_tokens equal to the length of your cached prefix and cache_creation_input_tokens equal to zero.

OpenAI prompt caching

OpenAI's approach is the opposite of Anthropic's: fully automatic. No configuration required. The API silently caches prompt prefixes and applies a discount when it gets a hit.

Pricing

Token type	Cost (relative to base)
Cache hit (prefix)	0.50× base input price
Cache miss / normal	1.00× base input price

The discount is 50%, not 90% like Anthropic. But you also don't pay a write penalty. First request is normal price, subsequent hits are half price.

Minimum prefix length and TTL

Minimum cacheable prefix: 1,024 tokens
TTL: approximately 5 minutes

Structuring prompts for automatic caching

Since OpenAI caches automatically based on prefix matching, your job is purely structural: keep identical content at the beginning of every request.

from openai import OpenAI

client = OpenAI()

# This prefix must be IDENTICAL across requests for caching to work
STATIC_SYSTEM = """You are a code review assistant specializing in Python.
You follow PEP 8 style guidelines and focus on:
1. Security vulnerabilities
2. Performance bottlenecks
3. Code maintainability
[... rest of your static instructions ...]"""

STATIC_EXAMPLES = """
## Example review

Code:
\`\`\`python
def get_user(id):
    return db.query(f"SELECT * FROM users WHERE id={id}")
\`\`\`

Review:
- SQL injection vulnerability: use parameterized queries
- Missing type hints
- No error handling for missing user
[... more examples ...]"""

def review_code(code_snippet: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                # Static prefix — same on every request
                "content": STATIC_SYSTEM + "\n\n" + STATIC_EXAMPLES
            },
            {
                "role": "user",
                # Dynamic content always comes last
                "content": f"Please review this code:\n\n\`\`\`python\n{code_snippet}\n\`\`\`"
            }
        ]
    )
    return response.choices[0].message.content

# Check cache usage
def review_code_with_stats(code_snippet: str):
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": STATIC_SYSTEM + "\n\n" + STATIC_EXAMPLES},
            {"role": "user", "content": f"Review:\n\n\`\`\`python\n{code_snippet}\n\`\`\`"}
        ]
    )
    usage = response.usage
    cached = usage.prompt_tokens_details.cached_tokens if usage.prompt_tokens_details else 0
    print(f"Total prompt tokens: {usage.prompt_tokens}")
    print(f"Cached tokens: {cached}")
    print(f"Cache hit rate: {cached / usage.prompt_tokens:.1%}")
    return response.choices[0].message.content

Pricing comparison at a glance

Provider	Cache write cost	Cache read cost	Min prefix	Config required
Anthropic	1.25× base	0.10× base	1,024 tokens (Sonnet/Opus)	Yes (`cache_control`)
OpenAI	1.00× base	0.50× base	1,024 tokens	No (automatic)

Anthropic gives a bigger discount on reads (90% off vs. 50% off) but charges more to write. For very high-traffic applications where the same prefix fires thousands of times per hour, Anthropic's model wins. For moderate traffic, OpenAI's zero-config approach is often the better starting point.

The ordering rule that determines everything

Both implementations share one absolute requirement: the cached portion must be a stable prefix. The provider checksums the beginning of your input. If anything in that prefix changes between requests, it's a cache miss and you pay full price.

This means your prompt structure should follow a strict hierarchy from most-stable to least-stable:

System instructions — never change between requests
Reference documents or retrieved context — change per session, not per request
Tool/function definitions — change only when you update your app
Few-shot examples — static or session-scoped
User message — changes every request

Most developers structure it backwards. They write code that prepends user context and appends boilerplate, then wonder why cache hit rates are near zero. The fix is always the same: invert the structure.

For a deeper look at how prefix ordering affects performance beyond caching, see context engineering.

High-impact use cases

RAG pipelines — this is the single biggest win. If you retrieve a 20K-token document and send it with every query in a session, you're paying full price for those tokens on every request. Cache the document after the first retrieval. Every follow-up question in that session hits the cache.

Agents with long tool definitions — complex agents often carry 3,000–8,000 tokens of tool definitions on every call in a loop. These definitions don't change mid-run. Cache them. On a 10-step agent loop, you pay write cost once and read cost nine times. For patterns on structuring this, see AI agent design patterns.

Document analysis pipelines — uploading a contract or report for analysis typically involves one large document and many smaller questions. Same document, different queries — exactly what caching is designed for.

Few-shot prompt libraries — if you maintain a library of 15–20 examples to improve output quality, those examples are a natural caching candidate. Put them in the system prompt before the user message, and they're paid for once per TTL window.

For reducing what you cache in the first place, the prompt compression guide covers techniques that often combine well with caching.

Real cost math

Here's a concrete example. A RAG pipeline that:

Retrieves a 20,000-token document per session
Receives 1,000 user queries per day
Uses Claude Sonnet 4.6 (input price: $3.00 per million tokens)
Average session has 5 queries against the same document

Without caching:

1,000 queries × 20,000 tokens = 20,000,000 input tokens/day
Cost: 20M × $3.00/1M = $60/day

With Anthropic prompt caching:

200 cache writes (one per session start): 200 × 20,000 = 4M tokens at 1.25× = $15.00
800 cache reads (4 follow-ups per session): 800 × 20,000 = 16M tokens at 0.10× = $4.80
Total: $19.80/day
Savings: 67%

With OpenAI GPT-4o (input price: $2.50/1M):

200 cache writes (full price): 4M × $2.50/1M = $10.00
800 cache reads (50% off): 16M × $1.25/1M = $20.00
Total: $30.00/day
Savings: 40% (vs. $50/day uncached)

For workloads with even higher query multipliers per session, Anthropic's 90% read discount compounds dramatically.

Monitoring cache performance in production

Don't guess — measure. Both APIs return cache metrics on every response.

Anthropic — check response.usage:

cache_hit_rate = (
    usage.cache_read_input_tokens /
    (usage.cache_read_input_tokens + usage.cache_creation_input_tokens + usage.input_tokens)
)

OpenAI — check response.usage.prompt_tokens_details:

details = response.usage.prompt_tokens_details
cache_hit_rate = details.cached_tokens / response.usage.prompt_tokens

Log these metrics per request and aggregate them. A healthy RAG pipeline should see 70–90% cache hit rates once the TTL window is warm. If you're seeing below 50%, the most common causes are:

Dynamic content bleeding into your prefix (timestamps, request IDs, random seeds)
Prefix length below the minimum threshold
TTL expiring between requests for low-traffic routes
Inconsistent whitespace or formatting in your static content

The fastest debug step: print the first 200 characters of your prompt on two consecutive requests and diff them. Any difference before the dynamic section explains your cache miss.

Common mistakes

Randomizing content before the cache breakpoint. Shuffling few-shot examples for "variety" kills caching. Pick a fixed order and keep it.

Including timestamps or session IDs in the system prompt. I've seen prompts like "Current date: {datetime.now()}" in the static section. That's a cache miss on every single request.

Setting cache_control on the user message. Anthropic's cache breakpoint marks the end of the cacheable prefix. If you put it on the user message, you're caching the dynamic part and potentially thrashing the cache.

Forgetting the minimum token threshold. If your system prompt is 800 tokens, neither provider caches it. Consolidate your instructions and examples until you're comfortably above 1,024 tokens, or accept that you won't get caching benefits on that route.

Prompt caching isn't a premature optimization — for any app making repeated API calls with shared context, it's table stakes. The configuration cost is low (zero for OpenAI, a few extra fields for Anthropic), and the savings kick in immediately on the second request.

What prompt caching actually is

This is distinct from context caching explained at the infrastructure level, though the end result is similar: you pay less to reuse context.

Anthropic prompt caching

Anthropic's implementation requires explicit opt-in. You mark cache breakpoints in your request using cache_control blocks, and the API stores everything up to that breakpoint.

Pricing

Token type	Cost (relative to base)
Cache write (first request)	1.25× base input price
Cache read (subsequent hits)	0.10× base input price
Normal input (uncached)	1.00× base input price

Minimum prefix lengths

Claude Sonnet, Opus: 1,024 tokens minimum cacheable prefix
Claude Haiku: 2,048 tokens minimum cacheable prefix

TTL

Python implementation

import anthropic

client = anthropic.Anthropic()

# Long static document or system instructions
SYSTEM_PROMPT = """You are an expert financial analyst assistant...
[... 2000+ tokens of instructions, examples, and context ...]
"""

REFERENCE_DOCUMENT = """[Your 15,000-token document here]"""

response = client.messages.create(
    model="claude-opus-4-5",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": SYSTEM_PROMPT,
            "cache_control": {"type": "ephemeral"}  # Cache breakpoint after system prompt
        },
        {
            "type": "text",
            "text": REFERENCE_DOCUMENT,
            "cache_control": {"type": "ephemeral"}  # Second cache breakpoint after document
        }
    ],
    messages=[
        {
            "role": "user",
            "content": "What is the revenue trend for Q3?"  # Dynamic — NOT cached
        }
    ]
)

# Check cache performance
usage = response.usage
print(f"Input tokens: {usage.input_tokens}")
print(f"Cache creation tokens: {usage.cache_creation_input_tokens}")
print(f"Cache read tokens: {usage.cache_read_input_tokens}")

What the response tells you

The usage object returns three token counts:

cache_creation_input_tokens — tokens written to cache (charged at 1.25×)
cache_read_input_tokens — tokens read from cache (charged at 0.10×)
input_tokens — tokens processed normally (charged at 1.00×)

On a warm cache hit, you'll see cache_read_input_tokens equal to the length of your cached prefix and cache_creation_input_tokens equal to zero.

OpenAI prompt caching

OpenAI's approach is the opposite of Anthropic's: fully automatic. No configuration required. The API silently caches prompt prefixes and applies a discount when it gets a hit.

Pricing

Token type	Cost (relative to base)
Cache hit (prefix)	0.50× base input price
Cache miss / normal	1.00× base input price

The discount is 50%, not 90% like Anthropic. But you also don't pay a write penalty. First request is normal price, subsequent hits are half price.

Minimum prefix length and TTL

Minimum cacheable prefix: 1,024 tokens
TTL: approximately 5 minutes

Structuring prompts for automatic caching

Since OpenAI caches automatically based on prefix matching, your job is purely structural: keep identical content at the beginning of every request.

from openai import OpenAI

client = OpenAI()

# This prefix must be IDENTICAL across requests for caching to work
STATIC_SYSTEM = """You are a code review assistant specializing in Python.
You follow PEP 8 style guidelines and focus on:
1. Security vulnerabilities
2. Performance bottlenecks
3. Code maintainability
[... rest of your static instructions ...]"""

STATIC_EXAMPLES = """
## Example review

Code:
\`\`\`python
def get_user(id):
    return db.query(f"SELECT * FROM users WHERE id={id}")
\`\`\`

Review:
- SQL injection vulnerability: use parameterized queries
- Missing type hints
- No error handling for missing user
[... more examples ...]"""

def review_code(code_snippet: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                # Static prefix — same on every request
                "content": STATIC_SYSTEM + "\n\n" + STATIC_EXAMPLES
            },
            {
                "role": "user",
                # Dynamic content always comes last
                "content": f"Please review this code:\n\n\`\`\`python\n{code_snippet}\n\`\`\`"
            }
        ]
    )
    return response.choices[0].message.content

# Check cache usage
def review_code_with_stats(code_snippet: str):
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": STATIC_SYSTEM + "\n\n" + STATIC_EXAMPLES},
            {"role": "user", "content": f"Review:\n\n\`\`\`python\n{code_snippet}\n\`\`\`"}
        ]
    )
    usage = response.usage
    cached = usage.prompt_tokens_details.cached_tokens if usage.prompt_tokens_details else 0
    print(f"Total prompt tokens: {usage.prompt_tokens}")
    print(f"Cached tokens: {cached}")
    print(f"Cache hit rate: {cached / usage.prompt_tokens:.1%}")
    return response.choices[0].message.content

Pricing comparison at a glance

Provider	Cache write cost	Cache read cost	Min prefix	Config required
Anthropic	1.25× base	0.10× base	1,024 tokens (Sonnet/Opus)	Yes (`cache_control`)
OpenAI	1.00× base	0.50× base	1,024 tokens	No (automatic)

The ordering rule that determines everything

This means your prompt structure should follow a strict hierarchy from most-stable to least-stable:

System instructions — never change between requests
Reference documents or retrieved context — change per session, not per request
Tool/function definitions — change only when you update your app
Few-shot examples — static or session-scoped
User message — changes every request

For a deeper look at how prefix ordering affects performance beyond caching, see context engineering.

High-impact use cases

For reducing what you cache in the first place, the prompt compression guide covers techniques that often combine well with caching.

Real cost math

Here's a concrete example. A RAG pipeline that:

Retrieves a 20,000-token document per session
Receives 1,000 user queries per day
Uses Claude Sonnet 4.6 (input price: $3.00 per million tokens)
Average session has 5 queries against the same document

Without caching:

1,000 queries × 20,000 tokens = 20,000,000 input tokens/day
Cost: 20M × $3.00/1M = $60/day

With Anthropic prompt caching:

200 cache writes (one per session start): 200 × 20,000 = 4M tokens at 1.25× = $15.00
800 cache reads (4 follow-ups per session): 800 × 20,000 = 16M tokens at 0.10× = $4.80
Total: $19.80/day
Savings: 67%

With OpenAI GPT-4o (input price: $2.50/1M):

200 cache writes (full price): 4M × $2.50/1M = $10.00
800 cache reads (50% off): 16M × $1.25/1M = $20.00
Total: $30.00/day
Savings: 40% (vs. $50/day uncached)

For workloads with even higher query multipliers per session, Anthropic's 90% read discount compounds dramatically.

Monitoring cache performance in production

Don't guess — measure. Both APIs return cache metrics on every response.

Anthropic — check response.usage:

cache_hit_rate = (
    usage.cache_read_input_tokens /
    (usage.cache_read_input_tokens + usage.cache_creation_input_tokens + usage.input_tokens)
)

OpenAI — check response.usage.prompt_tokens_details:

details = response.usage.prompt_tokens_details
cache_hit_rate = details.cached_tokens / response.usage.prompt_tokens

Log these metrics per request and aggregate them. A healthy RAG pipeline should see 70–90% cache hit rates once the TTL window is warm. If you're seeing below 50%, the most common causes are:

Dynamic content bleeding into your prefix (timestamps, request IDs, random seeds)
Prefix length below the minimum threshold
TTL expiring between requests for low-traffic routes
Inconsistent whitespace or formatting in your static content

The fastest debug step: print the first 200 characters of your prompt on two consecutive requests and diff them. Any difference before the dynamic section explains your cache miss.

Common mistakes

Randomizing content before the cache breakpoint. Shuffling few-shot examples for "variety" kills caching. Pick a fixed order and keep it.

Including timestamps or session IDs in the system prompt. I've seen prompts like "Current date: {datetime.now()}" in the static section. That's a cache miss on every single request.

What prompt caching actually is

Anthropic prompt caching

Pricing

Minimum prefix lengths

TTL

Python implementation

What the response tells you

OpenAI prompt caching

Pricing

Minimum prefix length and TTL

Structuring prompts for automatic caching

Pricing comparison at a glance

The ordering rule that determines everything

High-impact use cases

Real cost math

Monitoring cache performance in production

Common mistakes

Related articles

A/B Testing Prompts in Production — A Statistical Guide

Claude Max Plan — What You Get and Whether It's Worth It

Async Python for LLM Apps — Patterns That Actually Work in Production

What prompt caching actually is

Anthropic prompt caching

Pricing

Minimum prefix lengths

TTL

Python implementation

What the response tells you

OpenAI prompt caching

Pricing

Minimum prefix length and TTL

Structuring prompts for automatic caching

Pricing comparison at a glance

The ordering rule that determines everything

High-impact use cases

Real cost math

Monitoring cache performance in production

Common mistakes

Related articles

A/B Testing Prompts in Production — A Statistical Guide

Claude Max Plan — What You Get and Whether It's Worth It

Async Python for LLM Apps — Patterns That Actually Work in Production