What is prompt engineering?

Prompt engineering is the practice of crafting inputs to AI language models to produce accurate, useful, and reliable outputs. It involves choosing the right words, structure, context, and format to guide the AI toward the response you actually need — rather than a generic or off-target one.

Which AI models benefit most from better prompting?

All major large language models — including ChatGPT (GPT-4o), Claude, and Gemini — respond significantly to prompt quality. The same task can produce dramatically different results depending on how you structure your request. Better prompting improves output across every major model.

Do I need technical skills to do prompt engineering?

No. Prompt engineering is done in natural language — you write text instructions, not code. Basic prompting needs no technical background at all. Advanced techniques like prompt chaining or agentic workflows can benefit from light scripting knowledge, but the core skill is clear written communication.

Where can I learn more about prompt engineering?

MasterPrompting.net offers a structured curriculum from beginner to advanced, covering every major technique from basic clarity and context to chain-of-thought, meta-prompting, and agentic workflows. Start with the Beginner track to build a solid foundation.

Prompt Caching in Claude 4.6: How to Cut API Costs by 90% on Repeated System Prompts

Every API call that repeats the same system prompt is paying full price to reprocess tokens Claude has already seen. For a simple chatbot that sends a 2,000-token system prompt with every message, that's 60 million wasted tokens per month at 1,000 queries/day — before you even count the user messages.

Prompt caching fixes this: cached tokens cost 10% of the standard input price. For a customer support bot with a 2,000-token system prompt handling 1,000 daily queries, that's the difference between roughly ₹22,000/month and ₹5,000/month. One line of code.

How prompt caching works

Three phases:

Cache write (first request): Claude processes the cacheable block at full input price plus a 25% write surcharge. The cache is stored for 5 minutes from the last access.

Cache read (subsequent requests): Any request that sends the same cached block gets a cache hit. Cost drops to 10% of standard input price. The 5-minute timer resets on every hit, so an active conversation keeps its cache alive indefinitely.

Cache miss (after 5 minutes of inactivity): The cache expires. Next request pays full price + write surcharge again to recreate it.

Here's the full cost breakdown across models:

Model	Standard input	Cache write	Cache read
Claude Sonnet 4.6	$3.00/MTok	$3.75/MTok	$0.30/MTok
Claude Opus 4.6	$5.00/MTok	$6.25/MTok	$0.50/MTok

INR equivalents via AICredits.in (~10% markup on USD):

Model	Standard input	Cache write	Cache read
Sonnet 4.6	₹331/MTok	₹413/MTok	₹33/MTok
Opus 4.6	₹552/MTok	₹690/MTok	₹55/MTok

The math favors caching heavily once your system prompt exceeds ~1,000 tokens and you're making more than ~10 requests per session.

What to cache (and what not to)

Cache these:

System prompts — almost always worth caching if they're consistent across requests. If your system prompt is 2,000+ tokens of instructions, this is the highest-leverage thing you can do.
Long documents sent with every request — knowledge bases, product catalogues, reference documents. If you're sending a 50,000-word document to Claude for every analysis request, you're paying full price every time.
Tool definitions — each tool spec is 50-100 tokens. If you have 20 tools, that's 1,000-2,000 tokens per request that can be cached.
Few-shot examples — if you have a long example block to guide output format, cache it.

Don't cache these:

The user message — changes every request, can't be cached.
Short system prompts — below ~1,000 tokens, the write surcharge outweighs the read savings unless you have very high request volume.
Content that changes frequently — if your "current context" block updates every 30 minutes, the cache will miss more than it hits.

Automatic caching (recommended for most use cases)

The simplest approach: pass cache_control={"type": "auto"} at the message level. Claude places the cache point automatically at the end of the last cacheable block and moves it forward as the conversation grows.

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    cache_control={"type": "auto"},
    system="""You are a customer support agent for Acme Corp.

Company overview:
[2,000+ tokens of product documentation, policies, procedures...]
""",
    messages=[
        {"role": "user", "content": user_message}
    ]
)

print(response.content[0].text)
print(f"Cache read tokens: {response.usage.cache_read_input_tokens}")
print(f"Cache write tokens: {response.usage.cache_creation_input_tokens}")

That's it. The first request writes the cache. Every subsequent request within 5 minutes (extended on each hit) reads it for ₹33/MTok instead of ₹331/MTok.

Explicit cache breakpoints (for fine-grained control)

For more complex use cases — for example, you want to cache the system prompt but not a dynamically fetched document — use explicit cache breakpoints with cache_control: {"type": "ephemeral"} on specific content blocks.

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": """You are a customer support agent for Acme Corp.

[Your 2,000-token static instructions here...]

Always respond in the user's language. Be concise. Escalate billing issues to billing@acme.com.""",
            "cache_control": {"type": "ephemeral"}  # Cache this block
        },
        {
            "type": "text",
            "text": f"""Current product catalogue (updated daily):

{product_catalogue}"""
            # No cache_control — this block is dynamic or you control caching above
        }
    ],
    messages=[
        {"role": "user", "content": user_message}
    ]
)

You can add up to 4 explicit cache breakpoints. Everything up to and including a breakpoint is cached together.

Real cost calculation — customer support bot in India

Let's price this out for a realistic scenario.

Setup: 1,000 customer queries/day. 2,000-token system prompt. 300-token average user message. 500-token average response. Claude Sonnet 4.6 via AICredits.in.

Without caching:

Every request processes: 2,000 (system) + 300 (user) = 2,300 input tokens.

Monthly input tokens: 2,300 × 1,000 × 30 = 69,000,000 tokens = 69 MTok

Cost: 69 × ₹331 = ₹22,839/month

With caching:

Cache write: One 2,000-token write per session (simplified). At ₹413/MTok = effectively negligible.

Cache reads: 2,000 cached tokens per request at ₹33/MTok + 300 uncached tokens at ₹331/MTok.

Per 1,000 requests:

Cached: 2,000,000 tokens × ₹33/MTok = ₹66
Uncached: 300,000 tokens × ₹331/MTok = ₹99.30
Total: ₹165.30/day

Monthly: ₹165.30 × 30 = ₹4,959/month

Savings: ₹17,880/month (~78%)

That's assuming you do nothing else. If your system prompt is longer (e.g., 10,000 tokens), or you're also caching tool definitions, the savings compound further.

Monitoring cache hit rates

You should always track whether your caching is actually working. The response usage object gives you everything you need:

def make_cached_request(client, system_prompt, user_message):
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        cache_control={"type": "auto"},
        system=system_prompt,
        messages=[{"role": "user", "content": user_message}]
    )
    
    usage = response.usage
    cache_hits = usage.cache_read_input_tokens
    cache_writes = usage.cache_creation_input_tokens
    uncached = usage.input_tokens  # Tokens processed without cache
    
    total_input = cache_hits + cache_writes + uncached
    hit_rate = cache_hits / total_input if total_input > 0 else 0
    
    print(f"Cache hit rate: {hit_rate:.1%}")
    print(f"  Hit: {cache_hits} tokens (₹{cache_hits * 33 / 1_000_000:.4f})")
    print(f"  Write: {cache_writes} tokens (₹{cache_writes * 413 / 1_000_000:.4f})")
    print(f"  Uncached: {uncached} tokens (₹{uncached * 331 / 1_000_000:.4f})")
    
    return response

You want cache hit rate above 80% for meaningful savings. If you're seeing a low hit rate:

Check that you're not regenerating the system prompt with dynamic content on every call
Verify requests are spaced less than 5 minutes apart (for persistent sessions)
Confirm you're sending the exact same text — even a single character difference breaks the cache

Combining with effort parameter for maximum savings

Prompt caching and the effort parameter stack independently. For high-volume applications where you want to minimise costs without sacrificing quality:

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=2048,
    effort="medium",              # Reduces thinking token usage ~40-60%
    thinking={"type": "adaptive"},  # Scales thinking to task complexity
    cache_control={"type": "auto"},  # Caches system prompt at 10% of input price
    system=system_prompt,
    messages=conversation_history
)

The combined effect: ~78% savings from prompt caching on repeated system tokens, plus ~40-60% reduction in thinking token usage from effort="medium". For a customer support bot where most queries don't need deep reasoning, this combination brings effective cost well below 20% of the naive default configuration.

💡 Track your ₹ savings per API key at AICredits.in — the dashboard shows per-request token breakdown including cache hits vs writes vs uncached.

Practical checklist

Before you ship any production Claude integration:

Is your system prompt > 1,000 tokens? Add cache_control={"type": "auto"} immediately.
Are you sending the same document with every request? Put it in the system prompt under a cache breakpoint.
Do you have many tool definitions? Cache them in a separate block.
Are requests in the same session within 5 minutes of each other? If not, cache writes are mostly wasted — consider restructuring the session model.
Are you logging cache_read_input_tokens? If you're not monitoring cache hits, you don't know if it's working.

Next steps

Claude 4.6 effort parameter and cost optimization — reduce thinking token costs
Claude Opus 4.6 prompting guide — when to use Opus vs Sonnet
AICredits.in review — UPI-based API access in India
Prompt compression guide — shrink system prompts without losing effectiveness

Try it now with AICredits.in

Access Claude, GPT-4o, Gemini, and 300+ models with UPI payment in ₹. No international card needed. Create free account →

How prompt caching works

Three phases:

Cache write (first request): Claude processes the cacheable block at full input price plus a 25% write surcharge. The cache is stored for 5 minutes from the last access.

Cache miss (after 5 minutes of inactivity): The cache expires. Next request pays full price + write surcharge again to recreate it.

Here's the full cost breakdown across models:

Model	Standard input	Cache write	Cache read
Claude Sonnet 4.6	$3.00/MTok	$3.75/MTok	$0.30/MTok
Claude Opus 4.6	$5.00/MTok	$6.25/MTok	$0.50/MTok

INR equivalents via AICredits.in (~10% markup on USD):

Model	Standard input	Cache write	Cache read
Sonnet 4.6	₹331/MTok	₹413/MTok	₹33/MTok
Opus 4.6	₹552/MTok	₹690/MTok	₹55/MTok

The math favors caching heavily once your system prompt exceeds ~1,000 tokens and you're making more than ~10 requests per session.

What to cache (and what not to)

Cache these:

System prompts — almost always worth caching if they're consistent across requests. If your system prompt is 2,000+ tokens of instructions, this is the highest-leverage thing you can do.
Long documents sent with every request — knowledge bases, product catalogues, reference documents. If you're sending a 50,000-word document to Claude for every analysis request, you're paying full price every time.
Tool definitions — each tool spec is 50-100 tokens. If you have 20 tools, that's 1,000-2,000 tokens per request that can be cached.
Few-shot examples — if you have a long example block to guide output format, cache it.

Don't cache these:

The user message — changes every request, can't be cached.
Short system prompts — below ~1,000 tokens, the write surcharge outweighs the read savings unless you have very high request volume.
Content that changes frequently — if your "current context" block updates every 30 minutes, the cache will miss more than it hits.

Automatic caching (recommended for most use cases)

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    cache_control={"type": "auto"},
    system="""You are a customer support agent for Acme Corp.

Company overview:
[2,000+ tokens of product documentation, policies, procedures...]
""",
    messages=[
        {"role": "user", "content": user_message}
    ]
)

print(response.content[0].text)
print(f"Cache read tokens: {response.usage.cache_read_input_tokens}")
print(f"Cache write tokens: {response.usage.cache_creation_input_tokens}")

That's it. The first request writes the cache. Every subsequent request within 5 minutes (extended on each hit) reads it for ₹33/MTok instead of ₹331/MTok.

Explicit cache breakpoints (for fine-grained control)

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": """You are a customer support agent for Acme Corp.

[Your 2,000-token static instructions here...]

Always respond in the user's language. Be concise. Escalate billing issues to billing@acme.com.""",
            "cache_control": {"type": "ephemeral"}  # Cache this block
        },
        {
            "type": "text",
            "text": f"""Current product catalogue (updated daily):

{product_catalogue}"""
            # No cache_control — this block is dynamic or you control caching above
        }
    ],
    messages=[
        {"role": "user", "content": user_message}
    ]
)

You can add up to 4 explicit cache breakpoints. Everything up to and including a breakpoint is cached together.

Real cost calculation — customer support bot in India

Let's price this out for a realistic scenario.

Setup: 1,000 customer queries/day. 2,000-token system prompt. 300-token average user message. 500-token average response. Claude Sonnet 4.6 via AICredits.in.

Without caching:

Every request processes: 2,000 (system) + 300 (user) = 2,300 input tokens.

Monthly input tokens: 2,300 × 1,000 × 30 = 69,000,000 tokens = 69 MTok

Cost: 69 × ₹331 = ₹22,839/month

With caching:

Cache write: One 2,000-token write per session (simplified). At ₹413/MTok = effectively negligible.

Cache reads: 2,000 cached tokens per request at ₹33/MTok + 300 uncached tokens at ₹331/MTok.

Per 1,000 requests:

Cached: 2,000,000 tokens × ₹33/MTok = ₹66
Uncached: 300,000 tokens × ₹331/MTok = ₹99.30
Total: ₹165.30/day

Monthly: ₹165.30 × 30 = ₹4,959/month

Savings: ₹17,880/month (~78%)

That's assuming you do nothing else. If your system prompt is longer (e.g., 10,000 tokens), or you're also caching tool definitions, the savings compound further.

Monitoring cache hit rates

You should always track whether your caching is actually working. The response usage object gives you everything you need:

def make_cached_request(client, system_prompt, user_message):
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        cache_control={"type": "auto"},
        system=system_prompt,
        messages=[{"role": "user", "content": user_message}]
    )
    
    usage = response.usage
    cache_hits = usage.cache_read_input_tokens
    cache_writes = usage.cache_creation_input_tokens
    uncached = usage.input_tokens  # Tokens processed without cache
    
    total_input = cache_hits + cache_writes + uncached
    hit_rate = cache_hits / total_input if total_input > 0 else 0
    
    print(f"Cache hit rate: {hit_rate:.1%}")
    print(f"  Hit: {cache_hits} tokens (₹{cache_hits * 33 / 1_000_000:.4f})")
    print(f"  Write: {cache_writes} tokens (₹{cache_writes * 413 / 1_000_000:.4f})")
    print(f"  Uncached: {uncached} tokens (₹{uncached * 331 / 1_000_000:.4f})")
    
    return response

You want cache hit rate above 80% for meaningful savings. If you're seeing a low hit rate:

Check that you're not regenerating the system prompt with dynamic content on every call
Verify requests are spaced less than 5 minutes apart (for persistent sessions)
Confirm you're sending the exact same text — even a single character difference breaks the cache

Combining with effort parameter for maximum savings

Prompt caching and the effort parameter stack independently. For high-volume applications where you want to minimise costs without sacrificing quality:

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=2048,
    effort="medium",              # Reduces thinking token usage ~40-60%
    thinking={"type": "adaptive"},  # Scales thinking to task complexity
    cache_control={"type": "auto"},  # Caches system prompt at 10% of input price
    system=system_prompt,
    messages=conversation_history
)

💡 Track your ₹ savings per API key at AICredits.in — the dashboard shows per-request token breakdown including cache hits vs writes vs uncached.

Practical checklist

Before you ship any production Claude integration:

Is your system prompt > 1,000 tokens? Add cache_control={"type": "auto"} immediately.
Are you sending the same document with every request? Put it in the system prompt under a cache breakpoint.
Do you have many tool definitions? Cache them in a separate block.
Are requests in the same session within 5 minutes of each other? If not, cache writes are mostly wasted — consider restructuring the session model.
Are you logging cache_read_input_tokens? If you're not monitoring cache hits, you don't know if it's working.

Next steps

Claude 4.6 effort parameter and cost optimization — reduce thinking token costs
Claude Opus 4.6 prompting guide — when to use Opus vs Sonnet
AICredits.in review — UPI-based API access in India
Prompt compression guide — shrink system prompts without losing effectiveness

Try it now with AICredits.in

Access Claude, GPT-4o, Gemini, and 300+ models with UPI payment in ₹. No international card needed. Create free account →

Prompt Caching in Claude 4.6: How to Cut API Costs by 90% on Repeated System Prompts

How prompt caching works

What to cache (and what not to)

Automatic caching (recommended for most use cases)

Explicit cache breakpoints (for fine-grained control)

Real cost calculation — customer support bot in India

Monitoring cache hit rates

Combining with effort parameter for maximum savings

Practical checklist

Next steps

Try it now with AICredits.in

Related articles

Build Your First MCP Server in Python: Connect Claude to Indian APIs (Under 100 Lines)

How to Use Claude Code in India Without a Credit Card (2026 Guide)

Claude Code for QA Engineers: How to Automate Test Writing with AI (2026)

Prompt Caching in Claude 4.6: How to Cut API Costs by 90% on Repeated System Prompts

How prompt caching works

What to cache (and what not to)

Automatic caching (recommended for most use cases)

Explicit cache breakpoints (for fine-grained control)

Real cost calculation — customer support bot in India

Monitoring cache hit rates

Combining with effort parameter for maximum savings

Practical checklist

Next steps

Try it now with AICredits.in

Related articles

Build Your First MCP Server in Python: Connect Claude to Indian APIs (Under 100 Lines)

How to Use Claude Code in India Without a Credit Card (2026 Guide)

Claude Code for QA Engineers: How to Automate Test Writing with AI (2026)