What is prompt engineering?

Prompt engineering is the practice of crafting inputs to AI language models to produce accurate, useful, and reliable outputs. It involves choosing the right words, structure, context, and format to guide the AI toward the response you actually need — rather than a generic or off-target one.

Which AI models benefit most from better prompting?

All major large language models — including ChatGPT (GPT-4o), Claude, and Gemini — respond significantly to prompt quality. The same task can produce dramatically different results depending on how you structure your request. Better prompting improves output across every major model.

Do I need technical skills to do prompt engineering?

No. Prompt engineering is done in natural language — you write text instructions, not code. Basic prompting needs no technical background at all. Advanced techniques like prompt chaining or agentic workflows can benefit from light scripting knowledge, but the core skill is clear written communication.

Where can I learn more about prompt engineering?

MasterPrompting.net offers a structured curriculum from beginner to advanced, covering every major technique from basic clarity and context to chain-of-thought, meta-prompting, and agentic workflows. Start with the Beginner track to build a solid foundation.

Token Counting and Context Management — A Practical Guide for LLM Apps

Token counting broke my app in production. A customer support bot I'd built started throwing 413 Request Entity Too Large errors after three days of conversations. I hadn't budgeted for conversation history growth, and the fix took longer than the original build. Don't make the same mistake.

This guide covers everything you need to manage tokens properly: counting accurately, respecting context limits, avoiding the lost-in-the-middle attention problem, and building truncation strategies that hold up in production.

How token counting actually works

Tokens aren't words. They're subword chunks — roughly 4 characters per token for English text, but that varies a lot by language (Chinese and Korean are 1-2 chars per token; code with repeated patterns compresses well). A common rule of thumb is 1 token ≈ 0.75 words, but don't rely on it for budget planning.

Claude's 200k context window sounds huge until you realize input and output share the limit. If you're sending 190k tokens of context, the model has 10k tokens left to generate a response. For a complex reasoning task, that's not much.

Here's how to count precisely with the Anthropic SDK:

import anthropic

client = anthropic.Anthropic()

# Count tokens before sending — no generation happens, fast and cheap
response = client.beta.messages.count_tokens(
    model="claude-sonnet-4-5",
    system="You are a helpful customer support agent.",
    messages=[
        {"role": "user", "content": "What's your return policy for electronics?"}
    ],
    betas=["token-counting-2024-11-01"],
)

print(f"Input tokens: {response.input_tokens}")

The count_tokens endpoint is a beta feature as of mid-2026. It returns input_tokens without actually running inference, so you can use it to check budget before committing to a call. Useful for validating agent state before expensive multi-step runs.

For faster estimates without an API call, use tiktoken. It's not perfectly accurate for Claude (which uses its own tokenizer), but it's within 5-10% for English text:

import tiktoken

enc = tiktoken.get_encoding("cl100k_base")  # GPT-4 tokenizer — reasonable Claude approximation

def estimate_tokens(text: str) -> int:
    return len(enc.encode(text))

# Quick check before building your payload
system_tokens = estimate_tokens("You are a helpful assistant.")
user_tokens = estimate_tokens("Explain quantum entanglement simply.")
total_estimate = system_tokens + user_tokens + 4  # 4 tokens overhead per message

Use tiktoken for hot paths where you're checking tokens in a loop. Use the Anthropic count_tokens endpoint for pre-flight checks on expensive calls.

The 200k context window reality

Claude Sonnet and Opus support up to 200k input tokens, but that doesn't mean you should fill them. A few things to know:

Cost scales linearly with input tokens. At $3 per million input tokens (Sonnet pricing), sending 100k tokens costs $0.30 per call. If you're doing 1,000 calls per hour, that's $7,200/day in input tokens alone. Prompt caching dramatically cuts this when your system prompt or documents are repeated across calls.

Latency increases with context. Time-to-first-token grows roughly linearly with input size. At 100k tokens, you're waiting noticeably longer. At 180k tokens, real-time streaming feels sluggish.

Attention quality degrades on irrelevant context. Models technically "see" everything in the context window, but they attend selectively. Padding your context with tangentially related documents doesn't help — it adds noise.

The lost-in-the-middle problem

This is well-documented in research: models attend most strongly to content at the beginning and end of the context window. Content in the middle gets less reliable attention.

In practice, this means:

Your system prompt (start of context) is always well-attended
The most recent user message (end of context) is always well-attended
Retrieved documents or conversation history stuffed in the middle may be partially ignored

Mitigation strategies:

Put the most critical information first — if you're doing RAG, put the most relevant chunk at the top, not buried at position 5 of 10
Repeat key constraints at the end — if there's a rule the model must follow, state it in the system prompt AND in the final user turn
Fewer, better chunks — retrieving 3 highly relevant documents beats 10 loosely relevant ones

See the long context prompting guide for more on ordering and placement strategies.

Managing conversation history

Conversation history is the most common source of runaway token usage. Every turn appends to the message list, and without management, you'll hit context limits in long sessions.

The sliding window approach — keep the last N turns, drop the rest:

def sliding_window_history(
    messages: list[dict],
    max_turns: int = 10,
    max_tokens: int = 50_000
) -> list[dict]:
    """Keep recent turns within token budget."""
    # Always keep last max_turns turns
    windowed = messages[-max_turns * 2:]  # *2 because each turn = user + assistant message
    
    # Check token count and trim further if needed
    while len(windowed) > 2:
        token_count = estimate_tokens(str(windowed))
        if token_count <= max_tokens:
            break
        windowed = windowed[2:]  # Drop oldest user+assistant pair
    
    return windowed

This is simple and predictable, but it loses context abruptly. If the user asked for "the report" ten turns ago and you've windowed it out, the model has no idea what report they mean.

The summarize-and-compress approach handles this better. When history exceeds a threshold, summarize old turns with a separate LLM call:

import anthropic

client = anthropic.Anthropic()

def compress_history(
    old_messages: list[dict],
    keep_recent: int = 6
) -> list[dict]:
    """Summarize old turns, keep recent ones verbatim."""
    if len(old_messages) <= keep_recent * 2:
        return old_messages
    
    to_compress = old_messages[:-keep_recent * 2]
    to_keep = old_messages[-keep_recent * 2:]
    
    # Summarize old turns
    summary_prompt = f"""Summarize this conversation history concisely. 
    Capture: key decisions made, important context established, user preferences, 
    any commitments made. Be specific, not vague.
    
    Conversation:
    {format_messages(to_compress)}
    """
    
    summary_response = client.messages.create(
        model="claude-haiku-3-5",  # Use cheap model for compression
        max_tokens=500,
        messages=[{"role": "user", "content": summary_prompt}]
    )
    
    summary_text = summary_response.content[0].text
    
    # Inject summary as a synthetic "system context" message
    summary_message = {
        "role": "user",
        "content": f"[Previous conversation summary: {summary_text}]"
    }
    filler = {"role": "assistant", "content": "Understood, I'll keep that context in mind."}
    
    return [summary_message, filler] + to_keep

def format_messages(messages: list[dict]) -> str:
    return "\n".join(f"{m['role'].upper()}: {m['content']}" for m in messages)

Use Haiku for the compression call — it's 10x cheaper than Sonnet and "summarize this conversation" is an easy task. The summary injection adds ~200-300 tokens but lets you drop thousands.

Context budget planning for agents

Agents are where token management gets genuinely tricky. An agent loop might have:

System prompt: 2k tokens
Tool definitions: 3k tokens
Conversation history: 5k tokens
Retrieved context: 20k tokens
Tool call results: variable (can be huge if a tool returns a full file)
Reserved for response: 4k tokens

That's 34k tokens before you account for tool results. If a tool returns a 50k-token file, you've blown the budget.

Budget planning pattern:

class AgentContextBudget:
    def __init__(self, model_limit: int = 200_000):
        self.model_limit = model_limit
        self.system_prompt_tokens = 0
        self.tool_definitions_tokens = 0
        self.history_tokens = 0
        self.reserved_output_tokens = 4_000  # Minimum response budget
        self.reserved_tool_results_tokens = 10_000  # Buffer for tool outputs
    
    @property
    def available_for_context(self) -> int:
        used = (
            self.system_prompt_tokens
            + self.tool_definitions_tokens
            + self.history_tokens
            + self.reserved_output_tokens
            + self.reserved_tool_results_tokens
        )
        return max(0, self.model_limit - used)
    
    def can_add_retrieved_context(self, chunk_tokens: int) -> bool:
        return chunk_tokens <= self.available_for_context
    
    def truncate_tool_result(self, result: str, max_tokens: int = 8_000) -> str:
        tokens = estimate_tokens(result)
        if tokens <= max_tokens:
            return result
        # Rough character truncation (4 chars/token estimate)
        char_limit = max_tokens * 4
        return result[:char_limit] + f"\n\n[Truncated. Original was ~{tokens} tokens.]"

Always truncate tool results before they go into the context. A bash tool that returns 100k characters of log output will silently eat your context budget.

Cost estimation

Tokens × price per token. Simple formula, but useful to automate:

# Pricing as of mid-2026 (check Anthropic's pricing page for current rates)
PRICING = {
    "claude-opus-4": {"input": 15.00, "output": 75.00},       # per million tokens
    "claude-sonnet-4-5": {"input": 3.00, "output": 15.00},
    "claude-haiku-3-5": {"input": 0.80, "output": 4.00},
}

def estimate_call_cost(
    model: str,
    input_tokens: int,
    output_tokens: int
) -> float:
    """Returns cost in USD."""
    if model not in PRICING:
        raise ValueError(f"Unknown model: {model}")
    p = PRICING[model]
    return (input_tokens * p["input"] + output_tokens * p["output"]) / 1_000_000

# Example
cost = estimate_call_cost("claude-sonnet-4-5", input_tokens=50_000, output_tokens=2_000)
print(f"Estimated cost: ${cost:.4f}")  # $0.1800

For agents running thousands of calls per day, track this in your logging pipeline. See agent cost optimization for strategies to bring costs down at scale.

Practical chunking rules

When splitting documents for retrieval or context injection:

Chunk size: 512-1024 tokens per chunk is a safe default. Smaller chunks = more precise retrieval but more chunks to manage. Larger chunks = more context per retrieval but noisier.
Overlap: 10-15% overlap between chunks prevents losing information at boundaries. A 512-token chunk with 50-token overlap means each chunk shares 50 tokens with its neighbors.
Hard limits: Never split mid-sentence or mid-code-block. Split on paragraph breaks, heading boundaries, or logical separations.
Character-to-token rough math: 1 token ≈ 4 English characters. A 2,000-character paragraph ≈ 500 tokens. Use this for quick mental estimates.

What to actually monitor in production

Once your app is running, track these per endpoint and per model:

P50/P95 input token count — tells you if context is growing unexpectedly
P50/P95 output token count — models with max_tokens set too high waste on padding
Context limit hit rate — how often you're hitting truncation or errors
Cost per session — for conversation apps, cost should be roughly linear with session length

Set alerts when your daily token spend exceeds 1.5× your baseline. Spikes usually mean a bug — infinite loops, missing truncation, or a tool returning unexpectedly large results.

The context engineering guide covers the full mental model for thinking about context as a resource. The Claude Sonnet 4.6 guide has model-specific limits and capabilities if you're evaluating which model to use for your context budget.

Token management isn't glamorous, but it's the difference between an app that scales and one that breaks expensively in production.

How token counting actually works

Here's how to count precisely with the Anthropic SDK:

import anthropic

client = anthropic.Anthropic()

# Count tokens before sending — no generation happens, fast and cheap
response = client.beta.messages.count_tokens(
    model="claude-sonnet-4-5",
    system="You are a helpful customer support agent.",
    messages=[
        {"role": "user", "content": "What's your return policy for electronics?"}
    ],
    betas=["token-counting-2024-11-01"],
)

print(f"Input tokens: {response.input_tokens}")

For faster estimates without an API call, use tiktoken. It's not perfectly accurate for Claude (which uses its own tokenizer), but it's within 5-10% for English text:

import tiktoken

enc = tiktoken.get_encoding("cl100k_base")  # GPT-4 tokenizer — reasonable Claude approximation

def estimate_tokens(text: str) -> int:
    return len(enc.encode(text))

# Quick check before building your payload
system_tokens = estimate_tokens("You are a helpful assistant.")
user_tokens = estimate_tokens("Explain quantum entanglement simply.")
total_estimate = system_tokens + user_tokens + 4  # 4 tokens overhead per message

Use tiktoken for hot paths where you're checking tokens in a loop. Use the Anthropic count_tokens endpoint for pre-flight checks on expensive calls.

The 200k context window reality

Claude Sonnet and Opus support up to 200k input tokens, but that doesn't mean you should fill them. A few things to know:

Latency increases with context. Time-to-first-token grows roughly linearly with input size. At 100k tokens, you're waiting noticeably longer. At 180k tokens, real-time streaming feels sluggish.

The lost-in-the-middle problem

This is well-documented in research: models attend most strongly to content at the beginning and end of the context window. Content in the middle gets less reliable attention.

In practice, this means:

Your system prompt (start of context) is always well-attended
The most recent user message (end of context) is always well-attended
Retrieved documents or conversation history stuffed in the middle may be partially ignored

Mitigation strategies:

Put the most critical information first — if you're doing RAG, put the most relevant chunk at the top, not buried at position 5 of 10
Repeat key constraints at the end — if there's a rule the model must follow, state it in the system prompt AND in the final user turn
Fewer, better chunks — retrieving 3 highly relevant documents beats 10 loosely relevant ones

See the long context prompting guide for more on ordering and placement strategies.

Managing conversation history

Conversation history is the most common source of runaway token usage. Every turn appends to the message list, and without management, you'll hit context limits in long sessions.

The sliding window approach — keep the last N turns, drop the rest:

def sliding_window_history(
    messages: list[dict],
    max_turns: int = 10,
    max_tokens: int = 50_000
) -> list[dict]:
    """Keep recent turns within token budget."""
    # Always keep last max_turns turns
    windowed = messages[-max_turns * 2:]  # *2 because each turn = user + assistant message
    
    # Check token count and trim further if needed
    while len(windowed) > 2:
        token_count = estimate_tokens(str(windowed))
        if token_count <= max_tokens:
            break
        windowed = windowed[2:]  # Drop oldest user+assistant pair
    
    return windowed

This is simple and predictable, but it loses context abruptly. If the user asked for "the report" ten turns ago and you've windowed it out, the model has no idea what report they mean.

The summarize-and-compress approach handles this better. When history exceeds a threshold, summarize old turns with a separate LLM call:

import anthropic

client = anthropic.Anthropic()

def compress_history(
    old_messages: list[dict],
    keep_recent: int = 6
) -> list[dict]:
    """Summarize old turns, keep recent ones verbatim."""
    if len(old_messages) <= keep_recent * 2:
        return old_messages
    
    to_compress = old_messages[:-keep_recent * 2]
    to_keep = old_messages[-keep_recent * 2:]
    
    # Summarize old turns
    summary_prompt = f"""Summarize this conversation history concisely. 
    Capture: key decisions made, important context established, user preferences, 
    any commitments made. Be specific, not vague.
    
    Conversation:
    {format_messages(to_compress)}
    """
    
    summary_response = client.messages.create(
        model="claude-haiku-3-5",  # Use cheap model for compression
        max_tokens=500,
        messages=[{"role": "user", "content": summary_prompt}]
    )
    
    summary_text = summary_response.content[0].text
    
    # Inject summary as a synthetic "system context" message
    summary_message = {
        "role": "user",
        "content": f"[Previous conversation summary: {summary_text}]"
    }
    filler = {"role": "assistant", "content": "Understood, I'll keep that context in mind."}
    
    return [summary_message, filler] + to_keep

def format_messages(messages: list[dict]) -> str:
    return "\n".join(f"{m['role'].upper()}: {m['content']}" for m in messages)

Use Haiku for the compression call — it's 10x cheaper than Sonnet and "summarize this conversation" is an easy task. The summary injection adds ~200-300 tokens but lets you drop thousands.

Context budget planning for agents

Agents are where token management gets genuinely tricky. An agent loop might have:

System prompt: 2k tokens
Tool definitions: 3k tokens
Conversation history: 5k tokens
Retrieved context: 20k tokens
Tool call results: variable (can be huge if a tool returns a full file)
Reserved for response: 4k tokens

That's 34k tokens before you account for tool results. If a tool returns a 50k-token file, you've blown the budget.

Budget planning pattern:

class AgentContextBudget:
    def __init__(self, model_limit: int = 200_000):
        self.model_limit = model_limit
        self.system_prompt_tokens = 0
        self.tool_definitions_tokens = 0
        self.history_tokens = 0
        self.reserved_output_tokens = 4_000  # Minimum response budget
        self.reserved_tool_results_tokens = 10_000  # Buffer for tool outputs
    
    @property
    def available_for_context(self) -> int:
        used = (
            self.system_prompt_tokens
            + self.tool_definitions_tokens
            + self.history_tokens
            + self.reserved_output_tokens
            + self.reserved_tool_results_tokens
        )
        return max(0, self.model_limit - used)
    
    def can_add_retrieved_context(self, chunk_tokens: int) -> bool:
        return chunk_tokens <= self.available_for_context
    
    def truncate_tool_result(self, result: str, max_tokens: int = 8_000) -> str:
        tokens = estimate_tokens(result)
        if tokens <= max_tokens:
            return result
        # Rough character truncation (4 chars/token estimate)
        char_limit = max_tokens * 4
        return result[:char_limit] + f"\n\n[Truncated. Original was ~{tokens} tokens.]"

Always truncate tool results before they go into the context. A bash tool that returns 100k characters of log output will silently eat your context budget.

Cost estimation

Tokens × price per token. Simple formula, but useful to automate:

# Pricing as of mid-2026 (check Anthropic's pricing page for current rates)
PRICING = {
    "claude-opus-4": {"input": 15.00, "output": 75.00},       # per million tokens
    "claude-sonnet-4-5": {"input": 3.00, "output": 15.00},
    "claude-haiku-3-5": {"input": 0.80, "output": 4.00},
}

def estimate_call_cost(
    model: str,
    input_tokens: int,
    output_tokens: int
) -> float:
    """Returns cost in USD."""
    if model not in PRICING:
        raise ValueError(f"Unknown model: {model}")
    p = PRICING[model]
    return (input_tokens * p["input"] + output_tokens * p["output"]) / 1_000_000

# Example
cost = estimate_call_cost("claude-sonnet-4-5", input_tokens=50_000, output_tokens=2_000)
print(f"Estimated cost: ${cost:.4f}")  # $0.1800

For agents running thousands of calls per day, track this in your logging pipeline. See agent cost optimization for strategies to bring costs down at scale.

Practical chunking rules

When splitting documents for retrieval or context injection:

Chunk size: 512-1024 tokens per chunk is a safe default. Smaller chunks = more precise retrieval but more chunks to manage. Larger chunks = more context per retrieval but noisier.
Overlap: 10-15% overlap between chunks prevents losing information at boundaries. A 512-token chunk with 50-token overlap means each chunk shares 50 tokens with its neighbors.
Hard limits: Never split mid-sentence or mid-code-block. Split on paragraph breaks, heading boundaries, or logical separations.
Character-to-token rough math: 1 token ≈ 4 English characters. A 2,000-character paragraph ≈ 500 tokens. Use this for quick mental estimates.

What to actually monitor in production

Once your app is running, track these per endpoint and per model:

P50/P95 input token count — tells you if context is growing unexpectedly
P50/P95 output token count — models with max_tokens set too high waste on padding
Context limit hit rate — how often you're hitting truncation or errors
Cost per session — for conversation apps, cost should be roughly linear with session length

Set alerts when your daily token spend exceeds 1.5× your baseline. Spikes usually mean a bug — infinite loops, missing truncation, or a tool returning unexpectedly large results.

Token management isn't glamorous, but it's the difference between an app that scales and one that breaks expensively in production.

Token Counting and Context Management — A Practical Guide for LLM Apps

How token counting actually works

The 200k context window reality

The lost-in-the-middle problem

Managing conversation history

Context budget planning for agents

Cost estimation

Practical chunking rules

What to actually monitor in production

Related articles

Async Python for LLM Apps — Patterns That Actually Work in Production

50 Best AI Prompts for Claude That Actually Work (2026)

Claude Extended Thinking — How to Prompt for Deep Reasoning

Token Counting and Context Management — A Practical Guide for LLM Apps

How token counting actually works

The 200k context window reality

The lost-in-the-middle problem

Managing conversation history

Context budget planning for agents

Cost estimation

Practical chunking rules

What to actually monitor in production

Related articles

Async Python for LLM Apps — Patterns That Actually Work in Production

50 Best AI Prompts for Claude That Actually Work (2026)

Claude Extended Thinking — How to Prompt for Deep Reasoning