Token counting broke my app in production. A customer support bot I'd built started throwing 413 Request Entity Too Large errors after three days of conversations. I hadn't budgeted for conversation history growth, and the fix took longer than the original build. Don't make the same mistake.
This guide covers everything you need to manage tokens properly: counting accurately, respecting context limits, avoiding the lost-in-the-middle attention problem, and building truncation strategies that hold up in production.
How token counting actually works
Tokens aren't words. They're subword chunks — roughly 4 characters per token for English text, but that varies a lot by language (Chinese and Korean are 1-2 chars per token; code with repeated patterns compresses well). A common rule of thumb is 1 token ≈ 0.75 words, but don't rely on it for budget planning.
Claude's 200k context window sounds huge until you realize input and output share the limit. If you're sending 190k tokens of context, the model has 10k tokens left to generate a response. For a complex reasoning task, that's not much.
Here's how to count precisely with the Anthropic SDK:
import anthropic
client = anthropic.Anthropic()
# Count tokens before sending — no generation happens, fast and cheap
response = client.beta.messages.count_tokens(
model="claude-sonnet-4-5",
system="You are a helpful customer support agent.",
messages=[
{"role": "user", "content": "What's your return policy for electronics?"}
],
betas=["token-counting-2024-11-01"],
)
print(f"Input tokens: {response.input_tokens}")
The count_tokens endpoint is a beta feature as of mid-2026. It returns input_tokens without actually running inference, so you can use it to check budget before committing to a call. Useful for validating agent state before expensive multi-step runs.
For faster estimates without an API call, use tiktoken. It's not perfectly accurate for Claude (which uses its own tokenizer), but it's within 5-10% for English text:
import tiktoken
enc = tiktoken.get_encoding("cl100k_base") # GPT-4 tokenizer — reasonable Claude approximation
def estimate_tokens(text: str) -> int:
return len(enc.encode(text))
# Quick check before building your payload
system_tokens = estimate_tokens("You are a helpful assistant.")
user_tokens = estimate_tokens("Explain quantum entanglement simply.")
total_estimate = system_tokens + user_tokens + 4 # 4 tokens overhead per message
Use tiktoken for hot paths where you're checking tokens in a loop. Use the Anthropic count_tokens endpoint for pre-flight checks on expensive calls.
The 200k context window reality
Claude Sonnet and Opus support up to 200k input tokens, but that doesn't mean you should fill them. A few things to know:
Cost scales linearly with input tokens. At $3 per million input tokens (Sonnet pricing), sending 100k tokens costs $0.30 per call. If you're doing 1,000 calls per hour, that's $7,200/day in input tokens alone. Prompt caching dramatically cuts this when your system prompt or documents are repeated across calls.
Latency increases with context. Time-to-first-token grows roughly linearly with input size. At 100k tokens, you're waiting noticeably longer. At 180k tokens, real-time streaming feels sluggish.
Attention quality degrades on irrelevant context. Models technically "see" everything in the context window, but they attend selectively. Padding your context with tangentially related documents doesn't help — it adds noise.
The lost-in-the-middle problem
This is well-documented in research: models attend most strongly to content at the beginning and end of the context window. Content in the middle gets less reliable attention.
In practice, this means:
- Your system prompt (start of context) is always well-attended
- The most recent user message (end of context) is always well-attended
- Retrieved documents or conversation history stuffed in the middle may be partially ignored
Mitigation strategies:
- Put the most critical information first — if you're doing RAG, put the most relevant chunk at the top, not buried at position 5 of 10
- Repeat key constraints at the end — if there's a rule the model must follow, state it in the system prompt AND in the final user turn
- Fewer, better chunks — retrieving 3 highly relevant documents beats 10 loosely relevant ones
See the long context prompting guide for more on ordering and placement strategies.
Managing conversation history
Conversation history is the most common source of runaway token usage. Every turn appends to the message list, and without management, you'll hit context limits in long sessions.
The sliding window approach — keep the last N turns, drop the rest:
def sliding_window_history(
messages: list[dict],
max_turns: int = 10,
max_tokens: int = 50_000
) -> list[dict]:
"""Keep recent turns within token budget."""
# Always keep last max_turns turns
windowed = messages[-max_turns * 2:] # *2 because each turn = user + assistant message
# Check token count and trim further if needed
while len(windowed) > 2:
token_count = estimate_tokens(str(windowed))
if token_count <= max_tokens:
break
windowed = windowed[2:] # Drop oldest user+assistant pair
return windowed
This is simple and predictable, but it loses context abruptly. If the user asked for "the report" ten turns ago and you've windowed it out, the model has no idea what report they mean.
The summarize-and-compress approach handles this better. When history exceeds a threshold, summarize old turns with a separate LLM call:
import anthropic
client = anthropic.Anthropic()
def compress_history(
old_messages: list[dict],
keep_recent: int = 6
) -> list[dict]:
"""Summarize old turns, keep recent ones verbatim."""
if len(old_messages) <= keep_recent * 2:
return old_messages
to_compress = old_messages[:-keep_recent * 2]
to_keep = old_messages[-keep_recent * 2:]
# Summarize old turns
summary_prompt = f"""Summarize this conversation history concisely.
Capture: key decisions made, important context established, user preferences,
any commitments made. Be specific, not vague.
Conversation:
{format_messages(to_compress)}
"""
summary_response = client.messages.create(
model="claude-haiku-3-5", # Use cheap model for compression
max_tokens=500,
messages=[{"role": "user", "content": summary_prompt}]
)
summary_text = summary_response.content[0].text
# Inject summary as a synthetic "system context" message
summary_message = {
"role": "user",
"content": f"[Previous conversation summary: {summary_text}]"
}
filler = {"role": "assistant", "content": "Understood, I'll keep that context in mind."}
return [summary_message, filler] + to_keep
def format_messages(messages: list[dict]) -> str:
return "\n".join(f"{m['role'].upper()}: {m['content']}" for m in messages)
Use Haiku for the compression call — it's 10x cheaper than Sonnet and "summarize this conversation" is an easy task. The summary injection adds ~200-300 tokens but lets you drop thousands.
Context budget planning for agents
Agents are where token management gets genuinely tricky. An agent loop might have:
- System prompt: 2k tokens
- Tool definitions: 3k tokens
- Conversation history: 5k tokens
- Retrieved context: 20k tokens
- Tool call results: variable (can be huge if a tool returns a full file)
- Reserved for response: 4k tokens
That's 34k tokens before you account for tool results. If a tool returns a 50k-token file, you've blown the budget.
Budget planning pattern:
class AgentContextBudget:
def __init__(self, model_limit: int = 200_000):
self.model_limit = model_limit
self.system_prompt_tokens = 0
self.tool_definitions_tokens = 0
self.history_tokens = 0
self.reserved_output_tokens = 4_000 # Minimum response budget
self.reserved_tool_results_tokens = 10_000 # Buffer for tool outputs
@property
def available_for_context(self) -> int:
used = (
self.system_prompt_tokens
+ self.tool_definitions_tokens
+ self.history_tokens
+ self.reserved_output_tokens
+ self.reserved_tool_results_tokens
)
return max(0, self.model_limit - used)
def can_add_retrieved_context(self, chunk_tokens: int) -> bool:
return chunk_tokens <= self.available_for_context
def truncate_tool_result(self, result: str, max_tokens: int = 8_000) -> str:
tokens = estimate_tokens(result)
if tokens <= max_tokens:
return result
# Rough character truncation (4 chars/token estimate)
char_limit = max_tokens * 4
return result[:char_limit] + f"\n\n[Truncated. Original was ~{tokens} tokens.]"
Always truncate tool results before they go into the context. A bash tool that returns 100k characters of log output will silently eat your context budget.
Cost estimation
Tokens × price per token. Simple formula, but useful to automate:
# Pricing as of mid-2026 (check Anthropic's pricing page for current rates)
PRICING = {
"claude-opus-4": {"input": 15.00, "output": 75.00}, # per million tokens
"claude-sonnet-4-5": {"input": 3.00, "output": 15.00},
"claude-haiku-3-5": {"input": 0.80, "output": 4.00},
}
def estimate_call_cost(
model: str,
input_tokens: int,
output_tokens: int
) -> float:
"""Returns cost in USD."""
if model not in PRICING:
raise ValueError(f"Unknown model: {model}")
p = PRICING[model]
return (input_tokens * p["input"] + output_tokens * p["output"]) / 1_000_000
# Example
cost = estimate_call_cost("claude-sonnet-4-5", input_tokens=50_000, output_tokens=2_000)
print(f"Estimated cost: ${cost:.4f}") # $0.1800
For agents running thousands of calls per day, track this in your logging pipeline. See agent cost optimization for strategies to bring costs down at scale.
Practical chunking rules
When splitting documents for retrieval or context injection:
- Chunk size: 512-1024 tokens per chunk is a safe default. Smaller chunks = more precise retrieval but more chunks to manage. Larger chunks = more context per retrieval but noisier.
- Overlap: 10-15% overlap between chunks prevents losing information at boundaries. A 512-token chunk with 50-token overlap means each chunk shares 50 tokens with its neighbors.
- Hard limits: Never split mid-sentence or mid-code-block. Split on paragraph breaks, heading boundaries, or logical separations.
- Character-to-token rough math: 1 token ≈ 4 English characters. A 2,000-character paragraph ≈ 500 tokens. Use this for quick mental estimates.
What to actually monitor in production
Once your app is running, track these per endpoint and per model:
- P50/P95 input token count — tells you if context is growing unexpectedly
- P50/P95 output token count — models with
max_tokensset too high waste on padding - Context limit hit rate — how often you're hitting truncation or errors
- Cost per session — for conversation apps, cost should be roughly linear with session length
Set alerts when your daily token spend exceeds 1.5× your baseline. Spikes usually mean a bug — infinite loops, missing truncation, or a tool returning unexpectedly large results.
The context engineering guide covers the full mental model for thinking about context as a resource. The Claude Sonnet 4.6 guide has model-specific limits and capabilities if you're evaluating which model to use for your context budget.
Token management isn't glamorous, but it's the difference between an app that scales and one that breaks expensively in production.



