As AI applications grow more complex, prompts get longer. Longer prompts cost more, hit context limits, and can actually degrade performance (the "lost in the middle" problem — models attend less well to content buried in the middle of a long context).
Prompt compression is the set of techniques for reducing context size without losing the information that matters for the task.
Why Compression Matters
Cost. Input tokens are billed per token. A 100K-token prompt costs 10× more than a 10K-token prompt at the same per-token price. At scale — thousands of API calls per day — this compounds fast.
Performance. Research shows that beyond a certain context length, model attention quality degrades for content in the middle of the context. Key information buried in the middle of a 200K-token context is attended to less reliably than the same information in a shorter context.
Latency. Time to first token scales with context length. For real-time applications, a 50K-token context is noticeably slower than a 5K-token one.
Context window limits. Even with 1M-token windows available, many production pipelines accumulate context fast enough to hit limits in long-running agentic tasks.
Technique 1: Semantic Summarization
The simplest approach: use an LLM to summarize retrieved content before injecting it.
import anthropic
client = anthropic.Anthropic()
def compress_document(document: str, query: str, max_tokens: int = 500) -> str:
"""Compress a document to retain only query-relevant information."""
response = client.messages.create(
model="claude-haiku-4-5-20251001", # Use cheapest model for compression
max_tokens=max_tokens,
messages=[
{
"role": "user",
"content": f"""Extract only the information from this document that is relevant to answering: "{query}"
Discard irrelevant sections entirely. Preserve exact quotes and specific facts.
Keep your response under {max_tokens} tokens.
Document:
{document}"""
}
]
)
return response.content[0].text
# Usage in a RAG pipeline
def build_compressed_context(retrieved_docs: list[str], query: str) -> str:
compressed = []
for doc in retrieved_docs:
compressed_doc = compress_document(doc, query)
compressed.append(compressed_doc)
return "\n\n---\n\n".join(compressed)
When to use: When retrieved documents have significant irrelevant content. Particularly effective for web search results, long PDF documents, and database records with many fields.
Cost consideration: You're using a cheap model (Haiku) to compress, then using an expensive model (Opus/Sonnet) to answer. This can be cost-positive if the compression reduces the expensive call by enough tokens.
Technique 2: LLMLingua — Token-Level Compression
LLMLingua is an open-source library from Microsoft Research that compresses prompts at the token level using a small language model to determine token importance.
from llmlingua import PromptCompressor
compressor = PromptCompressor(
model_name="microsoft/llmlingua-2-xlm-roberta-large-meetingbank",
use_llmlingua2=True
)
long_prompt = """
[Long retrieved context here — potentially tens of thousands of tokens]
"""
compressed = compressor.compress_prompt(
long_prompt,
rate=0.4, # Compress to 40% of original length
force_tokens=["\n", "?", "!"], # Always preserve these tokens
drop_consecutive=True
)
print(f"Original: {len(long_prompt)} chars")
print(f"Compressed: {len(compressed['compressed_prompt'])} chars")
print(f"Ratio: {compressed['ratio']:.2f}")
LLMLingua works by:
- Running a small model to score the importance of each token
- Dropping tokens below the importance threshold
- Preserving structure and key tokens
Typical results: 3–10× compression with quality degradation of 5–15% on most benchmarks. Works best for retrieval-augmented contexts, less well for structured data.
When to use: High-volume pipelines where every token counts, and you can tolerate small quality degradation. Not recommended for tasks requiring precise detail from every part of the context.
Technique 3: Selective Field Inclusion
For structured data (database records, API responses, JSON objects), don't include fields that aren't relevant to the task.
def filter_fields(record: dict, task: str) -> dict:
"""Include only fields relevant to the task."""
# Define which fields matter for which tasks
field_map = {
"summarize_order": ["order_id", "items", "total", "status", "created_at"],
"check_shipping": ["order_id", "shipping_address", "carrier", "tracking_number", "estimated_delivery"],
"billing_issue": ["order_id", "payment_method", "amount", "billing_address", "transaction_id"]
}
relevant_fields = field_map.get(task, list(record.keys()))
return {k: v for k, v in record.items() if k in relevant_fields}
# Instead of sending:
# {"order_id": "123", "customer_id": "456", "items": [...], "shipping": {...},
# "billing": {...}, "metadata": {...}, "internal_notes": [...], ...}
# Send only:
# {"order_id": "123", "items": [...], "total": 89.99, "status": "shipped"}
This is especially effective for API integrations where responses contain dozens of fields but only a few are relevant to the AI task.
Technique 4: Conversation History Compression
For multi-turn applications, conversation history grows unbounded. Compress it:
def compress_history(messages: list[dict], max_messages: int = 10) -> list[dict]:
"""Keep recent messages in full, compress older ones into a summary."""
if len(messages) <= max_messages:
return messages
# Split into old (to compress) and recent (to keep)
old_messages = messages[:-max_messages]
recent_messages = messages[-max_messages:]
# Summarize the old conversation
history_text = "\n".join([f"{m['role']}: {m['content']}" for m in old_messages])
summary_response = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=300,
messages=[{
"role": "user",
"content": f"Summarize the key decisions and information from this conversation in 2-3 sentences:\n\n{history_text}"
}]
)
summary = summary_response.content[0].text
# Prepend the summary as a system note
compressed_history = [
{"role": "assistant", "content": f"[Earlier conversation summary: {summary}]"},
*recent_messages
]
return compressed_history
Technique 5: Positional Optimization
You can't always compress content — but you can position it better. Place the most important information at the beginning and end of the context:
def build_optimized_context(
system_instructions: str,
critical_docs: list[str],
supporting_docs: list[str],
query: str
) -> str:
"""Position content for optimal attention."""
return f"""{system_instructions}
## Primary Reference (Most Important)
{"\n\n".join(critical_docs)}
## Supporting Context
{"\n\n".join(supporting_docs)}
## Task
{query}"""
This doesn't reduce token count but exploits the "primacy and recency" attention bias in transformer models.
Choosing the Right Technique
| Situation | Best Technique |
|---|---|
| Retrieved web/PDF documents | Semantic summarization |
| High-volume, cost-sensitive pipeline | LLMLingua token compression |
| Structured DB/API responses | Selective field inclusion |
| Long conversation history | History summarization |
| Content is already minimal | Positional optimization |
| Multiple strategies available | Combine: retrieve → filter → summarize → position |
The practical stack for a production RAG pipeline:
- Retrieve top-k chunks (RAG)
- Filter to task-relevant fields if structured
- Semantic summarize with a cheap model
- Position summaries with critical info first
- Add query at the end
Applied consistently, this pipeline can reduce context by 70–80% vs. naive full-document injection.
Measuring Quality Impact
Always measure before deploying compression in production:
def evaluate_compression_quality(
test_cases: list[dict],
model: str,
full_context_fn,
compressed_context_fn
) -> dict:
"""Compare quality between full and compressed contexts."""
results = []
for case in test_cases:
full_answer = query_model(full_context_fn(case), case["question"], model)
compressed_answer = query_model(compressed_context_fn(case), case["question"], model)
# Score similarity (or use a judge model for quality scoring)
similarity = score_answer_similarity(
full_answer, compressed_answer, case["reference_answer"]
)
results.append({
"question": case["question"],
"full_tokens": count_tokens(full_context_fn(case)),
"compressed_tokens": count_tokens(compressed_context_fn(case)),
"quality_delta": similarity
})
avg_compression = sum(r["compressed_tokens"] for r in results) / sum(r["full_tokens"] for r in results)
avg_quality = sum(r["quality_delta"] for r in results) / len(results)
return {"compression_ratio": avg_compression, "quality_retention": avg_quality}
The goal: maximize compression while staying above your quality threshold. For most use cases, 60–70% compression with >90% quality retention is achievable.
