Does compressing prompts hurt quality?

It depends on the technique and compression ratio. Semantic summarization (asking an LLM to summarize retrieved context) preserves quality well at 50-70% compression. Token-level compression tools like LLMLingua can compress more aggressively (80-90%) with small quality drops on most tasks. The biggest quality hit comes from removing content that's actually relevant — quality suffers most when you over-compress or make poor decisions about what to cut.

What's the difference between prompt compression and RAG?

RAG selectively retrieves what to include in the context. Prompt compression reduces the size of content that's already been selected for inclusion. They're complementary: RAG decides what to put in the context, compression determines how to represent it efficiently. Using both together is common in production systems — retrieve 20 relevant chunks, then compress them to fit within budget.

Prompt Compression: How to Reduce Context Size Without Losing Quality

As AI applications grow more complex, prompts get longer. Longer prompts cost more, hit context limits, and can actually degrade performance (the "lost in the middle" problem — models attend less well to content buried in the middle of a long context).

Prompt compression is the set of techniques for reducing context size without losing the information that matters for the task.

Why Compression Matters

Cost. Input tokens are billed per token. A 100K-token prompt costs 10× more than a 10K-token prompt at the same per-token price. At scale — thousands of API calls per day — this compounds fast.

Performance. Research shows that beyond a certain context length, model attention quality degrades for content in the middle of the context. Key information buried in the middle of a 200K-token context is attended to less reliably than the same information in a shorter context.

Latency. Time to first token scales with context length. For real-time applications, a 50K-token context is noticeably slower than a 5K-token one.

Context window limits. Even with 1M-token windows available, many production pipelines accumulate context fast enough to hit limits in long-running agentic tasks.

Technique 1: Semantic Summarization

The simplest approach: use an LLM to summarize retrieved content before injecting it.

import anthropic

client = anthropic.Anthropic()

def compress_document(document: str, query: str, max_tokens: int = 500) -> str:
    """Compress a document to retain only query-relevant information."""
    response = client.messages.create(
        model="claude-haiku-4-5-20251001",  # Use cheapest model for compression
        max_tokens=max_tokens,
        messages=[
            {
                "role": "user",
                "content": f"""Extract only the information from this document that is relevant to answering: "{query}"

Discard irrelevant sections entirely. Preserve exact quotes and specific facts.
Keep your response under {max_tokens} tokens.

Document:
{document}"""
            }
        ]
    )
    return response.content[0].text

# Usage in a RAG pipeline
def build_compressed_context(retrieved_docs: list[str], query: str) -> str:
    compressed = []
    for doc in retrieved_docs:
        compressed_doc = compress_document(doc, query)
        compressed.append(compressed_doc)
    return "\n\n---\n\n".join(compressed)

When to use: When retrieved documents have significant irrelevant content. Particularly effective for web search results, long PDF documents, and database records with many fields.

Cost consideration: You're using a cheap model (Haiku) to compress, then using an expensive model (Opus/Sonnet) to answer. This can be cost-positive if the compression reduces the expensive call by enough tokens.

Technique 2: LLMLingua — Token-Level Compression

LLMLingua is an open-source library from Microsoft Research that compresses prompts at the token level using a small language model to determine token importance.

from llmlingua import PromptCompressor

compressor = PromptCompressor(
    model_name="microsoft/llmlingua-2-xlm-roberta-large-meetingbank",
    use_llmlingua2=True
)

long_prompt = """
[Long retrieved context here — potentially tens of thousands of tokens]
"""

compressed = compressor.compress_prompt(
    long_prompt,
    rate=0.4,  # Compress to 40% of original length
    force_tokens=["\n", "?", "!"],  # Always preserve these tokens
    drop_consecutive=True
)

print(f"Original: {len(long_prompt)} chars")
print(f"Compressed: {len(compressed['compressed_prompt'])} chars")
print(f"Ratio: {compressed['ratio']:.2f}")

LLMLingua works by:

Running a small model to score the importance of each token
Dropping tokens below the importance threshold
Preserving structure and key tokens

Typical results: 3–10× compression with quality degradation of 5–15% on most benchmarks. Works best for retrieval-augmented contexts, less well for structured data.

When to use: High-volume pipelines where every token counts, and you can tolerate small quality degradation. Not recommended for tasks requiring precise detail from every part of the context.

Technique 3: Selective Field Inclusion

For structured data (database records, API responses, JSON objects), don't include fields that aren't relevant to the task.

def filter_fields(record: dict, task: str) -> dict:
    """Include only fields relevant to the task."""

    # Define which fields matter for which tasks
    field_map = {
        "summarize_order": ["order_id", "items", "total", "status", "created_at"],
        "check_shipping": ["order_id", "shipping_address", "carrier", "tracking_number", "estimated_delivery"],
        "billing_issue": ["order_id", "payment_method", "amount", "billing_address", "transaction_id"]
    }

    relevant_fields = field_map.get(task, list(record.keys()))
    return {k: v for k, v in record.items() if k in relevant_fields}

# Instead of sending:
# {"order_id": "123", "customer_id": "456", "items": [...], "shipping": {...},
#  "billing": {...}, "metadata": {...}, "internal_notes": [...], ...}

# Send only:
# {"order_id": "123", "items": [...], "total": 89.99, "status": "shipped"}

This is especially effective for API integrations where responses contain dozens of fields but only a few are relevant to the AI task.

Technique 4: Conversation History Compression

For multi-turn applications, conversation history grows unbounded. Compress it:

def compress_history(messages: list[dict], max_messages: int = 10) -> list[dict]:
    """Keep recent messages in full, compress older ones into a summary."""
    if len(messages) <= max_messages:
        return messages

    # Split into old (to compress) and recent (to keep)
    old_messages = messages[:-max_messages]
    recent_messages = messages[-max_messages:]

    # Summarize the old conversation
    history_text = "\n".join([f"{m['role']}: {m['content']}" for m in old_messages])

    summary_response = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=300,
        messages=[{
            "role": "user",
            "content": f"Summarize the key decisions and information from this conversation in 2-3 sentences:\n\n{history_text}"
        }]
    )

    summary = summary_response.content[0].text

    # Prepend the summary as a system note
    compressed_history = [
        {"role": "assistant", "content": f"[Earlier conversation summary: {summary}]"},
        *recent_messages
    ]

    return compressed_history

Technique 5: Positional Optimization

You can't always compress content — but you can position it better. Place the most important information at the beginning and end of the context:

def build_optimized_context(
    system_instructions: str,
    critical_docs: list[str],
    supporting_docs: list[str],
    query: str
) -> str:
    """Position content for optimal attention."""

    return f"""{system_instructions}

## Primary Reference (Most Important)
{"\n\n".join(critical_docs)}

## Supporting Context
{"\n\n".join(supporting_docs)}

## Task
{query}"""

This doesn't reduce token count but exploits the "primacy and recency" attention bias in transformer models.

Choosing the Right Technique

Situation	Best Technique
Retrieved web/PDF documents	Semantic summarization
High-volume, cost-sensitive pipeline	LLMLingua token compression
Structured DB/API responses	Selective field inclusion
Long conversation history	History summarization
Content is already minimal	Positional optimization
Multiple strategies available	Combine: retrieve → filter → summarize → position

The practical stack for a production RAG pipeline:

Retrieve top-k chunks (RAG)
Filter to task-relevant fields if structured
Semantic summarize with a cheap model
Position summaries with critical info first
Add query at the end

Applied consistently, this pipeline can reduce context by 70–80% vs. naive full-document injection.

Measuring Quality Impact

Always measure before deploying compression in production:

def evaluate_compression_quality(
    test_cases: list[dict],
    model: str,
    full_context_fn,
    compressed_context_fn
) -> dict:
    """Compare quality between full and compressed contexts."""
    results = []

    for case in test_cases:
        full_answer = query_model(full_context_fn(case), case["question"], model)
        compressed_answer = query_model(compressed_context_fn(case), case["question"], model)

        # Score similarity (or use a judge model for quality scoring)
        similarity = score_answer_similarity(
            full_answer, compressed_answer, case["reference_answer"]
        )
        results.append({
            "question": case["question"],
            "full_tokens": count_tokens(full_context_fn(case)),
            "compressed_tokens": count_tokens(compressed_context_fn(case)),
            "quality_delta": similarity
        })

    avg_compression = sum(r["compressed_tokens"] for r in results) / sum(r["full_tokens"] for r in results)
    avg_quality = sum(r["quality_delta"] for r in results) / len(results)

    return {"compression_ratio": avg_compression, "quality_retention": avg_quality}

The goal: maximize compression while staying above your quality threshold. For most use cases, 60–70% compression with >90% quality retention is achievable.

Prompt compression is the set of techniques for reducing context size without losing the information that matters for the task.

Why Compression Matters

Latency. Time to first token scales with context length. For real-time applications, a 50K-token context is noticeably slower than a 5K-token one.

Context window limits. Even with 1M-token windows available, many production pipelines accumulate context fast enough to hit limits in long-running agentic tasks.

Technique 1: Semantic Summarization

The simplest approach: use an LLM to summarize retrieved content before injecting it.

import anthropic

client = anthropic.Anthropic()

def compress_document(document: str, query: str, max_tokens: int = 500) -> str:
    """Compress a document to retain only query-relevant information."""
    response = client.messages.create(
        model="claude-haiku-4-5-20251001",  # Use cheapest model for compression
        max_tokens=max_tokens,
        messages=[
            {
                "role": "user",
                "content": f"""Extract only the information from this document that is relevant to answering: "{query}"

Discard irrelevant sections entirely. Preserve exact quotes and specific facts.
Keep your response under {max_tokens} tokens.

Document:
{document}"""
            }
        ]
    )
    return response.content[0].text

# Usage in a RAG pipeline
def build_compressed_context(retrieved_docs: list[str], query: str) -> str:
    compressed = []
    for doc in retrieved_docs:
        compressed_doc = compress_document(doc, query)
        compressed.append(compressed_doc)
    return "\n\n---\n\n".join(compressed)

When to use: When retrieved documents have significant irrelevant content. Particularly effective for web search results, long PDF documents, and database records with many fields.

Technique 2: LLMLingua — Token-Level Compression

LLMLingua is an open-source library from Microsoft Research that compresses prompts at the token level using a small language model to determine token importance.

from llmlingua import PromptCompressor

compressor = PromptCompressor(
    model_name="microsoft/llmlingua-2-xlm-roberta-large-meetingbank",
    use_llmlingua2=True
)

long_prompt = """
[Long retrieved context here — potentially tens of thousands of tokens]
"""

compressed = compressor.compress_prompt(
    long_prompt,
    rate=0.4,  # Compress to 40% of original length
    force_tokens=["\n", "?", "!"],  # Always preserve these tokens
    drop_consecutive=True
)

print(f"Original: {len(long_prompt)} chars")
print(f"Compressed: {len(compressed['compressed_prompt'])} chars")
print(f"Ratio: {compressed['ratio']:.2f}")

LLMLingua works by:

Running a small model to score the importance of each token
Dropping tokens below the importance threshold
Preserving structure and key tokens

Typical results: 3–10× compression with quality degradation of 5–15% on most benchmarks. Works best for retrieval-augmented contexts, less well for structured data.

When to use: High-volume pipelines where every token counts, and you can tolerate small quality degradation. Not recommended for tasks requiring precise detail from every part of the context.

Technique 3: Selective Field Inclusion

For structured data (database records, API responses, JSON objects), don't include fields that aren't relevant to the task.

def filter_fields(record: dict, task: str) -> dict:
    """Include only fields relevant to the task."""

    # Define which fields matter for which tasks
    field_map = {
        "summarize_order": ["order_id", "items", "total", "status", "created_at"],
        "check_shipping": ["order_id", "shipping_address", "carrier", "tracking_number", "estimated_delivery"],
        "billing_issue": ["order_id", "payment_method", "amount", "billing_address", "transaction_id"]
    }

    relevant_fields = field_map.get(task, list(record.keys()))
    return {k: v for k, v in record.items() if k in relevant_fields}

# Instead of sending:
# {"order_id": "123", "customer_id": "456", "items": [...], "shipping": {...},
#  "billing": {...}, "metadata": {...}, "internal_notes": [...], ...}

# Send only:
# {"order_id": "123", "items": [...], "total": 89.99, "status": "shipped"}

This is especially effective for API integrations where responses contain dozens of fields but only a few are relevant to the AI task.

Technique 4: Conversation History Compression

For multi-turn applications, conversation history grows unbounded. Compress it:

def compress_history(messages: list[dict], max_messages: int = 10) -> list[dict]:
    """Keep recent messages in full, compress older ones into a summary."""
    if len(messages) <= max_messages:
        return messages

    # Split into old (to compress) and recent (to keep)
    old_messages = messages[:-max_messages]
    recent_messages = messages[-max_messages:]

    # Summarize the old conversation
    history_text = "\n".join([f"{m['role']}: {m['content']}" for m in old_messages])

    summary_response = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=300,
        messages=[{
            "role": "user",
            "content": f"Summarize the key decisions and information from this conversation in 2-3 sentences:\n\n{history_text}"
        }]
    )

    summary = summary_response.content[0].text

    # Prepend the summary as a system note
    compressed_history = [
        {"role": "assistant", "content": f"[Earlier conversation summary: {summary}]"},
        *recent_messages
    ]

    return compressed_history

Technique 5: Positional Optimization

You can't always compress content — but you can position it better. Place the most important information at the beginning and end of the context:

def build_optimized_context(
    system_instructions: str,
    critical_docs: list[str],
    supporting_docs: list[str],
    query: str
) -> str:
    """Position content for optimal attention."""

    return f"""{system_instructions}

## Primary Reference (Most Important)
{"\n\n".join(critical_docs)}

## Supporting Context
{"\n\n".join(supporting_docs)}

## Task
{query}"""

This doesn't reduce token count but exploits the "primacy and recency" attention bias in transformer models.

Choosing the Right Technique

Situation	Best Technique
Retrieved web/PDF documents	Semantic summarization
High-volume, cost-sensitive pipeline	LLMLingua token compression
Structured DB/API responses	Selective field inclusion
Long conversation history	History summarization
Content is already minimal	Positional optimization
Multiple strategies available	Combine: retrieve → filter → summarize → position

The practical stack for a production RAG pipeline:

Retrieve top-k chunks (RAG)
Filter to task-relevant fields if structured
Semantic summarize with a cheap model
Position summaries with critical info first
Add query at the end

Applied consistently, this pipeline can reduce context by 70–80% vs. naive full-document injection.

Measuring Quality Impact

Always measure before deploying compression in production:

def evaluate_compression_quality(
    test_cases: list[dict],
    model: str,
    full_context_fn,
    compressed_context_fn
) -> dict:
    """Compare quality between full and compressed contexts."""
    results = []

    for case in test_cases:
        full_answer = query_model(full_context_fn(case), case["question"], model)
        compressed_answer = query_model(compressed_context_fn(case), case["question"], model)

        # Score similarity (or use a judge model for quality scoring)
        similarity = score_answer_similarity(
            full_answer, compressed_answer, case["reference_answer"]
        )
        results.append({
            "question": case["question"],
            "full_tokens": count_tokens(full_context_fn(case)),
            "compressed_tokens": count_tokens(compressed_context_fn(case)),
            "quality_delta": similarity
        })

    avg_compression = sum(r["compressed_tokens"] for r in results) / sum(r["full_tokens"] for r in results)
    avg_quality = sum(r["quality_delta"] for r in results) / len(results)

    return {"compression_ratio": avg_compression, "quality_retention": avg_quality}

The goal: maximize compression while staying above your quality threshold. For most use cases, 60–70% compression with >90% quality retention is achievable.

Prompt Compression: How to Reduce Context Size Without Losing Quality

Why Compression Matters

Technique 1: Semantic Summarization

Technique 2: LLMLingua — Token-Level Compression

Technique 3: Selective Field Inclusion

Technique 4: Conversation History Compression

Technique 5: Positional Optimization

Choosing the Right Technique

Measuring Quality Impact

Related articles

How RAG Works: The Plain-English Guide to Retrieval Augmented Generation

Prompt Compression: How to Reduce Context Size Without Losing Quality

Why Compression Matters

Technique 1: Semantic Summarization

Technique 2: LLMLingua — Token-Level Compression

Technique 3: Selective Field Inclusion

Technique 4: Conversation History Compression

Technique 5: Positional Optimization

Choosing the Right Technique

Measuring Quality Impact

Related articles

How RAG Works: The Plain-English Guide to Retrieval Augmented Generation