Token count affects cost, latency, and context window capacity. But prompt compression is not just an optimization problem — it's a quality problem. Prompts that are too long often perform worse than concise ones because they bury the signal in noise.
This lesson covers strategies for reducing token usage while maintaining or improving output quality.
Why Compression Matters
Cost: Most LLM APIs charge per token. For high-volume applications, prompt length directly affects the bill. A prompt that runs 10,000 times per day that you can trim by 200 tokens saves 2 million tokens daily.
Latency: Longer prompts take longer to process. For real-time applications, this matters.
Context window: Every token of prompt is a token unavailable for context, conversation history, or output. In tasks requiring large inputs (documents, codebases), a bloated system prompt leaves less room for actual content.
Output quality: Counterintuitively, longer prompts don't always produce better outputs. Long prompts with redundant instructions cause the model to lose focus on what actually matters.
The Compression Audit
Before compressing, audit your prompt for these common sources of waste:
Redundant instructions — Saying the same thing multiple ways ("Be concise. Keep responses brief. Don't write long answers.") adds tokens without adding signal. Pick the best phrasing and drop the rest.
Hedging language — Phrases like "Please try to..." or "If possible, you should attempt to..." add tokens while weakening the instruction. Use direct imperatives.
Excessive caveats — "This is a complex task, but..." and "While there are many approaches..." add nothing. Get to the instruction.
Example padding — If you have 5 examples where 2 would do, you're paying for 3 unnecessary examples. Test the minimum number of examples that achieves your quality target.
Role preamble — "You are a helpful AI assistant..." when the model already knows this. Only include role framing when it meaningfully changes behavior.
Explained constraints — "Don't use bullet points because they can be hard to read and we prefer a more flowing prose style" → "Use prose only, not bullet points." The reason doesn't help the model comply better.
Compression Techniques
1. Replace prose with structured shorthand
Prose instructions compress well into structured formats:
Before (47 tokens):
When writing your response, make sure to always start with a brief summary,
then provide the detailed explanation, and finish with a concrete example.
After (14 tokens):
Format: Summary → Explanation → Example
2. Move repeated context to a shared location
If you're injecting the same information into every prompt in a chain, move it to the system prompt and reference it instead.
Instead of repeating:
The user is a senior software engineer working on a Python backend service
that handles payment processing for an e-commerce platform...
Set this once in the system prompt and reference it implicitly in task prompts.
3. Abbreviate reference information
For lookup-style data (options, categories, codes), compress long names to abbreviations with a legend:
Before:
The ticket must be categorized as one of: Billing Issue, Technical Problem,
Feature Request, Account Access, General Inquiry, Other.
After:
Categories: BILL, TECH, FEAT, ACCT, GEN, OTHER
4. Use XML structure instead of prose labels
XML tags communicate structure efficiently without verbose labels:
Before (more tokens):
Here is the document you need to summarize:
---
[document text]
---
Please summarize the above document in 3 bullet points.
After (fewer tokens):
Summarize in 3 bullets:
<doc>[document text]</doc>
5. Cut examples that don't change behavior
Test your prompt with progressively fewer examples. Measure output quality. Stop removing examples at the point where quality degrades.
For many tasks, 1–2 high-quality examples outperform 5 mediocre ones.
Compressing Context
Beyond the prompt itself, the conversation history and injected context often consume more tokens than the system prompt. Techniques for this:
Summarize conversation history — Instead of passing the full history, summarize earlier turns:
[Earlier conversation summary: User asked for help debugging a React performance issue.
Assistant identified the problem as unnecessary re-renders in the ProductList component
and suggested using React.memo. User has implemented memo and the issue persists.]
User: Still getting re-renders. Here's the profiler output: [...]
Truncate from the middle, not the end — When context must be cut, removing from the middle (between the task setup and the most recent exchange) usually hurts less than cutting from the end.
Extract only relevant sections — For document QA tasks, extract the relevant passages rather than injecting the entire document.
When to Use Structured vs. Prose Formats
Structured formats (JSON, XML, CSV, tables) tend to be more token-efficient for:
- Lookup data (categories, options, codes)
- Multi-field inputs that need to be referenced by name
- Output formats that will be parsed programmatically
Prose is more token-efficient for:
- Instructions that flow naturally as sentences
- Nuanced guidance that loses meaning when compressed
- Explanations that reference previous sentences
Don't use JSON for your system prompt just because it looks engineered. Use it when it genuinely reduces tokens or improves clarity.
Measuring the Impact of Compression
Token savings are meaningless if they come with quality losses. The right measurement framework:
- Define a test set — 20–50 representative inputs covering your use cases
- Establish a quality baseline — score outputs from your current prompt on your quality dimensions
- Apply compression — make one or two changes at a time
- Re-score — run the same test set through the compressed prompt
- Compare — if quality holds, keep the compression; if it drops, restore what you cut
For automated pipelines, track both token count and a quality metric (which could be automated via an evaluator model or logged human feedback).
The key discipline: never measure only token count. Compression that saves tokens while reducing quality is a regression, not an improvement.
Compression Trade-off Reference
| Technique | Token savings | Quality risk | Best for |
|---|---|---|---|
| Remove redundant instructions | Medium | Low | All prompts |
| Cut hedging language | Low | None | All prompts |
| Compress examples | High | Medium | Few-shot prompts |
| Summarize context | High | Medium | Long conversations |
| Abbreviate lookup data | Medium | Low | Structured data |
| Prose → structured format | Medium | Low | Instructions |
Key Takeaways
- Token count affects cost, latency, context capacity, and often output quality
- Audit for: redundant instructions, hedging language, excessive examples, verbose role preambles
- Compress prose instructions to structured shorthand where possible
- Summarize conversation history rather than passing raw transcript
- Always measure quality after compression — token savings that hurt quality are not wins
- Test the minimum number of examples that achieves your target quality
This is the final lesson in the Advanced Track. You've now covered the full spectrum from meta-prompting through adversarial robustness to token efficiency. Return to the Advanced Track →