In 2025, "context engineering" emerged as the term for something practitioners had been doing for years: carefully architecting everything inside an LLM's context window, not just the prompt.
Prompt Engineering vs. Context Engineering
Prompt engineering asks: "How should I phrase this instruction?"
Context engineering asks: "What is the complete information environment the model should reason from?"
For simple Q&A, these are the same question. For production AI systems — agents, chatbots, RAG pipelines, multi-step workflows — context engineering is the dominant skill.
Context Window = System Prompt
+ Conversation History (partial/full/summarized)
+ Retrieved Documents (RAG)
+ Tool Results (function calling outputs)
+ Injected Structured Data (user profile, state)
+ Few-Shot Examples
Every one of these components is a design decision.
The Four Layers of Context
1. System Prompt (Stable Instructions)
The foundation. Contains:
- Role and persona ("You are a helpful coding assistant")
- Task instructions and constraints
- Output format requirements
- Hard guardrails ("Never discuss competitor pricing")
- Optionally: static few-shot examples
The system prompt doesn't change between turns. Keep it focused — every token here is repeated for every message.
Context engineering decision: What level of detail belongs in the system prompt vs. injected dynamically per request?
2. Conversation History
Every prior turn in the session. The naive approach is to include all of it forever — which breaks down as conversations grow long and expensive.
Strategies:
- Full history (best for short conversations) — include everything
- Sliding window (simple, loses old context) — keep last N turns
- Summarization (preserves key facts) — summarize old turns, keep recent ones verbatim
- Selective retrieval (sophisticated) — embed all turns, retrieve only relevant ones
Context engineering decision: How many turns to keep verbatim? When to summarize? What's worth preserving?
3. Retrieved Content (RAG)
Externally retrieved documents, database records, or API results injected into the context. This is the most powerful tool for grounding an LLM in specific, up-to-date facts.
Context engineering decisions:
- How many chunks to retrieve? (More context helps, but adds noise)
- How to format retrieved content? (Headers, source labels, separators)
- Where in the context to place retrieved documents? (After system prompt, before user message)
- What metadata to include? (Source URL, date, relevance score)
4. Tool Results and Structured Data
Outputs from function calls, current user state, session variables — anything dynamically injected per-request.
[Tool: get_user_profile]
Name: Sarah Chen
Plan: Pro
Last login: 2026-02-24
Active projects: 3
[Tool: get_account_status]
Status: Active
Outstanding invoices: 0
Context engineering decision: How to format structured data for maximum model comprehension? JSON vs. key-value vs. natural language?
Common Context Architecture Patterns
Pattern 1: Simple Q&A System
System: [Short instructions]
User: [Question]
Assistant: [Answer]
No retrieval, no history — minimal context. Works for isolated queries.
Pattern 2: RAG Chatbot
System: [Instructions + grounding requirements]
[Retrieved documents — 3-5 chunks]
History: [Last 5 turns]
User: [Current message]
Balances recency (last 5 turns) with factual grounding (retrieved docs).
Pattern 3: Long-Running Agent
System: [Role + tools + instructions]
[Agent memory — key facts from prior sessions]
[Current task state]
[Recent tool results]
History: [Summary of prior conversation] + [Last 3 turns verbatim]
User: [Current input]
Agents need persistent state across sessions — context engineering handles what's worth keeping.
Pattern 4: Personalized Assistant
System: [Base instructions]
[User profile: preferences, past interactions, active projects]
[Current context: time of day, location, recent activity]
History: [Last N turns]
User: [Message]
Injects user-specific data per request to personalize responses without fine-tuning.
Context Compression Techniques
As context windows grow (Claude now supports 200K tokens), the temptation is to dump everything in. But:
- Longer context = slower, more expensive responses
- Models don't attend uniformly — content in the middle of long contexts is under-attended (the "lost in the middle" problem)
- Noise hurts accuracy — irrelevant retrieved content degrades response quality
Compression strategies:
| Technique | How | When |
|---|---|---|
| Conversation summarization | LLM summarizes old turns | Sessions > 10 turns |
| Chunk re-ranking | Score retrieved chunks, keep top-k | When retrieval is noisy |
| Dynamic few-shot selection | Pick examples relevant to current query | Large example banks |
| Schema stripping | Remove unused JSON fields | Structured data injection |
| Hierarchical context | Summary at top, details below | Long documents |
The "Lost in the Middle" Problem
Research shows LLMs pay disproportionate attention to content at the beginning and end of long contexts — content in the middle gets under-attended.
Practical implications:
- Put the most important instructions at the start of the system prompt
- Put critical retrieved documents near the end of the retrieved content block (closest to the user message)
- If you have many retrieved chunks, re-rank to put the most relevant ones at positions 1 and N (not the middle)
Context Engineering Checklist
Before deploying any LLM system, review:
- What does the system prompt contain that could be extracted to dynamic injection?
- Is conversation history being compressed at the right threshold?
- Is retrieved content formatted with clear source labels and separators?
- Is structured data injected in a format the model reads reliably?
- Are the most critical pieces of context near the beginning or end (not buried in the middle)?
- Is irrelevant content excluded to reduce noise?
- What's the worst-case token count? Does the system degrade gracefully as context grows?
Key Takeaways
- Context engineering is about architecting the full context window, not just writing better prompts
- The four layers: system prompt, conversation history, retrieved content, injected data
- Compress conversation history before it degrades performance or hits limits
- Models attend less to content in the middle of long contexts — position matters
- Irrelevant context hurts as much as missing context — include what's needed, exclude what isn't