In 2023, "prompt engineering" meant figuring out how to phrase your ChatGPT message to get a better response. The context window was 4,000 tokens. The model couldn't access the internet, call APIs, or remember last Tuesday's conversation. You gave it words, it gave you words back.
In 2026, none of that is still true. Context windows are 200,000+ tokens. Models run multi-hour autonomous tasks, call external APIs, maintain memory across sessions, and receive inputs from a dozen different sources before generating a single word of output. The prompt you write is one small piece of a much larger context management problem.
This shift is why context engineering has become the actual skill. Prompting — the act of writing instructions well — is still important. But it's now embedded inside a larger discipline that determines what the model knows, in what form, at what point in its processing. Get context engineering wrong and it doesn't matter how well-crafted your prompt is.
What changed between 2023 and 2026
The numbers tell the story. GPT-3's context window: 4,096 tokens. Claude 3.5 Sonnet: 200,000 tokens. That's a 50x increase in how much information a model can hold in working memory at once.
But the more significant change isn't the window size — it's what's being put into it.
2023 LLM input: System prompt + user message. That's it.
2026 LLM input: System prompt + retrieved documents from vector search + tool call results from external APIs + conversation history from previous sessions + files the user uploaded + agent scratchpad from previous reasoning steps + structured data from a database query + the user's current message.
A production AI agent in 2026 might have a context window that contains:
- A 2,000-token system prompt defining the agent's behaviour
- 15,000 tokens of retrieved knowledge base articles (RAG)
- 8,000 tokens of conversation history (past 10 turns)
- 12,000 tokens of tool call results (web searches, API responses, code execution output)
- 3,000 tokens of "agent scratchpad" from previous reasoning steps
- 500 tokens of the actual user message
That's nearly 41,000 tokens before the user says anything meaningful. Managing what goes in there — and what doesn't — is the new core skill.
What context engineering actually is
Here's the most useful definition I've found: Context engineering is the discipline of designing and managing everything that goes into an LLM's context window to produce reliable, accurate, and appropriate outputs.
The components of an LLM's context include:
- System prompt: Instructions, persona, constraints, output format
- Retrieved documents (RAG): External knowledge retrieved based on the current query
- Tool call results: Outputs from API calls, web searches, code execution, database queries
- Conversation history: Previous turns in the current session
- Agent scratchpad: Intermediate reasoning and partial work from multi-step tasks
- User-provided files: Documents, images, data the user explicitly included
- Long-term memory: Facts retrieved from a persistent memory store about this user or task
Context engineering is the practice of deciding: what to include, in what format, in what order, at what level of granularity, and what to exclude entirely.
The prompt you write is one input. Context engineering manages all the rest.
The core problems context engineering solves
The lost-in-the-middle problem
Research from 2024 confirmed what practitioners had been observing: LLMs pay less attention to information in the middle of long contexts than to information at the beginning or end. Put your most important instruction at position 50,000 in a 100,000-token context and the model will systematically underweight it compared to the same instruction at position 1,000.
This has practical consequences for RAG systems. When you retrieve 10 documents and stuff them into context, the model pays more attention to documents 1 and 10 than to documents 4-7. If your most relevant chunk is in the middle, you're leaving quality on the table.
The fix: Structure context so critical information is at the beginning (system prompt and immediate task context) or end (the current query and most relevant retrieved content). When building RAG, put the highest-scoring retrieved chunk first, not last — and consider a "top-and-tail" structure where you put the most relevant content at both the top and bottom of the retrieved section.
Context poisoning
A bad piece of information in context doesn't just fail to help — it actively hurts. If your vector search retrieves a document that's confidently wrong about the topic at hand, the model will often incorporate that wrong information into its answer even when it knows better from its training.
This is context poisoning: a contaminated context that degrades output quality downstream.
The fix: Don't retrieve blindly based on semantic similarity alone. Add a relevance threshold — if your top cosine similarity score is below 0.75 (or whatever threshold you've validated for your use case), consider injecting no retrieved content rather than injecting low-relevance content. A well-calibrated model with no RAG often outperforms a model with poorly-retrieved RAG.
Reranking helps significantly here. Retrieve broadly (top 20 chunks), then rerank with a cross-encoder model to identify the genuinely relevant subset (top 3-5 chunks), then inject only the reranked results. The extra compute is worth it.
Context overflow
What happens when your agent's inputs exceed the context window? In naive implementations: the model silently loses information. Older conversation history drops off. Retrieved documents get truncated. The model starts working from incomplete information without knowing it.
The fix: Implement explicit overflow handling. For conversation history: instead of a sliding window that silently drops old messages, run a summarisation step that compresses older turns into a structured summary, then prepend that summary at the start of the conversation history. The model now has access to the gist of the full conversation, not just the last 10 turns.
For agent tasks that might run long: implement "context checkpoints" where the agent periodically writes a structured state summary — what it's done, what it's decided, what it still needs to do. If the context gets too full, you can restart with that checkpoint rather than from scratch.
Stateful agents across sessions
A user comes back to your agent the next day to continue a task. The naive implementation: the agent has no memory of yesterday. The user has to re-explain everything. This is the state problem that most agent builders underestimate.
The fix: Separate working memory from long-term memory. Working memory is the current task's context — what the agent is doing right now. Long-term memory is persistent facts that should survive across sessions: user preferences, decisions already made, past task outcomes.
When a session starts, retrieve relevant long-term memories and inject them into the system prompt or conversation preamble. This gives the agent continuity without requiring the user to repeat themselves. What to store in long-term memory: user-stated preferences, completed task outcomes, decisions made, key facts stated by the user. What to re-retrieve each time rather than store: external data that might change (prices, availability, current events).
Context engineering patterns in 2026
The RAG context pattern
The full pipeline: chunk documents → embed chunks → store in vector database → on query, retrieve top-K chunks → rerank with cross-encoder → inject into context.
Where prompt engineering stops and context engineering starts: prompt engineering handles "how do I ask the model to use retrieved information well?" Context engineering handles everything that determines which information it gets to use in the first place.
Good RAG context structure:
[SYSTEM PROMPT — task definition and behaviour]
[RETRIEVED CONTEXT]
Source 1 (relevance score: 0.91): [content]
Source 2 (relevance score: 0.88): [content]
Source 3 (relevance score: 0.82): [content]
Note: Retrieved from [knowledge base name] on [date]. Treat as authoritative for this domain.
[/RETRIEVED CONTEXT]
[CONVERSATION HISTORY — last 5 turns]
[CURRENT USER MESSAGE]
The explicit relevance scores in the injected context help the model weight the sources appropriately. The date tells the model how to handle potentially stale information.
The agent memory pattern
Working memory = everything about the current task: the goal, steps taken, partial results, current state. This lives in the context window and gets rebuilt on task resumption from checkpoints.
Long-term memory = facts that persist across tasks and sessions. Use a memory system like Mem0 or a simple key-value store with semantic search. Retrieve relevant memories at session start based on the current task context.
The pattern I've found most reliable: store long-term memories as structured facts, not raw text. Instead of storing "The user said they prefer shorter responses and don't like bullet points," store:
{
"type": "preference",
"subject": "response_format",
"value": "concise prose, avoid bullets",
"confidence": "explicit",
"source": "user_statement",
"date": "2026-03-15"
}
Structured facts are easier to retrieve accurately and less likely to be misinterpreted when injected into context.
The tool result pattern
When an agent calls a tool — a web search, an API, a database query — the result goes back into context. How you format that result matters more than most people realise.
Compare these two formats for a weather API result:
Raw JSON dump (bad):
{"location":{"name":"Mumbai","region":"Maharashtra","country":"India"},"current":{"temp_c":34.0,"condition":{"text":"Partly cloudy"},"humidity":78,"wind_kph":19.4}}
Structured narrative (better):
[Weather API result for Mumbai, queried 2026-04-15 14:32 IST]
Current conditions: 34°C, partly cloudy, 78% humidity, 19 km/h wind
The structured narrative is shorter and the model comprehends it more reliably than raw JSON. For tool results that include large data structures, pre-process them into the relevant extracted facts before injecting into context.
The CLAUDE.md pattern
If you're using Claude Code, CLAUDE.md is a persistent context file that loads into every session. It's the most concrete example of long-term context engineering in practice.
What belongs in CLAUDE.md vs the system prompt: CLAUDE.md is for project-specific knowledge that stays stable — architecture decisions, coding conventions, file layout, key constraints. The system prompt (or the conversation) is for task-specific instructions.
The distinction matters because context space is precious. Don't put in CLAUDE.md what changes per-task. Don't put in the system prompt what's stable per-project.
From naive to engineered: a practical example
Here's a common naive approach and its engineered equivalent.
Naive approach: Concatenate everything into one giant string.
def get_response(user_query, documents, history):
prompt = f"""
Context: {' '.join([doc.text for doc in documents])}
History: {' '.join([f"{m.role}: {m.content}" for m in history])}
User: {user_query}
"""
return call_llm(prompt)
Engineered approach: Build context deliberately, with explicit structure, filtering, and overflow handling.
def build_context(user_query, documents, history, max_tokens=150000):
context_parts = []
token_budget = max_tokens
# 1. System prompt (reserved, always included)
system = load_system_prompt() # 2,000 tokens
token_budget -= 2000
# 2. Retrieve and rerank documents
candidates = retrieve_chunks(user_query, top_k=20)
reranked = rerank(candidates, user_query, top_n=5)
# Only include chunks above relevance threshold
relevant = [c for c in reranked if c.relevance_score > 0.75]
if relevant:
doc_section = format_retrieved_context(relevant)
doc_tokens = count_tokens(doc_section)
if doc_tokens < token_budget * 0.4: # Max 40% of budget for docs
context_parts.append(doc_section)
token_budget -= doc_tokens
# 3. Conversation history — summarise if needed
history_tokens = count_tokens(format_history(history))
if history_tokens > token_budget * 0.3:
# Summarise older turns, keep recent turns verbatim
summary = summarise_old_turns(history[:-5])
recent = history[-5:]
history_section = f"[Earlier conversation summary]\n{summary}\n\n[Recent turns]\n{format_history(recent)}"
else:
history_section = format_history(history)
context_parts.append(history_section)
# 4. Current query
context_parts.append(f"User: {user_query}")
return system, "\n\n".join(context_parts)
The difference isn't just cleaner code. The engineered version actively manages what the model knows and prevents context poisoning, overflow, and the lost-in-the-middle problem.
Tools for context engineering in 2026
| Tool | What it does | Best for |
|---|---|---|
| LangChain | Chain primitives, retrieval, document loaders | Building retrieval pipelines, chain orchestration |
| LlamaIndex | Document indexing, node management, advanced RAG | Complex document retrieval, enterprise knowledge bases |
| Mem0 | Persistent agent memory with semantic retrieval | Cross-session memory for personal AI agents |
| Langfuse | Observability and tracing for LLM applications | Debugging context issues, tracking token usage |
| Ragas | RAG evaluation framework | Measuring RAG quality (faithfulness, answer relevance) |
Observability tools like Langfuse are particularly valuable for context engineering work — they let you inspect exactly what went into the context window for any given request, making debugging much faster than logging raw strings.
💡 Want to go deeper on agents? The AI Agents track covers context engineering for agents in full depth — from working memory to multi-agent systems that share state across multiple LLM instances.
What to read next
- Context engineering for advanced prompting — the fundamentals before diving into agents
- Context engineering for AI agents — applying these patterns in agentic systems
- What is context engineering? — the conceptual overview if you want to start from first principles
- Agentic RAG: beyond simple question answering — RAG patterns for production agent workflows



