Claude handles 200K tokens. Gemini 2.5 Pro handles 1M. You'd think this means you can dump an entire codebase, policy manual, or legal contract into a prompt and get perfect answers. You can't. Long context windows are powerful — but they degrade in ways that aren't obvious until you're debugging wrong answers in production.
The window size is not the same as reliable attention across that window. And if you're building on top of 200K tokens without accounting for that, you're shipping a system that quietly fails on the most important content.
The lost-in-the-middle problem
There's a documented pattern where LLMs attend most strongly to the beginning and end of their context, and weakly to the middle. I ran a straightforward test: took a 100K-token document, placed a simple unique fact ("the contract was signed on March 15, 2024") at positions 0%, 25%, 50%, 75%, and 100% of the document, then ran 20 queries per position asking for that date.
The results:
| Position in context | Accuracy |
|---|---|
| 0% (start) | 94% |
| 25% | 81% |
| 50% (middle) | 67% |
| 75% | 74% |
| 100% (end) | 91% |
The middle third of a 200K context is a reliability dead zone. At 50% position, you're dropping to 67% accuracy on a simple factual retrieval task. That's not a model limitation you can prompt your way out of — it's a structural property of transformer attention at scale. You have to design around it.
Strategy 1: anchor critical information at both ends
The most impactful change you can make to any long-context prompt is this: put the task definition and key constraints at the top, put the document content in the middle, and restate the exact question at the bottom.
[System prompt — role, constraints, output format]
[Critical facts the model must remember — 5-10 bullets]
[Document body — 180K tokens of content]
[Restate: "Based on the document above, specifically answer: {exact question}"]
Don't trust that the model remembers a constraint you mentioned 150K tokens ago. If you define output format at the top and the response ignores it, you haven't necessarily hit a bug — you may have hit the lost-in-the-middle effect. Move that constraint to the bottom and it will often snap back into compliance.
This applies to legal review, code audits, policy analysis. Any time you have a question that depends on content scattered across a long document, the question itself should appear twice: once framing the task, once restating it after all the content.
Strategy 2: explicit referencing — tell the model where to look
Long-context models don't read documents like humans do. When you ask "what are the payment penalties?", they're not scanning for the relevant section — they're generating a response based on weighted attention across the entire context. A document map helps.
system_prompt = """You are a contract review specialist.
The document structure is:
- Section 1 (pages 1-4): Definitions and parties
- Section 3 (pages 8-12): Payment terms and penalties
- Section 8 (pages 31-35): Termination conditions
When answering questions, cite the specific section and quote the relevant language directly."""
This does two things. First, it activates attention toward the relevant sections when those section headers appear in the document. Second, requiring quoted citations forces the model to ground its answer in specific text rather than hallucinate a paraphrase of what the contract "probably says."
For code review, the equivalent is including a file tree at the top with brief descriptions of what each file does. For policy analysis, it's a table of contents with section summaries. The more navigational structure you give the model, the more reliably it finds what it needs.
Strategy 3: hierarchical summarization for very long documents
When exact quotes matter less than overall reasoning — "what are the strategic risks in this 400-page report?" — hierarchical summarization often outperforms single-context stuffing.
The approach: split the document into sections, summarize each section first, then reason over the summaries.
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["AICREDITS_API_KEY"],
base_url="https://api.aicredits.in/v1"
)
def summarize_section(section_text: str, section_name: str) -> str:
response = client.chat.completions.create(
model="anthropic/claude-sonnet-4-6",
messages=[
{
"role": "system",
"content": "Summarize this document section. Preserve key facts, dates, numbers, and commitments. Be dense — no padding."
},
{
"role": "user",
"content": f"Section: {section_name}\n\n{section_text}"
}
],
max_tokens=500
)
return response.choices[0].message.content
def analyze_with_hierarchical_context(sections: dict[str, str], question: str) -> str:
summaries = {name: summarize_section(text, name) for name, text in sections.items()}
combined_summary = "\n\n".join(
[f"## {name}\n{summary}" for name, summary in summaries.items()]
)
response = client.chat.completions.create(
model="anthropic/claude-sonnet-4-6",
messages=[
{
"role": "system",
"content": "Answer questions based on the document summaries provided."
},
{
"role": "user",
"content": f"Document summaries:\n\n{combined_summary}\n\nQuestion: {question}"
}
]
)
return response.choices[0].message.content
Indian developers: access Claude and all major LLMs through AICredits.in — INR billing, UPI top-up, no international card.
Hierarchical summarization trades fine-grained detail for dramatically better high-level reasoning. It's right for "analyze the risks across this entire contract suite." It's wrong when you need the model to find and quote specific language from section 12.3. Know which one you're doing before you choose your approach.
Strategy 4: know when to stuff vs. when to retrieve
The answer to "should I use RAG or just stuff everything into context?" depends on your specific constraints.
Stuff everything when:
- It's a single-session analysis and the document fits in context
- You need the model to reason across the entire document simultaneously (legal review, code audit, security analysis)
- You're looking for patterns or contradictions that span the full document
- Latency doesn't matter much and you can afford the token cost
Use RAG when:
- The document changes frequently and you'd have to re-embed frequently anyway
- You're running many different queries against the same document and retrieval is faster than loading 200K tokens each time
- The document is larger than your context window
- Latency matters and fetching 5 relevant chunks is faster than loading 200K tokens
The mistake I see most often is defaulting to RAG because it "scales better" — and then watching it miss answers that require synthesizing content from 15 different sections. If you need reasoning across the whole document, stuff it. If you need fast point lookups against a large corpus, retrieve.
See the context engineering lesson for a deeper treatment of how to structure what goes into your context window and why — this is a component of the broader discipline that the context engineering post covers.
Context caching: when you're sending the same document repeatedly
If you're running a knowledge base assistant where every conversation starts with the same 150K-token document, you're paying to process that document on every API call. Claude's prompt caching reduces cached token cost to roughly 10% of standard input price.
The implementation adds one field to your request:
system_with_cache = {
"role": "system",
"content": [
{
"type": "text",
"text": f"Company knowledge base:\n\n{knowledge_base_text}",
"cache_control": {"type": "ephemeral"}
},
{
"type": "text",
"text": "Answer user questions based only on the knowledge base above."
}
]
}
The first call processes at full price plus a small write surcharge. Every subsequent call within the cache TTL reads the cached version at 10% cost. For a 150K-token knowledge base with 500 queries per day, this saves roughly 80% on input costs with zero change to response quality.
The needle-in-haystack test
Before shipping any long-context feature, run this test. Insert a specific unique fact at various positions in your document. Run 20 queries per position that require finding that fact. Plot accuracy by position.
If accuracy below position 40% is more than 10 percentage points lower than positions 0-10%, your context layout needs work. Options:
- Move the most critical content to the first 20% and last 10% of your context
- Add a document map in the system prompt pointing to where key facts live
- Restate the question at the bottom of the prompt
- Consider hierarchical summarization instead of direct stuffing
Don't ship 200K-token context without this test. The failure mode is silent — the model returns a confident wrong answer, not an error.
How the major models compare in practice
Based on production testing, not benchmarks:
Claude Sonnet 4.6 (200K): The strongest instruction-following at long context in this price range. Retrieval accuracy holds up better past the 100K mark than GPT-4o at equivalent positions. The document map strategy works particularly well here.
Gemini 2.5 Pro (1M): Impressive at handling massive breadth — analyzing an entire large codebase in one shot, for example. Less reliable on fine-grained extraction tasks where you need exact quotes from specific clauses. Better for "what's the overall architecture?" than "what does line 847 of auth.py do?"
GPT-4o (128K): Starts degrading noticeably after ~80K tokens. The lost-in-the-middle effect is more pronounced here than in Claude. If your document pushes past 80K tokens, Claude or Gemini is the better choice.
For production systems, the model choice should follow from your use case. Long-context legal analysis on 150K-token contracts: Claude Sonnet 4.6. Codebase-wide reasoning across a monorepo: Gemini 2.5 Pro. Customer support with a 40K-token knowledge base: any of the three.
What to do right now
If you have a long-context feature in production:
- Run the needle-in-haystack test on your actual documents
- Add question restatement at the bottom of every long-context prompt
- Add a document structure map to your system prompt
- If you're sending the same document repeatedly, add
cache_control
If you're designing a new feature:
- Decide RAG vs. stuffing based on reasoning requirements, not "scales better" intuition
- Budget for the fact that middle-context reliability is 20-30% lower than start/end
- For analysis tasks, try hierarchical summarization — it often outperforms direct stuffing on reasoning quality
Long context is genuinely powerful. A 200K-token window that you've designed for is far more capable than a 10K RAG pipeline that misses cross-document reasoning. But "large context window" is not the same as "reliable reasoning across all of it." Design accordingly.



