Here's a pattern I've seen on dozens of RAG implementations: the team obsesses over the generation model — which LLM to use, how to write the system prompt, how to format the output. Then they go live and the answers are wrong. They blame the model. They switch models. The answers are still wrong.
The problem was retrieval all along. Bad chunks. Bad queries. Bad ranking. The generation model was doing its best with irrelevant context.
Most RAG problems are retrieval problems dressed up as generation problems. Fix the queries, fix the retrieval — and the generation almost takes care of itself.
The 3-step RAG loop and where prompts matter
A standard RAG pipeline has three stages:
- Query — transform the user's input into something the retrieval system can work with
- Retrieve — fetch the most relevant chunks from the vector store
- Generate — produce an answer grounded in the retrieved context
Prompts matter at all three steps, but the impact is uneven. Query-side prompt engineering is where most of the leverage is. Generation prompts are easier to get right and less likely to be the root cause of failures.
Step 1: Query rewriting prompts
Raw user queries are terrible search queries. Users ask conversational questions. They use pronouns without antecedents ("how does it work?"). They ask compound questions. They use terminology that doesn't match your documents.
The solution is a query rewriting step before retrieval.
HyDE (Hypothetical Document Embedding)
Instead of embedding the user's question, you generate a hypothetical answer, then embed that. The hypothesis lives in the same semantic space as your documents — a chunk from your knowledge base — so the vector similarity search works better.
System: You are a search optimization assistant. When given a user question,
generate a brief hypothetical document (2-4 sentences) that would perfectly
answer that question. Write it as if it were an excerpt from a technical
documentation page, not as an answer to a question. Do not include phrases
like "The answer is" or "To answer this question." Just write the content
directly.
User question: {user_query}
Output only the hypothetical document excerpt.
HyDE works especially well when your knowledge base consists of declarative content (documentation, articles, policies) and users ask questions in natural language. The gap between question space and document space is the core retrieval problem, and HyDE bridges it.
Query expansion
Generate multiple versions of the query and retrieve for each, then merge and deduplicate.
Given this user question, generate 3 alternative phrasings that capture the
same intent but use different vocabulary. A good set of alternatives covers:
- More specific technical terminology
- More general phrasing
- Related but distinct angle on the same topic
Output as a JSON array of strings.
User question: {user_query}
A question like "how do I stop the app from crashing?" might expand to "application crash error handling," "exception handling best practices," and "debugging runtime errors." Each version might hit different relevant chunks.
Query decomposition
Compound questions — "what are the pricing tiers and how do they compare to competitors?" — should be split before retrieval. Each sub-question retrieves independently, then the answers are synthesized.
Analyze this user question and determine if it contains multiple distinct
information needs. If yes, decompose it into 2-4 focused sub-questions, each
targeting a single piece of information. If it's a single focused question,
return it unchanged.
Output format:
- If compound: return a JSON array of sub-question strings
- If simple: return a JSON array with just the original question
Question: {user_query}
Decomposition adds latency (multiple retrieval calls) but dramatically improves recall for compound queries. Gate it behind a classifier if you need to keep latency down.
Step 2: Reranking prompts
Vector similarity is a proxy for relevance. It works well on average and fails in specific, predictable ways: it prioritizes lexical and semantic overlap over actual usefulness, it doesn't understand the query's intent at a deeper level, and it's sensitive to embedding model quality.
A reranker — either a cross-encoder model or an LLM-based relevance scorer — applies a more expensive but more accurate relevance signal to the top-K retrieved chunks.
LLM-based relevance scoring
You are a relevance judge. Given a user query and a document chunk, rate
how useful the chunk is for answering the query.
Rating scale:
3 = Directly answers the query with specific, accurate information
2 = Partially relevant — related topic but doesn't directly answer
1 = Tangentially related — same general domain but not useful for this query
0 = Irrelevant
Return only the numeric rating (0, 1, 2, or 3).
Query: {query}
Document chunk:
---
{chunk}
---
Rating:
Run this for each of your top-20 retrieved chunks, sort by score, keep the top 5-7. The LLM reranker is slower and more expensive than a cross-encoder like Cohere Rerank or BGE Reranker, but it's more controllable and doesn't require a separate model deployment.
Use cross-encoders (faster, cheaper) for production at scale. Use LLM reranking in development to establish a quality bar, then distill that judgment into a cross-encoder.
Step 3: Generation prompts
Once you have good context, the generation prompt is simpler than most people think. The two biggest mistakes are (1) not being explicit enough about grounding and (2) not handling the "no answer found" case.
The core generation prompt
System: You are a helpful assistant for [COMPANY/PRODUCT]. Answer the user's
question using ONLY the information provided in the context below.
Rules:
- If the context doesn't contain enough information to answer the question,
say exactly: "I don't have enough information to answer that question."
Do not guess or use outside knowledge.
- Quote or closely paraphrase the context when possible.
- If multiple context chunks are relevant, synthesize them into a coherent answer.
- Keep your answer focused on what was asked. Don't include tangential information
from the context.
Context:
---
{retrieved_chunks}
---
User question: {user_query}
Handling no-result cases
The "I don't have enough information" fallback is critical. Without it, models hallucinate. They're trained to be helpful, and "I don't know" feels unhelpful, so they fill the gap with plausible-sounding nonsense.
Test your system specifically on questions that fall outside your knowledge base. If the model is hallucinating on those, tighten the grounding instruction:
IMPORTANT: You must not use any knowledge outside the provided context.
If the context does not contain the answer, you must respond with:
"I don't have enough information in my knowledge base to answer that."
Even if you know the answer from your training data, do not use it here.
The doubled-down instruction sounds redundant, but it matters. Models need explicit permission to say they don't know.
Citation formatting
For production systems where users need to verify answers, add source attribution:
When answering, cite the source of your information using the format [Source N]
where N corresponds to the numbered chunks provided. At the end of your answer,
include a "Sources" section listing which chunks you used.
Context:
[1] {chunk_1}
[2] {chunk_2}
[3] {chunk_3}
Common RAG prompt mistakes
Asking compound questions without decomposition: "Tell me about the refund policy, return windows, and how to contact support" will retrieve sub-optimally because the query vector averages across all three topics. Split it.
Not telling the model its limits: Without explicit grounding instructions, GPT-4o and Claude will helpfully fill gaps with training knowledge. This feels right until users find the one answer that was wrong.
Over-stuffing context: More context isn't always better. Relevant-to-irrelevant ratio matters. If you're stuffing 15 chunks into context hoping one of them is relevant, you're diluting the signal. Better retrieval beats more context.
Ignoring chunk boundaries: If your chunks cut sentences in half or split tables, the model can't make sense of them. Prompts can't fix bad chunking. Fix it in the pipeline.
Asking the model to "search" the context: Prompts like "search the context for information about X" confuse generation with retrieval. The model sees all context simultaneously — it doesn't search. Just ask the question.
Worked example: customer support RAG
Here's the full prompt chain for a customer support system:
Stage 1 — Query rewriting (runs before retrieval)
The user sent this support message: "{raw_message}"
Rewrite it as a clear, specific search query that would retrieve the most
relevant help documentation. Remove conversational filler. Preserve all
technical details, error messages, and product names exactly as stated.
Output only the rewritten query.
Stage 2 — Reranking the top 10 chunks
Is this documentation chunk relevant to answering: "{rewritten_query}"?
Chunk: "{chunk}"
Answer with only YES or NO.
Filter to YES chunks, then pass top 5 to generation.
Stage 3 — Generation
You are a customer support agent for Acme Corp. Answer the customer's
question using only the documentation provided. Be warm but concise.
If the documentation doesn't cover their issue, apologize and provide
the support email: support@acme.com.
Documentation:
{relevant_chunks}
Customer message: {raw_message}
Three prompts, each doing one job. This structure makes debugging straightforward: if answers are wrong, check which stage is failing by logging intermediate outputs.
Metrics to track
You can't improve what you don't measure. Three metrics cover the essentials:
Faithfulness: Is the answer supported by the retrieved context? Check this manually on a 50-question eval set, or use an LLM-as-judge prompt. Target >90% faithfulness before shipping.
Context relevance: Are the retrieved chunks actually relevant to the query? Score each retrieved chunk — a high-quality pipeline should have >70% of retrieved chunks be directly relevant.
Answer relevance: Does the final answer actually address what was asked? This is distinct from faithfulness (an answer can be faithful to the context but still not answer the question).
The RAG lesson covers the fundamentals of how retrieval-augmented generation works at the architecture level. The how-rag-works post goes deeper on the vector similarity mechanics. This post is specifically about the prompt layer — what you can control without touching the retrieval infrastructure.
Most RAG systems ship with weak query-side prompting and decent generation prompting. The easiest wins are on the retrieval side: add HyDE, add query decomposition for compound questions, add a reranking step. You'll see measurable accuracy improvements without changing a single line of your generation prompt or switching models.
Fix the retrieval. Then worry about the generation.



