Retrieval Augmented Generation (RAG) is the most widely deployed technique in production LLM systems. It solves the two biggest problems with raw LLMs: outdated knowledge and hallucination.
The Problem RAG Solves
LLMs have a training cutoff. Ask GPT-4o about a product released last month and it genuinely doesn't know. Ask it about your company's internal pricing policy and it will invent a confident-sounding answer.
RAG fixes both:
- Outdated knowledge — Connect the model to a live knowledge base that you control
- Hallucination — Force the model to cite its sources, making fabrication much harder
How RAG Works
A RAG pipeline has three stages:
Stage 1: Indexing (offline)
Your documents are split into chunks, converted into vector embeddings (numerical representations of meaning), and stored in a vector database.
Document → Split into chunks → Embed each chunk → Store in vector DB
Stage 2: Retrieval (at query time)
The user's query is converted to an embedding. The vector database returns the top-k most semantically similar chunks.
User query → Embed query → Search vector DB → Return top-k chunks
Stage 3: Generation (at query time)
The retrieved chunks are injected into the prompt as context. The LLM generates its answer using only that context.
System prompt + Retrieved chunks + User query → LLM → Grounded answer
The RAG Prompt Pattern
The prompt structure in a RAG system typically looks like this:
System: You are a helpful assistant. Answer questions using ONLY the provided
context documents. If the answer is not in the documents, say:
"I don't have information about that in my knowledge base."
Context:
---
[Document 1 — retrieved chunk]
---
[Document 2 — retrieved chunk]
---
[Document 3 — retrieved chunk]
---
User: [user's question]
Key prompt instructions for RAG:
| Instruction | What it does |
|---|---|
Answer using only the provided documents | Prevents the model from mixing in training knowledge |
If not in the documents, say so | Prevents confident hallucinations |
Cite the document title when quoting | Creates an audit trail |
Do not make up missing information | Explicit reinforcement |
What Good RAG Prompting Looks Like
Without grounding instructions:
User: What is our refund policy?
Context: [policy documents]
Response: Our refund policy allows returns within 30 days...
(model may mix policy with its own assumptions)
With grounding instructions:
System: Answer only from the provided documents. If the answer is not
in the documents, say "I don't have that information."
Context: [policy documents]
User: What is our refund policy?
Response: According to the Refund Policy document: customers may request
a full refund within 14 days of purchase by contacting support@company.com.
After 14 days, store credit is offered at our discretion.
(model stays precisely within the provided text)
Retrieval Quality vs. Prompt Quality
A common mistake is blaming the prompt when retrieval is the real problem. The model can only work with what it's given.
Signs of a retrieval problem (not a prompt problem):
- The model says "I don't have information about that" when the info clearly exists
- Answers are vague or miss key details that are in the documents
- The model contradicts the source documents
Signs of a prompt problem:
- The model ignores the context and answers from training data
- The model doesn't follow citation or format instructions
- The model fabricates when asked to stay grounded
Diagnosis: Print the retrieved chunks. If the right information is in the retrieved documents and the model still fails, it's a prompt problem. If the right information isn't in the retrieved chunks, it's a retrieval problem.
Chunking Strategy Matters
How you split documents affects retrieval quality more than almost anything else.
| Strategy | When to use |
|---|---|
| Fixed-size chunks (512 tokens) | General purpose, fast to implement |
| Sentence-based chunks | Good for conversational content |
| Paragraph-based chunks | Best for structured documents (policies, manuals) |
| Hierarchical chunks (summary + detail) | Long documents where you need both overview and specifics |
The classic mistake is chunking too small. A 100-token chunk often loses the context needed to understand a sentence. Start with 512 tokens and 10–20% overlap between chunks.
Common RAG Failure Modes
| Failure | Cause | Fix |
|---|---|---|
| Model ignores retrieved context | Prompt doesn't enforce grounding | Add explicit "use only the context" instruction |
| Retrieved chunks miss the answer | Retrieval quality / chunking issue | Improve chunking, add metadata filters |
| Model hallucinates despite context | Insufficient grounding instruction | Stronger instructions + lower temperature |
| Answers miss key details | Chunk too small, splits mid-concept | Increase chunk size, add overlap |
| Slow responses | Retrieval returning too many chunks | Reduce k (top-k), use re-ranking |
When to Use RAG vs. Other Approaches
| Scenario | Best approach |
|---|---|
| Knowledge changes frequently (prices, policies) | RAG |
| Need to cite sources | RAG |
| General reasoning tasks | Base prompting |
| Domain-specific style/format | Fine-tuning |
| Both factual grounding + style | RAG + fine-tuning |
RAG is the right default for any use case where accuracy on specific facts matters and the information can't all fit in a single prompt.