What is Retrieval Augmented Generation (RAG)?

RAG is a technique that connects an LLM to an external knowledge base. When a user asks a question, the system first searches the knowledge base for relevant documents, then passes those documents to the LLM as context. The model answers using the retrieved content rather than relying on training data alone. This dramatically reduces hallucinations and keeps answers current.

Why use RAG instead of fine-tuning?

Fine-tuning bakes knowledge into model weights — expensive, slow to update, and hard to audit. RAG keeps knowledge in a separate database that you can update instantly. If a fact changes, you update the database, not the model. RAG is also easier to debug: you can see exactly which documents the model used to generate its answer.

What is a vector database and why does RAG need one?

A vector database stores documents as numerical embeddings — mathematical representations of meaning. When a query arrives, it's converted to an embedding and matched against stored embeddings by semantic similarity (not keyword match). This lets RAG find 'documents about refund policy' even if the query says 'can I return my order'. Popular vector databases include Pinecone, Weaviate, Chroma, and pgvector.

How do I write better prompts for RAG systems?

The key is telling the model to stay grounded in the provided context. Effective RAG prompts include: (1) Clear instructions like 'Answer using only the provided documents', (2) Explicit fallback behavior: 'If the documents don't contain the answer, say so', (3) Citation instructions: 'Reference the document title when you quote information'. The model is only as good as the retrieved context, so retrieval quality matters more than prompt wording.

Retrieval Augmented Generation (RAG): Ground Your AI in Real Data

Retrieval Augmented Generation (RAG) is the most widely deployed technique in production LLM systems. It solves the two biggest problems with raw LLMs: outdated knowledge and hallucination.

The Problem RAG Solves

LLMs have a training cutoff. Ask GPT-4o about a product released last month and it genuinely doesn't know. Ask it about your company's internal pricing policy and it will invent a confident-sounding answer.

RAG fixes both:

Outdated knowledge — Connect the model to a live knowledge base that you control
Hallucination — Force the model to cite its sources, making fabrication much harder

How RAG Works

A RAG pipeline has three stages:

Stage 1: Indexing (offline)

Your documents are split into chunks, converted into vector embeddings (numerical representations of meaning), and stored in a vector database.

Document → Split into chunks → Embed each chunk → Store in vector DB

Stage 2: Retrieval (at query time)

The user's query is converted to an embedding. The vector database returns the top-k most semantically similar chunks.

User query → Embed query → Search vector DB → Return top-k chunks

Stage 3: Generation (at query time)

The retrieved chunks are injected into the prompt as context. The LLM generates its answer using only that context.

System prompt + Retrieved chunks + User query → LLM → Grounded answer

The RAG Prompt Pattern

The prompt structure in a RAG system typically looks like this:

System: You are a helpful assistant. Answer questions using ONLY the provided
context documents. If the answer is not in the documents, say:
"I don't have information about that in my knowledge base."

Context:
---
[Document 1 — retrieved chunk]
---
[Document 2 — retrieved chunk]
---
[Document 3 — retrieved chunk]
---

User: [user's question]

Key prompt instructions for RAG:

Instruction	What it does
`Answer using only the provided documents`	Prevents the model from mixing in training knowledge
`If not in the documents, say so`	Prevents confident hallucinations
`Cite the document title when quoting`	Creates an audit trail
`Do not make up missing information`	Explicit reinforcement

What Good RAG Prompting Looks Like

Without grounding instructions:

User: What is our refund policy?
Context: [policy documents]

Response: Our refund policy allows returns within 30 days...
(model may mix policy with its own assumptions)

With grounding instructions:

System: Answer only from the provided documents. If the answer is not
in the documents, say "I don't have that information."

Context: [policy documents]

User: What is our refund policy?

Response: According to the Refund Policy document: customers may request
a full refund within 14 days of purchase by contacting support@company.com.
After 14 days, store credit is offered at our discretion.
(model stays precisely within the provided text)

Retrieval Quality vs. Prompt Quality

A common mistake is blaming the prompt when retrieval is the real problem. The model can only work with what it's given.

Signs of a retrieval problem (not a prompt problem):

The model says "I don't have information about that" when the info clearly exists
Answers are vague or miss key details that are in the documents
The model contradicts the source documents

Signs of a prompt problem:

The model ignores the context and answers from training data
The model doesn't follow citation or format instructions
The model fabricates when asked to stay grounded

Diagnosis: Print the retrieved chunks. If the right information is in the retrieved documents and the model still fails, it's a prompt problem. If the right information isn't in the retrieved chunks, it's a retrieval problem.

Chunking Strategy Matters

How you split documents affects retrieval quality more than almost anything else.

Strategy	When to use
Fixed-size chunks (512 tokens)	General purpose, fast to implement
Sentence-based chunks	Good for conversational content
Paragraph-based chunks	Best for structured documents (policies, manuals)
Hierarchical chunks (summary + detail)	Long documents where you need both overview and specifics

The classic mistake is chunking too small. A 100-token chunk often loses the context needed to understand a sentence. Start with 512 tokens and 10–20% overlap between chunks.

Common RAG Failure Modes

Failure	Cause	Fix
Model ignores retrieved context	Prompt doesn't enforce grounding	Add explicit "use only the context" instruction
Retrieved chunks miss the answer	Retrieval quality / chunking issue	Improve chunking, add metadata filters
Model hallucinates despite context	Insufficient grounding instruction	Stronger instructions + lower temperature
Answers miss key details	Chunk too small, splits mid-concept	Increase chunk size, add overlap
Slow responses	Retrieval returning too many chunks	Reduce k (top-k), use re-ranking

When to Use RAG vs. Other Approaches

Scenario	Best approach
Knowledge changes frequently (prices, policies)	RAG
Need to cite sources	RAG
General reasoning tasks	Base prompting
Domain-specific style/format	Fine-tuning
Both factual grounding + style	RAG + fine-tuning

RAG is the right default for any use case where accuracy on specific facts matters and the information can't all fit in a single prompt.