Standard RAG is simple: embed the query, retrieve the top-k chunks, paste them into the prompt, answer the question.
It works great for simple lookup questions — "What's your cancellation policy?", "What's the capital of France?", "Does this product support SSO?" One retrieval, one answer.
It fails when questions require multiple retrieval steps. When the first retrieval doesn't surface the right context. When answering requires synthesizing information from several different parts of a knowledge base. For those cases, giving the model a fixed retrieve-once-and-answer pattern is like giving a researcher one page to consult before writing a report.
Agentic RAG fixes this by giving the model control over the retrieval process itself.
Simple RAG vs agentic RAG
The architectural difference is small. The behavior difference is large.
Simple RAG:
User query → embed query → retrieve top-5 chunks → [query + chunks] → LLM → answer
One retrieval step, fixed at pipeline design time. The query that drives retrieval is always the user's original query. The model takes whatever the retriever gives it and works with that.
Agentic RAG:
User query → agent reasons about what to search → retrieval #1 → observe results
→ decide if more retrieval needed → retrieval #2 (different query) → ... → synthesize → answer
The agent decides what to retrieve. It can search multiple times with different queries. It can look at what it found and decide whether it has enough information. It can pivot if the first retrieval missed.
The key shift: the retrieval query is generated by the model at runtime, not hardcoded at design time. The model reads the initial results and decides what to search for next based on what it found — and didn't find.
When simple RAG is enough
Don't over-engineer. Agentic RAG adds cost and latency. If simple RAG handles your use case, use it.
Simple RAG works well when:
Questions are single-hop. "What's your return policy?" requires one retrieval — find the return policy document and quote it. There's no follow-up retrieval that would improve the answer.
Your knowledge base is well-structured and high-precision. If top-1 retrieval almost always returns the right document, adding iterative retrieval gains you little. This is common with small, curated knowledge bases where semantic search is reliable.
Latency is critical. Each agentic retrieval step adds time — another vector search, another LLM call to reason about results, another set of chunks to process. If you need sub-two-second responses, agentic RAG is probably off the table.
Questions have unambiguous retrieval queries. If the user's question almost directly maps to the right search query, the model doesn't gain much from being able to reformulate. Simple retrieval on the original query is fine.
When you need agentic RAG
Multi-hop questions. "Compare our Q3 performance to Q2 and explain the variance" requires retrieving Q3 data, Q2 data, and ideally any analysis documents that discuss the difference. A single retrieval on "Q3 vs Q2 performance" probably returns one of these at best. The model needs to search for each component separately.
Exploratory queries. When users don't know exactly what they're looking for — "What do we know about our competitor's pricing strategy?" — the initial query might be vague. The model benefits from being able to refine its search after seeing what the first retrieval surfaces.
Self-correction scenarios. Retrieval systems fail. The user's exact query might not match the vocabulary in your knowledge base. The documents might be organized around different concepts than the user is thinking in. Agentic RAG lets the model recognize when retrieval failed and try a different angle.
Research synthesis. "Summarize everything we know about customer complaints about feature X" requires pulling from multiple sources: support tickets, feedback documents, internal notes, forum posts. A single-shot retrieval gives you a slice. Multiple targeted retrievals give you coverage.
The 3 agentic RAG patterns
Pattern 1: Query decomposition
Break a complex question into sub-questions, run a retrieval for each, then synthesize.
Example: "How does our pricing compare to Competitor A and Competitor B?"
A simple RAG system retrieves on "pricing comparison Competitor A Competitor B." It might get a document or two. An agentic system decomposes this into three searches: "our current pricing plans," "Competitor A pricing," "Competitor B pricing" — then synthesizes the results into a comparison.
The synthesis step is where the value shows up. The model has retrieved clean, targeted information for each component rather than hoping a single retrieval caught all three.
This pattern works best for multi-faceted questions with distinct retrieval needs — questions that have "and" in them are usually good candidates.
Pattern 2: Iterative retrieval
Retrieve, read what you got, and if it's not enough, search again with a refined query informed by what you found.
The second query is better than the first because the model now knows something it didn't before. It can see what the knowledge base contains, which terms appear, which concepts are covered, and where the gaps are.
Concrete example: A user asks "What are the limitations of the Pro plan?" The agent searches "Pro plan limitations." The results mention a 10GB storage cap and reference "Advanced Storage features." The agent searches "Advanced Storage features" to understand what the Pro plan lacks that advanced tiers have. Now it has a complete answer.
The model essentially bootstraps its retrieval queries from partial knowledge. Each retrieval makes the next one more targeted.
This pattern is useful for open-ended questions where the right retrieval query isn't obvious upfront, and for knowledge bases with rich internal cross-referencing where following document trails is valuable.
Pattern 3: Self-correcting retrieval
Retrieve, evaluate whether the results are actually relevant, and if they're not, try alternative queries or a different search strategy.
This requires the agent to score its own retrieval quality before answering. A basic version: after each retrieval, the agent checks whether the returned documents are relevant to the original question. If they score below a threshold, it tries again with a different query formulation.
Example: User asks about "account suspension." The first retrieval returns documents about account creation and deletion — close but not right. The agent recognizes these don't answer the question and searches "suspended accounts policy" instead. Better results.
This pattern matters most for knowledge bases with inconsistent structure, sparse coverage on certain topics, or vocabulary mismatches between how users ask questions and how documents are titled.
How to implement iterative retrieval
The implementation is simpler than it sounds. You don't need a specialized framework.
Give your RAG agent a search_knowledge_base tool:
def search_knowledge_base(query: str) -> list[str]:
"""Search the knowledge base and return relevant document chunks."""
results = vector_store.similarity_search(query, k=5)
return [doc.page_content for doc in results]
Give it a system prompt that authorizes multiple retrievals:
You are a helpful assistant with access to a knowledge base.
Use the search_knowledge_base tool to find relevant information before answering.
You may call it multiple times with different queries if the first search
doesn't return the information you need. Always search before answering —
never answer from memory alone.
Set a maximum iteration count to prevent infinite loops. Five iterations covers almost all real cases — if the agent hasn't found what it needs in five searches, either the information doesn't exist in the knowledge base or it needs to say so.
That's the core. The model handles retrieval strategy. You provide the tool and the guardrails.
The cost trade-off
Each retrieval step adds tokens — the retrieved chunks go into context — plus latency from the vector search and the additional LLM call to reason about results.
A 3-hop agentic RAG interaction costs roughly 3× more than single-shot retrieval in raw token terms. In practice the multiplier is lower because the agent often finds what it needs in one or two retrievals, but you should budget for worst-case multi-hop behavior.
Run the math before you commit to agentic RAG in production:
- Average retrieved tokens per search: N chunks × average chunk size
- Average retrieval iterations: empirically measure this on a sample of your real queries
- Cost per 1K tokens: check your model provider's pricing
- Average sessions per day: your traffic estimate
If the per-session cost is acceptable at your projected scale, proceed. If not, look at whether query decomposition (one LLM call to decompose, then parallel retrievals) is cheaper than iterative retrieval for your use case — it often is for predictable multi-hop questions.
Hallucination risk
More retrieval generally reduces hallucination risk because the model has more facts to ground its answer on. A model that searched three times and found relevant information for each component of its answer is less likely to fabricate than one that retrieved nothing relevant and is working from parametric memory.
The exception: irrelevant retrieval. If the agent retrieves documents that aren't actually related to the question, those documents add noise to the context. A model that's trying to synthesize 20 irrelevant chunks into an answer will sometimes confabulate connections that aren't there.
This is why the self-correcting retrieval pattern matters. Don't just retrieve more — retrieve better. Garbage in, garbage out applies here as much as anywhere.
Quality of the search tool itself is a bottleneck. If your embeddings are poor or your chunking strategy is bad, more retrieval calls won't fix it — they'll just surface more bad results. Before moving to agentic retrieval, validate that your retrieval quality is decent with a standard RAG setup. Agentic behavior amplifies the quality of your underlying retrieval system, for better or worse.
Choosing the right pattern
Quick decision guide:
| Question type | Pattern |
|---|---|
| "Compare X and Y and Z" | Query decomposition |
| "What's the full story on X?" | Iterative retrieval |
| "Tell me about X" (open-ended) | Iterative retrieval |
| Precise lookup, small KB | Simple RAG |
| Poor vocabulary match | Self-correcting retrieval |
| Latency under 2s required | Simple RAG |
Most production systems benefit from combining decomposition and iterative retrieval — decompose complex questions, then iterate within each sub-question if the first retrieval isn't sufficient.
Start with iterative retrieval. It's the simplest to implement and handles the most common failure mode of standard RAG: the first retrieval not being quite right. Add decomposition when you identify that multi-faceted questions are a significant portion of your query distribution.
For the fundamentals of how retrieval-augmented generation works, see how RAG works. The RAG lesson in the intermediate track covers embedding, chunking, and retrieval strategies in depth.



