Every few months someone asks me whether they should fine-tune their model. Nine times out of ten, they shouldn't — at least not yet. The debate between fine-tuning, RAG, and prompt engineering isn't really a debate. They solve different problems. The question is which one solves your problem.
I've made the wrong call in both directions: spent three weeks building a RAG pipeline for something that a better system prompt would've handled, and written elaborate prompt chains for tasks that clearly needed fine-tuning. Here's the decision framework I use now.
The three levers
Prompting is changing how you ask. Zero modifications to the model, no infrastructure, no dataset. You ship today. It should always be your first attempt — not because it always works, but because the cost of trying is nearly zero.
RAG (Retrieval-Augmented Generation) leaves the model untouched but injects relevant context at query time. You build a knowledge base, embed it, retrieve the right chunks when a question comes in, and pass them to the model alongside the question. The model's weights never change — it just gets smarter context. If you want the fundamentals, the RAG lesson covers the mechanics.
Fine-tuning updates the model's weights on your data. The model genuinely learns new patterns, styles, or domain knowledge. It's not just reading instructions — it's internalizing them. This is also why it's expensive and slow to iterate on.
The decision framework
Work through these in order. Stop when you hit a "yes."
1. Does the base model already know how to do this?
GPT-4o, Claude Sonnet, Gemini 1.5 Pro — these models are remarkably capable out of the box. Before adding infrastructure, write a serious prompt. Use few-shot examples. Try chain-of-thought. I've seen teams spend months building fine-tuning pipelines for tasks that Claude nailed with a well-structured system prompt and three examples.
If better prompting solves it: ship it.
2. Does the model need current or private information?
Models have training cutoffs. They don't know your company's internal policies, your product catalog as of this morning, or what happened in the news last week. No amount of prompting gives a model information it was never trained on. Fine-tuning doesn't solve this either — you'd have to retrain constantly.
RAG is the right tool here. The how RAG works post goes deeper, but the short version: embed your knowledge base, retrieve relevant chunks at query time, pass them in context. The model stays current because the retrieval stays current.
3. Is the knowledge base too large to fit in context?
Even with 200k context windows, you can't dump an entire company wiki into every prompt. At some point you need selective retrieval. That's still RAG, even if the context window is large — you want the right information, not all of it.
4. Does the model need a very specific style or format that prompting can't reliably produce?
This is the real fine-tuning signal. Not "I want it to sound professional" — that's a prompting problem. I mean: you have a proprietary JSON schema with 40 fields and specific nesting rules, and even with 5 examples in the prompt the model still gets it wrong 15% of the time. Or you need output that consistently matches a very specific legal writing style across thousands of documents.
Fine-tuning bakes this in. The model stops guessing from examples and starts just knowing.
5. Are you doing millions of API calls with a large system prompt?
A 2,000-token system prompt at Sonnet 4.6 pricing costs roughly $6 per 1,000 calls. At 1 million calls per month, that's $6,000 — just for the system prompt. Fine-tuning lets you bake those instructions into the weights, drop the system prompt to 100 tokens, and cut that cost by 95%. The fine-tuning cost ($500–$2,000 one-time) pays back in a single month.
When prompting wins
General-purpose tasks where the model already has the knowledge. Rapid iteration. Experiments. Anything where you're still figuring out what you want.
The failure mode is predictable: the model ignores instructions, drifts in format, or produces inconsistent outputs at scale. Before escalating to fine-tuning, try:
- Adding 3–5 few-shot examples of exactly the format you want
- Restructuring the prompt so the output format comes after the input, not before
- Using XML tags to separate sections clearly
- Asking the model to reason before generating the final output
If you've done all that and you're still at 85% reliability on format, then fine-tuning is worth considering.
When RAG wins
Private data. Updated data. Anything requiring citations. Customer support bots that need to reference your actual documentation. Internal tools that pull from your company wiki. Legal research tools that cite real cases.
RAG is also dramatically cheaper to update than fine-tuning. When your product documentation changes, you re-index the changed documents. You don't retrain the model.
The failure mode is retrieval returning wrong or irrelevant chunks. Symptoms: the model answers confidently with completely unrelated information, or says "I don't have information about that" when it's definitely in the knowledge base. Fixes:
- Better chunking (smaller chunks with more overlap for precise questions, larger chunks for synthesis questions)
- Add a reranking step — retrieve 20 chunks, rerank to the top 5
- Hybrid search (semantic + keyword) instead of pure vector similarity
- Metadata filtering to scope retrieval to the right document types
Good RAG implementations with proper chunking and reranking can get retrieval accuracy above 95% on well-structured knowledge bases. Most RAG failures are chunking problems, not retrieval algorithm problems.
When fine-tuning wins
Three scenarios where I've seen fine-tuning genuinely earn its cost:
Reliable structured output. A client needed a medical record extractor that output a very specific FHIR-ish JSON schema. With prompting and few-shot examples, accuracy was around 82%. After fine-tuning on 800 annotated examples, it hit 97%. The difference was real — at 10,000 documents per day, that 15% gap was thousands of manual corrections.
Brand voice at scale. A content team was generating 500 product descriptions per day. They had a very specific tone — casual but precise, no superlatives, specific sentence length patterns. Even with a detailed style guide in the prompt, human editors were correcting 30% of outputs. After fine-tuning on 600 approved examples, editorial corrections dropped to under 5%. The fine-tuning cost was covered in two weeks of editor time saved.
Cost reduction at scale. If you have a massive system prompt and millions of calls per month, the math above applies. Fine-tuning pays for itself.
What fine-tuning doesn't fix: knowledge gaps, factual accuracy issues, or anything that requires the model to know things it was never taught. Fine-tuning teaches how to respond, not what to know.
The hybrid approaches (what production systems actually use)
Most serious production systems aren't pure anything.
RAG + prompting is the most common combination. RAG retrieves relevant context; a well-engineered prompt tells the model how to synthesize it, what format to use, and what to do when the retrieved context doesn't answer the question. This is the right default architecture for knowledge-intensive applications.
Fine-tuning + RAG is the highest-quality combination and also the most expensive. Fine-tune the model on your style and format requirements, then use RAG to give it current knowledge. Customer-facing products where you need both brand-consistent output and accurate, up-to-date information often land here. The prompt engineering for RAG pipelines post covers how to structure prompts in this setup.
Fine-tuning + prompting is rarer. Usually what people think they want but don't actually need. The cases where it makes sense: you've fine-tuned for cost reduction (short system prompt) but still need dynamic instructions at query time.
2026 cost reality check
| Approach | Setup cost | Per-call overhead | Maintenance |
|---|---|---|---|
| Prompting | $0 | Base API cost only | Low — iterate in hours |
| RAG | $100–$500 | Base + ~20% more tokens for retrieved context | Medium — keep KB indexed and fresh |
| Fine-tuning | $500–$5,000 | Base API cost (often lower — shorter prompts) | High — retrain when requirements change |
The RAG setup cost includes embedding your initial knowledge base, setting up a vector store (Pinecone starts at $70/month, Chroma and FAISS are free if self-hosted — see the build vector store guide), and building the retrieval pipeline. Ongoing costs are mostly compute for re-indexing when documents change.
Fine-tuning costs vary by model and provider. OpenAI charges ~$8 per 1M training tokens for GPT-4o. Anthropic's fine-tuning is invite-only and priced similarly. A 1,000-example dataset with 500 tokens per example costs roughly $4 in training compute — the real cost is dataset curation and the engineering time to set up evaluation.
The decision in one rule
Start with prompting. Add RAG when you need private or frequently updated information. Consider fine-tuning only when you have specific format or style requirements that prompting demonstrably can't solve, or when you have >100,000 API calls per month with a large system prompt.
That last condition is the filter most teams skip. I've seen companies spend $10,000 fine-tuning a model to save $200/month on tokens. The ROI math has to work.
If you're at the stage where you're choosing between RAG and fine-tuning, you're past the easy part. Both require real engineering investment. The question to ask isn't "which is better" — it's "which problem do I actually have?" Knowledge gaps and currency problems → RAG. Style, format, and cost problems → fine-tuning. Most teams need RAG first.
The fine-tuning vs prompting deep dive covers the fine-tuning decision in more detail if you're past the RAG question. If you're still early in the decision, the RAG lesson is the right starting point.



