When should I choose fine-tuning over prompt engineering?

Fine-tuning earns its cost when four conditions are true: the task is stable and well-defined (won't change significantly), you need high consistency at scale (thousands of outputs all matching a specific style or format), prompting produces unacceptable variance even with detailed few-shot examples, and you have 500–10,000 high-quality labeled examples. If any of these conditions aren't met, prompting — possibly with RAG — is the right answer. Most use cases that seem to require fine-tuning are actually solvable with aggressive few-shot prompting.

How much training data do I need for fine-tuning?

Typically 500 to 10,000+ high-quality input/output examples, depending on task complexity and how much you need to change the model's default behavior. The data quality requirement is real and often underestimated — bad training data produces bad fine-tuned models. You also need ongoing curation as the task evolves. If you can't produce 500 clean, labeled examples without significant effort, fine-tuning is probably not practical for your situation.

Is RAG a better option than fine-tuning for domain knowledge?

In most cases, yes. If you're considering fine-tuning because the model doesn't know enough about your domain, Retrieval-Augmented Generation is almost always the better answer. RAG retrieves relevant documents at query time and includes them in the prompt, making domain knowledge available without retraining. It's more maintainable (update the knowledge base, not the model), more transparent (you can see what was retrieved), and more current (no training cutoff). Fine-tuning for knowledge is expensive, fragile, and quickly goes stale.

Fine-Tuning vs Prompting: When to Use Which

At some point, prompting runs into its ceiling. You've written a detailed system prompt, added examples, constrained the output format, and it still produces inconsistencies. You've heard about fine-tuning. You wonder if that's the answer.

Sometimes it is. Often it isn't — not because fine-tuning doesn't work, but because it's solving a different problem than the one you actually have.

This lesson maps the decision clearly.

What Each Approach Actually Does

Prompting changes the input to the model. The model itself is unchanged. You're giving it context, instructions, examples, and constraints at runtime — guiding a general-purpose model toward specific behavior.

Fine-tuning changes the model itself. You provide training examples (input/output pairs) and update the model's weights so it produces that type of output more naturally — without needing extensive prompting.

The analogy: prompting is like giving someone detailed instructions every time they do a task. Fine-tuning is like training them until the skill is internalized.

When Prompting Is the Right Choice

The vast majority of use cases should start with prompting, and many should stay there.

Prompting is preferable when:

You're still iterating on the task definition. Fine-tuning fixes behavior. If you're not sure yet what "good" looks like — if the right format, tone, or output might change — fine-tuning locks you in too early. Prompting lets you adjust on the fly.

The task needs up-to-date knowledge. Fine-tuned models don't get smarter about your changing data. If you need the model to know about current events, recent documents, or data that changes regularly, prompting with retrieved context (RAG) is the right architecture, not fine-tuning.

Your task is complex and multi-step. Prompt chaining and agentic workflows can handle complexity that a single fine-tuned call can't. Fine-tuning is best for well-defined, single-output tasks.

You're under cost and time constraints. Fine-tuning requires training data collection, training runs, evaluation, and iteration. That's weeks and significant cost. Prompting is immediate.

You want to use the best available model. When new model versions are released, your fine-tuned model may need retraining. Prompt-based systems benefit from improved models automatically.

When Fine-Tuning Makes Sense

Fine-tuning has real advantages in specific situations. Don't dismiss it — just use it when it's genuinely warranted.

For consistent style and format at scale. If you need thousands of outputs that all follow a specific style, format, or persona, and prompting produces too much variance, fine-tuning can reliably internalize that pattern. Customer service voices, brand writing styles, structured data extraction formats — these are fine-tuning's home turf.

When your task has a specialized domain language. Medical, legal, financial, and other technical domains have vocabularies and conventions that general models handle imperfectly. Fine-tuning on domain-specific data improves reliability meaningfully.

For low-latency, high-volume applications. Fine-tuned models don't need long system prompts or few-shot examples, which reduces token counts and improves response time. At very high volumes, this cost saving compounds.

When the task is well-defined and stable. Fine-tuning is worth the investment when the task definition is clear, correct answers are well-understood, and the task won't change substantially. If those conditions aren't met, you'll be retraining constantly.

For behavior that's hard to specify in words. Some behaviors are easier to demonstrate than describe. Writing in a very specific author's style, producing a proprietary data format, classifying according to an idiosyncratic taxonomy — if you can produce 500–1,000 input/output examples, fine-tuning can capture the pattern without you needing to articulate every rule.

The Middle Path: Few-Shot Prompting

Before committing to fine-tuning, try few-shot prompting aggressively.

A well-designed prompt with 10–20 carefully chosen examples often achieves 80–90% of the quality benefit of fine-tuning — with none of the infrastructure cost.

Test this order:

Zero-shot prompting with a detailed system prompt
Few-shot prompting with 5–20 examples in the prompt
Retrieval-augmented generation if you need domain knowledge
Fine-tuning only if the above still don't meet your quality bar

Most use cases that people assume require fine-tuning are actually solvable at step 2 or 3.

The Data Requirement is Real

Fine-tuning requires training data. Not a little — hundreds to thousands of high-quality input/output examples, depending on the task.

Collecting this data is the part that most underestimates. It involves:

Defining exactly what "correct" output looks like
Generating or curating examples at that quality bar
Quality review (bad training data produces bad fine-tuned models)
Ongoing curation as the task definition evolves

If you can't produce 500 high-quality labeled examples without significant effort, fine-tuning may not be practical for your situation.

A Decision Framework

Ask these questions in order:

1. Can a good system prompt + examples produce acceptable quality?
If yes: stop here. Don't over-engineer.

2. Is the inconsistency in prompting due to unclear task definition, or inherent model variance?
Unclear task definition → refine the prompt, not the model.
Inherent variance in well-defined task → consider fine-tuning.

3. Is the task stable and well-defined enough to train on?
If the task might change significantly in 6 months → prompting.
If the task is stable → fine-tuning is viable.

4. Do you have the data and resources?
500+ quality examples + training infrastructure + evaluation capacity → fine-tuning is practical.
Can't produce that → stay with prompting + RAG.

5. Does latency or cost at scale justify it?
High volume + token-heavy prompts → fine-tuning economics may work out.
Low volume or prompt-light → fine-tuning likely doesn't pay off.

RAG: The Often-Overlooked Third Option

For many use cases framed as "prompting vs fine-tuning," the actual right answer is Retrieval-Augmented Generation (RAG).

RAG means: at query time, retrieve relevant documents from a knowledge base, include them in the prompt, and answer based on that context.

This handles:

Domain-specific knowledge (without fine-tuning)
Up-to-date information (without retraining)
Long-tail questions about your specific business, products, or data

If the reason you're considering fine-tuning is "the model doesn't know enough about our domain," RAG is almost always the better answer. It's more maintainable (update the knowledge base, not the model), more transparent (you can see what context was retrieved), and more current.

Summary Table

Factor	Prompting	Fine-Tuning
Setup time	Minutes	Weeks
Cost	Per token	Training cost + per token
Task stability needed	Low	High
Data needed	0–20 examples	500–10,000+
Handles changing data	With RAG	Requires retraining
Style/format consistency	Moderate	High
Iteration speed	Fast	Slow
Works with latest models	Automatically	Requires retraining

Key Takeaways

Prompting changes input; fine-tuning changes the model — they solve different problems
Try zero-shot → few-shot → RAG before committing to fine-tuning
Fine-tuning earns its cost when the task is stable, well-defined, high-volume, and needs consistency that prompting can't reliably achieve
Data quality and quantity is the most underestimated fine-tuning challenge
For domain knowledge, RAG is usually the right answer before fine-tuning

You've completed the Advanced Track. You now have the full picture — from basic prompt construction through meta-prompting, evaluation, tree-of-thought reasoning, adversarial robustness, and the architecture decision between prompting and fine-tuning.

The next step is practice: take these techniques into your actual work, iterate on what you've learned, and keep building. The playground is where you can experiment hands-on.