What is prompt engineering?

Prompt engineering is the practice of crafting inputs to AI language models to produce accurate, useful, and reliable outputs. It involves choosing the right words, structure, context, and format to guide the AI toward the response you actually need — rather than a generic or off-target one.

Which AI models benefit most from better prompting?

All major large language models — including ChatGPT (GPT-4o), Claude, and Gemini — respond significantly to prompt quality. The same task can produce dramatically different results depending on how you structure your request. Better prompting improves output across every major model.

Do I need technical skills to do prompt engineering?

No. Prompt engineering is done in natural language — you write text instructions, not code. Basic prompting needs no technical background at all. Advanced techniques like prompt chaining or agentic workflows can benefit from light scripting knowledge, but the core skill is clear written communication.

Where can I learn more about prompt engineering?

MasterPrompting.net offers a structured curriculum from beginner to advanced, covering every major technique from basic clarity and context to chain-of-thought, meta-prompting, and agentic workflows. Start with the Beginner track to build a solid foundation.

Prompt Engineering for RAG Pipelines: How to Write Queries That Actually Retrieve the Right Context

Here's a pattern I've seen on dozens of RAG implementations: the team obsesses over the generation model — which LLM to use, how to write the system prompt, how to format the output. Then they go live and the answers are wrong. They blame the model. They switch models. The answers are still wrong.

The problem was retrieval all along. Bad chunks. Bad queries. Bad ranking. The generation model was doing its best with irrelevant context.

Most RAG problems are retrieval problems dressed up as generation problems. Fix the queries, fix the retrieval — and the generation almost takes care of itself.

The 3-step RAG loop and where prompts matter

A standard RAG pipeline has three stages:

Query — transform the user's input into something the retrieval system can work with
Retrieve — fetch the most relevant chunks from the vector store
Generate — produce an answer grounded in the retrieved context

Prompts matter at all three steps, but the impact is uneven. Query-side prompt engineering is where most of the leverage is. Generation prompts are easier to get right and less likely to be the root cause of failures.

Step 1: Query rewriting prompts

Raw user queries are terrible search queries. Users ask conversational questions. They use pronouns without antecedents ("how does it work?"). They ask compound questions. They use terminology that doesn't match your documents.

The solution is a query rewriting step before retrieval.

HyDE (Hypothetical Document Embedding)

Instead of embedding the user's question, you generate a hypothetical answer, then embed that. The hypothesis lives in the same semantic space as your documents — a chunk from your knowledge base — so the vector similarity search works better.

System: You are a search optimization assistant. When given a user question,
generate a brief hypothetical document (2-4 sentences) that would perfectly
answer that question. Write it as if it were an excerpt from a technical
documentation page, not as an answer to a question. Do not include phrases
like "The answer is" or "To answer this question." Just write the content
directly.

User question: {user_query}

Output only the hypothetical document excerpt.

HyDE works especially well when your knowledge base consists of declarative content (documentation, articles, policies) and users ask questions in natural language. The gap between question space and document space is the core retrieval problem, and HyDE bridges it.

Query expansion

Generate multiple versions of the query and retrieve for each, then merge and deduplicate.

Given this user question, generate 3 alternative phrasings that capture the
same intent but use different vocabulary. A good set of alternatives covers:
- More specific technical terminology
- More general phrasing
- Related but distinct angle on the same topic

Output as a JSON array of strings.

User question: {user_query}

A question like "how do I stop the app from crashing?" might expand to "application crash error handling," "exception handling best practices," and "debugging runtime errors." Each version might hit different relevant chunks.

Query decomposition

Compound questions — "what are the pricing tiers and how do they compare to competitors?" — should be split before retrieval. Each sub-question retrieves independently, then the answers are synthesized.

Analyze this user question and determine if it contains multiple distinct
information needs. If yes, decompose it into 2-4 focused sub-questions, each
targeting a single piece of information. If it's a single focused question,
return it unchanged.

Output format:
- If compound: return a JSON array of sub-question strings
- If simple: return a JSON array with just the original question

Question: {user_query}

Decomposition adds latency (multiple retrieval calls) but dramatically improves recall for compound queries. Gate it behind a classifier if you need to keep latency down.

Step 2: Reranking prompts

Vector similarity is a proxy for relevance. It works well on average and fails in specific, predictable ways: it prioritizes lexical and semantic overlap over actual usefulness, it doesn't understand the query's intent at a deeper level, and it's sensitive to embedding model quality.

A reranker — either a cross-encoder model or an LLM-based relevance scorer — applies a more expensive but more accurate relevance signal to the top-K retrieved chunks.

LLM-based relevance scoring

You are a relevance judge. Given a user query and a document chunk, rate
how useful the chunk is for answering the query.

Rating scale:
3 = Directly answers the query with specific, accurate information
2 = Partially relevant — related topic but doesn't directly answer
1 = Tangentially related — same general domain but not useful for this query
0 = Irrelevant

Return only the numeric rating (0, 1, 2, or 3).

Query: {query}

Document chunk:
---
{chunk}
---

Rating:

Run this for each of your top-20 retrieved chunks, sort by score, keep the top 5-7. The LLM reranker is slower and more expensive than a cross-encoder like Cohere Rerank or BGE Reranker, but it's more controllable and doesn't require a separate model deployment.

Use cross-encoders (faster, cheaper) for production at scale. Use LLM reranking in development to establish a quality bar, then distill that judgment into a cross-encoder.

Step 3: Generation prompts

Once you have good context, the generation prompt is simpler than most people think. The two biggest mistakes are (1) not being explicit enough about grounding and (2) not handling the "no answer found" case.

The core generation prompt

System: You are a helpful assistant for [COMPANY/PRODUCT]. Answer the user's
question using ONLY the information provided in the context below.

Rules:
- If the context doesn't contain enough information to answer the question,
  say exactly: "I don't have enough information to answer that question."
  Do not guess or use outside knowledge.
- Quote or closely paraphrase the context when possible.
- If multiple context chunks are relevant, synthesize them into a coherent answer.
- Keep your answer focused on what was asked. Don't include tangential information
  from the context.

Context:
---
{retrieved_chunks}
---

User question: {user_query}

Handling no-result cases

The "I don't have enough information" fallback is critical. Without it, models hallucinate. They're trained to be helpful, and "I don't know" feels unhelpful, so they fill the gap with plausible-sounding nonsense.

Test your system specifically on questions that fall outside your knowledge base. If the model is hallucinating on those, tighten the grounding instruction:

IMPORTANT: You must not use any knowledge outside the provided context.
If the context does not contain the answer, you must respond with:
"I don't have enough information in my knowledge base to answer that."
Even if you know the answer from your training data, do not use it here.

The doubled-down instruction sounds redundant, but it matters. Models need explicit permission to say they don't know.

Citation formatting

For production systems where users need to verify answers, add source attribution:

When answering, cite the source of your information using the format [Source N]
where N corresponds to the numbered chunks provided. At the end of your answer,
include a "Sources" section listing which chunks you used.

Context:
[1] {chunk_1}
[2] {chunk_2}
[3] {chunk_3}

Common RAG prompt mistakes

Asking compound questions without decomposition: "Tell me about the refund policy, return windows, and how to contact support" will retrieve sub-optimally because the query vector averages across all three topics. Split it.

Not telling the model its limits: Without explicit grounding instructions, GPT-4o and Claude will helpfully fill gaps with training knowledge. This feels right until users find the one answer that was wrong.

Over-stuffing context: More context isn't always better. Relevant-to-irrelevant ratio matters. If you're stuffing 15 chunks into context hoping one of them is relevant, you're diluting the signal. Better retrieval beats more context.

Ignoring chunk boundaries: If your chunks cut sentences in half or split tables, the model can't make sense of them. Prompts can't fix bad chunking. Fix it in the pipeline.

Asking the model to "search" the context: Prompts like "search the context for information about X" confuse generation with retrieval. The model sees all context simultaneously — it doesn't search. Just ask the question.

Worked example: customer support RAG

Here's the full prompt chain for a customer support system:

Stage 1 — Query rewriting (runs before retrieval)

The user sent this support message: "{raw_message}"

Rewrite it as a clear, specific search query that would retrieve the most
relevant help documentation. Remove conversational filler. Preserve all
technical details, error messages, and product names exactly as stated.

Output only the rewritten query.

Stage 2 — Reranking the top 10 chunks

Is this documentation chunk relevant to answering: "{rewritten_query}"?
Chunk: "{chunk}"
Answer with only YES or NO.

Filter to YES chunks, then pass top 5 to generation.

Stage 3 — Generation

You are a customer support agent for Acme Corp. Answer the customer's
question using only the documentation provided. Be warm but concise.
If the documentation doesn't cover their issue, apologize and provide
the support email: support@acme.com.

Documentation:
{relevant_chunks}

Customer message: {raw_message}

Three prompts, each doing one job. This structure makes debugging straightforward: if answers are wrong, check which stage is failing by logging intermediate outputs.

Metrics to track

You can't improve what you don't measure. Three metrics cover the essentials:

Faithfulness: Is the answer supported by the retrieved context? Check this manually on a 50-question eval set, or use an LLM-as-judge prompt. Target >90% faithfulness before shipping.

Context relevance: Are the retrieved chunks actually relevant to the query? Score each retrieved chunk — a high-quality pipeline should have >70% of retrieved chunks be directly relevant.

Answer relevance: Does the final answer actually address what was asked? This is distinct from faithfulness (an answer can be faithful to the context but still not answer the question).

The RAG lesson covers the fundamentals of how retrieval-augmented generation works at the architecture level. The how-rag-works post goes deeper on the vector similarity mechanics. This post is specifically about the prompt layer — what you can control without touching the retrieval infrastructure.

Most RAG systems ship with weak query-side prompting and decent generation prompting. The easiest wins are on the retrieval side: add HyDE, add query decomposition for compound questions, add a reranking step. You'll see measurable accuracy improvements without changing a single line of your generation prompt or switching models.

Fix the retrieval. Then worry about the generation.

The problem was retrieval all along. Bad chunks. Bad queries. Bad ranking. The generation model was doing its best with irrelevant context.

Most RAG problems are retrieval problems dressed up as generation problems. Fix the queries, fix the retrieval — and the generation almost takes care of itself.

The 3-step RAG loop and where prompts matter

A standard RAG pipeline has three stages:

Query — transform the user's input into something the retrieval system can work with
Retrieve — fetch the most relevant chunks from the vector store
Generate — produce an answer grounded in the retrieved context

Step 1: Query rewriting prompts

The solution is a query rewriting step before retrieval.

HyDE (Hypothetical Document Embedding)

System: You are a search optimization assistant. When given a user question,
generate a brief hypothetical document (2-4 sentences) that would perfectly
answer that question. Write it as if it were an excerpt from a technical
documentation page, not as an answer to a question. Do not include phrases
like "The answer is" or "To answer this question." Just write the content
directly.

User question: {user_query}

Output only the hypothetical document excerpt.

Query expansion

Generate multiple versions of the query and retrieve for each, then merge and deduplicate.

Given this user question, generate 3 alternative phrasings that capture the
same intent but use different vocabulary. A good set of alternatives covers:
- More specific technical terminology
- More general phrasing
- Related but distinct angle on the same topic

Output as a JSON array of strings.

User question: {user_query}

Query decomposition

Analyze this user question and determine if it contains multiple distinct
information needs. If yes, decompose it into 2-4 focused sub-questions, each
targeting a single piece of information. If it's a single focused question,
return it unchanged.

Output format:
- If compound: return a JSON array of sub-question strings
- If simple: return a JSON array with just the original question

Question: {user_query}

Decomposition adds latency (multiple retrieval calls) but dramatically improves recall for compound queries. Gate it behind a classifier if you need to keep latency down.

Step 2: Reranking prompts

A reranker — either a cross-encoder model or an LLM-based relevance scorer — applies a more expensive but more accurate relevance signal to the top-K retrieved chunks.

LLM-based relevance scoring

You are a relevance judge. Given a user query and a document chunk, rate
how useful the chunk is for answering the query.

Rating scale:
3 = Directly answers the query with specific, accurate information
2 = Partially relevant — related topic but doesn't directly answer
1 = Tangentially related — same general domain but not useful for this query
0 = Irrelevant

Return only the numeric rating (0, 1, 2, or 3).

Query: {query}

Document chunk:
---
{chunk}
---

Rating:

Use cross-encoders (faster, cheaper) for production at scale. Use LLM reranking in development to establish a quality bar, then distill that judgment into a cross-encoder.

Step 3: Generation prompts

The core generation prompt

System: You are a helpful assistant for [COMPANY/PRODUCT]. Answer the user's
question using ONLY the information provided in the context below.

Rules:
- If the context doesn't contain enough information to answer the question,
  say exactly: "I don't have enough information to answer that question."
  Do not guess or use outside knowledge.
- Quote or closely paraphrase the context when possible.
- If multiple context chunks are relevant, synthesize them into a coherent answer.
- Keep your answer focused on what was asked. Don't include tangential information
  from the context.

Context:
---
{retrieved_chunks}
---

User question: {user_query}

Handling no-result cases

Test your system specifically on questions that fall outside your knowledge base. If the model is hallucinating on those, tighten the grounding instruction:

IMPORTANT: You must not use any knowledge outside the provided context.
If the context does not contain the answer, you must respond with:
"I don't have enough information in my knowledge base to answer that."
Even if you know the answer from your training data, do not use it here.

The doubled-down instruction sounds redundant, but it matters. Models need explicit permission to say they don't know.

Citation formatting

For production systems where users need to verify answers, add source attribution:

When answering, cite the source of your information using the format [Source N]
where N corresponds to the numbered chunks provided. At the end of your answer,
include a "Sources" section listing which chunks you used.

Context:
[1] {chunk_1}
[2] {chunk_2}
[3] {chunk_3}

Common RAG prompt mistakes

Ignoring chunk boundaries: If your chunks cut sentences in half or split tables, the model can't make sense of them. Prompts can't fix bad chunking. Fix it in the pipeline.

Worked example: customer support RAG

Here's the full prompt chain for a customer support system:

Stage 1 — Query rewriting (runs before retrieval)

The user sent this support message: "{raw_message}"

Rewrite it as a clear, specific search query that would retrieve the most
relevant help documentation. Remove conversational filler. Preserve all
technical details, error messages, and product names exactly as stated.

Output only the rewritten query.

Stage 2 — Reranking the top 10 chunks

Is this documentation chunk relevant to answering: "{rewritten_query}"?
Chunk: "{chunk}"
Answer with only YES or NO.

Filter to YES chunks, then pass top 5 to generation.

Stage 3 — Generation

You are a customer support agent for Acme Corp. Answer the customer's
question using only the documentation provided. Be warm but concise.
If the documentation doesn't cover their issue, apologize and provide
the support email: support@acme.com.

Documentation:
{relevant_chunks}

Customer message: {raw_message}

Three prompts, each doing one job. This structure makes debugging straightforward: if answers are wrong, check which stage is failing by logging intermediate outputs.

Metrics to track

You can't improve what you don't measure. Three metrics cover the essentials:

Faithfulness: Is the answer supported by the retrieved context? Check this manually on a 50-question eval set, or use an LLM-as-judge prompt. Target >90% faithfulness before shipping.

Context relevance: Are the retrieved chunks actually relevant to the query? Score each retrieved chunk — a high-quality pipeline should have >70% of retrieved chunks be directly relevant.

Answer relevance: Does the final answer actually address what was asked? This is distinct from faithfulness (an answer can be faithful to the context but still not answer the question).

Fix the retrieval. Then worry about the generation.

Prompt Engineering for RAG Pipelines: How to Write Queries That Actually Retrieve the Right Context

The 3-step RAG loop and where prompts matter

Step 1: Query rewriting prompts

Step 2: Reranking prompts

Step 3: Generation prompts

Common RAG prompt mistakes

Worked example: customer support RAG

Metrics to track

Related articles

A/B Testing Prompts in Production — A Statistical Guide

Async Python for LLM Apps — Patterns That Actually Work in Production

50 Best AI Prompts for Claude That Actually Work (2026)

Prompt Engineering for RAG Pipelines: How to Write Queries That Actually Retrieve the Right Context

The 3-step RAG loop and where prompts matter

Step 1: Query rewriting prompts

Step 2: Reranking prompts

Step 3: Generation prompts

Common RAG prompt mistakes

Worked example: customer support RAG

Metrics to track

Related articles

A/B Testing Prompts in Production — A Statistical Guide

Async Python for LLM Apps — Patterns That Actually Work in Production

50 Best AI Prompts for Claude That Actually Work (2026)