Most people evaluate prompts by reading the output and thinking "that looks about right." This is fine for personal use — but if you're building AI-powered products, automating business processes, or using prompts at scale, you need a more rigorous approach.
This guide covers how to measure prompt performance like an engineer, not a guesser.
The Problem with "Eyeball Testing"
Eyeball testing is:
- Non-reproducible — different people judge quality differently
- Non-scalable — you can't eyeball 1,000 outputs
- Biased toward recent results — the last output you saw dominates your judgment
- Misleading on edge cases — your 3 test cases might all miss the real failure modes
The fix: define success criteria before you start, and measure against them.
Step 1: Define What "Good" Looks Like
Before writing your prompt, answer:
- What must always be true? (Required conditions — must pass 100% of the time)
- What should usually be true? (Target conditions — aim for 90%+)
- What should never happen? (Failure modes — must be 0%)
Example for a customer support response prompt:
| Criterion | Type | Target | |-----------|------|--------| | Response addresses the customer's question | Required | 100% | | Tone is professional and empathetic | Target | 95% | | Response under 150 words | Target | 90% | | No incorrect policy information | Required | 100% | | Never tells customer to "calm down" | Never | 0% |
Step 2: Build a Test Dataset
You need real, diverse test cases — not just the happy path.
What to include:
- Normal cases (the typical input your prompt will handle)
- Edge cases (unusual, ambiguous, or tricky inputs)
- Adversarial cases (inputs that might break your prompt)
- Failure cases (real examples where a previous prompt failed)
How many: Start with 20-50 cases. More is better, but even 20 diverse cases beats infinite eyeball testing of 3 examples.
Where to get them:
- Real production logs (if you have them)
- Manually crafted examples
- Generated examples (ask an LLM to generate diverse test cases for your use case)
Step 3: Define Evaluation Methods
Automated evaluation (fast, scalable)
Check output programmatically:
def evaluate_output(output: str) -> dict:
return {
"under_150_words": len(output.split()) <= 150,
"contains_greeting": output.lower().startswith(("hi", "hello", "dear")),
"no_policy_errors": not contains_bad_claims(output),
"valid_json": is_valid_json(output) if json_required else True,
}
LLM-as-judge (flexible, scalable)
Use a second LLM call to evaluate the first output. Works for subjective criteria:
You are evaluating an AI customer support response.
Criteria:
1. Does it address the customer's question? (Yes/No)
2. Is the tone professional and empathetic? (1-5 scale)
3. Is it under 150 words? (Yes/No)
4. Are there any factual errors? (Yes/No)
Customer message: [message]
AI response: [response]
Return your evaluation as JSON matching this schema:
{
"addresses_question": boolean,
"tone_score": 1-5,
"under_150_words": boolean,
"factual_errors": boolean
}
Human evaluation (highest quality, slowest)
For high-stakes prompts, have humans score a sample. Create a rubric so different evaluators apply consistent standards.
Step 4: A/B Test Your Prompts
Never deploy a new prompt without comparing it to your baseline.
Setup:
- Prompt A = your current/baseline prompt
- Prompt B = your new/experimental prompt
- Run both prompts against your full test dataset
- Score both on your evaluation criteria
- Compare scores statistically
Key rules:
- Same test cases for both prompts
- Same model and temperature
- At least 20-50 test cases (3 examples is not enough)
- Report metrics, not vibes
Example results table:
| Metric | Prompt A | Prompt B | Winner | |--------|----------|----------|--------| | Addresses question | 91% | 96% | B | | Tone score (avg) | 3.8 | 4.1 | B | | Under 150 words | 78% | 94% | B | | Factual errors | 3% | 1% | B | | Latency (avg) | 1.2s | 1.8s | A |
Prompt B wins on quality but is slower. Now you make an informed tradeoff decision.
Step 5: Build a Golden Dataset
A golden dataset is a curated set of test cases with known-correct answers. It's the foundation of systematic prompt improvement.
Characteristics of a good golden dataset:
- Representative — covers the real distribution of inputs
- Diverse — includes edge cases and failure modes
- Labeled — has the "correct" output or evaluation for each case
- Stable — doesn't change frequently (so you can compare across prompt versions)
- Realistic — sourced from real production data, not just hand-crafted examples
Grow your golden dataset over time: whenever a prompt fails on a real input, add that input to the dataset.
Metrics Reference
| Metric | What it measures | Good for | |--------|-----------------|---------| | Pass rate | % of outputs meeting a binary criterion | Hard requirements | | Average score | Mean quality rating across outputs | Subjective criteria | | Failure rate | % with any critical failure | Safety/reliability | | Consistency | Variance across repeated runs | Reliability | | Latency | Time to generate output | UX and cost | | Token count | Output length | Cost optimization |
Key Takeaway
Good prompt engineering is iterative and measurable. Define success criteria first. Build a diverse test dataset. Automate evaluation where possible. A/B test every change. Build a golden dataset and grow it over time. This turns prompt engineering from art into engineering — and makes your prompts reliably better over time.
Next: Learn Tree of Thought Prompting — how to make AI explore multiple reasoning paths for complex decisions.