Shipping a prompt to production without testing it is like shipping code without running it. Most prompts get written, tried a few times in the playground, and called done. That works until it doesn't — and when it doesn't, you have no way to tell whether a model update, a small wording change, or edge-case input caused the regression.
This lesson covers how to test prompts like code: with golden sets, regression suites, automated evaluation, and metrics that actually tell you if a prompt is getting better or worse.
Why prompt testing is different from code testing
Code tests have deterministic outputs — the same input always produces the same output, and a failing test is a clear signal. Prompt outputs are probabilistic. The same prompt can produce slightly different responses across runs, and "correct" is often subjective.
This changes what testing means. You're not checking for exact string matches — you're checking for:
- Correctness: does the output contain the right information?
- Format compliance: does the output follow the required structure?
- Tone and safety: does it stay within boundaries?
- Consistency: does quality hold across a range of inputs?
The goal is a test suite that catches regressions — cases where a prompt change or model update makes things measurably worse — without requiring you to manually read hundreds of outputs.
Building a golden set
A golden set is a collection of (input, expected output) pairs that represent the range of real-world inputs your prompt will encounter.
Building a useful golden set:
-
Collect real inputs: If your prompt is in production, sample actual user inputs. If it's pre-launch, generate representative synthetic examples.
-
Cover the distribution: Include typical cases (what 80% of inputs look like), edge cases (empty input, very short, very long, unusual formatting), and adversarial cases (inputs that try to break the expected output format).
-
Write expected outputs carefully: For factual tasks, write exact expected answers. For generative tasks, write criteria instead of exact answers: "should contain X", "should not contain Y", "should be under 200 words".
-
Size the set: 20-50 examples is usually enough to catch regressions. More is better for critical production prompts, but 20 beats zero.
golden_set = [
{
"input": "Classify this support ticket: 'I can't log in to my account'",
"expected_category": "Account Access",
"must_not_contain": ["billing", "refund"],
"max_length": 50
},
{
"input": "Classify this support ticket: ''", # edge case: empty input
"expected_behavior": "should ask for more information or return 'Unknown'",
"must_contain": ["unclear", "more information", "unknown"]
},
# ... more examples
]
Running evaluations automatically
The simplest evaluation approach: run each golden set input through your prompt, then check the output against your criteria.
import anthropic
client = anthropic.Anthropic()
def evaluate_prompt(system_prompt: str, golden_set: list) -> dict:
results = []
for example in golden_set:
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=200,
system=system_prompt,
messages=[{"role": "user", "content": example["input"]}]
)
output = response.content[0].text
# Check criteria
checks = {}
if "expected_category" in example:
checks["category_correct"] = example["expected_category"].lower() in output.lower()
if "must_contain" in example:
checks["contains_required"] = all(
term.lower() in output.lower()
for term in example["must_contain"]
)
if "must_not_contain" in example:
checks["no_forbidden"] = not any(
term.lower() in output.lower()
for term in example["must_not_contain"]
)
if "max_length" in example:
checks["within_length"] = len(output) <= example["max_length"]
results.append({
"input": example["input"],
"output": output,
"checks": checks,
"passed": all(checks.values())
})
total = len(results)
passed = sum(1 for r in results if r["passed"])
return {
"pass_rate": passed / total,
"passed": passed,
"total": total,
"failures": [r for r in results if not r["passed"]]
}
Run this every time you change your prompt. If your pass rate drops, the change is a regression.
LLM-as-judge evaluation
For tasks where output quality can't be checked with string matching — writing quality, reasoning depth, tone adherence — you can use an LLM to evaluate the output:
def judge_output(prompt_output: str, criteria: str) -> dict:
"""Use Claude to evaluate whether an output meets specified criteria."""
judge_response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=200,
system="You are an evaluator assessing AI output quality. Be objective and strict.",
messages=[{
"role": "user",
"content": f"""Evaluate this output against the criteria below.
Output to evaluate:
{prompt_output}
Evaluation criteria:
{criteria}
Respond with JSON: {{"score": 1-5, "reasoning": "brief explanation", "pass": true/false}}
Use pass=true only if score >= 4."""
}]
)
return json.loads(judge_response.content[0].text)
LLM-as-judge isn't perfect — the judge model has its own biases — but it's far more scalable than manual review for large test suites. Use a different model as judge than the model being evaluated to reduce self-serving bias.
A/B testing prompt variants
When you're choosing between two prompt approaches, run them both against your golden set and compare:
def compare_prompts(prompt_a: str, prompt_b: str, golden_set: list) -> None:
results_a = evaluate_prompt(prompt_a, golden_set)
results_b = evaluate_prompt(prompt_b, golden_set)
print(f"Prompt A pass rate: {results_a['pass_rate']:.1%}")
print(f"Prompt B pass rate: {results_b['pass_rate']:.1%}")
# Show cases where they disagree
for i, (r_a, r_b) in enumerate(zip(results_a.get('all_results', []),
results_b.get('all_results', []))):
if r_a['passed'] != r_b['passed']:
print(f"\nDisagreement on example {i}:")
print(f" Input: {r_a['input'][:100]}")
print(f" A passed: {r_a['passed']}, B passed: {r_b['passed']}")
This reveals exactly which cases each prompt handles better or worse — not just an overall win/loss.
Regression testing in CI
Once you have a golden set and evaluation script, plug it into your CI pipeline. Every time a prompt changes (if prompts are stored in your codebase), run the evaluation:
# .github/workflows/prompt-eval.yml
name: Prompt evaluation
on:
push:
paths:
- 'prompts/**'
- 'src/prompts/**'
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run prompt evaluation
run: python scripts/evaluate_prompts.py
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
- name: Check pass rate
run: |
if [ "$PASS_RATE" -lt "90" ]; then
echo "Prompt evaluation failed: pass rate below 90%"
exit 1
fi
This makes prompt regressions as visible as code regressions — they show up as failed builds.
What to measure
The metrics that matter most depend on your use case:
For classification tasks: accuracy, precision, recall per class, confusion matrix
For extraction tasks: field-level accuracy (what % of fields were extracted correctly), false positive rate (fields extracted that shouldn't be)
For generation tasks: format compliance rate, criteria pass rate, latency, token count
For agents: task completion rate, tool call accuracy, error recovery rate, total cost per task
Pick 2-3 metrics that directly reflect whether your prompt is doing its job. Track them over time. A chart of your pass rate across prompt versions tells you immediately whether your iterations are improvements or regressions.
The manual review you can't skip
Automated evaluation catches many things but misses others — especially subtle quality regressions and failure modes you haven't anticipated. Build in regular manual review:
- Every time you push a new prompt version, read 20 random outputs side-by-side with the previous version
- When pass rates drop, read the failing cases — they reveal the actual failure mode, not just the metric
- Monthly: sample 50 production outputs and check for quality drift even if your automated metrics look healthy
Automated evaluation scales. Manual review maintains quality. You need both.
The workflow: write prompt → build golden set → automate evaluation → iterate on prompt → run regression test → deploy → monitor in production. That cycle is slower than "write prompt → ship" but it's the only one that catches problems before users do.