I changed a system prompt and the outputs looked better. So I shipped it. Three days later, ticket resolution rates had dropped 8%. The outputs felt better — shorter, punchier — but users were following up more because the bot was leaving out important details.
Eyeballing LLM output quality is unreliable. You need A/B testing with actual metrics and enough samples to trust the result. This guide shows you how to do it properly.
Why your intuition fails
Confirmation bias is the obvious problem. You wrote Prompt B, so you're looking for evidence it's better. But there are subtler failure modes:
Small sample sizes amplify noise. LLM outputs are stochastic. The same prompt can produce noticeably different quality on different calls. If you compare 10 outputs from each variant, you're measuring variance, not quality.
Qualitative judgment is inconsistent. Ask two people to rate which response is better and they'll agree maybe 70% of the time. Ask the same person twice with a week in between — agreement drops to 80%.
You're measuring the wrong thing. "This response sounds better" doesn't tell you if users got what they needed. Downstream metrics — did they follow up? Did they convert? Did they leave the chat? — are what matter.
Define your metric first
A/B testing a prompt without a pre-defined metric is just vibes with extra steps. Your metric needs to be:
- Measurable — you can compute it from logs without human judgment (or at least automate the judgment)
- Causally linked to prompt quality — changing the prompt should move this metric
- Sensitive enough — if only 1 in 1000 users triggers the behavior you care about, you'll need enormous sample sizes
Common metrics for LLM features:
| Feature | Metric |
|---|---|
| Support bot | Resolution rate (no follow-up within 24h) |
| Writing assistant | User edits the suggestion vs accepts as-is |
| Search/Q&A | Thumbs up/down, session abandonment |
| Code assistant | Code accepted without modification |
| Extraction | Accuracy against labeled ground truth |
LLM-as-judge metrics (1-5 scale, scored by a second LLM call) are useful when you can't measure downstream behavior directly. Use a strong model (Opus or GPT-4o) as judge with a detailed rubric. See building evaluation datasets for how to build rubrics that produce reliable scores.
Calculating sample size
This is where most teams skip ahead and get burned. Minimum detectable effect (MDE) drives sample size:
- If Prompt A resolves 60% of tickets and you want to detect a 10% improvement (to 66%), you need ~200 samples per variant
- If you want to detect a 5% improvement (to 63%), you need ~800 samples per variant
- If you want to detect a 2% improvement (to 61.2%), you need ~5,000 samples per variant
The math (two-proportion z-test, 80% power, 95% confidence):
import numpy as np
from scipy import stats
def required_sample_size(
baseline_rate: float,
minimum_detectable_effect: float,
alpha: float = 0.05, # significance level
power: float = 0.80 # statistical power
) -> int:
"""
Calculate required n per variant.
Args:
baseline_rate: expected rate in control group (e.g. 0.60 for 60%)
minimum_detectable_effect: smallest change worth detecting (e.g. 0.10 for 10pp lift)
alpha: false positive rate (0.05 = 5%)
power: probability of detecting a real effect (0.80 = 80%)
"""
p1 = baseline_rate
p2 = baseline_rate + minimum_detectable_effect
z_alpha = stats.norm.ppf(1 - alpha / 2)
z_beta = stats.norm.ppf(power)
pooled_p = (p1 + p2) / 2
n = (
(z_alpha * np.sqrt(2 * pooled_p * (1 - pooled_p)) +
z_beta * np.sqrt(p1 * (1 - p1) + p2 * (1 - p2))) ** 2
) / (p1 - p2) ** 2
return int(np.ceil(n))
# Customer support example
n = required_sample_size(
baseline_rate=0.60, # 60% resolution rate currently
minimum_detectable_effect=0.10 # want to detect a 10pp improvement
)
print(f"Need {n} samples per variant ({n*2} total)") # ~194 per variant
If your traffic is low, consider being realistic about MDE. A 10pp improvement is a big swing — if your prompt change is more subtle, you need more data.
A worked example: customer support bot
Here's a concrete test from start to finish.
Context: A B2B SaaS support bot. Current system prompt is verbose (800 tokens). New prompt is tighter (400 tokens) with more explicit resolution steps.
Metric: Resolution rate — defined as no follow-up ticket or follow-up message within 24 hours of the bot's response.
Baseline rate: 60% (measured over the previous 4 weeks from logs).
MDE: We want to detect a 10pp improvement (60% → 70%). Anything less isn't worth the prompt change.
Sample size calculation:
n = required_sample_size(0.60, 0.10) # → 194 per variant, 388 total
Traffic: ~100 support conversations per day. At 50/50 split, that's 50 per variant per day. We need 194 per variant → 4 days to reach significance, buffer to 7 days.
Traffic splitting: Assign users to variants by hashing their user ID modulo 2:
import hashlib
def get_variant(user_id: str, test_name: str = "prompt_test_v1") -> str:
"""Deterministic variant assignment — same user always gets same variant."""
hash_input = f"{test_name}:{user_id}"
hash_value = int(hashlib.md5(hash_input.encode()).hexdigest(), 16)
return "A" if hash_value % 2 == 0 else "B"
def get_system_prompt(user_id: str) -> str:
variant = get_variant(user_id)
if variant == "A":
return SYSTEM_PROMPT_A # Original
return SYSTEM_PROMPT_B # New
Hashing on user ID (not session ID) ensures the same user always sees the same variant. Mixing variants within a user's experience is a common mistake that inflates noise.
Logging: Log every response with the variant, timestamp, user ID, and conversation ID. After 24 hours, check if each conversation has a follow-up. That's your resolution label.
After 7 days: 350 total conversations (50 dropped due to incomplete data).
from scipy import stats
# Results after 7 days
group_a = {"n": 175, "resolved": 107} # 61.1%
group_b = {"n": 175, "resolved": 126} # 72.0%
rate_a = group_a["resolved"] / group_a["n"]
rate_b = group_b["resolved"] / group_b["n"]
# Two-proportion z-test
count = [group_a["resolved"], group_b["resolved"]]
nobs = [group_a["n"], group_b["n"]]
z_stat, p_value = stats.proportions_ztest(count, nobs)
print(f"Variant A: {rate_a:.1%}")
print(f"Variant B: {rate_b:.1%}")
print(f"Lift: {(rate_b - rate_a):.1%}")
print(f"p-value: {p_value:.4f}")
print(f"Significant: {p_value < 0.05}")
Output: Variant B, 72.0% vs 61.1%, lift of 10.9%, p=0.019. Significant. Ship it.
The most important rule: don't stop early
You ran the test for 4 days, peeked at the data, saw Variant B was winning 72% vs 58%, and wanted to stop. Don't.
Early stopping inflates false positive rates dramatically. If you check after every 10 samples and stop when p < 0.05, your actual false positive rate is closer to 30%, not 5%. You need to pre-commit to your sample size and let the test run.
If you absolutely can't wait, use sequential testing methods (like the mSPRT or always-valid p-values) which are designed for interim peeking. The sequential Python package implements these.
Using Promptfoo for automated A/B comparison
Promptfoo makes it easy to run variants against a fixed test set. It's not the same as live A/B testing (you're testing against pre-labeled examples, not live traffic), but it's fast and good for pre-ship validation.
# promptfoo.yaml
prompts:
- id: prompt_a
raw: "{{system_prompt_a}}"
- id: prompt_b
raw: "{{system_prompt_b}}"
providers:
- anthropic:messages:claude-haiku-3-5
tests:
- vars:
user_message: "How do I cancel my subscription?"
assert:
- type: llm-rubric
value: "Response explains cancellation steps clearly and completely"
- vars:
user_message: "I was charged twice this month"
assert:
- type: llm-rubric
value: "Response acknowledges the billing issue and provides next steps"
Run with promptfoo eval and you get a side-by-side comparison with pass rates per variant. Good for catching obvious regressions before you start a live test.
Braintrust is better for ongoing tracking. You push eval results to Braintrust after every prompt change, and it tracks improvement trends over time. Useful when you're iterating quickly and want a history of eval scores tied to specific prompt versions.
The practical shortcut: blind evaluation
When you can't wait for statistical significance — maybe you're testing on a new feature with low traffic — blind evaluation gets you directional signal faster.
Take 50 real user queries. Run both prompts. Strip the variant label. Have a colleague (or a strong LLM) judge which response is better for each pair without knowing which prompt produced it.
import anthropic
import random
client = anthropic.Anthropic()
def blind_evaluate(
query: str,
response_a: str,
response_b: str,
rubric: str
) -> str:
"""Returns 'A' or 'B' or 'tie'."""
# Randomize order to prevent position bias
if random.random() > 0.5:
first, second, labels = response_a, response_b, ("A", "B")
else:
first, second, labels = response_b, response_a, ("B", "A")
judge_prompt = f"""You are evaluating two AI responses to a user query.
Query: {query}
Rubric: {rubric}
Response 1:
{first}
Response 2:
{second}
Which response better satisfies the rubric? Reply with "1", "2", or "tie". Then explain in one sentence."""
result = client.messages.create(
model="claude-opus-4-5",
max_tokens=100,
messages=[{"role": "user", "content": judge_prompt}]
)
text = result.content[0].text.strip()
winner_num = "1" if text.startswith("1") else "2" if text.startswith("2") else "tie"
# Map back to original labels
if winner_num == "1":
return labels[0]
elif winner_num == "2":
return labels[1]
return "tie"
At 50 samples, if Prompt B wins 38/50 comparisons (76%), that's a strong directional signal even without formal statistical significance. If it's 27/50 (54%), it's noise — you need a larger test.
Common mistakes
Testing on too few outputs from your eval set. If you have 20 labeled examples and run each prompt 3 times, you have 60 data points but heavy autocorrelation. These aren't independent samples.
Changing the prompt mid-test. If you notice a problem with Prompt B and fix it on day 3, your test data is now from two different prompts. Start over.
Testing on non-representative queries. If your eval set is 20 easy questions, it won't reveal that Prompt B fails on edge cases. See the evaluation datasets guide for how to build representative test sets.
Ignoring latency. Prompt B might score better on quality but take 2 seconds longer. That matters. Measure it.
A/B testing prompts properly takes more discipline than most teams apply. But one rigorous test that correctly catches a regression is worth more than twenty gut-feel comparisons that led you in the wrong direction.
For setting up systematic eval infrastructure, the LLM evaluation frameworks guide covers the full toolchain. And before shipping any variant to production, run through the agent production checklist to make sure you've covered your bases.



