What is prompt engineering?

Prompt engineering is the practice of crafting inputs to AI language models to produce accurate, useful, and reliable outputs. It involves choosing the right words, structure, context, and format to guide the AI toward the response you actually need — rather than a generic or off-target one.

Which AI models benefit most from better prompting?

All major large language models — including ChatGPT (GPT-4o), Claude, and Gemini — respond significantly to prompt quality. The same task can produce dramatically different results depending on how you structure your request. Better prompting improves output across every major model.

Do I need technical skills to do prompt engineering?

No. Prompt engineering is done in natural language — you write text instructions, not code. Basic prompting needs no technical background at all. Advanced techniques like prompt chaining or agentic workflows can benefit from light scripting knowledge, but the core skill is clear written communication.

Where can I learn more about prompt engineering?

MasterPrompting.net offers a structured curriculum from beginner to advanced, covering every major technique from basic clarity and context to chain-of-thought, meta-prompting, and agentic workflows. Start with the Beginner track to build a solid foundation.

A/B Testing Prompts in Production — A Statistical Guide

I changed a system prompt and the outputs looked better. So I shipped it. Three days later, ticket resolution rates had dropped 8%. The outputs felt better — shorter, punchier — but users were following up more because the bot was leaving out important details.

Eyeballing LLM output quality is unreliable. You need A/B testing with actual metrics and enough samples to trust the result. This guide shows you how to do it properly.

Why your intuition fails

Confirmation bias is the obvious problem. You wrote Prompt B, so you're looking for evidence it's better. But there are subtler failure modes:

Small sample sizes amplify noise. LLM outputs are stochastic. The same prompt can produce noticeably different quality on different calls. If you compare 10 outputs from each variant, you're measuring variance, not quality.

Qualitative judgment is inconsistent. Ask two people to rate which response is better and they'll agree maybe 70% of the time. Ask the same person twice with a week in between — agreement drops to 80%.

You're measuring the wrong thing. "This response sounds better" doesn't tell you if users got what they needed. Downstream metrics — did they follow up? Did they convert? Did they leave the chat? — are what matter.

Define your metric first

A/B testing a prompt without a pre-defined metric is just vibes with extra steps. Your metric needs to be:

Measurable — you can compute it from logs without human judgment (or at least automate the judgment)
Causally linked to prompt quality — changing the prompt should move this metric
Sensitive enough — if only 1 in 1000 users triggers the behavior you care about, you'll need enormous sample sizes

Common metrics for LLM features:

Feature	Metric
Support bot	Resolution rate (no follow-up within 24h)
Writing assistant	User edits the suggestion vs accepts as-is
Search/Q&A	Thumbs up/down, session abandonment
Code assistant	Code accepted without modification
Extraction	Accuracy against labeled ground truth

LLM-as-judge metrics (1-5 scale, scored by a second LLM call) are useful when you can't measure downstream behavior directly. Use a strong model (Opus or GPT-4o) as judge with a detailed rubric. See building evaluation datasets for how to build rubrics that produce reliable scores.

Calculating sample size

This is where most teams skip ahead and get burned. Minimum detectable effect (MDE) drives sample size:

If Prompt A resolves 60% of tickets and you want to detect a 10% improvement (to 66%), you need ~200 samples per variant
If you want to detect a 5% improvement (to 63%), you need ~800 samples per variant
If you want to detect a 2% improvement (to 61.2%), you need ~5,000 samples per variant

The math (two-proportion z-test, 80% power, 95% confidence):

import numpy as np
from scipy import stats

def required_sample_size(
    baseline_rate: float,
    minimum_detectable_effect: float,
    alpha: float = 0.05,  # significance level
    power: float = 0.80   # statistical power
) -> int:
    """
    Calculate required n per variant.
    
    Args:
        baseline_rate: expected rate in control group (e.g. 0.60 for 60%)
        minimum_detectable_effect: smallest change worth detecting (e.g. 0.10 for 10pp lift)
        alpha: false positive rate (0.05 = 5%)
        power: probability of detecting a real effect (0.80 = 80%)
    """
    p1 = baseline_rate
    p2 = baseline_rate + minimum_detectable_effect
    
    z_alpha = stats.norm.ppf(1 - alpha / 2)
    z_beta = stats.norm.ppf(power)
    
    pooled_p = (p1 + p2) / 2
    
    n = (
        (z_alpha * np.sqrt(2 * pooled_p * (1 - pooled_p)) +
         z_beta * np.sqrt(p1 * (1 - p1) + p2 * (1 - p2))) ** 2
    ) / (p1 - p2) ** 2
    
    return int(np.ceil(n))

# Customer support example
n = required_sample_size(
    baseline_rate=0.60,      # 60% resolution rate currently
    minimum_detectable_effect=0.10  # want to detect a 10pp improvement
)
print(f"Need {n} samples per variant ({n*2} total)")  # ~194 per variant

If your traffic is low, consider being realistic about MDE. A 10pp improvement is a big swing — if your prompt change is more subtle, you need more data.

A worked example: customer support bot

Here's a concrete test from start to finish.

Context: A B2B SaaS support bot. Current system prompt is verbose (800 tokens). New prompt is tighter (400 tokens) with more explicit resolution steps.

Metric: Resolution rate — defined as no follow-up ticket or follow-up message within 24 hours of the bot's response.

Baseline rate: 60% (measured over the previous 4 weeks from logs).

MDE: We want to detect a 10pp improvement (60% → 70%). Anything less isn't worth the prompt change.

Sample size calculation:

n = required_sample_size(0.60, 0.10)  # → 194 per variant, 388 total

Traffic: ~100 support conversations per day. At 50/50 split, that's 50 per variant per day. We need 194 per variant → 4 days to reach significance, buffer to 7 days.

Traffic splitting: Assign users to variants by hashing their user ID modulo 2:

import hashlib

def get_variant(user_id: str, test_name: str = "prompt_test_v1") -> str:
    """Deterministic variant assignment — same user always gets same variant."""
    hash_input = f"{test_name}:{user_id}"
    hash_value = int(hashlib.md5(hash_input.encode()).hexdigest(), 16)
    return "A" if hash_value % 2 == 0 else "B"

def get_system_prompt(user_id: str) -> str:
    variant = get_variant(user_id)
    if variant == "A":
        return SYSTEM_PROMPT_A  # Original
    return SYSTEM_PROMPT_B      # New

Hashing on user ID (not session ID) ensures the same user always sees the same variant. Mixing variants within a user's experience is a common mistake that inflates noise.

Logging: Log every response with the variant, timestamp, user ID, and conversation ID. After 24 hours, check if each conversation has a follow-up. That's your resolution label.

After 7 days: 350 total conversations (50 dropped due to incomplete data).

from scipy import stats

# Results after 7 days
group_a = {"n": 175, "resolved": 107}  # 61.1%
group_b = {"n": 175, "resolved": 126}  # 72.0%

rate_a = group_a["resolved"] / group_a["n"]
rate_b = group_b["resolved"] / group_b["n"]

# Two-proportion z-test
count = [group_a["resolved"], group_b["resolved"]]
nobs = [group_a["n"], group_b["n"]]

z_stat, p_value = stats.proportions_ztest(count, nobs)

print(f"Variant A: {rate_a:.1%}")
print(f"Variant B: {rate_b:.1%}")
print(f"Lift: {(rate_b - rate_a):.1%}")
print(f"p-value: {p_value:.4f}")
print(f"Significant: {p_value < 0.05}")

Output: Variant B, 72.0% vs 61.1%, lift of 10.9%, p=0.019. Significant. Ship it.

The most important rule: don't stop early

You ran the test for 4 days, peeked at the data, saw Variant B was winning 72% vs 58%, and wanted to stop. Don't.

Early stopping inflates false positive rates dramatically. If you check after every 10 samples and stop when p < 0.05, your actual false positive rate is closer to 30%, not 5%. You need to pre-commit to your sample size and let the test run.

If you absolutely can't wait, use sequential testing methods (like the mSPRT or always-valid p-values) which are designed for interim peeking. The sequential Python package implements these.

Using Promptfoo for automated A/B comparison

Promptfoo makes it easy to run variants against a fixed test set. It's not the same as live A/B testing (you're testing against pre-labeled examples, not live traffic), but it's fast and good for pre-ship validation.

# promptfoo.yaml
prompts:
  - id: prompt_a
    raw: "{{system_prompt_a}}"
  - id: prompt_b  
    raw: "{{system_prompt_b}}"

providers:
  - anthropic:messages:claude-haiku-3-5

tests:
  - vars:
      user_message: "How do I cancel my subscription?"
    assert:
      - type: llm-rubric
        value: "Response explains cancellation steps clearly and completely"
  - vars:
      user_message: "I was charged twice this month"
    assert:
      - type: llm-rubric
        value: "Response acknowledges the billing issue and provides next steps"

Run with promptfoo eval and you get a side-by-side comparison with pass rates per variant. Good for catching obvious regressions before you start a live test.

Braintrust is better for ongoing tracking. You push eval results to Braintrust after every prompt change, and it tracks improvement trends over time. Useful when you're iterating quickly and want a history of eval scores tied to specific prompt versions.

The practical shortcut: blind evaluation

When you can't wait for statistical significance — maybe you're testing on a new feature with low traffic — blind evaluation gets you directional signal faster.

Take 50 real user queries. Run both prompts. Strip the variant label. Have a colleague (or a strong LLM) judge which response is better for each pair without knowing which prompt produced it.

import anthropic
import random

client = anthropic.Anthropic()

def blind_evaluate(
    query: str,
    response_a: str,
    response_b: str,
    rubric: str
) -> str:
    """Returns 'A' or 'B' or 'tie'."""
    
    # Randomize order to prevent position bias
    if random.random() > 0.5:
        first, second, labels = response_a, response_b, ("A", "B")
    else:
        first, second, labels = response_b, response_a, ("B", "A")
    
    judge_prompt = f"""You are evaluating two AI responses to a user query.

Query: {query}

Rubric: {rubric}

Response 1:
{first}

Response 2:
{second}

Which response better satisfies the rubric? Reply with "1", "2", or "tie". Then explain in one sentence."""
    
    result = client.messages.create(
        model="claude-opus-4-5",
        max_tokens=100,
        messages=[{"role": "user", "content": judge_prompt}]
    )
    
    text = result.content[0].text.strip()
    winner_num = "1" if text.startswith("1") else "2" if text.startswith("2") else "tie"
    
    # Map back to original labels
    if winner_num == "1":
        return labels[0]
    elif winner_num == "2":
        return labels[1]
    return "tie"

At 50 samples, if Prompt B wins 38/50 comparisons (76%), that's a strong directional signal even without formal statistical significance. If it's 27/50 (54%), it's noise — you need a larger test.

Common mistakes

Testing on too few outputs from your eval set. If you have 20 labeled examples and run each prompt 3 times, you have 60 data points but heavy autocorrelation. These aren't independent samples.

Changing the prompt mid-test. If you notice a problem with Prompt B and fix it on day 3, your test data is now from two different prompts. Start over.

Testing on non-representative queries. If your eval set is 20 easy questions, it won't reveal that Prompt B fails on edge cases. See the evaluation datasets guide for how to build representative test sets.

Ignoring latency. Prompt B might score better on quality but take 2 seconds longer. That matters. Measure it.

A/B testing prompts properly takes more discipline than most teams apply. But one rigorous test that correctly catches a regression is worth more than twenty gut-feel comparisons that led you in the wrong direction.

For setting up systematic eval infrastructure, the LLM evaluation frameworks guide covers the full toolchain. And before shipping any variant to production, run through the agent production checklist to make sure you've covered your bases.

Eyeballing LLM output quality is unreliable. You need A/B testing with actual metrics and enough samples to trust the result. This guide shows you how to do it properly.

Why your intuition fails

Confirmation bias is the obvious problem. You wrote Prompt B, so you're looking for evidence it's better. But there are subtler failure modes:

Define your metric first

A/B testing a prompt without a pre-defined metric is just vibes with extra steps. Your metric needs to be:

Measurable — you can compute it from logs without human judgment (or at least automate the judgment)
Causally linked to prompt quality — changing the prompt should move this metric
Sensitive enough — if only 1 in 1000 users triggers the behavior you care about, you'll need enormous sample sizes

Common metrics for LLM features:

Feature	Metric
Support bot	Resolution rate (no follow-up within 24h)
Writing assistant	User edits the suggestion vs accepts as-is
Search/Q&A	Thumbs up/down, session abandonment
Code assistant	Code accepted without modification
Extraction	Accuracy against labeled ground truth

Calculating sample size

This is where most teams skip ahead and get burned. Minimum detectable effect (MDE) drives sample size:

If Prompt A resolves 60% of tickets and you want to detect a 10% improvement (to 66%), you need ~200 samples per variant
If you want to detect a 5% improvement (to 63%), you need ~800 samples per variant
If you want to detect a 2% improvement (to 61.2%), you need ~5,000 samples per variant

The math (two-proportion z-test, 80% power, 95% confidence):

import numpy as np
from scipy import stats

def required_sample_size(
    baseline_rate: float,
    minimum_detectable_effect: float,
    alpha: float = 0.05,  # significance level
    power: float = 0.80   # statistical power
) -> int:
    """
    Calculate required n per variant.
    
    Args:
        baseline_rate: expected rate in control group (e.g. 0.60 for 60%)
        minimum_detectable_effect: smallest change worth detecting (e.g. 0.10 for 10pp lift)
        alpha: false positive rate (0.05 = 5%)
        power: probability of detecting a real effect (0.80 = 80%)
    """
    p1 = baseline_rate
    p2 = baseline_rate + minimum_detectable_effect
    
    z_alpha = stats.norm.ppf(1 - alpha / 2)
    z_beta = stats.norm.ppf(power)
    
    pooled_p = (p1 + p2) / 2
    
    n = (
        (z_alpha * np.sqrt(2 * pooled_p * (1 - pooled_p)) +
         z_beta * np.sqrt(p1 * (1 - p1) + p2 * (1 - p2))) ** 2
    ) / (p1 - p2) ** 2
    
    return int(np.ceil(n))

# Customer support example
n = required_sample_size(
    baseline_rate=0.60,      # 60% resolution rate currently
    minimum_detectable_effect=0.10  # want to detect a 10pp improvement
)
print(f"Need {n} samples per variant ({n*2} total)")  # ~194 per variant

If your traffic is low, consider being realistic about MDE. A 10pp improvement is a big swing — if your prompt change is more subtle, you need more data.

A worked example: customer support bot

Here's a concrete test from start to finish.

Context: A B2B SaaS support bot. Current system prompt is verbose (800 tokens). New prompt is tighter (400 tokens) with more explicit resolution steps.

Metric: Resolution rate — defined as no follow-up ticket or follow-up message within 24 hours of the bot's response.

Baseline rate: 60% (measured over the previous 4 weeks from logs).

MDE: We want to detect a 10pp improvement (60% → 70%). Anything less isn't worth the prompt change.

Sample size calculation:

n = required_sample_size(0.60, 0.10)  # → 194 per variant, 388 total

Traffic: ~100 support conversations per day. At 50/50 split, that's 50 per variant per day. We need 194 per variant → 4 days to reach significance, buffer to 7 days.

Traffic splitting: Assign users to variants by hashing their user ID modulo 2:

import hashlib

def get_variant(user_id: str, test_name: str = "prompt_test_v1") -> str:
    """Deterministic variant assignment — same user always gets same variant."""
    hash_input = f"{test_name}:{user_id}"
    hash_value = int(hashlib.md5(hash_input.encode()).hexdigest(), 16)
    return "A" if hash_value % 2 == 0 else "B"

def get_system_prompt(user_id: str) -> str:
    variant = get_variant(user_id)
    if variant == "A":
        return SYSTEM_PROMPT_A  # Original
    return SYSTEM_PROMPT_B      # New

Hashing on user ID (not session ID) ensures the same user always sees the same variant. Mixing variants within a user's experience is a common mistake that inflates noise.

Logging: Log every response with the variant, timestamp, user ID, and conversation ID. After 24 hours, check if each conversation has a follow-up. That's your resolution label.

After 7 days: 350 total conversations (50 dropped due to incomplete data).

from scipy import stats

# Results after 7 days
group_a = {"n": 175, "resolved": 107}  # 61.1%
group_b = {"n": 175, "resolved": 126}  # 72.0%

rate_a = group_a["resolved"] / group_a["n"]
rate_b = group_b["resolved"] / group_b["n"]

# Two-proportion z-test
count = [group_a["resolved"], group_b["resolved"]]
nobs = [group_a["n"], group_b["n"]]

z_stat, p_value = stats.proportions_ztest(count, nobs)

print(f"Variant A: {rate_a:.1%}")
print(f"Variant B: {rate_b:.1%}")
print(f"Lift: {(rate_b - rate_a):.1%}")
print(f"p-value: {p_value:.4f}")
print(f"Significant: {p_value < 0.05}")

Output: Variant B, 72.0% vs 61.1%, lift of 10.9%, p=0.019. Significant. Ship it.

The most important rule: don't stop early

You ran the test for 4 days, peeked at the data, saw Variant B was winning 72% vs 58%, and wanted to stop. Don't.

If you absolutely can't wait, use sequential testing methods (like the mSPRT or always-valid p-values) which are designed for interim peeking. The sequential Python package implements these.

Using Promptfoo for automated A/B comparison

# promptfoo.yaml
prompts:
  - id: prompt_a
    raw: "{{system_prompt_a}}"
  - id: prompt_b  
    raw: "{{system_prompt_b}}"

providers:
  - anthropic:messages:claude-haiku-3-5

tests:
  - vars:
      user_message: "How do I cancel my subscription?"
    assert:
      - type: llm-rubric
        value: "Response explains cancellation steps clearly and completely"
  - vars:
      user_message: "I was charged twice this month"
    assert:
      - type: llm-rubric
        value: "Response acknowledges the billing issue and provides next steps"

Run with promptfoo eval and you get a side-by-side comparison with pass rates per variant. Good for catching obvious regressions before you start a live test.

The practical shortcut: blind evaluation

When you can't wait for statistical significance — maybe you're testing on a new feature with low traffic — blind evaluation gets you directional signal faster.

Take 50 real user queries. Run both prompts. Strip the variant label. Have a colleague (or a strong LLM) judge which response is better for each pair without knowing which prompt produced it.

import anthropic
import random

client = anthropic.Anthropic()

def blind_evaluate(
    query: str,
    response_a: str,
    response_b: str,
    rubric: str
) -> str:
    """Returns 'A' or 'B' or 'tie'."""
    
    # Randomize order to prevent position bias
    if random.random() > 0.5:
        first, second, labels = response_a, response_b, ("A", "B")
    else:
        first, second, labels = response_b, response_a, ("B", "A")
    
    judge_prompt = f"""You are evaluating two AI responses to a user query.

Query: {query}

Rubric: {rubric}

Response 1:
{first}

Response 2:
{second}

Which response better satisfies the rubric? Reply with "1", "2", or "tie". Then explain in one sentence."""
    
    result = client.messages.create(
        model="claude-opus-4-5",
        max_tokens=100,
        messages=[{"role": "user", "content": judge_prompt}]
    )
    
    text = result.content[0].text.strip()
    winner_num = "1" if text.startswith("1") else "2" if text.startswith("2") else "tie"
    
    # Map back to original labels
    if winner_num == "1":
        return labels[0]
    elif winner_num == "2":
        return labels[1]
    return "tie"

At 50 samples, if Prompt B wins 38/50 comparisons (76%), that's a strong directional signal even without formal statistical significance. If it's 27/50 (54%), it's noise — you need a larger test.

Common mistakes

Testing on too few outputs from your eval set. If you have 20 labeled examples and run each prompt 3 times, you have 60 data points but heavy autocorrelation. These aren't independent samples.

Changing the prompt mid-test. If you notice a problem with Prompt B and fix it on day 3, your test data is now from two different prompts. Start over.

Ignoring latency. Prompt B might score better on quality but take 2 seconds longer. That matters. Measure it.

A/B Testing Prompts in Production — A Statistical Guide

Why your intuition fails

Define your metric first

Calculating sample size

A worked example: customer support bot

The most important rule: don't stop early

Using Promptfoo for automated A/B comparison

The practical shortcut: blind evaluation

Common mistakes

Related articles

Async Python for LLM Apps — Patterns That Actually Work in Production

FastAPI + Claude API — Production Patterns for AI Backends

LLM Observability with OpenTelemetry — Beyond LangSmith

A/B Testing Prompts in Production — A Statistical Guide

Why your intuition fails

Define your metric first

Calculating sample size

A worked example: customer support bot

The most important rule: don't stop early

Using Promptfoo for automated A/B comparison

The practical shortcut: blind evaluation

Common mistakes

Related articles

Async Python for LLM Apps — Patterns That Actually Work in Production

FastAPI + Claude API — Production Patterns for AI Backends

LLM Observability with OpenTelemetry — Beyond LangSmith