Skip to main content
Data Prompts

A/B Test Results Interpreter

Paste A/B test results and get a statistically sound interpretation, including whether to ship, wait, or iterate.

intermediateWorks with any modelData
Prompt
Interpret the following A/B test results and give me a clear ship/wait/iterate recommendation.

**Test name:** [TEST_NAME]
**What was changed:** [CHANGE_DESCRIPTION]
**Metric being measured:** [PRIMARY_METRIC] (primary) and [SECONDARY_METRICS] (guardrail metrics)
**Test duration:** [DURATION]
**Sample size:** Control: [CONTROL_N] users | Variant: [VARIANT_N] users

**Results:**
- Control [PRIMARY_METRIC]: [CONTROL_VALUE]
- Variant [PRIMARY_METRIC]: [VARIANT_VALUE]
- Reported p-value: [P_VALUE]
- Confidence interval: [CONFIDENCE_INTERVAL]
- Guardrail metric results: [GUARDRAIL_RESULTS]

Analyze this test with:

1. **Statistical validity check** — is the sample size adequate? Was the test run long enough? Any signs of peeking, novelty effect, or Simpsons Paradox to watch for?

2. **Practical significance** — is the measured lift large enough to matter for the business? State the absolute difference and the relative lift separately.

3. **Recommendation** — Ship / Wait / Iterate, with one clear sentence of reasoning.

4. **Risks** — what could make this result misleading? (Segment differences, interaction effects, short-term vs. long-term behavior)

5. **Next steps** — if shipping: monitoring plan. If waiting: what additional data is needed. If iterating: what hypothesis to test next.

How to Use

Fill in the test parameters and results from your experimentation platform (Optimizely, VWO, Google Optimize, custom). The more guardrail metrics you include, the better the AI can spot cases where the primary metric improved at the cost of something else. [CONFIDENCE_INTERVAL] is the range around the effect size, not the confidence level.

Variables

VariableDescription
[TEST_NAME]Short descriptive name for the test
[CHANGE_DESCRIPTION]What was different in the variant (e.g., "button color changed from grey to blue")
[PRIMARY_METRIC]The main success metric (e.g., checkout conversion rate)
[SECONDARY_METRICS]Guardrail metrics to check for negative side effects (e.g., session duration, return rate)
[DURATION]How long the test ran
[CONTROL_N / VARIANT_N]User counts in each group
[CONTROL_VALUE / VARIANT_VALUE]Measured metric values for each group
[P_VALUE]Statistical significance result from your platform
[CONFIDENCE_INTERVAL]The 95% CI on the effect size
[GUARDRAIL_RESULTS]Results for secondary metrics

Tips

  • If you don't have a confidence interval, ask: "Calculate a 95% confidence interval given these results" before running this prompt.
  • For inconclusive tests (p > 0.05), ask: "What sample size would I need to detect a lift of X% at 80% power?" to plan the next iteration.
  • Always include at least one guardrail metric — a test that improves conversion but increases refund rate is not a win.