🧪

A/B Test Analyzer

Verified

by Community

Guides you through proper A/B test methodology including hypothesis formation, sample size calculation, test duration, statistical significance, and avoiding common pitfalls like peeking.

ab-testingstatisticsexperimentsoptimizationconversion

A/B Test Analyzer

Design and analyze A/B tests with proper statistical methodology.

Usage

  1. Form a clear hypothesis: "Changing X will improve Y by Z%"
  2. Calculate required sample size based on baseline rate, minimum detectable effect, and significance level
  3. Determine test duration and traffic allocation
  4. Monitor for technical issues without peeking at results
  5. Analyze results with proper statistical tests and make a decision

Examples

  • Sample size calculation: Baseline conversion rate: 5%. Minimum detectable effect: 10% relative (0.5 percentage points). Significance level: 95% (alpha=0.05). Power: 80%. Required sample: ~31,000 per variant. At 1,000 visitors/day with 50/50 split: test needs 62 days. If that's too long, either increase traffic or accept a larger minimum detectable effect
  • Analyzing results: Control: 5,000 visitors, 250 conversions (5.0%). Variant: 5,000 visitors, 280 conversions (5.6%). Relative lift: +12%. P-value: 0.18. NOT statistically significant (p > 0.05). Decision: do not ship the variant. The apparent improvement could be due to chance. Need more sample or the effect isn't real
  • Segmented analysis: Overall result: no significant difference. But segment by device: mobile shows +15% (significant), desktop shows -5% (not significant). This suggests a mobile-specific improvement. Validate with a follow-up mobile-only test before concluding — segment analysis inflates false positives

Guidelines

  • Never peek at results before reaching your pre-calculated sample size — peeking inflates false positive rates from 5% to 30%+
  • If you must check early, use sequential testing methods (always valid p-values) instead of fixed-horizon tests
  • Test one change at a time. If you change headline AND button color, you can't attribute the effect to either
  • Run tests for full weeks (7, 14, 21 days) to account for day-of-week effects
  • A "non-significant" result is still a result — it tells you the change doesn't matter enough to invest in
  • Document every test: hypothesis, variants, sample size, duration, result, decision. Build an institutional testing knowledge base