Advertisement
Loading...

Test Parameters

%

Your current control group conversion rate (%)

%

Smallest lift worth detecting — relative is % of baseline, absolute is percentage points

Probability of detecting a real effect — 80% is the industry default

Two-tailed tests for any difference (increase or decrease). One-tailed tests only for improvement.

Total daily visitors entering the test (leave blank to skip duration estimate)

%

Enter Your Test Parameters

Fill in your baseline conversion rate and MDE to calculate the required sample size and estimated test duration.

Advertisement
Loading...

How to Use the A/B Test Sample Size Calculator

1

Enter Your Baseline Conversion Rate

Type in your current control conversion rate — the percentage of visitors who currently complete the goal action (e.g., purchase, sign-up, click). If you're not sure, check your analytics platform for the last 30–90 days of data on the specific metric you plan to test.

2

Set Your Minimum Detectable Effect

Enter the smallest lift that would be worth acting on. Choose 'Relative' if you want to express MDE as a percentage of the baseline (e.g., 20% relative on a 5% baseline = detecting lifts of 1 percentage point or more). Choose 'Absolute' if you want to specify the raw percentage-point lift directly. A larger MDE means fewer visitors required but you risk missing smaller real improvements.

3

Choose Significance Level and Power

Select your significance level (95% is standard — means 5% false-positive risk) and statistical power (80% is the industry default — means 80% chance of detecting a real effect). Higher confidence or higher power both increase the required sample size. For critical business decisions, consider 99% confidence. Add your daily visitor count to get an estimated test duration.

4

Analyze Post-Test Results

After your test concludes, switch to the 'Analyze Results' tab and enter the actual visitor and conversion counts for your control and variation. The calculator outputs a p-value, Z-score, confidence interval, and observed power — and flags any Sample Ratio Mismatch that might indicate a bucketing problem. Add your average order value and monthly visitors to see projected revenue impact.

Frequently Asked Questions

What is the difference between absolute and relative MDE?

Relative MDE expresses the minimum detectable effect as a percentage of your baseline. For example, a 20% relative MDE on a 5% baseline means you want to detect at least a 1 percentage-point lift (5% × 20% = 1 pp). Absolute MDE specifies the percentage-point change directly — a 1% absolute MDE also means detecting a 1 pp lift, but the framing is independent of your baseline. For low-baseline metrics (e.g., 1% purchase rate), relative MDE is usually more intuitive. For high-baseline metrics (e.g., 50% button click rate), absolute MDE makes the target clearer. Either way, a smaller MDE requires a substantially larger sample size.

When should I use a one-tailed vs. two-tailed test?

A two-tailed test checks whether the variation is different from the control in either direction — better or worse. A one-tailed test checks only whether the variation is better (or only whether it is worse). One-tailed tests require a smaller sample size because the critical region is on one side of the distribution, but they carry a risk: if your variation unexpectedly hurts conversions, a one-tailed test configured to detect improvements will fail to flag the harm. Best practice for most A/B tests is two-tailed, especially for changes that could plausibly have negative effects. Reserve one-tailed tests for situations where a negative result is genuinely impossible or irrelevant.

Why does the calculator recommend running for at least 14 days?

User behavior varies significantly by day of week. Monday morning shoppers behave differently from Saturday afternoon browsers. If your test runs for only a few days, it may capture a disproportionate slice of one type of user, producing results that don't generalize. Running for at least two full weeks (14 days) ensures every day of the week is represented roughly twice in each variant's data. Some practitioners extend this to 3–4 weeks for B2B products with long consideration cycles. The 14-day minimum is a heuristic recommended by major A/B testing platforms including AB Tasty, Optimizely, and VWO.

What is a Sample Ratio Mismatch (SRM) and why does it matter?

A Sample Ratio Mismatch occurs when the traffic split between your control and variation doesn't match the intended ratio. For a 50/50 test, you'd expect roughly equal visitors in each bucket. If one bucket has significantly more visitors than the other (as detected by a chi-squared goodness-of-fit test), it suggests your bucketing mechanism is flawed — perhaps due to bot traffic, caching issues, redirect chains, or bugs in your experimentation SDK. An SRM invalidates your test results because the two groups are no longer comparable. Always check for SRM before concluding any test, especially if you're running on custom infrastructure rather than a dedicated A/B testing platform.

What is statistical power, and why is 80% the standard?

Statistical power (1−β) is the probability that your test will detect a real effect of the specified size, given that the effect truly exists. At 80% power, if your variation truly improves conversion by your MDE, you have an 80% chance of observing a statistically significant result and a 20% chance of missing it (Type II error). The 80% convention comes from Jacob Cohen's 1988 statistical power handbook, where he suggested that a 4:1 ratio of Type II to Type I errors (β/α = 4 at α=0.05) was a reasonable default. Higher power (90%, 95%) increases confidence in negative results but substantially increases required sample size. Most CRO practitioners accept 80% as a practical trade-off between rigor and test speed.

Can I use this calculator for metrics other than conversion rates?

This calculator is designed for binomial (0/1) conversion rate metrics — did the user convert or not? It uses the two-proportions Z-test formula, which is appropriate when your metric is a proportion. For continuous metrics like average order value, revenue per visitor, or session duration, you need a two-sample t-test with variance estimation, which requires knowing the standard deviation of your metric — information this calculator does not collect. For those use cases, consider using an experiment platform that supports continuous metrics, or calculate the required sample size using a t-test power calculator with your metric's historical standard deviation.