A/B Test Sample Size Calculator - Free Online Tool

Plan your experiment and analyze results with statistical confidence

Running an A/B test without knowing how many visitors you need is like setting off on a road trip without a map — you might reach your destination, but you're far more likely to run out of fuel halfway there. The A/B Test Sample Size Calculator solves this foundational problem: it tells you exactly how many visitors each variation needs before your results can be trusted. A/B testing — also called split testing — is the backbone of modern conversion rate optimization. The concept is straightforward: you show two versions of a page, email, or feature to different users, measure which one performs better, and use statistics to determine whether the observed difference is real or just noise. But the math hiding beneath this process is what most practitioners get wrong. Running a test for only a few days, or stopping as soon as you see a promising lift, inflates your false-positive rate and produces results you can't act on. The key inputs to any sample size calculation are your baseline conversion rate, the minimum detectable effect (MDE), your significance level, and your desired statistical power. Your baseline conversion rate is simply your current rate — the percentage of visitors who complete the goal action in the control group. The MDE is the smallest improvement you care about detecting. If a 1% absolute lift isn't worth the engineering cost to ship, there's no point designing a test that can detect it. Your significance level (α) controls your false-positive rate: at 95% confidence, you accept a 5% chance that a test will appear significant even when the variation has no real effect. Statistical power (1−β) is the flip side: it controls your false-negative rate — at 80% power, you have an 80% chance of detecting a real effect of the stated size. These four inputs feed into the two-proportions Z-test formula, which produces the required sample size per variation. The formula is: n = (Z_α/2 + Z_β)² × [p₁(1−p₁) + p₂(1−p₂)] / (p₁−p₂)², where p₁ is your baseline rate and p₂ is your expected rate with the MDE applied. Once you have a sample size, you can divide your total traffic by the number of daily visitors to estimate how many days the test must run. The tool also supports post-test significance analysis. After a test concludes, you enter the actual visitor and conversion counts for your control and variation. The calculator computes the Z-score, p-value, confidence interval, and observed power. If the p-value is below your significance threshold, the result is statistically significant and you can ship the winning variation with confidence. One practical note that many calculators omit: even if the math says a test needs only 7 days, industry best practice recommends running for a minimum of 14 days. This ensures you capture full weekly seasonality cycles — user behavior on Mondays differs substantially from behavior on Saturdays, and a test that runs for fewer than two weeks may miss this variation. This calculator supports relative and absolute MDE, one-tailed and two-tailed hypotheses, two to five variants (for A/B/C/D/E tests), and percent-of-traffic controls. The MDE sensitivity table shows how dramatically sample size grows as you try to detect smaller and smaller effects — a key insight for understanding why low-traffic sites should focus on large lifts. The power curve chart makes this relationship visual: sample size grows exponentially as MDE decreases toward zero. For e-commerce teams, the revenue impact section adds practical context: given your average order value and monthly traffic, the calculator estimates the monetary value of the observed uplift, making the business case for CRO investment concrete and defensible.

Understanding A/B Test Statistics

What Is Statistical Significance in A/B Testing?

Statistical significance tells you how likely it is that an observed difference between two groups happened by chance. When you run an A/B test, you're sampling from a larger population — your total user base. Even if both variations perform identically, random fluctuations will cause them to show different conversion rates in your sample. Statistical significance, expressed as a confidence level (95%, 99%, etc.), is the threshold below which you declare a result 'real.' At 95% confidence, you accept a 5% chance of a false positive — declaring a winner when no true difference exists. Setting a higher threshold (99%) reduces false positives but requires a larger sample size. Most practitioners use 95% as the standard, though high-stakes decisions (medical, financial) may warrant 99%.

How Is Sample Size Calculated?

The sample size formula for comparing two conversion rates is based on the two-proportions Z-test: n = (Z_α/2 + Z_β)² × [p₁(1−p₁) + p₂(1−p₂)] / (p₁−p₂)², where p₁ is your baseline conversion rate, p₂ is p₁ plus your minimum detectable effect (MDE), Z_α/2 is the critical value for your significance level (1.96 for 95% two-tailed), and Z_β is the critical value for your desired power (0.842 for 80% power). The result n is the minimum number of visitors required per variation. For tests with more than two variants, multiply n by the number of variants to get total required traffic. The formula reveals why small effects require enormous samples: as (p₁−p₂) approaches zero, the denominator shrinks toward zero and n explodes toward infinity.

Why Proper Sample Sizing Matters

Underpowered A/B tests are one of the most costly mistakes in digital experimentation. When you stop a test early — as soon as you see a positive result — you exploit the randomness in your data. This practice, called 'peeking,' inflates your true false-positive rate far above the nominal α level. Academic research has shown that stopping a test at the first significant result can increase your false-positive rate to 30–50%, even when running at a nominal 5% significance level. A properly sized test prevents this: you decide the sample size before the test begins, run to completion, and make a single decision. This discipline is what separates rigorously validated results from wishful thinking.

Limitaciones y Advertencias Importantes

This calculator uses the classical frequentist Z-test for proportions, which works well for conversion rate metrics on moderate to large sample sizes. It is not appropriate for continuous metrics like average order value or revenue per visitor (which require a t-test with variance estimation) or for situations with very low expected conversion rates (fewer than ~5 events per cell), where Fisher's exact test is more appropriate. The calculator also assumes a 50/50 traffic split between control and variation; unequal splits require modified formulas. Finally, statistical significance does not mean practical significance — a 0.1% lift that is highly significant may not be worth shipping if it adds engineering complexity. Always combine statistical analysis with business judgment.

Fórmulas

Calculates the minimum number of visitors per variation. p₁ is the baseline conversion rate, p₂ is the expected rate with MDE applied, Z_α/2 is the critical value for the significance level (1.96 for 95% two-tailed), and Z_β is the critical value for statistical power (0.842 for 80%).

Estimates how many days the test must run to collect enough data. Divides total required sample by effective daily traffic entering the experiment.

Computes the test statistic for comparing two observed proportions, where p̂ is the pooled proportion: p̂ = (x_A + x_B) ÷ (n_A + n_B). A Z-score beyond the critical value indicates statistical significance.

Gives the range within which the true difference in conversion rates likely falls. If the interval excludes zero, the result is statistically significant at the chosen confidence level.

Reference Tables

Z Critical Values for Common Significance and Power Levels

Parámetro	Nivel	Z Value
Significance (two-tailed)	90% (α=0.10)	1.645
Significance (two-tailed)	95% (α=0.05)	1.960
Significance (two-tailed)	98% (α=0.02)	2.326
Significance (two-tailed)	99% (α=0.01)	2.576
Potencia	70% (β=0.30)	0.524
Potencia	80% (β=0.20)	0.842
Potencia	85% (β=0.15)	1.036
Potencia	90% (β=0.10)	1.282

Industry Benchmark Baseline Conversion Rates

Industry / Metric	Typical Baseline CR	Recommended MDE
E-commerce Purchase	2–4%	10–20% relative
E-commerce Add-to-Cart	8–12%	5–10% relative
SaaS Free Trial Signup	3–7%	10–15% relative
Lead Generation Form	5–15%	10–20% relative
Email Click-Through	2–5%	15–25% relative
Landing Page CTA Click	10–25%	5–10% relative

Worked Examples

E-commerce Purchase Rate Test

p₁ = 0.035, p₂ = 0.035 + 0.007 = 0.042

Z_α/2 = 1.96, Z_β = 0.842

Numerator = (1.96 + 0.842)² × [0.035×0.965 + 0.042×0.958] = 7.849 × [0.03378 + 0.04024] = 7.849 × 0.07402 = 0.5810

Denominator = (0.042 − 0.035)² = 0.007² = 0.000049

n = 0.5810 ÷ 0.000049 = 11,857 visitors per variation

SaaS Free Trial with Higher Power

p₁ = 0.05, p₂ = 0.05 × 1.15 = 0.0575

Z_α/2 = 1.96, Z_β = 1.282

Numerator = (1.96 + 1.282)² × [0.05×0.95 + 0.0575×0.9425] = 10.517 × [0.0475 + 0.05419] = 10.517 × 0.10169 = 1.0695

Denominator = (0.0575 − 0.05)² = 0.0075² = 0.00005625

n = 1.0695 ÷ 0.00005625 = 19,013 visitors per variation

Post-Test Significance Analysis

p̂_A = 420/12000 = 0.035, p̂_B = 480/12000 = 0.04

Pooled p̂ = (420+480)/(12000+12000) = 900/24000 = 0.0375

SE = √[0.0375 × 0.9625 × (1/12000 + 1/12000)] = √[0.03609 × 0.000167] = √0.000006015 = 0.002453

Z = (0.04 − 0.035) / 0.002453 = 0.005 / 0.002453 = 2.038

p-value (two-tailed) = 2 × (1 − Φ(2.038)) ≈ 2 × 0.0208 = 0.0416

How to Use the A/B Test Sample Size Calculator

Enter Your Baseline Conversion Rate

Type in your current control conversion rate — the percentage of visitors who currently complete the goal action (e.g., purchase, sign-up, click). If you're not sure, check your analytics platform for the last 30–90 days of data on the specific metric you plan to test.

Set Your Minimum Detectable Effect

Enter the smallest lift that would be worth acting on. Choose 'Relative' if you want to express MDE as a percentage of the baseline (e.g., 20% relative on a 5% baseline = detecting lifts of 1 percentage point or more). Choose 'Absolute' if you want to specify the raw percentage-point lift directly. A larger MDE means fewer visitors required but you risk missing smaller real improvements.

Choose Significance Level and Power

Select your significance level (95% is standard — means 5% false-positive risk) and statistical power (80% is the industry default — means 80% chance of detecting a real effect). Higher confidence or higher power both increase the required sample size. For critical business decisions, consider 99% confidence. Add your daily visitor count to get an estimated test duration.

Analyze Post-Test Results

After your test concludes, switch to the 'Analyze Results' tab and enter the actual visitor and conversion counts for your control and variation. The calculator outputs a p-value, Z-score, confidence interval, and observed power — and flags any Sample Ratio Mismatch that might indicate a bucketing problem. Add your average order value and monthly visitors to see projected revenue impact.

Preguntas Frecuentes

What is the difference between absolute and relative MDE?

Relative MDE expresses the minimum detectable effect as a percentage of your baseline. For example, a 20% relative MDE on a 5% baseline means you want to detect at least a 1 percentage-point lift (5% × 20% = 1 pp). Absolute MDE specifies the percentage-point change directly — a 1% absolute MDE also means detecting a 1 pp lift, but the framing is independent of your baseline. For low-baseline metrics (e.g., 1% purchase rate), relative MDE is usually more intuitive. For high-baseline metrics (e.g., 50% button click rate), absolute MDE makes the target clearer. Either way, a smaller MDE requires a substantially larger sample size.

When should I use a one-tailed vs. two-tailed test?

A two-tailed test checks whether the variation is different from the control in either direction — better or worse. A one-tailed test checks only whether the variation is better (or only whether it is worse). One-tailed tests require a smaller sample size because the critical region is on one side of the distribution, but they carry a risk: if your variation unexpectedly hurts conversions, a one-tailed test configured to detect improvements will fail to flag the harm. Best practice for most A/B tests is two-tailed, especially for changes that could plausibly have negative effects. Reserve one-tailed tests for situations where a negative result is genuinely impossible or irrelevant.

Why does the calculator recommend running for at least 14 days?

User behavior varies significantly by day of week. Monday morning shoppers behave differently from Saturday afternoon browsers. If your test runs for only a few days, it may capture a disproportionate slice of one type of user, producing results that don't generalize. Running for at least two full weeks (14 days) ensures every day of the week is represented roughly twice in each variant's data. Some practitioners extend this to 3–4 weeks for B2B products with long consideration cycles. The 14-day minimum is a heuristic recommended by major A/B testing platforms including AB Tasty, Optimizely, and VWO.

What is a Sample Ratio Mismatch (SRM) and why does it matter?

A Sample Ratio Mismatch occurs when the traffic split between your control and variation doesn't match the intended ratio. For a 50/50 test, you'd expect roughly equal visitors in each bucket. If one bucket has significantly more visitors than the other (as detected by a chi-squared goodness-of-fit test), it suggests your bucketing mechanism is flawed — perhaps due to bot traffic, caching issues, redirect chains, or bugs in your experimentation SDK. An SRM invalidates your test results because the two groups are no longer comparable. Always check for SRM before concluding any test, especially if you're running on custom infrastructure rather than a dedicated A/B testing platform.

What is statistical power, and why is 80% the standard?

Statistical power (1−β) is the probability that your test will detect a real effect of the specified size, given that the effect truly exists. At 80% power, if your variation truly improves conversion by your MDE, you have an 80% chance of observing a statistically significant result and a 20% chance of missing it (Type II error). The 80% convention comes from Jacob Cohen's 1988 statistical power handbook, where he suggested that a 4:1 ratio of Type II to Type I errors (β/α = 4 at α=0.05) was a reasonable default. Higher power (90%, 95%) increases confidence in negative results but substantially increases required sample size. Most CRO practitioners accept 80% as a practical trade-off between rigor and test speed.

Can I use this calculator for metrics other than conversion rates?

This calculator is designed for binomial (0/1) conversion rate metrics — did the user convert or not? It uses the two-proportions Z-test formula, which is appropriate when your metric is a proportion. For continuous metrics like average order value, revenue per visitor, or session duration, you need a two-sample t-test with variance estimation, which requires knowing the standard deviation of your metric — information this calculator does not collect. For those use cases, consider using an experiment platform that supports continuous metrics, or calculate the required sample size using a t-test power calculator with your metric's historical standard deviation.