A/B testing has limitations. There are different outcomes as a result of A/B testing.
- Control > Solution
- Control = Solution
- Control < Solution
From the above, A/B testing can tell you if solution is better/worse than control, but it does not tell you how much it is better or worse by.
In an A/B test, the default assumption is that the solution and control will have the same win rates.
The default assumption is called the null hypothesis.
The goal of an A/B test is to disprove the null hypothesis. It should have enough evidence that the solution is better or worse than control.
Using statistical significance.
Statistical significance is to reduce the difference between the true and observed win rate below a threshold. There is some risk but it is low enough to be acceptable.
2 parameters of stat sig: P value, and statistical power.
P-value: probability that your test is a false positive. P-value of 5% means there is a 5% chance of falsely calling a positive or negative impact. (Type 1 error)
Statistical power: probability of not having a false null. A false null is when there is a real difference between control and solution, but data suggests they are actually the same. A statistical power of 80% means there is a 20% chance of falsely calling a null impact (Type 2 error)
Type 1 error: Calling a false positive / negative Type 2 error: Accepting the null hypothesis when one variation is actually different than the other
Widely adopted principle is that maximum P value should be 0.05 or 5%.
False error rate
For false nulls.
False error rate (actually the inverse of statistical power). To measure the probability of having a false null (or type 2 error). Lower the better.
The max false error rate is usually 0.2 or 20%.
Evan Miller's A/B tools - can use to calculate sample size