Steve Howard

7/28/2014

- Overview of A/B testing
- Two approaches
- Frequentist hypothesis testing
- A Bayesian approach

- Monte Carlo simulation of A/B testing
- Results of simulation for various approaches

- Signup page gets a steady stream of visitors
- Find a version of the page that maximizes
**conversion rate**(proportion of visitors that sign up) - Will do a sequence of many experiments one after another

- Two groups,
**treatment**vs**baseline** **Randomize**each visitor into one group- Run both groups over
**same time period** - At some point,
**choose one**to keep

- When do I stop the test?
- Which group do I choose to keep?

(But this could vary greatly with context)

- Bias and confounding
*(not discussed here)* - Might choose a bad treatment that did well by chance
- Might turn down a good treatment that did poorly by chance
- Opportunity cost of running long tests
- Simultaneous tests can be tricky

- Choose a significance level
- "If baseline and treatment are identical, we'll keep the baseline X% of the time"

- Choose a power level
- "If the treatment is better, we'll choose it Y% of the time"

- Choose an effect size for the desired power:
- "If the treatment is better by Z%, we'll choose it Y% of the time"

- Compute sample size and wait for that many visitors
- Plug visitor and conversion counts into hypothesis test
- Make decision based on p-value of test

- Sample size calculation
- Test at 95% significance
- Want 80% power to detect a 10% relative lift (this is our effect size)
- 15% baseline conversion rate
- Then we need
**9,257**visitors in each group

- Suppose we observe 1,350 conversions in baseline, 1,425 conversions in treatment
- Chi-squared test gives a p-value of 0.13
- Insufficient evidence that treatment is better
- Decide to keep the baseline

- Directly compute (posterior) "probability" treatment is better than baseline
- Chris Stucchio's Bayesian decision rule:
- Choose a
**threshold of caring**: "If baseline and treatment conversion rates differ by no more than X, I don't care which one I choose" - At any point, compute the
**expected loss**from choosing the current winner and being wrong - If expected loss is less than threshold of caring, stop the test

- Choose a
- Evan Miller's closed-form probability formula for the above

- Choose threshold of caring: 1% relative lift
- At 10% baseline conversion rate, threshold of caring is 0.1% loss
- Suppose we have 10 / 100 baseline conversions, 12 / 100 treatment conversions
- Expected loss is about 1%, keep running

- Suppose we later have 100 / 1,000 baseline conversions, 120 / 1,000 treatment conversions
- Expected loss is about 0.05%, choose treatment

- Suppose we instead reached 10,000 / 100,000 conversions in both groups
- Expected loss is about 0.05%, choose either group

- Bayesian test will always end eventually

Imagine a sequence of experiments on a single page over one million total visitors.

- Start with a baseline page, conversion rate of 10%
- Come up with a variation page (treatment), conversion rate randomly drawn around baseline
- Consume visitors in both groups until a decision is made
- Keep chosen page as new baseline and repeat from step 2

Consume 5,000 visitors during "implementation" of new treatment

- Set them
- Log them
- Careful when forking
- Consider injecting RNGs

```
def loop_posterior(baseline_alpha, baseline_beta,
treatment_alpha, treatment_beta):
sum_result = 0
for i in xrange(treatment_alpha):
sum_result += math.exp(
log_beta(baseline_alpha + i, treatment_beta + baseline_beta)
- log_beta(1 + i, treatment_beta)
- log_beta(baseline_alpha, baseline_beta)
) / (treatment_beta + i)
return sum_result
```

```
def vectorized_posterior(baseline_alpha, baseline_beta,
treatment_alpha, treatment_beta):
i_values = numpy.arange(treatment_alpha)
return numpy.sum(
numpy.exp(
log_beta(baseline_alpha + i_values, treatment_beta + baseline_beta)
- log_beta(1 + i_values, treatment_beta)
- log_beta(baseline_alpha, baseline_beta)
) / (treatment_beta + i_values)
)
```

```
$ python benchmark_bayesian_posterior.py
Samples Slow Fast Ratio
10 0.032us 0.033us 1.0x
100 0.208us 0.041us 5.1x
1000 2.006us 0.101us 19.8x
10000 20.282us 0.601us 33.7x
100000 206.210us 6.160us 33.5x
```

- Should treatment rate distribution be stationary around baseline rate?
- Shape, location, scale of treatment rate distribution
- Implementation cost
- Additional costs of adopting new treatment
- Etc etc etc

- Frequentist
- Tests other than Chi-squared
- Sequential testing

- Bayesian
- Different (informative) prior
- Different loss function
- Other decision rules

- Bandit methods
- Etc etc etc

- Common significance levels may be too conservative
- Bayesian approach minimizes loss with shorter tests
- It pays to give this stuff some thought!