Understanding A/B testing through Monte Carlo simulation

Steve Howard

7/28/2014

Overview of A/B testing
Two approaches
- Frequentist hypothesis testing
- A Bayesian approach
Monte Carlo simulation of A/B testing
Results of simulation for various approaches

Conversion optimization

Signup page gets a steady stream of visitors
Find a version of the page that maximizes conversion rate (proportion of visitors that sign up)
Will do a sequence of many experiments one after another

Overview of testing

Two groups, treatment vs baseline
Randomize each visitor into one group
Run both groups over same time period
At some point, choose one to keep

A decision problem

When do I stop the test?
Which group do I choose to keep?

Goal: to maximize final conversion rate

(But this could vary greatly with context)

Pitfalls

Bias and confounding (not discussed here)
Might choose a bad treatment that did well by chance
Might turn down a good treatment that did poorly by chance
Opportunity cost of running long tests
- Simultaneous tests can be tricky

Frequentist hypothesis testing

Choose a significance level
- "If baseline and treatment are identical, we'll keep the baseline X% of the time"
Choose a power level
- "If the treatment is better, we'll choose it Y% of the time"
Choose an effect size for the desired power:
- "If the treatment is better by Z%, we'll choose it Y% of the time"
Compute sample size and wait for that many visitors
Plug visitor and conversion counts into hypothesis test
Make decision based on p-value of test

Frequentist example

Sample size calculation
- Test at 95% significance
- Want 80% power to detect a 10% relative lift (this is our effect size)
- 15% baseline conversion rate
- Then we need 9,257 visitors in each group
Suppose we observe 1,350 conversions in baseline, 1,425 conversions in treatment
- Chi-squared test gives a p-value of 0.13
- Insufficient evidence that treatment is better
- Decide to keep the baseline

A Bayesian approach

Directly compute (posterior) "probability" treatment is better than baseline
Chris Stucchio's Bayesian decision rule:
- Choose a threshold of caring: "If baseline and treatment conversion rates differ by no more than X, I don't care which one I choose"
- At any point, compute the expected loss from choosing the current winner and being wrong
- If expected loss is less than threshold of caring, stop the test
Evan Miller's closed-form probability formula for the above

Bayesian example

Choose threshold of caring: 1% relative lift
At 10% baseline conversion rate, threshold of caring is 0.1% loss
Suppose we have 10 / 100 baseline conversions, 12 / 100 treatment conversions
- Expected loss is about 1%, keep running
Suppose we later have 100 / 1,000 baseline conversions, 120 / 1,000 treatment conversions
- Expected loss is about 0.05%, choose treatment

Bayesian example, continued

Suppose we instead reached 10,000 / 100,000 conversions in both groups
- Expected loss is about 0.05%, choose either group
Bayesian test will always end eventually

Imagine a sequence of experiments on a single page over one million total visitors.

Start with a baseline page, conversion rate of 10%
Come up with a variation page (treatment), conversion rate randomly drawn around baseline
Consume visitors in both groups until a decision is made
Keep chosen page as new baseline and repeat from step 2

Implementation cost

Consume 5,000 visitors during "implementation" of new treatment

Mind your seeds

Set them
Log them
Careful when forking
Consider injecting RNGs

Computation of posterior

def loop_posterior(baseline_alpha, baseline_beta,
                   treatment_alpha, treatment_beta):
    sum_result = 0
    for i in xrange(treatment_alpha):
        sum_result += math.exp(
            log_beta(baseline_alpha + i, treatment_beta + baseline_beta)
            - log_beta(1 + i, treatment_beta)
            - log_beta(baseline_alpha, baseline_beta)
        ) / (treatment_beta + i)
    return sum_result

def vectorized_posterior(baseline_alpha, baseline_beta,
                         treatment_alpha, treatment_beta):
    i_values = numpy.arange(treatment_alpha)
    return numpy.sum(
        numpy.exp(
            log_beta(baseline_alpha + i_values, treatment_beta + baseline_beta)
            - log_beta(1 + i_values, treatment_beta)
            - log_beta(baseline_alpha, baseline_beta)
        ) / (treatment_beta + i_values)
    )

Vectorized version up to 30x faster

$ python benchmark_bayesian_posterior.py
   Samples       Slow       Fast      Ratio
        10    0.032us    0.033us       1.0x
       100    0.208us    0.041us       5.1x
      1000    2.006us    0.101us      19.8x
     10000   20.282us    0.601us      33.7x
    100000  206.210us    6.160us      33.5x

All results depend on simulation parameters

Should treatment rate distribution be stationary around baseline rate?
Shape, location, scale of treatment rate distribution
Implementation cost
Additional costs of adopting new treatment
Etc etc etc

There are many other approaches to testing

Frequentist
- Tests other than Chi-squared
- Sequential testing
Bayesian
- Different (informative) prior
- Different loss function
- Other decision rules
Bandit methods
Etc etc etc

Common significance levels may be too conservative
Bayesian approach minimizes loss with shorter tests
It pays to give this stuff some thought!

Questions?

github.com/gostevehoward/absimulation