Understanding A/B testing through Monte Carlo simulation

Steve Howard

7/28/2014

Overview

Background

Conversion optimization

Overview of testing

A decision problem

  1. When do I stop the test?
  2. Which group do I choose to keep?

Goal: to maximize final conversion rate

(But this could vary greatly with context)

Pitfalls

Frequentist hypothesis testing

  1. Choose a significance level
    • "If baseline and treatment are identical, we'll keep the baseline X% of the time"
  2. Choose a power level
    • "If the treatment is better, we'll choose it Y% of the time"
  3. Choose an effect size for the desired power:
    • "If the treatment is better by Z%, we'll choose it Y% of the time"
  4. Compute sample size and wait for that many visitors
  5. Plug visitor and conversion counts into hypothesis test
  6. Make decision based on p-value of test

Frequentist example

A Bayesian approach

Bayesian example

Bayesian example, continued

Simulation

Imagine a sequence of experiments on a single page over one million total visitors.

  1. Start with a baseline page, conversion rate of 10%
  2. Come up with a variation page (treatment), conversion rate randomly drawn around baseline
  3. Consume visitors in both groups until a decision is made
  4. Keep chosen page as new baseline and repeat from step 2

Implementation cost

Consume 5,000 visitors during "implementation" of new treatment

Aside: technical tidbits

Mind your seeds

Computation of posterior

def loop_posterior(baseline_alpha, baseline_beta,
                   treatment_alpha, treatment_beta):
    sum_result = 0
    for i in xrange(treatment_alpha):
        sum_result += math.exp(
            log_beta(baseline_alpha + i, treatment_beta + baseline_beta)
            - log_beta(1 + i, treatment_beta)
            - log_beta(baseline_alpha, baseline_beta)
        ) / (treatment_beta + i)
    return sum_result
def vectorized_posterior(baseline_alpha, baseline_beta,
                         treatment_alpha, treatment_beta):
    i_values = numpy.arange(treatment_alpha)
    return numpy.sum(
        numpy.exp(
            log_beta(baseline_alpha + i_values, treatment_beta + baseline_beta)
            - log_beta(1 + i_values, treatment_beta)
            - log_beta(baseline_alpha, baseline_beta)
        ) / (treatment_beta + i_values)
    )

Vectorized version up to 30x faster

$ python benchmark_bayesian_posterior.py
   Samples       Slow       Fast      Ratio
        10    0.032us    0.033us       1.0x
       100    0.208us    0.041us       5.1x
      1000    2.006us    0.101us      19.8x
     10000   20.282us    0.601us      33.7x
    100000  206.210us    6.160us      33.5x

Results

Caveats

All results depend on simulation parameters

There are many other approaches to testing

Conclusions

Questions?

github.com/gostevehoward/absimulation