(DRAFT: in progress)
This article is about efficient estimation using data from randomized experiments. By "efficient", I mean that we will employ available side information to reduce variance as much as possible, that is, to get more precise estimates with smaller confidence intervals. I'll discuss estimation of the average treatment effect, relative lift, heterogeneous treatment effects, quantile treatment effects, values of personalized policies, and more. Perhaps surprisingly, all of these problems can be tackled with a single approach, which will be the main focus of this article. This approach has different names in different literatures, but I'll refer to it as augmented inverse propensity weighting (AIPW), a common name in the statistics and causal inference literature and an apt name in the setting of randomized experiments. AIPW can appear bewildering at first sight, but I believe that, once one understands the "imputation principle" behind AIPW, it becomes "obvious" how to efficiently solve a variety of estimation problems.
Setup: mean estimation with missing data
Many problems in causal inference can be viewed as missing data problems, and, at its heart, AIPW is a method for estimation in the presence of missing data. As such, I'll present AIPW as a solution to the problem of mean estimation with missing data. This problem is simpler than what we usually encounter, but all the fundamentals will apply to the more realistic problems we tackle below.
As an example to illustrate the setup, suppose a ridesharing service is interested in offering riders a discount and would like to know what proportion of riders will use the discount if offered. We could simply offer a discount to all riders and measure how many use it. But that might be prohibitively expensive. Instead, we could offer a discount to a small random sample of riders. We might want the proportion of riders who receive a discount to varying by geography, for example to sample more riders in less densely populated areas.
Our statistical model is as follows:
- There is a (hypothetical, infinite) population of units, each represented by the triple . These might represent all possible riders on the service.
- is a vector of covariates, or side information observed about a unit observed prior to treatment. In our example, this might include geographical information as well as past usage, such as number of rides in the past month.
- is an outcome of interest. In our example, this is whether the rider would use a discount if offered. In our model, this outcome exists for every rider, including those who do not actually receive a discount offer (in which case it is a counterfactual outcome).
- is a binary indicator for whether this unit’s outcome is observed. In our example, this is one if we offer a discount to the rider.
- We are interested in estimating , this expectation being taken over the full population. In our example, this is the proportion of all riders who would use the discount if offered one.
- We'll perform estimation based on an i.i.d. sample of observed units indexed by . However, we don't observe the full triple for all units in our sample. Rather, we only observe outcomes for units with ; otherwise, the outcome is missing. That is, we observe i.i.d. triples for .
- The propensity score is . We’ll assume this is known, as in a designed
experiment, and bounded away from zero.
- In a simple experiment, would be constant. That makes the problem much easier, though there’s still a lot to say, and we'll return to this special case often. But we get a lot of useful mileage out of treating the general case.
- When is unknown, we can estimate it. This is how we would analyze an observational study, “as if” it were a randomized experiment. All the algorithmic ideas below still apply, but things get much more complicated conceptually and analytically, and we won’t discuss any of that.
Identification
As things stand, there are two different distributions in play:
- the full distribution, that of ; and
- the observed distribution, that of .
We have in hand an i.i.d. sample from the observed distribution, so it is clear that we can estimate any aspect of that distributed we like, given enough data. But our target estimand is a functional of the full distribution, not the observed distribution, and it's not true in general that we can estimate it given data from the observed distribution, even in the limit of infinite data. We need an assumption to connect the observed distribution to the full distribution. This is the problem of identification.
To understand why identification assumptions are important, it's worth recognizing that without additional assumptions, identification may not be possible. As a counterexample, we describe two full distributions which have different values of the estimand but which both lead to the exact same observed distribution:
- Full distribution 1: and , with and
independent
- The true mean is
- Observable distribution: and
- Full distribution 2: and
- The true mean is
- Observable distribution: and (by Bayes’ rule)
The observable distribution is identical between both cases. Since the only data we have comes from the observable distribution, we cannot hope to reach different conclusions under the two full distributions, no matter how much data we observe and what algorithm we apply to it. We simply cannot distinguish these two states of the world although the have very different values of .
The identifying assumption we’ll make is that is independent of given , typically called unconfoundedness in the causal inference literature although it has other names. This assumption is extremely common and the basis of much research, though it is often unconvincing in practice unless some randomization has been done. In any randomized experiment where the randomization probability depends only on , including when it’s constant, unconfoundedness is assured by the design.
The unconfoundedness assumption implies , that is, the conditional mean outcome among treated units is equal to the conditional mean outcome among all units. The left-hand expectation can be computed from the observed distribution alone, while the right-hand expectation requires the full distribution. This equality is the key to identification using unconfoundedness. Taking expectations of both sides over , we can connect the observed distribution to the estimand :
Again, the leftmost expectation can be computed from the observed distribution alone, so this equation proves that we can estimate given enough data from the observed distribution. Note that "Full distribution 2" in the counterexample above violates unconfoundedness, as depends on in a way that is not explained by (indeed, I have not included any covariates in the counterexample). This must be the case, as our identification equation proves that we cannot have a situation like the above: we cannot have two full distributions satisfying unconfoundedness with different values of which both lead to the same observed distribution.
With identification taken care of, we can turn to algorithms we might use to actually estimate from a finite sample, as well as how to estimate the variance of our estimator.
Inverse propensity weighting
Consider the special case when is constant for all , say equal to some . This corresponds to a simple random experiments in which we observe the outcome for a proportion of all eligible units. In this case, the sample with is fully representative of the population (on average) and the most obvious (and probably most common) estimator is the sample mean of observed outcomes,
where is the number of observed outcomes. This is a special case of the inverse propensity weighting (IPW) estimator,
The IPW estimator is motivated by the identity
which, together with the identifying equation (I), shows that is unbiased. This owes to the fact that we know .
The sample mean (S) is a special case of when . That is the propensity score if we actually randomized by selecting units uniformly at random the full sample of units, so that is fixed by design. But if we randomized by i.i.d. coin flips with probability , for example, then will differ from , in general, and using may appear to be cheating. It may seem that we ought to be using the "true" IPW estimator
which uses the expectation of in place of the realized value in (S). But it's not hard to see that, in this case, (S) is also unbiased. In fact, (S) is not only marginally unbiased (as the "true" IPW estimator is) but also unbiased conditional on (which the "true" IPW estimator is not). So we can think of (S) as an IPW estimator conditional on , and in doing "conditional" IPW, we obtain conditional unbiasedness and thus lower variance for free. In general, as a matter of practice, if the propensity score is constant on strata, we ought to condition on the number of observed outcomes within each stratum, which amounts to setting the propensity score equal to the realized fracton of observed outcomes in each stratum. There is more to say about this, but it's a topic for another article.
The last thing we need to discuss is variance estimation, that is, estimating . Fortunately, this is easy: is the mean of i.i.d. quantities, so we estimate its variance by
(We are ignoring the distinction between and in the sample variance; we typically would use IPW and AIPW with samples large enough that the distinciton is immaterial.) This is just as in a one-sample t-test, and we would construct confidence intervals in the usual way.
In short, the IPW estimator is unbiased and lends itself to simple variance estimation and confidence interval construction. But it only uses the subset of units with observed outcomes, completely discarding the rest of the units in our sample. As we will see, those units often contain useful information, and discarding them can lead to unnecessarily large variance.
Outcome modeling
An alternative idea to estimate comes directly from the identification equation (I), which we recall as
Interpret the RHS from the inside out:
- The inner expectation is the conditional mean function of given among units with observed outcomes. This is a function of .
- The outer expectation averages this function of over the distribution of , yielding .
It's natural, then, to estimate in two analogous stages:
- Fit a regression function to estimate , by regressing on among units with . Here “regressing” could mean something simple like OLS, or could mean any complicated, high-dimensional, regularized machine learning model you like.
- Then estimate by
This is the outcome modeling estimator. While the regression that produces only uses the subset of units with observed outcomes, the final estimator averages over the full sample, so no data is wasted and the outcome modeling estimator tends to have low variance.
Unfortunately, we no longer have any guarantees about bias. If our fitted outcome model is a poor approximation to the true conditional mean function , whether due to model misspecification, regularization bias, "double-dipping bias" (when we overfit our model to the sample at hand, then compute from the same sample), or anything else, then our estimator could be biased. There's no obvious way to quantify or control this bias with a pure outcome modeling approach.
A more subtle issue concerns variance estimation. The variance of will be driven by the variance of the fitted model itself, and estimating the variance due to model fitting may be difficult. We either need to use a simple model like OLS, perform some complicated variance calculations, or use a general method like the bootstrap, though even that could be very computationally inconvenient compared to a simple formula like the IPW estimated variance (V).
The outcome modeling approach is sort of the opposite of IPW: it yields a low-variance estimator but may be biased, and variance estimation may be challenging.
Augmented inverse propensity weighting
Augmented inverse propensity weighting combines the IPW and outcome modeling approaches. In so doing, we retain all the advantages of IPW—unbiasedness and easy variance estimation—while potentially achieving lower variance.
We first fit an outcome model as above. Actually, we typically fit the outcome model two or more times. We'll notate this as a separate fitted model for each unit, , though in practice we never actually fit separate models. The important requirement is that is independent of . That is, the fitted model we use for unit was fit only on other units. We typically implement this via cross-fitting: divide the data into random folds, and for all units in fold , use an outcome model fit on the other folds.
Then we estimate by the following formula:
Unbiasedness of AIPW
The first thing to know about is that it is unbiased regardless of the outcome model you use. This fact may seem bewildering at first, but it is fundamental to the success of AIPW. To see that this is true, take the expectation of summand conditional on and , and use the facts that (a) is independent of because of cross-fitting and (b) and are conditionally independent by our identifying assumption.
So cancels in (U), and we are left only with . Taking an outer expectation over , we see that each summand in is unbiased for conditional on its outcome model. This fact immediately implies is unbiased, but this fact turns out to also be critical for variance derivations, as discussed below.
Connecting AIPW and IPW and outcome modeling
To get intuition for , we can relate it to and .
-
We can write the AIPW estimator as an outcome model estimator plus an IPW bias correction term:
When , the bias correction is simply the average residual among treated units:
Our knowledge of the propensity score enabled unbiased estimation in the IPW strategy, and here we use it to debias . Recall that has low variance, but the bias correction term is an average only over the subset of units with observed outcomes, so may have larger variance. If the model is good, however, then the residuals will be small, and the bias correction term will also have small variance, resulting in an overall variance reduction.
For outcome models with zero average residual, e.g., in-sample OLS with an intercept, , hence the outcome modeling estimator is unbiased. Another way to view AIPW, then, is that we start with an outcome modeling estimator and simply adjust the model's intercept so that the average residual is zero within our sample of units with observed outcomes. I prefer thinking of it as AIPW, however.
-
Alternatively, we can write the AIPW estimator as an IPW estimator plus a mean-zero correction for covariate imbalance:
This form makes it obvious that IPW is a special case of AIPW with . When this reduces to
The correction term involves the difference between the average prediction over the full sample and the average prediction over the subset with observed outcomes. Since predictions are a function of alone, this difference reflects some imbalance in covariates between the full sample and the subset with observed outcomes. Since (as both are unbiased for ), the correction term is mean zero. It corrects for chance imbalance in covariates, hence, reduces the variance of the estimator.
Variance estimation
We have seen that AIPW inherits the IPW estimator's unbiasedness. The other advantage of IPW over outcome modeling was that variance estimation is straightforward, as given in equation (V), since the IPW estimator is a mean of i.i.d. quantities. Since AIPW involves a random outcome model, the terms of the mean are no longer i.i.d., so the argument isn't quite as straightforward. However, it turns out to still be true: we can estimate the variance of the AIPW estimator by
just as if it were a mean of i.i.d. observations, ignoring the randomness in . This is extremely convenient and again follows from the key fact that each summand in is unbiased conditional on its fitted outcome model. The proof is not trivial, but follows, for example, from Theorem 5.1 of Chernozhukov et al (2016).
AIPW isn't leaving any variance reduction on the table
We have shown that AIPW inherits the advantages IPW while providing for possible variance reduction. The amount of variance reduction depends on the explanatory power of the covariates and the quality of the outcome model. We've been a bit vague on this, but there is a powerful fact which says, in essence, that AIPW provides the best possible variance reduction given the covariates and outcome model it has to work with. More specifically, if we had access to the true mean outcome function, (or, more realistically, if our outcome models converged to the truth in an appropriate sense), then the variance of would be the smallest possible (under mild assumptions, among a large class of estimators which reasonably includes all estimators of interest). This is a slightly obscure theoretical fact (semiparametric efficiency), but gives us some reassurance that we’re not leaving anything obviously on the table by using AIPW. Of course, this all depends on choosing good covariates and a good outcome model, but the message is reassuring: no matter what outcome model we fit, AIPW remains unbiased, and if we happen to pick a good one, we achieve the best possible variance reduction.
The AIPW imputation principle
As a segue to applications, we present one final way to think about AIPW: we are imputing a "pseudo-outcome" for every unit,
and then acting as if these pseudo-outcomes are the real outcomes, but now observed for the entire sample. In particular, we forget that we're in a situation with covariates and missing outcomes. We simply estimate the population mean by the sample mean of pseudo-outcomes, and construct a CI with the usual sample variance and CLT argument. The magic of AIPW is that we can do this without worrying about bias or (co)variance introduced by our imputation model.
Note that we impute pseudo-outcomes for all units, not just those with missing outcomes: for units with observed outcomes, we replace the observed outcome with the pseudo-outcome, which is generally different from the observed outcome. It would not work, in general, to only impute pseudo-outcomes for those units with missing outcomes. We need to replace the true observed outcomes with pseudo-outcomes as well to ensure the bias correction works.
The imputation princple doesn’t add much to what we’ve already discussed for the problem of mean estimation with missing data, but will be useful in the applications below. In particular, causal inference problems are usually hard because of missing potential outcomes. If we could observe all potential outcomes for all units, then many estimation problems would have fairly obvious solutions. The imputation principle lets us fill in the full potential outcomes table, then pretend as if these are true potential outcomes and thus solve various problems in the "obvious" way. And the resulting estimators are typically efficient.
Applications
We depart now from the notation introduced at the start and switch to a standard potential outcomes model. Units in the population (the full, unobserved distribution) are given by the quadruple where is a binary indidator for whether we assign the unit to "control" or "treatment", and and are potential outcomes under these two assignments. We observe the triple . Unconfoundedness still holds in the sense that are independent of conditional on .
Average treatment effect estimation
Our first task is to estimate the average treatment effect . Looking at each of the estimands and separately, we are back to mean estimation with missing data, so we can apply AIPW as described above:
- Take and to estimate .
- Take and to estimate .
Write and for unit 's estimated control and treatment outcome models, respectively, and for define
Then the AIPW estimator of is . Following the imputation principle above, we wish we could observe the full potential outcomes table:
But we only observe one entry in each row:
? | |
---|---|
? | |
? | |
So we replace it with the imputed table
and then act as if this were the full table of potential outcomes and do the "obvious" thing: we have "observed" every unit's (pseudo-)individual treatment effect , so our ATE estimator is the mean and its estimated variance is , the sample variance of the per-unit differences divided by . This ATE estimator is unbiased and can have greatly reduced variance when the outcome models are effective.
Relative lift estimation
Continuing from above, if we instead want to estimate the relative lift for a nonnegative outcome, we can use
defining for . Variance estimation is not trivial, because the numerator and denominator are not independent, but it's not too hard. A first-order Taylor expansion shows that
This is basically saying that the lift estimator “looks like” an average, asymptotically. So the (asymptotically valid) estimated variance of the lift estimator is
CATE estimation with DR-learner
When we’re interested in treatment effect heterogeneity, a natural estimand is the CATE function . If we could observe the full potential outcomes table in our sample, it would be reasonable to regress the individual treatment effects against using a regression method of your choice (OLS, boosting, NNs, …). The imputation view suggests we instead regress imputed differences against . This method is called “DR-learner” in Kennedy (2020), where it is shown to have good properties.
Policy evaluation and optimization
One reason we might be interested in the CATE function is because we want to treat only some users, for example those users with positive treatment effects. In that case, it may be better to sidestep the problem of CATE estimation and directly solve the problem of policy optimization. That is, we want find a policy , a mapping from covariates to action , to optimize average outcomes in some way. In our model, this is equivalent to a two-armed stochastic contextual bandit problem.
The value of a policy is , and the problem of estimating this value given a policy is called policy evaluation. If we had access to the full potential outcomes table for our sample, we could estimate by . Again, the imputation view suggests we use
and that’s a good idea. The ATE example can be viewed as evaluation of two trivial policies, one that treats everyone and the other that treats no one. This view generalizes easily to more than two actions.
Now that we have a good estimator of the value of any policy—a function of the policy—it’s natural to treat it as an optimization objective and maximize it as a method of policy optimization, akin to empirical risk minimization but using AIPW to deal with missingness. The objective is equivalent to cost-sensitive multiclass classification. In the binary treatment case (two actions), the objective is equivalent to weighted binary classification, with labels and weights . To see this, write the objective as
The minimizer of is the same as the maximizer of , since is constant with respect to the policy. The last expression has the form of a weighted classification loss: it is like zero-one loss, but instead of paying loss one for each misclassification, we pay loss .
M-estimators
Everything above has focused on mean estimation in one form or another, but the AIPW idea can be applied more generally to M-estimators. Return to the missing data model and suppose we wish to estimate a parameter defined by a population estimating equation :
An M-estimator would ordinarily be defined by the sample estimating equation
but, as before, we only observe rather than observing for all units, so cannot even construct the sample estimating equation. But we can apply AIPW to the estimating equation itself. That is, we find to solve
It’s easier to read if we define, with slight abuse of notation, and , so that the AIPW estimating equation becomes
In principle, to obtain confidence intervals, we can construct a confidence region for the LHS for each value of , say , and then the set gives a confidence region for . In practice this may be computationally difficult.
As an example, assume comes from a continuous distribution and set for some , which estimates the th quantile of . Let’s also assume is constant for simplicity. The AIPW estimating equation becomes
A few references
- Stefan Wager’s STATS 361 notes are really nice and cover AIPW in Chapter 3 in the more general setting of observational studies.
- AIPW is usually attributed to Robins, Rotnitsky & Zhao (1994). See bibliographic notes in chapter 3 of Wager’s notes for more early references.
- Chernozhukov et al (2016) did a lot to popularize the use of cross-fitting together with AIPW-style estimators, though that paper is much more general.
- Jin & Ba (2022) focus on the variance reduction for ATE estimation in randomized experiments, including for "ratio metrics" such as click-through rates, and give some very clean theorems.
- Dudík, Langford & Li (2011) is a great source on policy evaluation with AIPW (which they call the doubly robust method). There are more recent and more general sources, but Dudík et al is practical and easy to read.
- Kennedy (2022) analyzes the "DR-learner" for CATE estimation. That estimator has been proposed in other, earlier references, but Kennedy gives stronger theoretical results.
- Angelopoulos et al (2023) discusses variance reduction for M-estimators. They operate in a totally different context, but the problem is really the same.