17 Experimental : causal

An inquiry is causal if it involves a comparison of counterfactual states of the world and a data strategy is experimental if it involves explicit assignment of units to treatment conditions. Experimental designs for causal inference combine these two elements. The designs in this section aim to estimate causal effects and the procedure for doing so involves actively allocating treatments.

Many of experimental designs for causal inference in the social sciences take advantage of researcher control over the assignment of treatments to assign treatments . In the archetypal two-arm randomized trial, a group of \(N\) subjects are recruited, \(m\) of them are chosen at random to receive treatment and the remaining \(N-m\) of them do not receive treatment and serve as controls. The inquiry is the average treatment effect, the answer strategy is the difference-in-means estimator. The strength of the design can be appreciated by analogy to random sampling. The \(m\) outcomes in the treatment group represent a random sample from the treated potential outcomes among all \(N\) subjects, so the sample mean in the treatment group is a good estimator of the true average treated potential outcome; an analogous claim holds for the control group.

The randomization of treatments to estimate average causal effects is a relatively recent human invention. While glimmers of the idea appeared earlier, it wasn’t until at least the 1920s that explicit randomization appeared in agricultural science, medicine, education, and political science (Jamison 2019). Only a few generations of scientists have had access to this tool. Sometimes critics of experiments will charge “you can’t randomize [important causal variable].” There are of course practical constraints on what treatments researchers can control, be they ethical, financial, or otherwise. We think the main constraint is researcher creativity. The scientific history of randomized experiments is short – just because it hasn’t been randomized doesn’t mean it can’t be. (By the same token, just because it be randomized doesn’t mean that it should be.)

Randomized experiments are rightly praised for their desirable inferential properties, but of course they can go wrong in many ways that designers of experiments should anticipate and minimize. These problems include problems in the data strategy (randomization implementation failures, excludability violations, noncompliance, attrition, and interference between units), problems in the answer strategy (conditioning on post-treatment variables, failure to account for clustering, \(p\)-hacking), and even problems in the inquiry (estimator-inquiry mismatches). Of course all these problems apply a fortiori to nonexperimental studies, but they are important to emphasize for experimental studies since they are often characterized as being “unbiased” without qualification.

The designs in this chapter proceed from the simplest experimental design – the two arm trial – up through very complex designs like the randomized saturation design. The chapter can profitably be read alongside Gerber and Green (2012).

17.1 Two-arm randomized experiments

We declare a canonical two arm trial, motivate key diagnosands for assessing the quality of a design, use diagnosis and redesign to explore the properties of two arm trials, and discuss key risks to inference. This entry includes code for a “designer” which lets you quickly design and redesign two arm trials.

All two-arm randomized trials have in common that subjects are randomly assigned to one of two conditions. Canonically, the two conditions include one treatment condition and one control condition. Some two-arm trials eschew the pure control condition in favor of a placebo control condition, or even a second treatment condition. The uniting feature of all these designs is that the model includes two and only two potential outcomes for each unit and that the data strategy randomly assigns which of these potential outcomes will be revealed.

A key choice in the design of two arm trials is the random assignment procedure. Will we use simple (coin flip, or Bernoulli) random assignment or will we use complete random assignment? Will the randomization be blocked or clustered? Will we “restrict” the randomization so that only randomizations that generate acceptable levels of balance on pre-treatment characteristic are permitted? We will explore the implications of some of these choices in the coming sections, but for the moment, the main point is that saying “treatments were assigned at random” is insufficient. We need to describe the randomization procedure in detail in order to know how to analyze the resulting experiment. See Section 8.1.2 for a description of many different random assignment procedures.

In this chapter, we’ll consider the canonical two arm-trial design described in Gerber and Green (2012). The canonical design conducts complete random assignment in a fixed population, then uses difference-in-means to estimate the average treatment effect. We’ll now unpack this shorthand into the components of M, I, D, and A.

The model specifies a fixed sample of \(N\) subjects. Here we aren’t imagining that we are sampling from a larger population first. We have in mind a fixed set of units among whom we will conduct our experiment. That is, we are conducting “finite sample inference.” Under the model, each unit is endowed with two latent potential outcomes: a treated potential outcome and an untreated potential outcome. The difference between them is the individual treatment effect. In the canonical design, we assume that potential outcomes are “stable,” in the sense that all \(N\) units’ potential outcomes are defined with respect to the same treatment and that units potential outcomes do not depend on the treatment status of other units. This assumption is often referred to as the “stable unit treatment value assumption,” or SUTVA (Rubin 1980).

The potential outcomes themselves have a correlation of \(\rho\). If units with higher untreated potential outcomes also have higher treated potential outcomes, \(\rho\) will be positive. Developing intuitions about \(\rho\) is frustrated by the fundamental problem of causal inference. Since we can only ever observe a unit in its treated or untreated state (but not both), we can’t directly observe the correlation in potential outcomes. In order to make a guess about \(\rho\), we need to reason about treatment effect heterogeneity. If treatment effects are very similar from unit to unit, \(\rho\) will be close to 1. In the limiting case of exactly constant effects, \(\rho\) is equal to 1.

It is difficult (but not impossible) to imagine settings in which \(\rho\) is negative. So-called “Robin Hood” treatments generate negatively correlated potential outcomes, because they “take” from units with high untreated potential outcomes and “give” to units with low untreated outcomes. An example of a Robin Hood treatment might be a “surprising” partisan cue in the context of the American party system. Imagine that in the control condition, Democratic subjects tend to support a policy (\(Y_i(0)\) is high) and Republicans tend to oppose it (\(Y_i(0)\) is low). The treatment is a “surprise” endorsement of the policy by a Republican elite: treatment group Republicans will find themselves supporting the policy (\(Y_i(1)\) is high) whereas treatment group Democrats will infer from the Republican endorsement that the policy must not be a good one (\(Y_i(1)\) is low.) Treatments with extreme heterogeneity like this example could in principle cause negatively correlated potential outcomes.

Because the model specifies a fixed sample, the inquiries are also defined at the sample level. The most common inquiry for a two-arm trial is the sample average treatment effect, or SATE. It is equal to the average difference between the treated and untreated potential outcomes for the units in the sample: \(\mathbb{E}_{i\in N}[Y_i(1) - Y_i(0)]\). Two-arm trials can also support other inquiries like the SATE among a subgroup (called a conditional average treatment effect, or CATE), but we’ll leave those inquiries to the side for the moment.

The data strategy uses complete random assignment in which exactly \(m\) of \(N\) units are assigned to treatment (\(Z = 1\)) and the remainder are assigned to control (\(Z = 0\)). We measure observed outcomes in such a way that we measure the treated potential outcome in the treatment group and untreated potential outcomes in the control group: \(Y = Y_i(1) \times Z + Y_i(0)\times(1 - Z)\). This expression is sometimes called the “switching equation” because of the way it “switches” which potential outcome is revealed by the treatment assignment. It also embeds the crucial assumption that indeed units reveal the potential outcomes they are assigned to. If the experiment encounters noncompliance, this assumption is violated. It’s also violated if “excludability” is violated, i.e., if something other than treatment moves with assignment to treatment. For example, if the treatment group is measured differently from the control group, excludability would be violated.

The answer strategy is the difference-in-means estimator with so-called Neyman standard errors:

\[\begin{align} \widehat{DIM} &= \frac{\sum_1^mY_i}{m} - \frac{\sum_{m + 1}^NY_i}{N-m} \\ \widehat{se(DIM)} &= \sqrt{\frac{\widehat{Var}(Y_i|Z = 1)}{m} - \frac{\widehat{Var}(Y_i|Z = 0)}{N-m}}\\ \end{align}\]

The estimated standard error can be used as an input for two other statistical procedures: null hypothesis significance testing via a \(t\)-test and the construction of a 95% confidence interval.

The DAG corresponding to a two-arm randomized trial is very simple. An outcome \(Y\) is affected by unknown factors \(U\) and a treatment \(Z\). The measurement procedure \(Q\) affects \(Y\) in the sense that it measures a latent \(Y\) and records the measurement in a dataset. No arrows lead into \(Z\) because it is randomly assigned. No arrow leads from \(Z\) to \(Q\), because we assume no excludability violations wherein the treatment changes how units are measured. This simple DAG confirms that the average causal effect of \(Z\) on \(Y\) is nonparametrically identified because no back-door paths lead from \(Z\) to \(Y\).

DAG of a two-arm randomized experiment

Figure 17.1: DAG of a two-arm randomized experiment

17.1.1 Analytic design diagnosis

The statistical theory for the canonical two-arm design is very well explored, so analytic expressions for many diagnosands are available.

  1. Bias of the difference-in-means estimator. Equation 2.14 in Gerber and Green (2012) demonstrates that regardless of the values (except in degenerate cases) of \(m\), \(N\), or \(\rho\), the bias diagnosand is equal to zero. This is the “unbiasedness” property of many randomized experimental designs. On average, the difference-in-means estimates from the canonical design will equal the average treatment effect. As we’ll explore later in this chapter, not every experimental design yields unbiased estimates. Some (like blocked experiments with differential probabilities of assignments) require fix-ups in the answer strategy and others (like clustered experiments with unequal cluster sizes) require fix-ups in the data strategy.

  2. The true standard error of the difference-in-means estimator. Equation 3.4 in Gerber and Green (2012) provides an exact expression for the true standard error of the canonical two-arm trial.

\[ SE(DIM)= \sqrt{\frac{1}{n-1}\left\{\frac{m\mathbb{V}(Y_i(0))}{n-m} + \frac{(N-m)\mathbb{V}(Y_i(1))}{m} + 2Cov(Y_i(0), Y_i(1))\right\}} \]

This equation contains many design lessons. It shows how the standard error decreases as sample size (\(N\)) increases and as the variances of the potential outcomes decrease. It provides a justification for “balanced” designs that assign the same proportion of subjects to treatment and control. If the variances of \(Y_i(0)\) and \(Y_i(1)\) are equal, then a balanced split of subjects across conditions will yield the lowest standard error. If the variances of the potential outcomes are not equal, the expression suggests allocating more units to the condition with the higher variance.

  1. Bias of the standard error estimator. Equation 3.4 is the true standard error. We also learn from analytic design diagnosis that the standard error estimator is upwardly biased, which is to say that it is conservative (see Section 9.2.1). The intuition for this bias is that we can’t directly estimate the covariance term in Equation 3.4, so we bound the variance under a worst-case assumption.19 The amount of bias in the standard error estimator depends on how wrong this worst case assumption is. When \(\rho\) is equal to 1, the bias goes to zero.

  2. Coverage. Since the standard errors are upwardly biased – they are “too big” – the statistics that are built on them will inherit this bias as well. The 95% confidence intervals will also be “too big,” so the coverage diagnosand will be above nominal, that is, 95% confidence intervals will cover the true parameter more frequently than 95% of the time.

  3. Power. Our answer strategy involves conducting a statistical significance test against the null hypothesis that the average outcome in the control group is equal to the average outcome in the treatment group. This test is also built on the estimated standard error, so the upward bias in the standard error estimator will put downward pressure on statistical power. In Section 17.1.1, we reproduced the formula given in Gerber and Green (2012) for statistical power that makes two further restrictions on the canonical design: equally-sized treatment groups and equal variances in the potential outcomes.

Analytic design diagnosis is tremendously useful, for two reasons. First, we obtain guarantees for a large class of designs. Any experiment that fits into the canonical design will have these properties. Second, we learn from the analytic design diagnosis what the important design parameters are. In our model, we need to think about treatment effect heterogeneity in order to develop expectations about the variances and covariances of the potential outcomes. In our inquiry, we need to be thinking about specific average causal effects – the SATE, not the PATE or the CATE or the LATE. The data strategy in the canonical design is complete random assignment, so we need to think about how many units to assign to treatment (\(m\)) relative to control (\(N-m\)). The answer strategy is difference-in-means with Neyman standard errors – difference in means is unbiased for the ATE, but the Neyman standard error estimator is upwardly biased. This means our coverage will be conservative and we’ll take a small hit to statistical power.

17.1.2 Design diagnosis through simulation

Of course we can also declare this design and conduct design diagnosis using simulation. This process will confirm the analytic results, as well as provide estimates of diagnosands for which statisticians have not yet derived analytic expressions. This code produces a “designer” that allows us to easily vary the important components of the design.

eq_3.4_designer <-
  function(N, m, var_Y0, var_Y1, cov_Y0_Y1, mean_Y0, mean_Y1) {
    fixed_sample <-
        n = N,
        mu = c(mean_Y0, mean_Y1),
        Sigma = matrix(c(var_Y0, cov_Y0_Y1, cov_Y0_Y1, var_Y1), nrow = 2),
        empirical = TRUE # this line makes the means and variances "exact" in the sample data
      ) %>%
      magrittr::set_colnames(c("Y_Z_0", "Y_Z_1"))
    declare_model(data = fixed_sample) +
      declare_inquiry(ATE = mean(Y_Z_1 - Y_Z_0)) +
      declare_assignment(m = m) +
      declare_reveal(Y, Z) +
      declare_estimator(Y ~ Z, inquiry = "ATE")

This simulation investigates how much of the sample we should allocate to the treatment group if the treatment group variance is twice as large as the control group variance. The diagnosis confirms the bias is zero, the true standard errors are what Equation 3.4 predicts, coverage is above nominal, and that we are above the 80% power target for a middle range of \(m\). We learn that power is maximized (and the true standard error is minimized) when we allocate 60 or 70 units (of 100 total) to treatment. We also learn from this that the gains from choosing the unbalanced design (relative to a 50/50 allocation) are very small. Even when the variance in the treatment group is twice as large as the variance in the control group, we don’t lose much when sticking with the balanced design. Since we can never be sure of the relative variances of the treatment and control groups ex ante, this exercise provides further support for choosing balanced designs in many design settings.

designs <- 
  expand_design(designer = eq_3.4_designer,
                N = 100,
                m = seq(10, 90, 10),
                var_Y0 = 1,
                var_Y1 = 2,
                cov_Y0_Y1 = 0.5,
                mean_Y0 = 1.0,
                mean_Y1 = 1.75)

dx <- diagnose_designs(designs, sims = 100, bootstrap_sims = FALSE)

Figure 17.2 illustrates how key diagnosands respond to treatment assignment propensities. We see that bias and coverage are unaffected while standard errors are minimized, and so statistical power maximized, with middling assignment propensities.

How diagnosis depends on the number of units assigned to treatment

Figure 17.2: How diagnosis depends on the number of units assigned to treatment

17.1.3 What can go wrong

Even for the simplest two-arm trial design, many things can go wrong. Naturally, the sorts of problems can be described in terms of M, I, D, and A.

The most crucial assumption we made in the model is that that are exactly two potential outcomes for each unit. This is violated if there are “spillovers” between units, say between housemates. If a unit’s outcome depends on the treatment status of their housemate, we could imagine four potential outcomes for each unit: only unit \(i\) is treated, only unit \(i\)’s housemate is treated, both are treated, or neither is treated. If there are indeed spillovers, but they are ignored, the very definition of the inquiry is malformed. If units don’t even have clearly defined “treated” and “untreated” potential outcomes because of spillovers from others, then we can’t define the ATE in the usual way. The solution to this problem is to elaborate the model to account for all the potential outcomes, then to redefine the inquiry with respect to those potential outcomes. We explore two experimental designs for learning about spillovers in Sections 17.10 and 17.11.

Other ways for a two arm-trial to go wrong concern the data strategy. If you think you are using complete random assignment but you in fact are not, bias may creep in. A nonrandom assignment procedure might be something like “first-come, first-served.” If you assign the “first” \(m\) units to treatment and the remainder to control, the assignment procedure is not randomized. Bias will occur if the potential outcomes of the first \(m\) are unlike the potential outcomes of the remaining \(N-m\) units.

Sometimes researchers do successfully conduct random assignment, but the random assignment happened to produce treatment and control groups that are unlike each other in observable ways. The unbiasedness property applies to the whole procedure – over many hypothetical iterations of the experiment, the average estimate will be equal to the value of the inquiry. But any particular estimate can be close or far from the true value. A solution to this problem is to change the answer strategy to adjust estimates for covariates, though we would recommend adjusting for covariates regardless of whether the treatment and control groups appear imbalanced. We explore procedures for including covariates in the data strategy (blocking) and in the answer strategy (covariate adjustment) in Section 17.2.

Other data strategy problems include noncompliance and attrition. Noncompliance occurs when units’ treatment status differs from their treatment assignment. We describe two designs for addressing noncompliance in Sections 17.6 and 17.7. Attrition occurs when outcome data are missing. The attrition problem in experiments is exactly analogous to the attrition problem in descriptive studies (see section 14.1), since we no longer have random samples of each set of potential outcomes.

Further reading

  • Chapters 2 and 3 of Gerber and Green (2012) cover the potential outcomes framework and features of sampling distribution of the difference-in-means estimator of the average treatment effect.

  • Chapter 2 of Angrist and Pischke (2008) describes the two arm trial as the “experimental ideal” that observational studies are trying to emulate.

  • Chapter 7 of Aronow and Miller (2019) provides a rigorous mathematical foundation for causal inference in the two-arm randomized experimental setting.

17.2 Block-randomized experiments

We declare a block randomized trial in which subjects are assigned to treatment and control conditions within groups. We use design diagnosis to assess the reductions in variance of in estimation that can be achieved from block randomization, examine possible downsides of block randomization, and compare strategies that randomize ex ante and that introduce controls ex post.

In a block-randomized experimental design, homogeneous sets of units are grouped together into blocks on the basis of covariates. The ideal blocking would group together units with identical potential outcomes, but since we don’t have access to any outcome information at the moment of treatment assignment, let alone the full set of potential outcomes, we have to make do grouping together units on the basis of covariates we hope are strongly correlated with potential outcomes. The blocking will be more effective in terms of increasing precision, the more strongly the blocking variable predicts potential outcomes.

Blocks can be formed on the basis of the levels of a single discrete covariate. We might be able to do better by blocking on the intersection of the levels of two discrete covariates. We could coarsen a continuous variable in order to create strata. We might want to create matched quartets of units, partitioning the sample into sets of four units that are as similar as possible on many covariates. Methodologists have crafted many algorithms for creating blocks, each with their own tradeoffs in terms of computational speed and efficiency guarantees (BlockTools, SoftBlock, Gram-Schmidt). The main point is that there are many paths to the creation of blocks on the basis of covariates and which one is the best choice in any particular setting will depend on the availability of covariate information that is correlated with potential outcomes.

In this design, we block our assignment on a binary covariate X. We assign different fractions of each block to treatment to illustrate the notion that probabilities of assignment need not be constant across blocks, and if they aren’t, we need to weight units by the inverse of the probability of assignment to the condition that they are in. In the answer strategy, adjust for blocks using the Lin (2013) regression adjustment estimator including IPW weights. A fuller motivation for the Lin estimator is presented at the end of this section.

Declaration 17.1 \(~\)

design <-
    N = 500,
    X = rep(c(0, 1), each = N / 2),
    U = rnorm(N, sd = 0.25),
    potential_outcomes(Y ~ 0.2 * Z + X + U)
  ) +
    Z = block_ra(blocks = X, block_prob = c(0.2, 0.5)),
    probs =
      obtain_condition_probabilities(Z, blocks = X, 
                                     block_prob = c(0.2, 0.5)),
    ipw = 1 / probs
  ) +
  declare_measurement(Y = reveal_outcomes(Y ~ Z)) +
    Y ~ Z,
    covariates = ~ X,
    model = lm_lin,
    weights = ipw,
    label = "Lin"


17.2.1 Why does blocking help?

Why does blocking increase the precision with which we estimate the ATE? One piece of intuition is that blocking rules out “poor” random assignments that exhibit imbalance on the blocking variable. If \(N\) = 12 and \(m\) = 6, complete random assignment allows choose(12, 6) = 924 possible permutations. If we form two blocks of size 6 and conduct block random assignment, then there are choose(6, 3) * choose(6, 3) = 400 remaining possible assignments. The assignments that are ruled are those in which too many or too few units in a block are assigned to treatment, because blocking requires that exactly \(m_B\) units be treated in each block \(B\). When potential outcomes are correlated the blocking variable, those “extreme” assignments produce estimates that are in the tails of the sampling distribution associated with complete random assignment.20

This intuition behind blocking is illustrated in Figure 17.3, which shows the sampling distribution of the difference-in-means estimator under complete random assignment. The histogram is shaded according to whether the particular random assignment is permissible under a procedure that blocks on the binary covariate \(X\). The sampling distribution of the estimator among the set of assignments that are permissible under blocking is more tightly distributed around the true average treatment effect than the estimates associated with assignments that are not perfectly balanced. Here we can see the value of a blocking procedure – it it rules out by design those assignments that are not perfectly balanced.

Sampling distribution under complete random assignment, by covariate balance

Figure 17.3: Sampling distribution under complete random assignment, by covariate balance

17.2.2 Can blocking ever hurt?

We showed above how blocking typically increases precision by ruling out some random assignments allowed under complete random assignment. When we form blocks of units whose potential outcomes are similar, then assignments that generate estimates that are far from the ATE are ruled out. A “bad blocking” occurs if the assignments that are ruled out are the ones that generate estimates that are close to the ATE. This possibility can only occur if, rather than forming blocks of units with similar potential outcomes, we unwittingly form blocks of units whose potential outcomes are very different. Doing so turns out to be rare and requires a convoluted blocking strategy, but it is possible.

Here is an example of design in which blocking hurts precision. We block on “couple,” but we imagine an “opposites attract” model of romance: the unit with the highest value of X is paired with the unit with the lowest value, the second highest with the second lowest, and so on. The diagnosis shows that in this odd case, the complete random assignment design has a tighter sampling distribution than the block random assignment design. Problems like this can be avoided by blocking together units who are similar on a prognostic covariate, not dissimilar.

Declaration 17.2 \(~\)

MI <- declare_model(
  N = 100,
  X = sort(rnorm(N)),
  couple = c(1:(N / 2), (N / 2):1),
  U = rnorm(N, sd = 0.1),
  potential_outcomes(Y ~ Z + X * Z + U)
) +
  declare_inquiry(ATE = mean(Y_Z_1 - Y_Z_0))

design_complete <- MI +
  declare_assignment(Z = complete_ra(N)) +
  declare_measurement(Y = reveal_outcomes(Y ~ Z)) +
  declare_estimator(Y ~ Z)

design_blocked <- MI +
  declare_assignment(Z = block_ra(blocks = couple)) +
  declare_measurement(Y = reveal_outcomes(Y ~ Z)) +
  declare_estimator(Y ~ Z)


simulations <- simulate_designs(design_complete, design_blocked)
simulations <- simulate_designs(design_complete, design_blocked)
Table 17.1: Bad blocking leads to precision loss
design bias Standard Error
design_blocked -0.001 0.164
design_complete 0.000 0.142

17.2.3 Connection of blocking to covariate adjustment

Choosing block random assignment over complete random assignment is a way to incorporate covariate information into the data strategy D. We can also incorporate covariate information into the answer strategy A for the same purpose (increasing precision), by controlling for covariates or otherwise conditioning on them when estimating the average treatment effect. In observational settings like the one we explore in Section 15.2, conditioning on covariates is used to block back-door paths to address confounding. Here, confounding is no problem – the treatment is assigned at random by design, so we do not need to control for covariates in order to decrease bias. Instead, we control for covariates in order to reduce sampling variability.

Figure 17.4 illustrates this point. The sampling distribution under difference in means is shown on the top line and the sampling distribution under ordinary least squares (Y ~ Z + X) is shown on the bottom line. Estimates that are on the extreme ends of the distribution under difference-in-means are pulled in more tightly to center on the ATE. One interesting wrinkle this graph reveals is that covariate adjustment does not tighten up the estimates for assignments that exactly balance X – they only help the assignments that are slightly imbalanced.

Impact of covariate adjustment on the sampling distribution

Figure 17.4: Impact of covariate adjustment on the sampling distribution

17.2.4 Simulation comparing blocking to covariate adjustment

Adjusting for pre-treatment covariates that are predictive of the outcome almost always increases precision; blocking on covariates that are predictive of the outcome almost always increases precision too. Another way of putting this idea is that covariate information can be incorporated in the answer strategy through covariate adjustment or in the data strategy through blocking and that in this way, the two procedures are approximately equivalent.

We’ll now declare and diagnose four closely-related experimental designs. To begin, we describe a fixed population of 100 units with a binary covariate \(X\) and unobserved heterogeneity \(U\). Potential outcomes are a function of the treatment \(Z\) and are correlated with \(X\). Throughout this exercise, our inquiry is the ATE.

Declaration 17.3 \(~\)

fixed_pop <-
    N = 100,
    X = rbinom(N, 1, 0.5),
    U = rnorm(N),
    potential_outcomes(Y ~ 0.2*Z + X + U)

MI <-
  declare_model(data = fixed_pop) +
  declare_inquiry(ATE = mean(Y_Z_1 - Y_Z_0))

We have two answer strategies and two data strategies that we’ll mix-and-match.

# Data strategies
complete_assignment <- 
  declare_assignment(Z = complete_ra(N = N)) + 
  declare_measurement(Y = reveal_outcomes(Y ~ Z))
blocked_assignment <- 
  declare_assignment(Z = block_ra(blocks = X)) + 
  declare_measurement(Y = reveal_outcomes(Y ~ Z))

# Answer strategies
unadjusted_estimator <- declare_estimator(Y ~ Z, inquiry = "ATE")
adjusted_estimator <- declare_estimator(Y ~ Z + X, model = lm_robust, inquiry = "ATE")

These combine to create four designs, which we then diagnose.

design_1 <- MI + complete_assignment + unadjusted_estimator
design_2 <- MI + blocked_assignment + unadjusted_estimator
design_3 <- MI + complete_assignment + adjusted_estimator
design_4 <- MI + blocked_assignment + adjusted_estimator


diagnose_designs(list(design_1, design_2, design_3, design_4))
Table 17.2: Comparison of block randomization to covariate adjustment
Data Strategy Answer Strategy True Standard Error Average Estimated Standard Error
Complete Random Assignment Difference-in-means 0.229 0.217
Block Random Assignment Difference-in-means 0.168 0.218
Complete Random Assignment Covariate Adjustment 0.190 0.181
Block Random Assignment Covariate Adjustment 0.168 0.181

The diagnosis shows that incorporating covariate information either in the data strategy or in the answer strategy yields similar gains to the true standard error. Relative to the canonical design (complete random assignment with difference-in-means), any of the alternatives represents an improvement. Blocking on \(X\) in the data strategy decreases sampling variability. Controlling for \(X\) in the answer strategy decreases sampling variability. Doing both – blocking on \(X\) and controlling for \(X\) – does not yield additional gains, but controlling for \(X\) is nevertheless appropriate when using a blocked design. The reason for this can be seen in the “average estimated standard error” diagnosand. If we block, but still use the difference-in-means estimator, the estimated standard errors do not decrease relative to complete random assignment. The usual Neyman variance estimator doesn’t “know” about the blocking. A number of fixes to this problem are available. You can, as we do in the simulation, control for the blocking variable in an OLS regression. Alternatively, you can use the “stratified” estimator that obtains block-level ATE estimates, then averages them together, weighting by block size. The stratified estimator has an associated standard error estimator – see Gerber and Green (2012) page 73-74. The stratified estimator is an instance of Principle 3.7: Seek M:I::D:A parallelism. Respecting the data strategy in the answer strategy (by adjusting for the blocking) brings down the estimated standard error as well.

17.2.5 Can controlling for covariates hurt precision?

Freedman (2008) critiques the practice of using OLS regression to adjust experimental data. While the difference-in-means estimator is unbiased for the average treatment effect, the covariate-adjusted OLS estimator exhibits a small sample bias (sometimes called “Freedman bias”) that diminishes quickly as sample sizes increase. More worrying is the critique that covariate adjustment can even hurt precision. Lin (2013) unpacks the circumstances under which this precision loss occurs and offers an alternative estimator that is guaranteed to be at least as precise as the unadjusted estimator. The trouble occurs when the correlation of covariates to outcomes is quite different in the treatment condition from in the control condition and when designs are strongly imbalanced in the sense of having large proportions of treated or untreated units. We refer the reader to this excellent and quite readable paper for details and the connection between covariate adjustment in randomized experiments and covariate adjustment in random sampling designs. In sum, the Lin estimator deals with the problem by performing covariate adjustment in each arm of the experiment separately, which is equivalent to the inclusion of a full set of treatment-by-covariate interactions. In a clever bit of regression magic, Lin shows how first pre-processing the data by de-meaning the covariates renders the coefficient on the treatment regressor an estimate of the overall ATE. The lm_lin estimator in the estimatr package implements this pre-processing seamlessly.

Declaration 17.4 will help us to explore the precision of three estimators under a variety of circumstances. We want to understand the performance of the difference-in-means, OLS, and Lin estimators depending on how different the correlation between X and the outcome is by treatment arm, and depending on the fraction of units assigned to treatment.

Declaration 17.4 \(~\)

prob = 0.5
control_slope = -1

design <-
  declare_model(N = 100,
                X = runif(N, 0, 1),
                U = rnorm(N, sd = 0.1),
                Y_Z_1 = 1*X + U,
                Y_Z_0 = control_slope*X + U
                ) +
  declare_inquiry(ATE = mean(Y_Z_1 - Y_Z_0)) +
  declare_assignment(Z = complete_ra(N = N, prob = prob)) + 
  declare_measurement(Y = reveal_outcomes(Y ~ Z)) +
  declare_estimator(Y ~ Z, inquiry = "ATE", label = "DIM") +
  declare_estimator(Y ~ Z + X, model = lm_robust, inquiry = "ATE", label = "OLS") +
  declare_estimator(Y ~ Z, covariates = ~X, model = lm_lin, inquiry = "ATE", label = "Lin")


designs <- redesign(design, 
                    control_slope = seq(-1, 1, 0.5), 
                    prob = seq(0.1, 0.9, 0.1))

simulations <- simulate_designs(designs)
designs <- redesign(design, 
                    control_slope = seq(-1, 1, 0.5), 
                    prob = seq(0.1, 0.9, 0.1))

simulations <- simulate_designs(designs)

Figure 17.5 considers a range of designs under five possible models. The five models are described by the top row of facets. In all cases, the slope of the treated potential outcomes with respect to \(X\) is set to 1. All the way to the left, the slope with respect to the control potential outcomes is set to -1, and all the way to the right, is set to +1. The bottom row of facets shows the performance of three estimators along a range of treatment assignment probabilities.

When the control slope is -1, we can see Freedman’s precision critique. The standard error of the OLS is larger than difference-in-means for many designs, though they coincide when the fraction treated is 50%. This problem persists in some form until the slope of the control potential outcome with respect to \(X\) gets close enough to the slope of the treated potential outcomes with respect to \(X\).

All along this range, however, the Lin estimator dominates OLS and difference-in-means. Regardless of the fraction assigned to treatment and the model of potential outcomes, the Lin estimator achieves equal or better precision than either difference-in-means or OLS.

Performance of three estimators

Figure 17.5: Performance of three estimators

17.2.6 Summary

Covariate information can help experimenters increase the precision of their estimates. This precision gain is “free” if covariate information is easily available. If measuring covariates is costly, then experimenters face a tradeoff: should scare resources be spent on increasing the sample size or on measuring covariate information? When covariates are especially predictive of outcomes, the measurement of covariates can be a good investment.

Covariate information can be included either in the data strategy as the basis for blocks or in the answer strategy as a control procedure. To a first approximation, we can achieve the same gains using either approach, though we take the view that, where possible, it would be preferable to incorporate covariates in the data strategy rather than the answer strategy, since inferences will be less dependent on the specifics of the estimator.

The incorporation of covariates almost always increases precision, but bad blocking or perverse control can cause decreases in precision. Bad blocks are easy to avoid if we block on covariates that are predictive of the outcome. Adverse consequences of regression adjustment can be easily sidestepped by adopting the Lin estimator as the default form of covariate adjustment.

17.3 Cluster-randomized experiments

We declare a cluster randomized trial in which subjects are assigned to treatment and control conditions in groups. We use design diagnosis to assess the reductions in variance of in estimation that can be achieved from block randomization, examine possible downsides of block randomization, and compare strategies that randomize ex ante and that introduce controls ex post.

When whole groups of units are assigned to treatment conditions together, we say that the assignment procedure is clustered. A common example is an education experiment that is randomized at the classroom level. All students in a classroom are assigned to either treatment or control together; assignments do not vary within classroom. Clusters can be localities, like villages, precincts, or neighborhood. Clusters can be households if treatments are assigned at the household level.

Typically, cluster randomized trials exhibit higher variance than the equivalent individually-randomized trial. How much higher variance depends on a statistic that can be hard to think about, the intra-cluster correlation (ICC). The total variance can be decomposed in to the variance of the cluster means \(\sigma^2_{between}\) plus the individual variance of the cluster-demeaned outcome \(\sigma^2_{within}\). The ICC is a number between zero and one that describes the fraction of the total variance that is due to the between variance: \(\frac{\sigma^2_{between}}{\sigma^2_{between} + \sigma^2_{within}}\). If ICC equals one, then all units within a cluster express the same outcome, and all of the variation in outcomes is due to cluster-level differences. If ICC equals zero, then the cluster means are all identical, but the individuals vary within each cluster. When ICC is one, the effective sample size is equal to the number of clusters. When ICC is zero, the effective sample size is equal to the number of individuals. Since ICC is usually somewhere between these two values, we can see that clustering decreases the effective sample size from the number of individuals. The size of this decrease depends on how similar outcomes are within cluster compared to how similar outcomes are across clusters.

For these reasons clustered random assignment is not usually a desirable feature of a design, However sometimes it is useful or even necessary for logistical or ethical reasons for subjects to be assigned to together in groups.

To demonstrate the consequences of clustering, Declaration 17.5 shows a design in which both the untreated outcome Y_Z_0 and the treatment effect tau_i exhibit intra-cluster correlation.

The inquiry is the average treatment effect over individuals which can be defined without reference to the clustered structure of the data.

The data strategy employs clustered random assignment. We highlight two features of the cluster assignment.

First we highlight that the clustered nature of the data does not itself call for clustered assignment. In principle one could assign at the individual level or subgroup level even if outcomes are correlated within groups.

Second, surprisingly, random assignment of clusters to conditions does not guarantee unbiasedness of outcomes when clusters are of unequal size. (Middleton 2008; Imai, King, and Nall 2009). The bias stems from the possibility that potential outcomes could be correlated with cluster size. With uneven cluster sizes, the total number of units (the denominator in the mean estimation) in each group bounces around from assignment to assignment. Since the expectation of a ratio is not, in general, equal to the ratio of expectations, any dependence between cluster size and potential outcomes will cause bias. We can address this problem by blocking clusters into groups according to cluster size. If all clusters in a block are of the same size, then then overall size of the treatment group will remain stable from assignment to assignment. For this reason the design below uses clustered assignment blocked on cluster size.

Declaration 17.5 \(~\)

ICC <- 0.9

design <-
    cluster =
        N = 10,
        cluster_size = rep(seq(10, 50, 10), 2),
        cluster_shock = 
          scale(cluster_size + rnorm(N, sd = 5)) * sqrt(ICC),
        cluster_tau = rnorm(N, sd = sqrt(ICC))
    individual =
        N = cluster_size,
        individual_shock = rnorm(N, sd = sqrt(1 - ICC)),
        individual_tau = rnorm(N, sd = sqrt(1 - ICC)),
        Y_Z_0 = cluster_shock + individual_shock,
        Y_Z_1 = Y_Z_0 + cluster_tau + individual_tau
  ) +
  declare_inquiry(ATE = mean(Y_Z_1 - Y_Z_0)) +
    Z = block_and_cluster_ra(clusters = cluster, blocks = cluster_size)
  ) +
  declare_measurement(Y = reveal_outcomes(Y ~ Z)) +
  declare_estimator(Y ~ Z,
                    clusters = cluster,
                    inquiry = "ATE")

designs <- redesign(design, ICC = seq(0.1, 0.9, by = 0.4))


diagnoses <- diagnose_designs(designs)

Figure 17.6 shows the sampling distribution of the difference-in-means estimator under cluster random assignment at five levels of intra-cluster correlation ranging from 0.1 to 0.9.

The top row of panels plots the treatment effect on the vertical axis and the untreated potential outcome on the horizontal axis. Clusters of units are circled. At low levels of ICC, the circles all overlap, because the differences across clusters are smaller than the differences within cluster. At high levels of ICC, the differences across clusters are more pronounced than differences within cluster. The bottom row of panels shows that the sampling distribution of the difference-in-means estimator spreads out as the ICC increases. At low levels of ICC, the standard error is small; at high levels the standard error is high.

Sampling distribution under differenct ICCs

Figure 17.6: Sampling distribution under differenct ICCs

This diagnosis clarifies the costs of cluster assignment. These costs are greatest when there are few clusters and when units within clusters have similar potential outcomes. Diagnosis can be further used to compare these costs to advantages and assess the merits of variations in the design that seek to alter the number or size of clusters.

17.4 Subgroup designs

We declare and diagnose a design that is targeted at understanding the difference in treatment effects between subgroups. The design combines a sampling strategy that ensures reasonable numbers within each group of interest and a blocking assignment strategy to minimize variance.

Subgroup designs are experimental designs that have been tailored to a specific inquiry, the difference-in-CATEs. A CATE is a “conditional average treatment effect,” or the average treatment effect conditional on membership in some group. A difference-in-CATEs is just the difference between two CATEs.

For example, studies of political communication often have the difference in response to a party cue by subject partisanship as the main inquiry, since Republican subjects tend to respond positively to a Republican party cue, whereas Democratic subjects tend to respond negatively.

Subgroup designs share much in common with factorial designs, discussed in detail in Section 17.5. The main source of commonality is the answer strategy for the difference-in-CATEs inquiry. In subgroup designs and factorial designs, the usual approach is to inspect the interaction term from an OLS regression. The two designs differ because in the subgroup design, the difference-in-CATEs is a descriptive difference. We don’t randomly assign partisanship, so we can’t attribute the difference in response to treatment to partisanship, which could just be marker for the true causes of the difference in response. In the factorial design, we randomize the levels of all treatments, so the differences-in-CATEs carry with them a causal interpretation.

Since we don’t randomly assign membership in subgroups, how can we optimize the design to target the difference-in-CATEs? Our main data strategy choice comes in sampling. We need to obtain sufficient numbers of both groups in order to generate sharp enough estimates of each CATE, the better to estimate their difference. For example, at the time of this writing, many sources of convenience samples (Mechanical Turk, Lucid, Prolific, and many others) appear to underrepresent Republicans, so researchers sometimes need to make special efforts to increase the their numbers in the eventual sample.

Declaration 17.6 describes a fixed population of 10,000 units, among whom people with X = 1 are relatively rare (only 20%). In the potential_outcomes call, we build in both baseline differences in the outcome, and also oppositely signed responses to treatment. Those with X = 0 have a CATE of 0.1 and those with X = 1 have a CATE of 0.1 - 0.2 = -0.1. The true difference-in-CATEs is therefore 20 percentage points.

If we were to draw a sample of 1000 at random, we would expect to yield only 200 people with X = 1. Here we improve upon that through stratified sampling. We deliberately sampling 500 units with X = 1 and 500 with X = 0, then block randomly assigned the treatment by X.

Declaration 17.6 \(~\)

fixed_pop <-
  fabricate(N = 10000,
            X = rbinom(N, 1, 0.2),
              Y ~ rbinom(N, 1,
                         prob = 0.7 + 0.1 * Z  - 0.4 * X - 0.2 * Z * X))

total_n <- 1000
n_x1 <- 500
# Note: n_x2 = total_n - n_x1

design <-
  declare_population(data = fixed_pop) +
    CATE_X1 = mean(Y_Z_1[X == 1] - Y_Z_0[X == 1]),
    CATE_X0 = mean(Y_Z_1[X == 0] - Y_Z_0[X == 0]),
    diff_in_CATEs = CATE_X1 - CATE_X0
  ) +
    S = strata_rs(strata = X, strata_n = c(total_n - n_x1, n_x1))
    ) +
  declare_assignment(Z = block_ra(blocks = X)) +
  declare_measurement(Y = reveal_outcomes(Y ~ Z)) +
  declare_estimator(Y ~ Z + X + Z * X, 
                    term = "Z:X", 
                    inquiry = "diff_in_CATEs")


To show the benefits of stratified sampling for experiments, we redesign over many values of under- and over-sampling units with X = 1, holding the total sample size fixed at 1000. The top panel Figure 17.7 shows the distribution difference-in-CATE estimates at each size of the X = 1 group. When very small or very large fractions of the total sample have X = 1, the variance of the estimator is much larger than when the two groups the same size.

The bottom panel of the figure shows how three diagnosands change over the oversampling design parameter. Bias is never a problem – even small subgroups will generate unbiased difference-in-CATE estimates. As suggested by the top panel, the standard error is minimized in the middle and is largest at the extremes. Likewise, statistical power is maximixed in the middle, but drops off surprisingly quickly as we move away from evenly balanced recruitment.

designs <- redesign(design, n_x1 = seq(20, 980, by = 96))
simulations <- simulate_designs(designs)
Design performance given different sampling strategies.

Figure 17.7: Design performance given different sampling strategies.


  • Suppose it costs $2 to recruit an X = 1 and $1 to recruit an X = 0 and we have a budget of $1000. Modify the redesign to find the power-maximizing financially feasible design.

Further reading

  • Druckman and Kam (2011) makes the point that a main difficulty when using convenience samples (in their case, student samples) is a lack of variation on crucial moderating variables.

17.5 Factorial experiments

We declare and diagnose a canonical factorial design in which two different treatments are crossed. The design allows for unbiased estimation of a wide range of estimands including conditional effects and interaction effects. We highlight the difficulty of achieving statistical power for interaction terms and the risks of treating a difference between a significant conditional effect and a nonsignificant effect as itself significant.

In factorial experiments, researchers randomly assign the level of not just one treatment, but multiple treatments. The prototypical factorial design is a “two-by-two” factorial design in which factor 1 has two levels and factor 2 has two levels as well. Similarly, a “three-by-three” factorial design has two factors, each of which has three levels. We can entertain any number of factors with any number of levels. For example, a “two-by-three-by-two” factorial design has three factors, two of which have two levels and one of which has three levels. Conjoint experiments are (Section 16.3) are highly factorial, often including six or more factors with two or more levels each.

Factorial designs can help researchers answer many inquiries, so it is crucial to design factorials with a particular set in mind. Let’s consider the two-by-two case, which is complicated enough. Let’s call the first factor Z1 and the second factor Z2, each of which can take on the values of zero or one. Considering only average effects, this design can support seven separate inquiries:

  1. the average treatment effect (ATE) of Z1,
  2. the ATE of Z2,
  3. the conditional average treatment effect (CATE) of Z1 given Z2 is 0,
  4. the CATE of Z1 given Z2 is 1
  5. the CATE of Z2 given Z1 is 0
  6. the CATE of Z2 given Z1 is 1
  7. The difference-in-CATEs of Z1 given Z2 is 1 and of Z1 given Z2 is 0
  8. Which is numerically equivalent to the difference-in-CATEs of Z2 given Z1 is 1 and of Z2 given Z1 is 0

The reason we distinguish between the ATE of Z1 versus the CATEs of Z1 depending on the level of Z2 is that the two factors may “interact.” When factors interact, the effects of Z1 are heterogeneous in the sense that they are different depending on Z2. We often care about the difference-in-CATEs inquiry because of theoretical settings in which the effects of one treatment are supposed to depend on the level of another treatment.

However, if we are not so interested in the difference-in-CATEs, then factorial experiments have another good justification – we can learn about the ATEs of each treatment for half price, in the sense that we apply treatments to the same subject pool using the same measurement strategy. Conjoint experiments are a kind of factorial design (discussed in Section 16.3) often target average treatments effects that average over the levels of the other factors.

Here we declare a factorial design with two treatments and a normally distributed outcome variable. We imagine that the CATE of Z1 given Z2 is zero is equal to 0.2 standard units, the CATE of Z2 given Z1 is zero is equal to 0.1, and the interaction of the two is 0.1 as well.

Declaration 17.7 2x2 Factorial design

CATE_Z1_Z2_0 <- 0.2
CATE_Z2_Z1_0 <- 0.1
interaction <- 0.1
N <- 1000

design <-
    N = N,
    U = rnorm(N),
    potential_outcomes(Y ~ CATE_Z1_Z2_0 * Z1 +
                         CATE_Z2_Z1_0 * Z2 +
                         interaction * Z1 * Z2 + U,
                       conditions = list(Z1 = c(0, 1),
                                         Z2 = c(0, 1)))) +
    CATE_Z1_Z2_0 = mean(Y_Z1_1_Z2_0 - Y_Z1_0_Z2_0),
    CATE_Z1_Z2_1 = mean(Y_Z1_1_Z2_1 - Y_Z1_0_Z2_1),
    ATE_Z1 = 0.5 * CATE_Z1_Z2_0 + 0.5 * CATE_Z1_Z2_1,
    CATE_Z2_Z1_0 = mean(Y_Z1_0_Z2_1 - Y_Z1_0_Z2_0),
    CATE_Z2_Z1_1 = mean(Y_Z1_1_Z2_1 - Y_Z1_1_Z2_0),
    ATE_Z2 = 0.5 * CATE_Z2_Z1_0 + 0.5 * CATE_Z2_Z1_1,
    diff_in_CATEs_Z1 = CATE_Z1_Z2_1 - CATE_Z1_Z2_0,
    diff_in_CATEs_Z2 = CATE_Z2_Z1_1 - CATE_Z2_Z1_0
  ) + 
  declare_assignment(Z1 = complete_ra(N),
                     Z2 = block_ra(Z1)) +
  declare_measurement(Y = reveal_outcomes(Y ~ Z1 + Z2)) +
  declare_estimator(Y ~ Z1, subset = (Z2 == 0), 
                    inquiry = "CATE_Z1_Z2_0", label = 1) +
  declare_estimator(Y ~ Z1, subset = (Z2 == 1), 
                    inquiry = "CATE_Z1_Z2_1", label = 2) +
  declare_estimator(Y ~ Z2, subset = (Z1 == 0), 
                    inquiry = "CATE_Z2_Z1_0", label = 3) +
  declare_estimator(Y ~ Z2, subset = (Z1 == 1),
                    inquiry = "CATE_Z2_Z1_1", label = 4) +
  declare_estimator(Y ~ Z1 + Z2, term = c("Z1", "Z2"), 
                    inquiry = c("ATE_Z1", "ATE_Z2"), label = 5) +
  declare_estimator(Y ~ Z1 + Z2 + Z1*Z2, term = "Z1:Z2", 
                    inquiry = c("diff_in_CATEs_Z1", "diff_in_CATEs_Z2"), 
                    label = 6) 


We now redesign this factorial over many sample sizes, considering the statistical power for each of the inquiries. Figure 17.8 shows that depending on the inquiry, the statistical power of this design can vary dramatically. The average treatment effect of Z1 is relatively large at 0.25 standard units, so power is above the 80% threshold at all the sample sizes we consider. The ATE of Z2 is smaller, at 0.15 standard units, so power is lower, but not dramatically so. Both ATEs use all \(N\) data points, so power is manageable for the average effects. The conditional average effects generally fare worse, mainly because each is estimated on only half the sample. The power for the 0.1 standard unit difference-in-CATEs is abysmal at all sample sizes considered here.

designs <- redesign(design, N = seq(500, 3000, 500))
simulations <- simulate_design(designs, sims = 100)
Power for factorial inquires

Figure 17.8: Power for factorial inquires

17.5.1 Avoiding misleading inferences

The very poor power for the difference-in-CATEs sometimes leads researchers to rely on a different answer strategy for considering whether the effects of Z1 depend on the level of Z2. Sometimes, researchers will consider the statistical significance of each of Z1’s CATEs separately, then conclude the CATEs are “different” if the effect is significant for one CATE but not the other. This is bad practice.

Here we diagnose over the true values of the Z1 ATE, setting the true interaction term to zero. Our diagnosis question will be, how frequently do we conclude the two CATEs are different, using two different strategies. The first is the usual approach, i.e., we consider the statistical significant of the interaction term. The second considers whether one, but not the other, of the two CATE estimates is significant.

designs <- redesign(
  CATE_Z1_Z2_0 = seq(0, 0.5, 0.05),
  CATE_Z2_Z1_0 = 0.2,
  interaction = 0

simulations <- simulate_design(designs, sims = 500)

Figure 17.9 shows that the error rate when we consider the statistical significance of the interaction term is nominal. Only 5% of the time do we falsely reject the null that the difference-in-CATEs is zero. But when we claim “treatment effect heterogeneity!” when one CATE is significant but not the other, we make egregious errors. When the true (constant) average effect of Z1 approaches 0.2, we falsely conclude that there heterogeneity nearly 50% of the time!

False conclusions of heterogeneity

Figure 17.9: False conclusions of heterogeneity


  1. With 1000 subjects, how big does the interaction term need to be to achieve 80% power?
  2. Holding the interaction at 0.1, how many subjects do we need for the interaction term to achieve 80% power?

17.6 Encouragement designs

In many experimental settings, we can’t we can’t require units we assign to take treatment to actually take treatment. Nor can we require units assigned to the control group not to take treatment. Instead, we have to content ourselves with “encouraging” units assigned to the treatment group to take treatment and “encouraging” units assigned to the control group not to.

Encouragements are often only partially successful. Some units assigned to treatment refuse treatment and some units assigned to control find a way to obtain treatment after all. In these settings, we say that experiments encounter “noncompliance.” This section will describe the most common approach to the design and analysis of encouragement trials, and will point out potential pitfalls along the way.

Any time a data strategy entails contacting subjects in order to deliver a treatment like a bundle of information or some good, noncompliance is a potential problem. Emails go undelivered, unopened, and unread. Letters get lost in the mail. Phone calls are screened, text messages get blocked, direct messages on social media are ignored. People don’t come to the door when you knock, either because they aren’t home or they don’t trust strangers. Noncompliance can affect noninformational treatments as well: goods may be difficult to deliver to remote locations, subjects may refuse to participate in assigned experimental activities, or research staff might simply fail to respect the realized treatment schedule out of laziness or incompetence.

Experimenters who anticipate noncompliance should make compensating adjustments to their research designs (relative to the canonical two arm design). These adjustments ripple through M, I, D, and A.

17.6.1 Changes to the model

The biggest change to M is developing beliefs about compliance types, also called “principal strata” (Frangakis and Rubin 2002). In a two-arm trial, subjects can be one of four compliance types, depending on how their treatment status responds to their treatment assignment. The four types are described in Table 17.3. \(D_i(Z = 0)\) is a potential outcome – it is the treatment status that unit \(i\) would express if assigned to control. Likewise, \(D_i(Z = 1)\) is the treatment status that unit \(i\) would express if assigned to treatment. These potential outcomes can take each take on a value of 0 or 1, so their intersection allows for four types. For Always-takers, \(D_i\) is equal to 1 regardless of the value of \(Z\) – they always take treatment. Never-takers are the opposite – \(D_i\) is equal to 0 regardless of the value of \(Z\). For Always-takers and Never-takers, assignment to treatment does not change whether they take treatment.

Compliers are units that take treatment if assigned to treatment and do not take treatment if assigned to control. Their name “compliers” connotes that something about their disposition as subjects makes them “compliant” or otherwise docile, but this connotation is misleading. Compliance types are generated by the confluence of subject behavior and data strategy choices. Whether or not a subject answers the door when the canvasser comes calling is a function many things, including whether the subject is at home and whether they open the door to canvassers. Data strategies that attempt to deliver treatment in the evenings and on weekends might generate more (or different) compliers than those that attempt treatment during working hours.

Table 17.3: Compliance types
Compliance Type \(D_i(Z_i = 0)\) \(D_i(Z_i = 1)\)
Never-taker 0 0
Complier 0 1
Defier 1 0
Always-taker 1 1

The last compliance type to describe are defiers. These strange birds refuse treatment when assigned to treatment, but find a way to obtain treatment when assigned to control. Whether or not “defiers” exist turns out to be a consequential assumption that must be made in the model. We have good reason to believe that defiers are rare – assignment to treatment almost always has a positive average effect on treatment take-up.

A unit’s compliance type often not possible to observe directly. Subjects assigned to the control group who take take treatment (\(D_i(0) = 1\)) could be defiers or always-takers. Subjects assigned to the treatment group who do not take treatment (\(D_i(1) = 0\)) could be defiers or never-takers. Our inability to be sure of compliance types is another facet of the fundamental problem of causal inference. Even though a subject’s compliance type (with respect to a given design) is a stable trait, it is defined by how the subject would act in multiple counterfactual worlds. We can’t tell what type a unit is because we would need to see whether they take treatment when assigned to treatment and also when assigned to control.

17.6.2 Changes to the inquiry

The inclusion of noncompliance and compliance types to the model also necessitate changes to the inquiry. Always-takers and Never-takers present a real problem for causal inference. Even with the power to randomly assign, we can’t change what treatments these units take. As a result, we don’t get to learn about the effects of treatment among these groups. Even if our inquiry were the average effect of treatment among the never-takers, the experiment (as designed) would not be able to generate empirical estimates of it.21 Our inquiry has to fall back to the average effects among those units that whose treatment status we can successfully encourage to change – the compliers.

This inquiry is called the complier average causal effect (the CACE). It is defined as \(\mathbb{E}[Y_i(1) - Y_i(0) | d_i(1) > d_i(0)]\). Just like the average treatment effect, it refers to an average over individual causal effects, but this average is taken over a specific subset of units, the compliers. Compliers are the only units for whom \(d_i(1) > d_i(0)\), because for compliers, \(d_i(1) = 1\) and \(d_i(0) = 0\). When assignments and treatments are binary, the CACE is mathematically identical to the local average treatment effect (LATE) described in Chapter @ref(sec: p3iv). Whether we write CACE or LATE sometimes depends on academic discipline, with LATE being more common among economists. An advantage of “CACE” over “LATE” is that it is specific about which units the effect is “local” to – it is local to the compliers.

When experiments encounter noncompliance, the CACE is usually the most important inquiry for theory, since it refers to the average effect of the causal variable, at least for a subset of the units in the study. However, two other common inquiries are important to address here as well.

The first is the intention-to-treat (ITT) inquiry, which is defined as \(\mathbb{E}[Y_i(D_i(Z = 1), Z = 1) - Y_i(D_i(Z = 0), Z = 0)]\). The encouragement itself \(Z\) has a total effect on \(Y\) that is mediated in whole or in part by the treatment status. Sometimes the ITT is the policy-relevant inquiry, since it describes what would happen if a policy maker implemented the policy in the same way as the experiment, inclusive of noncompliance. Consider an encouragement design to study the effectiveness of a tax webinar on tax compliance. Even if the webinar is very effective among people willing to watch it (the CACE is large), the main trouble faced by the policy maker will be getting people to sit through the webinar. The ITT describes the average effect of inviting people to the webinar, which could be quite small if very few people are willing to join.

The second additional inquiry is the compliance rate, sometimes referred to as the \(\mathrm{ITT}_{\rm D}\). It describes the average effect of assignment on treatment, and is written \(\mathbb{E}[(D_i(Z = 1) - D_i(Z = 0)]\). A small bit of algebra shows that the \(\mathrm{ITT}_{\rm D}\) is equal to the fraction of the sample that are compliers minus the fraction that are defiers.

These three inquiries are tightly related. Under five very important assumptions (described below), we can write:

\[\begin{align*} \mathrm{CACE} = \frac{\mathrm{ITT}}{\mathrm{ITT}_{\mathrm{D}}} \end{align*}\]

A derivation of this relationship is given in Section 15.4 on instrumental variables. The five assumptions described in that section are identical to the assumptions required here. In an experimental setting, “exogeneity of the instrument” is guaranteed by features of the data strategy. Since we use random assignment, we know for sure that the “instrument” (the encouragement) is exogenous. Excludability of the instrument refers to the idea that the effect of the encouragement on the outcome is fully mediated by the treatment. This assumption could be violated if the mere act of encouragement changes outcomes. Stated differently, if never-takers or always-takers reveal different potential outcomes in treatment and control (\(Y_i(D_i(Z = 1), Z = 1) \neq Y_i(D_i(Z = 0), Z = 0)\)), it must be because encouragement itself changes outcomes. Non-interference in this setting means that units’ treatment status and outcomes do not depend on the assignment or treatment status of other units. In an experimental context, the assumption of monotonicity rules out the existence of defiers. This assumption is often made plausible by features of the data strategy (perhaps it is impossible for those who are not assigned to treatment to obtain treatment) or features of the model (“defiant” responses to encouragement are behaviorally unlikely). The final assumption – nonzero effect of the instrument on the treatment – can also be assured by features of the data strategy. In order to learn about the effects of treatment, data strategies must successfully encourage at least some units to take treatment.

17.6.3 Changes to the data strategy

When experimenters expect that noncompliance will be a problem, they should take steps to mitigate that problem in the data strategy. Sometimes doing so just means trying harder: investigating the patterns of noncompliance, attempting to deliver treatment on multiple occasions, or offering subjects incentives for participation. “Trying harder” is about turning more subjects into compliers by choosing a data strategy that encounters less noncompliance.

A second important change to the data strategy is the explicit measurement of treatment status as distinct from treatment assignment. For some designs, measuring treatment status is easy. We just record which units were treated and which were untreated. But in some settings, measuring compliance is trickier. For example, if treatments are emailed, we might never know if subjects read the email. Perhaps our email service will track read receipts, in which case one facet of this measurement problem is solved. We won’t know, however, how many subjects read the subject line – and if the subject line contains any treatment information, then even subjects who don’t click on the email may be “partially” treated. Our main advice is to measure compliance in the most conservative way: if treatment emails bounce altogether, then subjects are not treated.

In multi-arm trials or with continuous rather than binary instruments, noncompliance becomes a more complex problem to define and address through the data strategy and answer strategy. We must define complier types according to all of the (potentially-infinite) possible treatment conditions. For multiarm trials, the complier types for the first treatment may not be the same for the second treatment; in other words, units will comply at different rates to different treatments. Apparent differences in complier average treatment effects and intent-to-treat effects, as a result, may reflect not differences in treatment effects but different rates of compliance.

17.6.4 Changes to the answer strategy

Estimation of the CACE is not as straightforward subsetting the analysis to compliers. A plug-in estimator of the CACE with good properties takes the ratio of the \(ITT\) estimate to the \(ITT_d\) estimate. Since the \(ITT_d\) must be a number between zero and one, this estimator “inflates” the \(ITT\) by the compliance rate. Another way of thinking about this is that the \(ITT\) is deflated by all the never-takers and always-takers, among whom the \(ITT\) is by construction 0, so instead of “inflating,” we are “re-inflating” the ITT to the level of the CACE. Two-stage least squares in which we instrument the treatment with the random assignment is a numerically equivalent procedure when treatment and assignments are binary. Two-stage least squares has the further advantage of being able to seamlessly incorporate covariate information to increase precision.

Two alternative answer strategies are biased and should be avoided. An “as-treated” analysis ignores the encouragement \(Z\) and instead compares units by their revealed treatment status \(D\). This procedure is prone to bias because those who come to be treated may differ systematically from those who do not. The “per protocol” analysis is similarly biased. It drops any unit that fails to comply with its assignment, but those who take treatment in the treatment group (compliers and always-takers) may differ systematically from those who do not take treatment in the contorl group (compliers and never-takers). Both the “as-treated” and “per-protocol” answer strategies suffer from a special case of post-treatment bias, wherein conditioning on a post-assignment variable (treatment status) essentially de-randomizes the study.

Declaration 17.8 elaborates the model to include the four compliance types, setting the share of defiers to zero to match the assumption of monotonicity. It imagines that the potential outcomes of the outcomes \(Y\) with respect to the treatment \(D\) are different for each compliance type, reflecting the idea that compliance type could be correlated with potential outcomes. The declaration also links compliance type to the potential outcomes of the treatment \(D\) with respect to the randomized encouragement \(Z\). We then move on to declaring two inquiries (the CACE and the ATE) and three answer strategies (two-stage least squares, as-treated analysis, and per-protocol analysis).

Declaration 17.8 \(~\)

design <-
    N = 100,
    type = 
      rep(c("Always-Taker", "Never-Taker", "Complier", "Defier"),
          c(0.2, 0.2, 0.6, 0.0)*N),
    U = rnorm(N),
    # potential outcomes of Y with respect to D
      Y ~ case_when(
        type == "Always-Taker" ~ -0.25 - 0.50 * D + U,
        type == "Never-Taker" ~ 0.75 - 0.25 * D + U,
        type == "Complier" ~ 0.25 + 0.50 * D + U,
        type == "Defier" ~ -0.25 - 0.50 * D + U
      conditions = list(D = c(0, 1))
    # potential outcomes of D with respect to Z
      D ~ case_when(
        Z == 1 & type %in% c("Always-Taker", "Complier") ~ 1,
        Z == 1 & type %in% c("Never-Taker", "Defier") ~ 0,
        Z == 0 & type %in% c("Never-Taker", "Complier") ~ 0,
        Z == 0 & type %in% c("Always-Taker", "Defier") ~ 1
      conditions = list(Z = c(0, 1))
  ) +
    ATE = mean(Y_D_1 - Y_D_0),
    CACE = mean(Y_D_1[type == "Complier"] - Y_D_0[type == "Complier"])) +
  declare_assignment(Z = conduct_ra(N = N)) +
  declare_measurement(D = reveal_outcomes(D ~ Z),
                      Y = reveal_outcomes(Y ~ D)) +
    Y ~ D | Z,
    model = iv_robust,
    inquiry = c("ATE", "CACE"),
    label = "Two stage least squares"
  ) +
    Y ~ D,
    model = lm_robust,
    inquiry = c("ATE", "CACE"),
    label = "As treated"
  ) +
    Y ~ D,
    model = lm_robust,
    inquiry = c("ATE", "CACE"),
    subset = D == Z,
    label = "Per protocol"


Figure 17.10 represents the encouragement design as a DAG. No arrows lead into \(Z\), because the treatment was randomly assigned. The compliance type \(C\), the assignment \(Z\), and unobserved heterogeneity \(U\) conspire to set the level of \(D\). The outcome \(Y\) is affected by the treatment \(D\) of course, but also by compliance type \(C\) and unobserved heterogeneity \(U\). The required exclustion restriction that \(Z\) only affect \(Y\) through \(D\) is represented by the lack of an arrow from \(Z\) to \(Y\). The deficiencies of the as-treated and per-protocol analysis strategies can be learned from the DAG as well. \(D\) is a collider, so conditioning on it will open up back-door paths between \(Z\), \(C\), and \(U\), leading to bias of unknown direction and magnitude.

DAG of the encouragement design

Figure 17.10: DAG of the encouragement design

The design diagnosis shows the sampling distribution of the three answer strategies and compares it to two potential inquiries: the complier average causal effect and the average treatment effect. Our preferred method, two-stage least squares, is biased for the ATE. Because we can’t learn about the effects of treatment among never-takers or always-takers, any estimate of the true ATE will be necessarily be prone to bias, except in the happy circumstance that never-takers and always-takers happen to be just like compliers.

Two-stage least squares does a much better job of estimating the complier average causal effect. Even though the sampling distribution is wider than those for the per-protocol and as-treated analysis, it is at least centered on a well-defined inquiry. By contrast, the other two answer strategies are biased for either target.

diagnosis <- diagnose_design(design, sims = sims, bootstrap_sims = b_sims)
Sampling distribution of three estimators

Figure 17.11: Sampling distribution of three estimators

17.7 Placebo-controlled experiments

In common usage, the notion of a placebo is a treatment that carries with it everything about the bonafide treatment – except the active ingredient. We’re used to thinking about placebos in terms of the “placebo effect” in medical trials. Some portion of the total effect of the actual treatment is due the mere act of getting any treatment, so the administration of placebo treatments can difference this portion off. Placebo controlled designs abound in social sciences too (see Porter and Velez (2021)) for similar purposes. Media treatments often work through a bundle of priming effects and new information; a placebo treatment might include only the prime but not the information. The main use of placebos is to difference off the many small excludability violations involved in bundled treatments the better to understand the main causal variable of interest.

In this chapter, we study the use of placebos for a different purpose: to combat the negative design consequences of noncompliance in experiments. As described in the previous chapter, a challenge for experiments that encounter noncompliance is that we do not know for sure who the compliers are. Compliers are units that would take treatment if assigned to treatment, but would not do so if assigned to control. Compliers are different from always-takers and never-takers in that assignment to treatment actually changes which potential outcome they reveal.

In the placebo-controlled design, we attempt to deliver a real treatment to the treatment group and a placebo treatment to the placebo group, then we conduct our analysis among those units that accept either treatment. This design solves two problems at once. First, it lets us answer a descriptive question: “Who are the compliers?” Second, it lets answer a causal causal question: “What is the average effect of treatment among compliers?”

Employing a placebo control can seem like an odd design choice – you go to all the effort of contacting a unit but at the very moment you get in touch, you deliver a placebo message instead of the treatment message. It turns out that despite this apparent waste, the placebo-controlled design can often lead to more precise estimates than the standard encouragement design. Whether it does or not depends in large part on the underlying compliance rate.

Declaration 17.9 actually includes two separate designs. Here we’ll directly compare the standard encouragement design to the placebo controlled design. They have identical theoretical halves, so we’ll just declare those once, before declaring the specifics of the empirical strategies for each design.

Declaration 17.9 \(~\)

compliance_rate <- 0.2

MI <-
    N = 400,
    type = sample(x = c("Never-Taker", "Complier"), 
                  size = N,
                  prob = c(1 - compliance_rate, compliance_rate),
                  replace = TRUE),
    U = rnorm(N),
    # potential outcomes of Y with respect to D
      Y ~ case_when(
        type == "Never-Taker" ~ 0.75 - 0.25 * D + U,
        type == "Complier" ~ 0.25 + 0.50 * D + U
      conditions = list(D = c(0, 1))
    # potential outcomes of D with respect to Z
      D ~ if_else(Z == 1 & type == "Complier", 1, 0),
      conditions = list(Z = c(0, 1))
  ) +
  declare_inquiry(CACE = mean(Y_D_1[type == "Complier"] - Y_D_0[type == "Complier"]))

Here again are the data and answer strategies for the encouragement design (simplified from the previous chapter to focus on the one-sided compliance case). We conducted a random assignment among all units, then reveal treatment status and outcomes according to the potential outcomes declared in the model. The two-stage least squares estimator operates on all \(N\) units to generate estimates of the CACE.

encouragement_design <-
  MI +
  declare_assignment(Z = conduct_ra(N = N)) +
  declare_measurement(D = reveal_outcomes(D ~ Z),
                      Y = reveal_outcomes(Y ~ D)) +
    Y ~ D | Z,
    model = iv_robust,
    inquiry = "CACE",
    label = "2SLS among all units"

By contrast, here are the data and answer strategies for the placebo-controlled design. In a typical canvassing experiment setting, the expensive part is sending canvassing teams to each household, regardless of whether a treatment or a placebo message is delivered when the door opens. So in order to keep things “fair” across the placebo controlled and encouragement designs, we’re going to hold fixed the number of treatment attempts – the sampling step subsets to the same \(N/2\) that we will attempt. Then among that subset, we conduct a random assignment to treatment or placebo. When we attempt to deliver the placebo or the treatment, we will either succeed or fail, which gives us a direct measure of whether a unit is a complier. This measuremenet is represented in the declare_measurement step where an observable X now corresponds to compliance type. We conduct our estimation directly conditioning on the subset of the sample we have measured to be compliers.

placebo_controlled_design <-
  MI +
  declare_sampling(S = complete_rs(N)) +
  declare_assignment(Z = conduct_ra(N = N)) +
  declare_measurement(X = if_else(type == "Complier", 1, 0),
                      D = reveal_outcomes(D ~ Z),
                      Y = reveal_outcomes(Y ~ D)) +
    Y ~ Z,
    subset = X == 1,
    model = lm_robust,
    inquiry = "CACE",
    label = "OLS among compliers"


We diagnose both the encouragement design and the placebo-controlled design over a range of possible levels of noncompliance, focusing on the standard deviation of the estimates (the standard error) as our main diagnosand. Figure 17.12 shows the results of the diagnosis. At high levels of compliance, the standard encouragement design actually outperforms the placebo-controlled design. But when compliance is low, the placebo controlled design is preferred. Which is preferable in any particular scenario will depend on the compliance rate as well as other design features like the total number of attempts and the fraction treated.

Comparison of the placebo controlled design to a standard encouragment design

Figure 17.12: Comparison of the placebo controlled design to a standard encouragment design

Further reading

17.8 Stepped-wedge experiments

We often face an ethical dilemma in allocating treatments to some units but not others, since we would rather not withhold treatment from anyone. However, practical constraints often make it impossible to allocate treatments to everyone at the same time. In these circumstances, a stepped-wedge experiment can help. Under a stepped-wedge design, we follow an allocation rule that randomly assigns a portion of units to treatment in each of one or more periods, and then in a final period, everyone is allocated treatment. We conduct posttreatment measurement after each period except for the last one. Figure 17.13 illustrates the allocation procedure. A common design is allocating one third to treatment in the first period, an additional third in the second period, and the remaining third in the final third period.

Declaration 17.10 \(~\)

Illustration of random assignment in a stepped-wedge design.

Figure 17.13: Illustration of random assignment in a stepped-wedge design.


Our model describes unit-specific effects, time-specific effects, and time trends in the potential outcomes. In the data strategy, we allocate treatment for 100 units in three time periods, following the 1/3, 2/3, 3/3 allocation rule.

We assign treatment by randomly assigning the wave each unit will receive treatment. We use cluster assignment at the unit level because the data is at the unit-period level. We then transform this treatment variable into a unit-period treatment indicator, if the time period is at or after the treatment wave.

Our inquiry is the average treatment effect among time periods before the last period. In the stepped-wedge design, we don’t obtain information about the control potential outcome in the final period. Our answer strategy also only uses the data from the first two periods (in reality, we probably would not collect outcome data after the last period for this reason). We fit a two-way fixed effects regression model by periods and units with standard errors clustered at the unit level.

We show in the difference-in-differences design entry in Section 15.3 that under a very similar model, the two-way fixed effects estimator is biased for the average treatment effect on the treated in the presence of treatment effect by time interactions. The differences between the designs are twofold. Here, we randomize treatment, rather than using observational data with confounded treatment assignment, so we do not need to make the parallel trends assumption. Our diagnosis below will show no bias in estimating the average treatment effect with the two-way fixed effects estimator.

Declaration 17.11 \(~\)

design <-
    units = add_level(
      N = 100, 
      U_unit = rnorm(N)
    periods = add_level(
      N = 3,
      time = 1:max(periods),
      U_time = rnorm(N),
      nest = FALSE
    unit_period = cross_levels(
      by = join(units, periods),
      U = rnorm(N),
        Y ~ scale(U_unit + U_time + time + U) + effect_size * Z
  ) +
    wave = cluster_ra(clusters = units, conditions = 1:max(periods)),
    Z = if_else(time >= wave, 1, 0)
  ) +
  declare_inquiry(ATE = mean(Y_Z_1 - Y_Z_0), subset = time < max(time)) + 
  declare_measurement(Y = reveal_outcomes(Y ~ Z)) +
  declare_estimator(Y ~ Z, fixed_effects = ~ periods + units, 
                    clusters = units, 
                    subset = time < max(time), 
                    inquiry = "ATE", label = "TWFE")


17.8.1 When to use a stepped wedge experiment

Compared to the equivalent two-arm randomized experiment, a stepped-wedge experiment involves the same number of units, but more treatment (all versus half) and more measurement (all units are measured at least twice). The decision of whether to adopt the stepped-wedge design, then, rides on your budget, the relative costs of measurement and treatment, ethical and logistical constraints such as the imperative to treat all units, and your beliefs about effect sizes relative to measurement noise in your outcomes.

We compare the stepped-wedge design to a two-arm randomized experiment with varying sample sizes to assess these tradeoffs. In particular, we examine a study with the same number of units, which would be the relevant comparison if the main constraint is you cannot increase the number of units in the study. The second comparison is a two-arm experiment with double the number of units, which would be the right comparison if you can increase the number of units but have a fixed budget for measurement and treatment allocation. We summarize each design in terms of the number of study units, the number that are treated, and the number of unit measurements taken.

Table 17.4: Design parameters in the comparison between stepped-wedge and two-arm experimental designs.
Design N m treated n measurements
Stepped-wedge 100 100 200
Two-arm v1 100 50 100
Two-arm v2 200 100 200

We declare a comparable two-arm experimental design below, with the wrinkle being that the estimand is slightly different by necessity. In the stepped-wedge design, we target the average treatment effect averaging over all periods up to the penultimate one, because there is no information about the control group from the last period. In a single period design, by its nature, we cannot average over time. We would obtain a biased answer if we targeted an out-of-sample time period. The average treatment effect we target is the current-period ATE for the period that is chosen. We cannot extrapolate beyond that if treatment effects vary over time. If you expect time heterogeneity in effects, you may not want to use a stepped-wedge design but instead design a new experiment that efficiently targets the conditional average treatment effects within each period. Then you could describe both the average effect and how effects vary over time.

Declaration 17.12 \(~\)

design_single_period <-
      N = n_units, 
      U_unit = rnorm(N),
      U = rnorm(N),
      effect_size = effect_size,
      potential_outcomes(Y ~ scale(U_unit + U) + effect_size * Z)
  ) +
  declare_assignment(Z = complete_ra(N, m = n_units / 2)) +
  declare_inquiry(ATE = mean(Y_Z_1 - Y_Z_0)) + 
  declare_measurement(Y = reveal_outcomes(Y ~ Z)) +
  declare_estimator(Y ~ Z, inquiry = "ATE", label = "TWFE")


We plot power curves for the three comparison designs in Figure 17.14. The top line (blue dashed) is the 200-unit study, which is preferred in terms of power, and by a considerable margin. That design involves the same amount of measurement and treatment as the stepped-wedge so may be the same cost. Ethical constraints such as you must treat all units or logistical constraints such as there are only 100 eligible units to study would be the only reason under these beliefs about the model to adopt the stepped wedge design. However, the stepped-wedge design here strictly dominates a two-arm experiment with only 100 units. In that design, there is less measurement (half) and fewer units are treated (half). However, it delivers much less power. The two-arm may in some cases be logistically less complicated.

Power analysis of three designs: stepped wedge with 100 units and 1/3-1/3-1/3 allocation, two-arm experiment with 100 units, and two-arm experiment with 200 units.

Figure 17.14: Power analysis of three designs: stepped wedge with 100 units and 1/3-1/3-1/3 allocation, two-arm experiment with 100 units, and two-arm experiment with 200 units.

17.9 Crossover experiments

In empirical research, we can only ever access one potential outcome for any given unit, because only one can be revealed at a time. As a result, we cannot estimate the treatment effect for any individual, which would require knowing both the treatment potential outcome and the untreated potential outcome. We only get one. Experiments typically address this fundamental problem of causal inference by randomly assigning different units to reveal different potential outcomes (see Principal 3.5).

An appealing possibility is to do the same thing, but within units. If we could assign a unit to reveal its treated potential outcome in one period and its untreated potential outcome in the next period, we could subtract the two realized outcomes and obtain that unit’s individual treatment effect. With that effect in hand for all units, we could efficiently explore how treatment effects vary across units, as well as the average effect.

The crossover design is founded on that possibility. In the cross-over design, we block randomize units to receive treatment and control over two periods (the blocks are the units). Each unit either receives treatment first then control or control first then treatment. We collect endline outcome data after the first period and after the second period. The treatment that is randomized in the design must be one in which once you receive the treatment, you can be untreated again in future periods – it is not sticky. A treatment that could be randomized is providing a time-delimited voucher in one period that you cannot use in the next period. There still could be effects of treatment in the first period (“carry-over” effects), but in the second period you are untreated. An example of a treatment that could not be randomized in this design would be a fixed asset like a plough; once you receive it, you have it, so in the second period you still have the plough and are in this sense still treated. For such sticky treatments, a stepped-wedge design might be more appropriate (Section 17.8).

In order to declare a cross-over design, we need to redefine potential outcomes, because of the possibility of carry-over effects. We allow potential outcomes to not only be a function of whether they are treated now, in the current period, but whether they were treated in the past period. In particular, we consider three potential outcomes: what outcome a unit would have if untreated in the past period and untreated now, if untreated in the past but treated now, and treated in the past but untreated now. We will not be able to reveal the outcome if you were treated in the past and treated now, the fourth of the two-by-two of current and past treatment statuses and we don’t need it for the inquiry, so we don’t define that one in the declaration. We define our potential outcomes such that there is a 0.2 treatment effect, and there is the possibility of a carryover effect to the outcomes for units that were treated in the preceding period. In the redesign, we explore no carryover effects (carryover = 0) up through large carry-over effects representing the treatment effect remaining the same into the next period for treated units (carryover = 1).

We make several other changes for this design from a standard two-arm experiment. The potential outcome redefinition means that we need to redefine our average treatment effect inquiry too: which two outcomes are we differencing? We choose to focus on the two that involve an untreated period beforehand, and so the average difference is between if you are treated now and untreated now. We measure outcomes that are a function of both current and past treatment. Our answer strategy is an OLS regression with fixed effects for units and standard errors clustered on unit. We explore in the exercises why clustering is needed.

Declaration 17.13 \(~\)

design <-
    units = add_level(
      N = 100, 
      U_unit = rnorm(N, sd = 5)
    periods = add_level(
      N = 2, 
      time = 1:2, 
      U_time = rnorm(N), 
      nest = FALSE
    unit_periods = cross_levels(
      by = join(units, periods), 
      U = rnorm(N),
      Y_Z_0_Zlag_0 = U_time + U_unit + U,
      Y_Z_1_Zlag_0 = Y_Z_0_Zlag_0 + 0.2,
      Y_Z_0_Zlag_1 = Y_Z_0_Zlag_0 + 0.2 * carryover
  ) +
    ATE_untreated_before = mean(Y_Z_1_Zlag_0 - Y_Z_0_Zlag_0)
  ) + 
    Z = block_ra(blocks = units, prob = 0.5),
    Zlag = if_else(time == 2 & Z == 0, 1, 0)
  ) + 
  declare_measurement(Y = reveal_outcomes(Y ~ Z + Zlag)) +
    Y ~ Z, 
    cluster = units,
    fixed_effects = ~units,
    model = lm_robust, 
    inquiry = "ATE_untreated_before"


Bias in estimates of the average treatment effect if untreated before as a function of the magnitude of carryover effects from none (far left) to carryover effects the same size as the average treatment effect if untreated.

Figure 17.15: Bias in estimates of the average treatment effect if untreated before as a function of the magnitude of carryover effects from none (far left) to carryover effects the same size as the average treatment effect if untreated.

In Figure 17.15, we show the results the redesign exercise. When there are no carryover effects, the design is unbiased for the average treatment effect if untreated before. However, as carryover effects increase, bias increases. The bias does not reach the magnitude of the estimand, in fact it is about half. The reason is that half the data are assigned to the control first then treated sequence. For these data, there is no bias because in the first period they reveal the untreated then untreated potential outcome and in the second period they reveal the untreated then treated potential outcome. Thus, they have no risk of carryover effects. These data are half the data, so the bias comes from the other half that move from treated to untreated.

Further reading

17.10 Randomized saturation experiments

We study most treatments at an isolated, atomized, individualistic level. We define potential outcomes with respect to a unit’s own treatment status, ignoring the treatment status of all other units in the study. Accordingly, our inquiries tend to be averages of individual-level causal effects and our data strategies tend to assign treatments at the individual level as well. All of the experimental designs we have considered to this point have been of this flavor.

However, when the potential outcome that a unit reveals depends on the treatment status of other units, then we have to make adjustments to every part of the design. We have to redefine the model M to specify what potential outcomes are possible. Under a no-spillover model, we might only have the treated and untreated potential outcome \(Y_i(1)\) and \(Y_i(0)\). But under spillover models, we have to expand the set of possibilities. For, we might imagine that unit \(i\)’s potential outcomes can be written as a function of their own treatment status and that of their housemate, unit \(j\): \(Y(Z_i, Z_j)\). We have to redefine our inquiry I with respect to those reimagined potential outcomes. The average treatment effect is typically defined as \(\mathbb{E}[Y_i(1) - Y(0)]\), but if \(Y_i(1)\) and \(Y_i(0)\) no longer exist, we need to choose a new inquiry, like the average direct effect of treatment when unit \(j\) is not treated: \(\mathbb{E}[Y_i(1, 0) - Y(0, 0)]\). We have to alter our data strategy D so that the randomization procedure produces healthy samples of all of the potential outcomes involved in the inquiries, and we have to amend our answer strategy A to account for the important features of the randomization protocol.

We divide up our investigation of experimental designs to learn about spillovers into two sets. This chapter addresses randomized saturation designs, which are appropriate when we can exploit a hierarchical clustering of subjects into groups within which spillover can occur but across which spillover can’t occur. The next chapter addresses experiments over networks, which are appropriate when spillover occurs over geographic, temporal, or social networks.

The randomized saturation design (sometimes called the partial population design, as in Baird et al. (2018)) is purpose-built for scenarios in we have good reason to imagine that a units potential outcomes depend on the fraction of treated units within the same cluster. For example, we might want to considered the fraction of people within a neighborhood assigned to receive a vaccine: a person’s health outcomes could easily depend on whether two-thirds or one-third of neighbors have been treated.

In the model, we now have to define potential outcomes with respect to both the individual level treatment and also the saturation level. We can imagine a variety of different kinds of potential outcomes functions. Consider the vaccine example, imagining a 100% effective vaccine against infection. Directly treated individuals never contract the illness, but the probability of infection for untreated units depends on the fraction who are treated nearby. If the treatment is a persuasive message to vote for a particular candidate, which might imagine that direct treatment is ineffective when only a few people around you hear the message, but becomes much more effective when many people hear the message at the same time. The main challenge in developing intuitions about complex interactions like this is articulating the discrete potential outcomes that each subject could express, then reasoning about the plausible values for each potential outcome.

The randomized saturation design is a factorial design of sorts, and like any factorial design can support a wide range of inquiries. We can describe the average effect of direct treatment at low saturation, at high saturation, the average of the two, or the difference between the two. Similarly, we could describe the average effect of high versus low saturation among the untreated, among the treated, the average of the two, or the difference between the two. In some settings, all eight of these inquiries might be appropriate to report, in others just a subset.

The design employs a two-stage data strategy. First, pre-defined clusters of units are randomly assigned to treatment saturation levels, for example 25% or 75%. Then, in each clusters, individual units are assigned to treatment or control with probabilities determined by their clusters’ saturation level. The main answer strategy complication is that now there are two levels of randomization that must be respected. The saturation of treatment varies at the cluster level, so whenever we are estimating saturation effects, we have to cluster standard errors at the level saturation was assigned. The direct treatments are assigned at the individual level, so we do not need to cluster.

Declaration 17.14 describes 50 groups of 20 individuals each. We imagine one source of unobserved variation at the group level (the group_shock) and another at the individual level (the individual shock). We built potential outcomes in which the individual and saturation treatment assignments each have additive (non-interacting) effects, though more complex potential outcomes functions are of course possible. We choose two inquiries in particular: the conditional average effect of saturation among the untreated and the conditional average effect of treatment when saturation is low.

We can learn about the effects of the dosage of indirect treatment by comparing units with the same individual treatment status across the levels of dosage. For example, we could compare untreated units across the 25% or 75% saturation clusters. We can also learn about the direct effects of treatment at either saturation level, e.g., the effect of treatment when saturation is low. We use difference-in-means estimators of both inquiries, subsetted and clustered appropriately.

Declaration 17.14 \(~\)

design <-
    group = add_level(N = 50, group_shock = rnorm(N)),
    individual = add_level(
      N = 20,
      individual_shock = rnorm(N),
        Y ~ 0.2 * Z + 0.1 * (S == "low") + 0.5 * (S == "high") +
          group_shock + individual_shock,
        conditions = list(Z = c(0, 1),
                          S = c("low", "high"))
  ) +
    CATE_S_Z_0 = mean(Y_Z_0_S_high - Y_Z_0_S_low),
    CATE_Z_S_low = mean(Y_Z_1_S_low - Y_Z_0_S_low)
  ) +
    S = cluster_ra(clusters = group, 
                   conditions = c("low", "high")),
    Z = block_ra(blocks = group, 
                 prob_unit = if_else(S == "low", 0.25, 0.75))
  ) +
  declare_measurement(Y = reveal_outcomes(Y ~ Z + S)) +
    Y ~ S,
    model = difference_in_means,
    subset = Z == 0,
    term = "Shigh",
    clusters = group,
    inquiry = "CATE_S_Z_0",
    label = "Effect of high saturation among untreated"
  ) +
    Y ~ Z,
    model = difference_in_means,
    subset = S == "low",
    blocks = group,
    inquiry = "CATE_Z_S_low",
    label = "Effect of treatment at low saturation"


simulations <- simulate_design(design)

The diagnosis plot in Figure 17.16 shows the sampling distribution of the two estimators with the value of the relevant inquiry overlaid. Both estimators are unbiased for their targets, but the thing to notice from this plot is that the estimator of the saturation inquiry is far more variable than the estimator of the direct treatment inquiry. Saturation is by its nature a group-level treatment, so must be assigned at a group level. The clustered nature of the assignment to saturation level brings extra uncertainty. When designing randomized saturation experiments, researchers should be aware that we typically have much better precision for individually-randomized treatments than cluster-randomized treatments, and should plan accordingly.

Sampling distribution of two estimators

Figure 17.16: Sampling distribution of two estimators


  1. Diagnose the declared design with respect to statistical power. What is the statistical power for each inquiry?

  2. Add an inquiry for the average effect of treatment, averaging over both high and low saturation. Add an associated estimator. What are the power, bias, and standard error diagnosands for this inquiry - estimator pair?

  3. We chose saturations of 25% and 75%. Write a short paragraph describing under what circumstances it would be preferable to choose saturations of 10% and 90%.

17.11 Experiments over networks

When experimental subjects as embedded in a network, units’ outcomes may depend on the treatment statuses of nearby units. In other words, treatments map spill over across the network. For example, in a geographic network, vote margin in one precinct may depend on outdoor advertisements in neighboring precincts. In a social network, information delivered to a treated subjects might be shared with friends or followers. In a temporal network, treatments in the past might affect outcomes in the future.

This chapter describes the special challenges associated with experiments over networks. In the previous chapter, we discussed randomized saturation designs, which are appropriate when we can describe a hierarchy of units embedded in clusters, within which spillovers can occur but across which spillovers cannot occur. In other words, the randomized saturation design is appropriate when the network is composed of many disconnected network components (the clusters). But most networks are not disconnected; all or almost all units connected in a vast web. This chapter describes how we need to modify the model M, inquiry I, data strategy D, and answer strategy A to learn from experiments over networks.

In the model, our main challenge is to define how far apart (in social, geographic, or temporal space) units have to be in order for unit \(i\)’s potential outcomes not to depend on unit \(j\). We might say units within 5km matter but units further away do not. We might say that units within two friendship links matter but more distal connections do not. We might allow the treatment statuses of three, two, or one periods ago to impact present outcomes differently from one another. For example, we might stipulate that each unit has only four potential outcomes that depend on whether a unit is directly treated or indirectly treated by virtue of being adjacent to a directly treated unit.

Table 17.5: Example treatment conditions for an experiment over a network
Condition Potential outcomes
Pure control \(Y_i(\mathrm{direct} = 0, \mathrm{indirect} = 0)\)
Direct only \(Y_i(\mathrm{direct} = 1, \mathrm{indirect} = 0)\)
Indirect only \(Y_i(\mathrm{direct} = 0, \mathrm{indirect} = 1)\)
Direct and indirect \(Y_i(\mathrm{direct} = 1, \mathrm{indirect} = 1)\)

With potential outcomes defined, we can define inquiries. With four potential outcomes, there are six pairwise contrasts that we could contemplate. For example, the direct effect in the absence of indirect treatment is defined as \(\mathbb{E}[Y_i(\mathrm{direct} = 1, \mathrm{indirect} = 0) - Y_i(\mathrm{direct} = 0, \mathrm{indirect} = 0)]\) and the direct effect in the presence of indirect treatment is \(\mathbb{E}[Y_i(\mathrm{direct} = 1, \mathrm{indirect} = 1) - Y_i(\mathrm{direct} = 0, \mathrm{indirect} = 1)]\). We could similarly define indirect effects as \(\mathbb{E}[Y_i(\mathrm{direct} = 0, \mathrm{indirect} = 1) - Y_i(\mathrm{direct} = 0, \mathrm{indirect} = 0)]\) or \(\mathbb{E}[Y_i(\mathrm{direct} = 1, \mathrm{indirect} = 1) - Y_i(\mathrm{direct} = 1, \mathrm{indirect} = 0)]\). We may be interested in how direct and indirect treatments interact, which would require taking the difference between the two direct effect inquiries or taking the difference between the two indirect effect inquiries. Which inquiry is most appropriate will depend on the theoretical setting.

The data strategy for an experiment over networks still involves random assignment. Typically, however, experimenters are in direct control of the direct treatment application only, and the resulting indirect exposures result from the natural channels through which spillover occur. The mapping from a direct treatment vector to the assumed set of conditions is described by Aronow and Samii (2017) as an ``exposure mapping.’’ The exposures mapping defines how the randomized treatment results in the exposures that reveal each potential outcome. The probabilities of assignment to each of the four conditions are importantly not constant across units, for the main reason that units with more neighbors are more likely to receive indirect treatment. Furthermore, exposures are dependent across units: if one unit is directly treated, then all adjacent units must be indirectly treated.

Under Principle 3.7: Seek M:I::D:A parallelism, we need to adjust our answer strategy to compensate for the differential probabilities generated by this complex data strategy. As usual, we need to weight units by the inverse of the probability of assignment to the condition that they in. In the networked setting we have to further account for dependence in treatment assignment probabilities. This dependence tends to increase sampling variability. For intuition, consider how clustering (an extreme form of across-unit dependence in treatment conditions) similarly tends to increase sampling variability. Aronow and Samii (2017) propose Hajek- and Horvitz-Thompson-style point and variance estimators that account for these complex joint probabilities of assignment, which are themselves estimated by simulating the exposures that would result from many thousands of possible random assignments.

17.11.1 Example

Here we declare a experiment design to estimate the effects of lawn signs. The units are the lowest level at which we can observe vote margin, the voting precinct. In our model, we define four potential outcomes. Precincts can be both directly and indirectly treated, only directly treated, only indirectly treated, or neither. Indirect treatment occurs when a neighboring precinct is treated. This model could support many possible inquiries, but here we will focus on three: the direct effect of treatment when the precinct is not indirectly treated, the effect of indirect treatment when the precinct is not directly treated, and the total effect of direct and indirect treatment versus pure control. The data strategy will involve randomly assigning some units to direct treatment, which will in turn cause other units to be indirectly treated. We will need to learn via simulation the probabilities of assignments to conditions that this procedure produces. We’ll make use of two answer strategies: the Horvitz-Thompson and Hajek estimators proposed by Aronow and Samii (2017), along with their associated variance estimators, as implemented in the interference package.

Some features of this design must be described outside the design-as-declared, for the main reason that some of the matrices required by the design (the geographic adjacency matrix, the permutation matrix, and the probability matrices) have non-tidy data structures that must be handled outside of DeclareDesign.

First, we load the Fairfax County, Virginia voting precincts shapefile, removing one singleton voting precinct, and we plot the precincts in Figure 17.17.

Voting Precincts in Fairfax County, Virginia

Figure 17.17: Voting Precincts in Fairfax County, Virginia

Next, we obtain the adjacency matrix.

adj_matrix <-
  fairfax %>%
  as("Spatial") %>%
  poly2nb(queen = TRUE) %>%
  nb2mat(style = "B", zero.policy = TRUE)

The last bit of preparation we need to do is to create a permutation matrix of possible random assignments, from which probabilities of assignment to each condition can be calculated:

ra_declaration <- declare_ra(N = 238, prob = 0.1)

permutatation_matrix <- 
  ra_declaration %>%
  obtain_permutation_matrix(maximum_permutations = 10000) %>%

Now we’re ready to declare the full design. We’ve hidden the get_exposure_AS and estimator_AS helper functions to reduce the amount of code to read through. The declare_model call builds in a dependence of potential outcomes on the length of each precinct’s perimeter to reflect the idea that outcomes are correlated with geography in some way. The declare_inquiry call describes the three inquiries in terms of potential outcomes.The declare_assignment call first conducts a random assignment according to the procedure describes by declare_ra above, then obtains the exposures that the assignment generates. Finally, all the relevant information is fed into the Aronow and Samii estimation functions via estimator_AS.

Declaration 17.15 \(~\)

design <-
    data = select(fairfax, -geometry),
    Y_0_0 = pnorm(scale(SHAPE_LEN), sd = 3),
    Y_1_0 = Y_0_0 + 0.02,
    Y_0_1 = Y_0_0 + 0.01,
    Y_1_1 = Y_0_0 + 0.03
  ) +
    total_ATE = mean(Y_1_1 - Y_0_0),
    direct_ATE = mean(Y_1_0 - Y_0_0),
    indirect_ATE = mean(Y_0_1 - Y_0_0)
  ) +
    Z = conduct_ra(ra_declaration),
    exposure = get_exposure_AS(make_exposure_map_AS(adj_matrix, Z, hop = 1))
  ) +
    Y = case_when(
      exposure == "dir_ind1" ~ Y_1_1,
      exposure == "isol_dir" ~ Y_1_0,
      exposure == "ind1" ~ Y_0_1,
      exposure == "no" ~ Y_0_0
  ) +
  declare_estimator(handler = estimator_AS, 
                    permutatation_matrix = permutatation_matrix, 
                    adj_matrix = adj_matrix)


The maps in Figure 17.18 show how this procedure generates differential probabilities of assignment to each exposure condition. Units that are in denser areas of the county are more likely to be in the Indirect Exposure Only and Direct and Indirect Exposure conditions than those in less dense areas.

Probabilities of assignment to condition

Figure 17.18: Probabilities of assignment to condition

Figure 17.19 compares the performance of the Hajek and Horvitz-Thompson estimators. Both are approximately unbiased for their targets, but the Horvitz-Thompson estimator is much higher variance, suggesting that in many design settings, researchers will want to opt for the Hajek estimator.

simulations <- simulate_design(design)
Sampling distribution of two estimators for three inquiries. The vertical lines refer to the true values of the inquiries.

Figure 17.19: Sampling distribution of two estimators for three inquiries. The vertical lines refer to the true values of the inquiries.


  1. Increase the fraction directly treated to 0.3, and describe how the bias and standard deviation of the sampling distributions for all six estimators change.