15 Observational : causal

In an observational design, researchers do not themselves set the conditions to which units are exposed; the natural proceses of the world are responsive for the observed treatment conditions. Observational causal inference is the art is inferring what the effects of the observed treatments were, or what the effects of treatments not applied by the work would have been had things been different.

In some accounts, “observational causal inference” is an oxymoron, but this is wrong. An “observational experiment” is an oxymoron, but observational causal inference is possible and necessary – just sometimes unflinchingly difficult. It’s hard because observational causal inference depends on the world generating just the right circumstances. Process tracing requires observation of just the right clues to the causal mystery. Section on observables requires measurement of a set of variables that collectively close all the back door paths from treatment to outcome. A difference-in-difference design requires a certain kind of overtime stability – that untreated potential outcomes move in parallel. Instrumental variables needs nature to randomly assign something we can measure. Regression discontinuity designs requires a cutoff that we can observe and understand.

These are the big five observational causal research designs as we see it. Many additional innovative designs have been developed to try to estimate causal quantities with observational data. These generally seek clever innovations in \(A\) in order to have as few assumptions on \(M\) as possible (Principle 3.3: Entertain many models). We refer readers to Angrist and Pischke (2008) and Dunning (2012) for excellent overviews of the theory behind many of these methods.

15.1 Process tracing

We declare a qualitative design in which researchers seek to learn about the effect of a cause on a single unit. The diagnosis helps evaluate the gains from different within-case data gathering strategies.

In qualitative research we are often interested in learning about the causal effect for a single unit. For example, for a country unit that underwent sanctions (an “event”), we want to know the causal effect of the sanctions on government behavior. To do so, we need to know what would have happened if the event did not happen, the counterfactual outcome that did occur as opposed to the factual outcome that did not. Due to the fundamental problem of causal inference, we cannot observe what would have happened if that counterfactual case had happened. We have to guess—or infer—what would have happened (Principle 3.5). Social scientists have developed a large array of tools for guessing missing counterfactual outcomes — what would have happened in the counterfactual case, if the event had not happened.14

A common inquiry in this setting is whether an outcome was due to a cause. This “causal attribution” inquiry or “Cause of Effect” inquiry can be written \(\mathrm{CoE}:=1-Y(0)| X=1 \mathrm{ \& } Y=1\). For a unit with \(X=1\) and \(Y=1\), \(\mathrm{CoE}=1\) if \(Y(0)=0\). This “Cause of Effect” inquiry is different from the treatment effect of the treated case (TET). The TET is the difference between the treated potential outcome and the control potential outcome in the posttreatment period for the treated unit of interest. The TET is defined regardless of whether the unit’s outcome equals 1 or 0, but the “cause of effect” inquiry is asked only for units whose treated outcome is equal to one.

“Process tracing” is a prominent strategy for assessing causes of effects (Bennett and Checkel 2015; Fairfield and Charman 2017). Here, following for example Humphreys and Jacobs (2017), we think of process tracing as a procedure in which researchers provide a theory, in the form of a causal model, that is rich enough to characterize the probability of observing ancillary data (“Causal process observations” (Brady 2004)) given underlying causal relations. If equipped with priors, such a model in turn lets one use Bayes’ rule to form posteriors over causal relations when these observations are observed.

For intuition, say we are interested in whether a policy caused a change in economic outcomes. We expect that for the policy to matter it at least had to be implemented and so if we find out that it was not implemented we infer that it did not matter. We make theory dependent inferences that are reasonable insofar as the theory is reasonable. In this example, if there are plausible channels through which a policy might have mattered even if not implemented, then our conclusion would not be warranted.

To illustrate design choices for a process tracing study we consider a setting in which we have observed \(X=1\) and \(Y=1\) and we are interested in figuring out whether \(Y=1\) because \(X=1\). More specifically we imagine a model with two ancillary variables, \(M\) and \(W\). We posit that \(X\) causes \(Y\) via \(M\)—with negative effects of \(X\) on \(M\) and of \(M\) and \(Y\) ruled out. And we posit that \(W\) is a cause of both \(M\) and \(Y\), specifically we posit that if \(W=1\) then \(X\) causes \(M\) and \(M\) causes \(Y\) for sure. Under this model \(M\) and \(W\) each serve as “clues” for the causal effect of \(X\) on \(Y\).

Using the language popularized by Van Evera (1997), \(M\) provides a “hoop” test: if you look for data on \(M\) and find that \(M=0\) then you infer that \(X\) did not cause \(Y\). If on the other hand you examine \(W\) and find that \(W=1\) then you have a “smoking gun” test: You infer that \(X\) did indeed cause \(Y\). However, if you find both \(M=0\) and \(W=1\), then you know your model is wrong.

The model can be described using the CausalQueries package thus:

model <- make_model("X -> M -> Y <- W -> M") %>%
  set_restrictions("(M[X=1] < M[X=0]) | (M[X=1, W=1] == M[X=0, W=1])") %>%
  set_restrictions("(Y[M=1] < Y[M=0]) | (Y[M=1, W=1] == Y[M=0, W=1])")


This model definition describes the DAG but also specifies a set of restrictions on causal relations. By default flat priors are then placed over all other possible causal relations, though of course other prior beliefs could also be specified.

We now have all we need to assess what inferences we might make given different sorts of observations using CausalQueries::query_model. This function lets you ask any causal question of of a model conditional on observed (or counterfactual) conditions. We can use it to calculate beliefs, likely data, and conditional inferences thus:

queries <- 
    query = list('Prob(CoE=1)' = "Y[X=1] > Y[X=0]",
                 'Prob(M=1)' = "M==1",
                 'Prob(CoE=1 | M=0)' = "Y[X=1] > Y[X=0]",
                 'Prob(CoE=1 | M=1)' = "Y[X=1] > Y[X=0]"),
    given = list("Y==1 & X==1",
                 "Y==1 & X==1",
                 "Y==1 & X==1 & M==0",
                 "Y==1 & X==1 & M==1"),
    using = "parameters") 
Table 15.1: Beliefs for a case with \(X=1, Y=1\). The first row gives the prior belief that \(X=1\) caused \(Y=1\). The second row gives the expecation that M=1. The last rows gives posterior beliefs that X=1 caused Y=1 after M is observed, depending on what is found.
Query Given mean
Prob(CoE=1) Y==1 & X==1 0.71
Prob(M=1) Y==1 & X==1 0.93
Prob(CoE=1 | M=0) Y==1 & X==1 & M==0 0.00
Prob(CoE=1 | M=1) Y==1 & X==1 & M==1 0.77

From the model specification, we can calculate directly the probability of observing \(M=0\) or \(M=1\) (or \(W=0\) or \(W=1\)) and what we would infer in each case. Table 15.1 shows an example of these values. From a table like this we can calculate directly the expected posterior variance associated with a process tracing data strategy and a Bayesian answer strategy. For instance, here our prior on CoE is 0.71 which implies a variance of 0.2. If we gather data on \(M\) however our posterior variance will be either 0.18 (with probability 0.93) or 0.

Thus we see that we can already imagine how our a design in which we seek data on \(M\) only will perform in expectation in terms of reducing our uncertainty. No simulation is required. Even still, we think it useful to fold these quantities into a design declaration so that users can access the data strategies and answer strategies in the same way as they would for any other problem.

Declaration 15.1 uses a custom data function that draws a value of the estimand (EoC) using the prior in the first row of Table 15.1 and then draws values of M, using values from the second row of Table 15.1 (and similarly for \(W\)). The estimation step also uses a custom function, which simply returns the posterior estimates—like those in the last rows of Table 15.1, but for both \(M\) and \(W\). No data strategy is provided because we imagine estimation given all possible data strategies (observation of \(M\), of \(W\), and of both).

Declaration 15.1 \(~\)

design <- 
  declare_model(data = data_function()) + 
  declare_inquiry(CoE = CoE) +
  declare_estimator(handler = my_estimator_function)


Given such a model, a case in which \(X=Y=1\), and limited resources, we now want to know whether we would be better gathering data on \(M\) or on \(W\) or both? The answers are given in Table 15.2. We see only modest declines in expected posterior variance from observation of the mediator \(M\) (consistent with the manual calculation above), but large declines from observing \(W\).

Table 15.2: Inference upon observation of M or \(W\) for cases in which \(X = 1\) and \(Y = 1\). We see that expected posterior variance (equivalently, expected mean squared error) falls modestly when M is observed but substantially when \(W\) is observed. The gains from observing \(W\) are modest if M is already observed.
Inquiry Estimator N Sims Bias Mse Mean Posterior Var Mean Estimate Mean Estimand
CoE Prior 1000 -0.01 0.20 0.20 0.71 0.72
(0.01) (0.01) (0.00) (0.00) (0.01)
CoE M only 1000 -0.01 0.16 0.17 0.72 0.72
(0.01) (0.01) (0.00) (0.01) (0.01)
CoE W only 1000 -0.00 0.06 0.06 0.72 0.72
(0.01) (0.00) (0.00) (0.01) (0.01)
CoE Both 1000 -0.00 0.05 0.05 0.72 0.72
(0.01) (0.00) (0.00) (0.01) (0.01)

Note here that the causal model used for M and I is the same model used for A (Principle 3.7). This does not have to be the case however. If instead we used different models for these two parts we could assess how well or poorly our strategy performs when our model is wrong. In this case, and unlike the results in table 15.2, we would find that the expected posterior variance (where the expectation is taken with respect to the model in M but posterior taken with respect to the model in A) will not be the same as the expected mean squared error.

15.2 Selection-on-observables

We declare a design in which a researcher tries to address concerns about confounding by conditioning on observable third variables. In the design the researcher has to specify how a third variable creates a risk of confounding and how exactly they will take account of these variables to minimize bias.

When we want to learn about causal effects – but treatments are allocated by the world and not by the researcher – we are sometimes stuck. It’s possible that a comparison of treated units to control units will generate biased inferences because of selection. If selection puts units with higher average potential outcomes into the treatment group, then a comparison with the control group will yield biased causal inferences.

Sometimes, however, we know enough about selection in order to condition on the variables that cause it. A selection-on-observables design stipulates a family of models \(M\) of the world that describe which variables are responsible for selection, then employs a data strategy that measures those variables, rendering them “observable.” In the answer strategy, we draw on substantive knowledge of the causal process to generate an “adjustment set”: a set of variables that predict selection into treatment, or in the language of DAGs, a set that when conditioned upon, closes all back door paths from the treatment to the outcome. We can condition in many ways, for example through regression adjustment, stratification, or matching.

The quality of causal inferences we draw comes down to whether our claims about selection into treatment are correct. If we’ve missed a cause (missed a back door path), then our inferences will be prone to bias.

Three roles for an observable variable X.

Figure 15.1: Three roles for an observable variable X.

However, it is not as simple as adjusting for any variable that will close back parhts. We face problems if we fail to adjust for an observable \(X\) under some models — but we will also face problems if we do adjust for \(X\) under other models. In Figure 15.1 we illustrate four of the possible roles for an observable variable \(X\): as a confounder of the relationship between \(D\) and \(Y\); as a collider, which is affected by both \(D\) and \(Y\); as a mediator of the relationship between \(D\) and \(Y\); and as a predictor of \(Y\) with no connection to \(D\). We set aside in these DAGs the possible roles of an unobservable variable \(U\) that would introduce additional problems of confounding.

If \(X\) is a confounder, failing to adjust for it in studying the relationship between the treatment \(D\) and outcome \(Y\) will lead to confounding bias. We often think of fourth DAG as the alternative to this, where \(X\) is a predictor of \(Y\) but has no link to \(D\). In this circumstance, we still want to adjust for \(X\) to seek efficiency improvements by soaking up additional variation in \(Y\), but failing to do so will not introduce bias. If the true model is definitely one of these two, first and fourth DAGs, we should clearly choose to adjust for \(X\) to ensure we avoid confounding bias in case the first is correct (and we will do no worse if instead it is the fourth).

However, the middle two DAGs present problems if we do adjust for \(X\). In the first, \(X\) is a collider: it is affected by both \(D\) and \(Y\). Adjusting for \(X\) if this is the true model introduces collider bias, because we open a backdoor path between \(D\) and \(Y\) through \(X\). We also introduce bias if we control for \(X\) if the mediator model (DAG 3) is the true model, wherein \(D\) affects \(X\) and \(Y\) and \(X\) affects \(Y\) (i.e., \(X\) is a mediator for the relationship between \(D\) and \(Y\)). But the reason here is different: controlling for a mediator adjusts away part of the true treatment effect.

In a selection-on-observables design, we must get many features of the model correct, not only about the factors that affect \(D\). We must be sure of all the arrows into \(X\), \(D\), and \(Y\) and the order of the causal arrows. In some cases, in natural experiments where selection processes are not randomized by researchers but are known, these assumptions can be sustained. In others, we will be making heroic assumptions.

We declare a design to explore these considerations, with a model defining the relationship between a causal factor of interest \(D\) and outcome \(Y\) and an observable confounder \(X\), the average treatment effect as the inquiry, a simple measurement strategy, and two estimators with and without adjustment for \(X\). We use exact matching as our adjustment strategy.

Declaration 15.2 \(~\)

exact_matching <- 
  function(data) { 
    matched <- matchit(D ~ X, method = "exact", data = data) 

design <-
    N = 100, 
    U = rnorm(N), 
    X = rbinom(N, 1, prob = 0.5),
    D = rbinom(N, 1, prob = 0.25 + 0.5 * X),
    Y_D_0 = 0.2 * X + U,
    Y_D_1 = Y_D_0 + 0.5
  ) + 
  declare_inquiry(ATE = mean(Y_D_1 - Y_D_0)) +
  declare_step(handler = exact_matching) +
  declare_measurement(Y = reveal_outcomes(Y ~ D)) +
  declare_estimator(Y ~ D,
                    weights = weights,
                    model = difference_in_means,
                    label = "adjusted") +
  declare_estimator(Y ~ D, 
                    model = difference_in_means, 
                    label = "unadjusted")


We declare beliefs about the selection process and how \(D\), \(Y\), and \(X\) are related. The model needs to include potential outcomes for the main outcome of interest (\(Y\)) and a specification of the assignment of the key causal variable (here, \(D\)). Here, we have defined the assignment process as a function of an observable variable \(X\). In fact, \(X\) is the only variable that affects selection into treatment: \(X\) is a binary variable (i.e., two groups), and the probability of treatment is 0.4 when \(X=0\) and 0.6 when \(X=1\). In addition, we define the potential outcomes for \(Y\), which invoke confounding by \(X\) because it affects both \(D\) and \(Y\). We only invoke one possible relationship between \(X\), \(Y\), and \(D\), and so do not consider the possibilities of colliders or mediators.

In our model \(U\) is not a confounder in this model because it affects \(Y\) but not \(D\); this is a strong excludability assumption on which our causal inferences depend. The assumption is strong because ruling out all unobservable confounders is typically impossible. Most causal factors in nature are affected by many variables, only some of which we can imagine and measure. The problem with unobserved confounding is bias. In the design declared above, the unadjusted answer strategy suffers from unobserved confounding, because we do not adjust for \(X\) which predicts both treatment and the outcome. Depending on the inquiry and your diagnosands, a biased answer strategy may be good enough — or it may be all you can get. If the decision you face is whether to adopt a treatment, you may simply want to know if the treatment is at least 10% effective, even if you can’t get a precise estimate of its effectiveness. With this goal in mind, a little bias that still allows you to correctly determine if the treatment effect is at least 0.1 may be okay.

Sensitivity analysis tackles this idea of how much bias is a problem. We can’t adjust for unobservable confounders, but we can reason about how much confounding would have to be present to draw the wrong conclusions from our data. How much confounding would there have to be to decide incorrectly that a treatment is at least 10% effective is a question sensitivity analysis may be able to answer. An answer sensitivity analysis could provide is that you would have to have a variable with a 20% correlation with the outcome and a 5% correlation with treatment to make the wrong decision. The substantive or theoretical judgement then about whether that amount of confounding is likely, takes us back to the model and whether we think that is plausible. We explore one implementation of the idea of sensitivity analysis in the exercises.

In addition to the model, we define our inquiry as the average treatment effect of \(D\) on \(Y\), a measurement strategy to collect observations on \(Y\), and two possible estimators with and without adjustment of \(X\). The first estimator, with adjustment, uses the weights from the exact matching estimator. The matching procedure adjusts for differences in the probability of selection into treatment according to \(X\). The second, unadjusted, estimator does not control for \(X\). We see in a diagnosis of the bias of the design for these two estimators that the first has no bias and the second substantial bias (about 12% of the average estimate).

We see in a diagnosis of the design that if we fail to adjust for the observable features of the selection process – that the probability of treatment depends on the value of the observable characteristic \(X\) — then we have biased answers. If we do control, we obtain unbiased answers. However, as highlighted earlier, these results depend on the believability of our model, which in this case involves observed confounding of the relationship between \(D\) and \(Y\) by \(X\). If there was unobservable confounding from \(U\), which we ruled out, or if \(X\) was a collider of mediator, there would be bias.

diagnosis <- diagnose_design(design, sims = sims, bootstrap_sims = b_sims)
Table 15.3: Bias and RMSE for two estimators from a selection-on-observables design
estimator bias
adjusted 0.009
unadjusted 0.113


In this exercise, we’ll investigate how to diagnose an observation research design that includes sensitivity as part of the answer strategy. Researchers use sensitivity analysis in settings when treatment effect estimates may be biased due to unobserved confounding. In such cases, we can’t condition on the variables that would deconfound the treatment effect estimates, because we don’t have access to them. In such cases, we turn to sensitivity analyses which essentially ask, “how bad would the unobserved confounding have to be” to overturn the substantive conclusions of the paper.

To diagnose sensitivity analysis, we’ll need to make use of a helper function:

sensitivity_helper <-
  function(model) {
    sensitivity <- sensemakr(model = model, treatment = "Z")
    sensitivity$sensitivity_stats[,c("estimate", "rv_qa")] 

Here’s a design:

design <-
  declare_population(N = 1000, 
                     U = rnorm(N, sd = 0.6),
                     X = correlate(rnorm, given = U, rho = 0.5),
                     Z = rbinom(N, 1, prob = pnorm(X)),
                     Y = 0.2 * Z + 0.5 * X + U) +
  declare_estimator(Y ~ Z + X, model = lm, model_summary = sensitivity_helper)
  1. Inspect the declare_population call. What is the true average treatment effect on Z on Y?

  2. Draw data from this design. Run a regression of Y on Z and X, using the lm function. Conduct a sensitivity analysis using the sensemakr function from the sensmakr package. Interpret the “Robustness Value, q = 1, alpha = 0.05” statistic. (Hint: run summary on the output from the sensemakr function)

  3. Diagnose the design with respect to bias relative to the true average treatment effect. In this design, is a regression of Y on Z, controlling for X unbiased for the true ATE? Why or why not?

  4. Now diagnose the design with respect to the robustness value. How frequently do we obtain a robustness value less than 0.10? less than 0.15?

  5. Recall that in this design, the treatment and outcome are not confounded by unobservables. In this light, interpret the diagnosis of the sensitivity analysis.

15.3 Difference-in-differences

We declare a differences in differences design in which the effect of a cause is assessed by comparing changes over time for a unit that gets treated to changes over time for a unit the does not get treated. The declaration and diagnosis helps clarify when effect heterogeneity threatens inference using this approach.

The difference-in-difference design does not rely on the assumption of a known assignment mechanism as in selection-on-observables. But it adds a new one instead: the parallel trends assumption. In order for the difference-in-difference design to work, the change between before and after treatment in the control potential outcomes must be equal (i.e., parallel). Because this assumption depends on the change in values of the unrealized (and thus unobservable) control potential outcome in the treated unit after treatment, it cannot be tested. A widely-used diagnostic is to look at the trends in outcomes between the treated and control unit before treatment; this is only an indirect test because parallel trends concerns the unobserved control trend of the actually treated unit.

The design has been famously used for analyses of two periods (before and after) and two groups (treated and untreated) such as a policy change in one state compared to another before and after the policy change. Today, it is most often used in many-period many-group settings with observational panel data. Here, the logic of the two-period two-group design is extended through analogy. Parallel trends between treated and control groups are assumed on average across treated groups and periods. Unfortunately, the analogy holds only under limited circumstances, a fact only recently discovered.

We declare a design for a 20-period, 20-unit design in which units are become treated at different times, a common setting in empirical social science often referred to as the staggered adoption design. The treatment effect of interest might be a state-level policy adopted in 20 states at some point within a 20-year period, so we draw on comparisons before and after policy adoption within states and across states that have and have not yet adopted treatment.

Declaration 15.3 \(~\)

N_units <- 20
N_time_periods <- 20

design <-
    units = add_level(
      N = N_units, 
      U_unit = rnorm(N), 
      D_unit = if_else(U_unit > median(U_unit), 1, 0),
      D_time = sample(1:N_time_periods, N, replace = TRUE)
    periods = add_level(
      N = N_time_periods,
      U_time = rnorm(N),
      nest = FALSE
    unit_period = cross_levels(
      by = join(units, periods), 
      U = rnorm(N),
        Y ~ U + U_unit + U_time + 
          D * (0.2 - 3 * (D_time - as.numeric(periods))), 
        conditions = list(D = c(0, 1))
      D = if_else(D_unit == 1 & as.numeric(periods) >= D_time, 1, 0),
      D_lag = lag_by_group(D, groups = units, n = 1, order_by = periods)
  ) + 
    ATT_switchers = mean(Y_D_1 - Y_D_0), 
    subset = D == 1 & D_lag == 0 & !is.na(D_lag)
  ) + 
  declare_measurement(Y = reveal_outcomes(Y ~ D)) + 
    Y ~ D, fixed_effects = ~ units + periods,
    model = lm_robust,
    inquiry = "ATT_switchers",
    label = "2wfe"
  ) + 
    Y = "Y", 
    G = "units", 
    T = "periods", 
    D = "D",
    handler = label_estimator(did_multiplegt_tidy),
    inquiry = "ATT_switchers",
    label = "chaisemartin"


We define hierarchical data with time periods nested within groups, such that each of the 20 units have 20 time periods from 1 to 20. We assign units to be treated at some point in the period (D_unit), and confound treatment assignment with unobservable unit-specific features U_unit. (If there was no confounding, we would not need the parallel pretrends assumption. This would be guaranteed by virtue of the randomization.) In addition, we assign the timing of treatment is randomly assigned (D_time), such that for treated units treatment turns on the randomly assigned time. D then is jointly determined by whether the unit is treated and whether the current period is after the assigned D_time. We allow for unit-specific variation U_unit and time-specific variation U_time that affects the outcome as well as unit-period characteristics U. Potential outcomes are a function of these unit-, time-, and unit-time-specific characteristics, and a treatment effect of \(0.2\) that varies according to when units are treated (more on the importance of this treatment effect heterogeneity below).

The difference-in-difference design lends itself to the average treatment effect on the treated (ATT) inquiry. We are seeking counterfactual comparisons for each treated unit in untreated units, whether those that are never treated or those not yet treated. We leverage over time comparisons within units to isolate the causal effect of treatment, and net out time-varying characteristics by subtracting off the change in untreated units. Unfortunately, except under extremely limited circumstances — exactly homogenous treatment effects — we will not be able to recover unbiased estimates of the ATT. We can, however, under some circumstances and with some estimators, recover the ATT for a specific subgroup: units that just switched from untreated to treated. We declare the ATT for these “switchers” as the inquiry.

We measure the outcome \(Y\), and define two answer strategies. First, we define the standard two-way fixed effects estimator with fixed effects by time and unit. The two-way fixed effects fits the empirical goal of difference-in-differences: the time fixed effects net out time-varying unit-invariant variables such as seasonality and time trends. The unit fixed effects net out unit-varying variables that are time-invariant like race or age-at-birth of individuals and longterm histories of violence of territories. Second, we define the newly-defined De Chaisemartin and d’Haultfoeuille (2020) estimator, which addresses bias in the staggered adoption difference-in-differences design. The De Chaisemartin and d’Haultfoeuille (2020) design addresses a problem in standard analyses of the staggered adoption difference-in-difference design: the two-way fixed effects estimator relies on comparisons between units that are treated and units that are not yet treated. When treatment effects differ across units depending on when they are treated (as they do in the design here), then those comparisons will be biased: part of the treatment effect will be subtracted out of the estimate.

15.3.1 Treatment timing

The first important design parameter is the pattern of treatment timing across units. Three patterns are common: all units are treated at one common point in time, and typically stay treated once treated; the staggered design as in the declaration above, in which units are treated for the first time at different times and once treated stay treated; and on-off patterns in which units are treated at different times and once treated become untreated again.

We declare these three possibilities as three different treatment variables D:

  D_sametime = 
    if_else(D_unit == 1 & as.numeric(periods) >= N_time_periods/2, 1, 0)
  D_staggered = 
    if_else(D_unit == 1 & as.numeric(periods) >= D_time, 1, 0)
  D_one_period = 
    if_else(D_unit == 1 & as.numeric(periods) == D_time, 1, 0)

The proportion of treated units that are treated at different points is also consequential, so in addition to specifying which of these general patterns holds we need to either get the specific timing pattern of treatment right or explore how the design fares under different possible patterns. For example, if half of units are treated within the first three periods and the other half in the last three periods of 20, the properties of the design may differ from a design in which the second half is treated in the fourth through sixth periods. In our staggered design above, we declared a uniform distribution of timing with D_time = sample(1:N_time_periods, N, replace = TRUE). When you know details of timing in advance, bring in that data directly to the declaration instead.

15.3.2 Treatment effect heterogeneity

Whether treatment effects vary across units and across time affects whether two-way fixed effects designs can recover treatment effects. We started with homogenous treatment effects in the declaration above. In this case, as long as a generalized form of parallel trends holds, the timing of treatment and other details of the potential outcomes will not affect whether you can recover unbiased estimates of the ATT for switchers.

When treatment effects differ depending on when units are treated, however, problems emerge for many treatment timing profiles. We declare homogenous treatment effects below, as well as two kinds of time-varying heterogenous treatment effects where effects get lower for units treated later (Y_later_lower) and where effects get higher for units treated later (Y_later_higher).

    Y_homogenous ~ U + U_unit + U_time + D * 0.2,
    conditions = list(D = c(0, 1))
    Y_later_lower ~ U + U_unit + U_time + 
      D * (0.2 + 0.5 * (D_time - as.numeric(periods))),
    conditions = list(D = c(0, 1))
    Y_later_higher ~ U + U_unit + U_time + 
      D * (0.2 - 0.5 * (D_time - as.numeric(periods))),
    conditions = list(D = c(0, 1))

To follow the Principle 3.3 “entertain many models” for a difference-in-differences design, we assess the performance of the design under all three possible kinds of heterogeneity. With some treatment profiles and some answer strategies proposed in recent years, unbiased estimates will be possible, but not with other combinations. The timing of treatment, nature of heterogeneous effects, and answer strategies interact to determine the properties of the design.

15.3.3 Answer strategies

So many answer strategies have been proposed in recent years to address bias in the difference-in-differences design that we cannot summarize them in this short entry. Instead, we illustrate how assess the properties of just one estimator under a range of conditions. We point in further reading to references for the alternative answer strategies, both estimation procedures and diagnostic tools.

In the declaration, we define the units as in the first declaration and adopt the same staggered treatment allocation with uniform timing, add all three sets of potential outcomes declared above (homogenous, increasing over time, and decreasing over time), and examine both the ATT and the ATT for switchers inquiries. We consider both the two-way fixed effects estimator and the De Chaisemartin and d’Haultfoeuille (2020) estimator.

Diagnoses of two generalized difference-in-differences estimators under three types of potential outcomes.

Figure 15.2: Diagnoses of two generalized difference-in-differences estimators under three types of potential outcomes.

15.4 Instrumental variables

We declare a design in which a researcher addresses confounding concerns by using an instrumental variable. Under the right conditions, the approach can generate unbiased estimates for treatment effects for units whose treatment status is sensitive to the instrument.

When we cannot credibly assert that we have controlled for all confounding variables in a selection-on-observables design (see Section 15.2) — blocked all back-door paths between a treatment \(D\) and an outcome \(Y\) — we might have to give up on our goal of drawing causal inferences.

But occasionally, the world yields an opportunity to sidestep unobserved confounding by generating as-if random variation in a variable that itself affects the treatment variable. We call the variable that is as-if randomly assigned by the world an “instrument.”

Instruments are special. They are variables that are randomly assigned by nature, the government, or other individuals or groups. Usually, genuine random assignments have to be crafted by researchers as part of a deliberate data strategy. Experimenters go to great lengths to randomly expose some units but not others to treatments. When the world provides bonafide random variation in a variable of interest, it’s a rare and valuable opportunity. By virtue of as-if random assignment, we can learn about the average causal effects of the instrument itself without any further consternation. Conditional on geography and season, weather conditions are as-if randomly assigned, so we can learn about the average effects of rainfall on many outcomes, like crop yields, voter turnout, and attendance at sporting events. Conditional on gender and birth year, draft numbers are randomly assigned by the government, so we can learn about the average effects of being drafted on educational attainment, future earnings, or public policy preferences.

We illustrate the logic of instrumental variables in the DAG below. An instrument \(Z\) affects an endogenous treatment \(D\) which subsequently affects an outcome \(Y\). Many other common causes, summarized in terms of unknown heterogeneity \(U\), affect both \(D\) and \(Y\). It is because of this unmeasured confounding that we cannot learn about the effect of \(D\) on \(Y\) directly using standard tools.

Directed acyclic graph (DAG) of an instrumental variables design.

Figure 15.3: Directed acyclic graph (DAG) of an instrumental variables design.

We can naturally imagine an inquiry that represents the effect of the instrument (\(Z\)) on an outcome of interest such as the effect of the draft number on future earnings. In the terminology of instrumental variables, this inquiry is called the “reduced form” or the or the “intention-to-treat” (ITT) effect. If it’s really true that the instrument is randomly assigned by the world, estimation of the reduced form effect is straightforward: we estimate the causal effect of \(Z\) on \(Y\) using, for example, a difference-in-means estimator. Sometimes, we are also interested in the “first-stage” effect of the instrument on the treatment variable. Similarly, we can target this inquiry by studying the causal effect of \(Z\) on \(D\), for example using the difference-in-means estimator again.

The trouble comes when we want to leverage the random assignment of the instrument (\(Z\)) to learn about the average causal effects of the treatment variable (\(D\)) on the outcome (\(Y\)), where \(D\) is not randomly assigned but rather affected by \(Z\) and also other variables. To do so, we need to understand better how the instrument affects whether you are treated and then how both affect the outcome of interest. We need to understand “compliance” with the instrument.

We can define the compliance types of units in terms of the combinations of values of the potential outcomes of \(D\) in terms of values of instrument \(Z\) takes on. The combinations are made up of the value of treatment if unit \(i\) is assigned to control (\(D_i(Z_i = 0)\)) and the value if it is assigned to treatment (\(D_i(Z_i = 1)\)). For a binary instrument and binary treatment, there are four possible potential outcomes values and thus four types, enumerated in Table 15.4. These compliance types are often known as “principal strata” (Frangakis and Rubin 2002).

Table 15.4: Compliance types
Type \(D_i(Z_i = 0)\) \(D_i(Z_i = 1)\)
Never-taker 0 0
Complier 0 1
Defier 1 0
Always-taker 1 1

Never-takers are those who never take the treatment, no matter what the value of the instrument. Compliers take exactly the value of the instrument they are assigned. Defiers do exactly the opposite. When the instrument is take the treatment, they refuse it, but when assigned to not take it, they take treatment. Like their name suggests, always-takers take the treatment regardless of the instrument condition nature assigns them to.

With these types in mind, we can now define a new inquiry that we will be able to target with instrumental variables under a special set of assumptions: a “local” average treatment effect (LATE) among compliers. The concept of “local” reflects the idea that the effect applies only to the specific group of units whose treatment status changes as a result of the instrument.15 The LATE is \(\mathbb{E}[Y_i(D_i = 1) - Y_i(D_i = 0) | D_i(Z_i = 1) > D_i(Z_i = 0)]\) – the average treatment effect among the group of people whose value of the treatment is higher as a result of the treatment. The LATE is different from the ATE because it does not average over those units whose value of the treatment does not depend on the instrument.

We can adopt the instrumental variables answer strategy — using two-stage least squares for example — to estimate the LATE. But to do so without bias, we need to invoke new assumptions on top of those for randomized experiments. In the case of a binary instrument and a binary treatment, the five assumptions are:

  1. Exogeneity of the instrument: \(Y_i(D_i = 1), Y_i(D_i = 0), D_i(Z_i = 1), D_i(Z_i = 0) \perp \!\!\! \perp Z_i | X_i\). Substantively, this assumption requires that (possibly conditional on observed covariates in \(X\)) the instrument is as-if randomly assigned, so it is jointly independent of the treated and untreated potential outcomes as well as the potential values the treatment variable \(D\) would take on depending on the values of the instrument \(Z\). Exogeneity is usually justified on the basis of qualitative knowledge of why as-if random assignment is a reasonable assumption to make. The assumption can be bolstered — but not directly tested — with design checks like the balance on pre-treatment values of the covariates according to the levels of the instrument.

  2. Excludability of the instrument: \(Y_i(D_i(Z_i), Z_i) = Y_i(D_i(Z_i))\). We can “exclude” the instrument from the potential outcomes function \(Y_i()\) – the only relevant argument is the value of the treatment variable. Substantively, this means that \(Z\) has exactly no effect on \(Y\) except by changing the value of \(D\), or in other words a “total mediation assumption.” Under the exclusion restriction, the effect of the instrumental variable is wholly mediated by the treatment variable. The validity of the exclusion restriction cannot be demonstrated empirically and typically must be asserted on qualitative grounds. Since the reduced form of the instrument can be estimated for many different outcome variables, one piece of evidence that can bolster the exclusion restriction is to show that the instrument does not affect other plausible causal precedents of the outcome variable. If it does affect other variables that might, in turn, affect the outcome, doubt may be cast on the exclusion restriction.

  3. Non-interference: \(Y(D_i(Z_i), D_{-i}, Z_{-i}) = Y(D_i(Z_i))\). Like any non-interference assumption, here we assert that for any particular unit, other units’ values of the instrument or the treatment do not affect the outcome.

  4. Monotonicity: \(D_i(Z_i = 1) \geq D_i(Z_i = 0), \forall_i\). This assumption states that the effect of the instrument on the treatment is either zero or is positive for all units. Monotonicity rules the defier complier type who would have \(D = 1\) if \(Z = 0\) but would have \(D = 0\) if \(Z = 1\). Monotonicity is usually quite plausible (it’s tough to imagine a person who would serve in the military if not drafted but would serve if drafted!), but it’s not possible to affirm empirically. An empirical test that demonstrates a positive effect of the instrument on the treatment for one group but a negative effect for a different group could, however, falsify the monotonicity assumption.

  5. Non-zero effect of the instrument on the treatment. If the instrument does not affect the treatment, then it is useless for learning about the effects of the treatment on the outcome, simply because it generates no compliers. If there are no compliers, the LATE itself is undefined.

If all five of these assumptions are met, it can be shown that the inquiry \(\mathrm{LATE} = \frac{\mathrm{ATE}_{ \mathrm{Z\rightarrow Y} }}{\mathrm{ATE}_{Z\rightarrow D}} = \frac{\mathrm{Reduced~Form}}{\mathrm{First~Stage}}\). It is the quotient of the causal effect of the instrument on the outcome divided by the causal effect of the instrument on the received treatment. This expression underlines the importance of assumption 5: if the instrument doesn’t affect the treatment, the first stage is equal to zero and the ratio is undefined.

Going back to the DAG above, many of the five assumptions is represented in the DAG. The exogeneity of the instrument is represented by the exclusion of a causal effect of \(U\) on \(Z\). The omission of an arrow from \(Z\) to \(Y\) directly invokes the exclusion restriction. Non-interference is typically not directly represented in DAGs. Monotonicity and the non zero effect of the instrument on the treatment is represented by the causal arrow from \(Z\) to \(D\).

Moving to answer strategy, a plug-in estimator of the LATE is the difference-in-means of the outcome according to the instrument divided by the difference-in-means of the treatment according to the instrument. Equivalently, we can use two-stage least squares, which will yield the identical answer as the ratio of the difference-in-means estimates when no covariates are included in the regression.

With the four elements of the design in hand, we declare the model in a simple form below, invoking each of the five assumptions in doing so:

Declaration 15.4 \(~\)

design <-
    N = 100, 
    U = rnorm(N),
    potential_outcomes(D ~ if_else(Z + U > 0, 1, 0), 
                       conditions = list(Z = c(0, 1))), 
    potential_outcomes(Y ~ 0.1 * D + 0.25 + U, 
                       conditions = list(D = c(0, 1))),
    complier = D_Z_1 == 1 & D_Z_0 == 0
  ) + 
  declare_inquiry(LATE = mean(Y_D_1 - Y_D_0), subset = complier == TRUE) + 
  declare_assignment(Z = complete_ra(N, prob = 0.5)) +
  declare_measurement(D = reveal_outcomes(D, Z),
                      Y = reveal_outcomes(Y, D)) + 
  declare_estimator(Y ~ D | Z, model = iv_robust, inquiry = "LATE") 


The exclusion restriction is invoked in omitting a causal direct effect of \(Z\) in the \(Y\) potential outcomes. We invoke the monotonicity and non-zero effect of \(Z\) on \(D\) in the \(D\) potential outcomes, which have an effect of \(Z\) of 1. We invoke the non-interference assumption by excluding effects of the instrument values from other units in the definition of the \(D\) and \(Y\) potential outcomes. And we invoke the ignorability of \(Z\) by randomly assigning it in the assignment step.

Many instrumental variables designs involve historical data, such that most remaining design choices are in the answer strategy. Declaring and diagnosing the design can yield insights about how to construct standard errors and confidence intervals and the implications of analysis procedures in which data-dependent tests are run before fitting instrumental variables models — and only fit if the tests pass. But other instrumental variables papers involve prospective data collection, in which an instrument is identified and outcome data (and possibly instrument and treatment data) are collected anew. In such settings, the tools design diagnosis and redesign are useful just as in any prospective research design, to help select sampling and measurement procedures from the sample size to the number of outcomes to be collected. The key in this case is to build in the features of the instrument, the potential outcomes of treatment receipt, and the potential outcomes of the ultimate outcome of interest and to explore violations of the five assumptions in the model.

The instrumental variables setup is perfectly analogous to a randomized experiment with noncompliance (see Section 17.6). The instrument is equivalent to the random assignment. If some units do not comply with their assignment (some in the treatment group don’t take treatment or some in the control group do take treatment), then a comparison of groups according to the treatment variable will be biased by unobserved confounding. The best we can do under noncompliance is to redefine the estimand to be the complier average causal effect (equivalent to the LATE), then estimate it via two-stage least squares. The required excludability assumption is that the assignment to treatment can’t affect the outcome except through the realized treatment variable, which may or may not hold in a given experimental setting.

15.5 Regression discontinuity designs

We declare a design in which the assignment of a treatment is determined by whether a unit exceeds an arbitrary threshold. We demonstrate through diagnosis the bias-variance tradeoff at the heart of the choice of answer strategies that target the local average treatment effect inquiry, defined right at the cutoff.

Regression discontinuity designs exploit substantive knowledge that treatment is assigned in a particular way: everyone above a threshold is assigned to treatment and everyone below it is not. Even though researchers do not control the assignment, substantive knowledge about the threshold serves as a basis for a strong causal identification claim.

Thistlewhite and Campbell introduced the regression discontinuity design in the 1960s to study the impact of scholarships on academic success. Their insight was that students with a test score just above a scholarship cutoff were plausibly comparable to students whose scores were just below the cutoff, so any differences in future academic success could be attributed to the scholarship itself.

Just like instrumental variables, regression discontinuity designs identify a local average treatment effect: in this case, the average effect of treatment exactly at the cutoff. The main trouble with the design is that there is vanishingly little data exactly at the cutoff, so any answer strategy needs to use data that is some distance away from the cutoff. The further away from the cutoff we move, the larger the threat of bias.

Regression discontinuity designs have four components: A running variable \(X\), a cutoff, a treatment variable \(D\), and an outcome \(Y\). The cutoff determines which units are treated depending on the value of the running variable. The running variable might be the Democratic party’s margin of victory at time \(t-1\); and the treatment, \(D\), might be whether the Democratic party won the election in time \(t-1\). The outcome, \(Y\), might be the Democratic vote margin at time \(t\).

A major assumption required for regression discontinuity is that the conditional expectation functions — a function that defines the expected value of the outcome at every level of the running variable — for both treatment and control potential outcomes are continuous at the cutoff.16

The regression discontinuity design is closely related to the selection-on-observables design in that exact knowledge of the assignment mechanism is needed. In Figure 15.4, we illustrate the DAG for the RD design, which includes a treatment with a known assignment mechanism (exactly that units with \(X>\mathrm{cutoff}\) are treated and those below are not). We highlight that the exogenous variable \(X\), the running variable, is a common cause both of treatment and the outcome, and so must be adjusted for (dashed line) in order to identify the effect of treatment on the outcome and avoid confounding bias. However, confounding bias remains far from the cut-off, even after adjustment, because of the possibly different functional forms linking \(X\) to \(Y\).

Directed acyclic graph (DAG) for the regression discontinuity design.

Figure 15.4: Directed acyclic graph (DAG) for the regression discontinuity design.

We illustrate major features of the model for the regression discontinuity using simulated data in Figure 15.5. The raw data is plotted as dots with untreated data (blue dots) to the left of the cutoff (vertical line) and treated data (red dots) to the right of the cutoff. The true conditional expectation functions for both the treated potential outcome and the control potential outcome, which are fourth-order polynomials, are displayed as colored lines, dashed where they are unobservable and solid where they are observable. The discontinuity at the cutoff, above zero, is visible from the raw data alone: the red dots just to the right of the cutoff have higher average values than the blue dots to the left of it. The true treatment effect at the cutoff is illustrated by the dark vertical line, which is the difference between the two polynomials from which the data are drawn. There is a positive treatment effect of about 0.15.

Data from a regression discontinuity design. On the x-axis is the running variable, which defines the cut off between control (here, below 0 on the running variable) and treatment (above 0). The outcome Y is on the y-axis. We illustrate smooth potential outcomes by showing the observed data curve (solid) and the counterfactual potential outcome (dashed).

Figure 15.5: Data from a regression discontinuity design. On the x-axis is the running variable, which defines the cut off between control (here, below 0 on the running variable) and treatment (above 0). The outcome Y is on the y-axis. We illustrate smooth potential outcomes by showing the observed data curve (solid) and the counterfactual potential outcome (dashed).

The inquiry in a regression discontinuity design is the effect of the treatment exactly at the cut-off. Formally, it is the difference in the conditional expectation functions of the control and treatment potential outcomes when the running variable is exactly zero. The black vertical line in the plot shows this difference.

The data strategy is to collect the outcome data for all units within a interval (bandwidth) around the cut-off; there must be data above and below. The choice of bandwidth, which we explore further below, involves a tradeoff between bias and variance. The wider we make the bandwidth, the more data we have, and thus the more precision we may be able to achieve. However, the claim of causal identification comes from the comparability of units just above and just below the threshold. Units far from the cut-off are not likely to be comparable: at a minimum they differ in their values of the running variable and if that is correlated with other unobserved heterogeneity then they differ in that too. Candidates who just win reelection because they get a vote share just over 50%, at 50.01% for example, are likely to be comparable to those who get just below, such as 49.99%, in terms of resources and favorability. Candidates who win in blowout elections with 65% of the vote are likely to have more resources and favorability from those who lose with 35% of the vote. Thus, the more units we compare further from the threshold, the more likely we are to induce bias in our estimates of the treatment effect. However, as we describe now, optimal bandwidth selectors tradeoff bias and variance in the selection of the bandwidth and other parameters, and so collecting data more expansively if inexpensive will still enable subsetting to an appropriate bandwidth at the analysis stage.

Regression discontinuity answer strategies approximate the treated and untreated conditional expectation functions to the left and right of the cutoff. The choice of answer strategy involves two key choices: the bandwidth of data around the threshold that is used for estimation, and the model used to fit that data. We declare the local polynomial regression discontinuity estimator with robust bias-corrected inference procedures described in Calonico, Cattaneo, and Titiunik (2014), which fits a nonparametric model to the data and chooses an optimal bandwidth that minimizes bias. This robust estimator is now widely used in practice, because it navigates the bias-variance tradeoff in selecting bandwidths and statistical model specifications we describe below in a data-driven manner and removes the need for researchers to select these parameters themselves.

Declaration 15.5 \(~\)

cutoff <- 0.5
control <- function(X) {
  as.vector(poly(X, 4, raw = TRUE) %*% c(.7, -.8, .5, 1))}
treatment <- function(X) {
  as.vector(poly(X, 4, raw = TRUE) %*% c(0, -1.5, .5, .8)) + .15}

design <-
    N = 100,
    U = rnorm(N, 0, 0.1),
    X = runif(N, 0, 1) + U - cutoff,
    D = 1 * (X > 0),
    Y_D_0 = control(X) + U,
    Y_D_1 = treatment(X) + U
  ) +
  declare_inquiry(LATE = treatment(0.5) - control(0.5)) +
  declare_measurement(Y = reveal_outcomes(Y ~ D)) + 
    Y, X, c = 0, vce = "hc2",
    term = "Bias-Corrected",
    handler = label_estimator(rdrobust_helper),
    inquiry = "LATE",
    label = "optimal"


To illustrate the tradeoffs in choosing bandwidths and statistical models more generally, we consider a broader class of models holding bandwidth fixed and then explore varying the bandwidth.

A simple statistical modeling strategy would be a fully-saturated model with a one-degree polynomial for the running variable interacted with treatment. Controlling for the running variable through the polynomial in addition to the treatment variable is a recognition that units far to the left or far to the right of the cutoff may have different potential outcomes (adjusting closes the back-door path through the running variable). We illustrate data from the regression discontinuity design in Figure 15.6 (top left panel) with a one degree polynomial (linear regression) interacted with treatment, such that there are different slopes and intercepts to the left and right of the cut-off. The next panels in the top of the figure illustrate the same data with three other choices of model fit: a second-, third-, and fourth-degree polynomial. The dark black interval indicates the estimated treatment effect; it is the vertical distance between the predicted value of the treatment polynomial at the cut-off and the predicted value for the control polynomial at the cut-off. We can see that each estimates a slightly different value for the treatment effect: smallest for the one-degree polynomial.

Intuitively, the more degrees in the polynomial, the better the fit with the data. If the data is in fact drawn from a model with a lower-degree polynomial, the higher-order statistical model fit reduces to a lower one (estimating zero for some parameters). However, choosing a higher-order polynomial is less efficient than a lower-degree polynomial. We can see this in the sampling distribution of the estimates from each polynomial fit across draws of the data in the lower panel of Figure 15.6. On the left, the the one-degree polynomial model exhibits low variability in estimates; the fourth-degree polynomial estimates are much more variable. However, the center of the fourth-order polynomial is much closer to the estimand (vertical red dashed line) than is the one-degree polynomial, which is highly biased. In short, there is a bias-variance tradeoff in the selection of polynomial: you may improve model fit with higher-order polynomials, but this will introduce variability in the estimates.

Polynomial degree choice.

Figure 15.6: Polynomial degree choice.

The second central choice in a regression discontinuity estimator is the bandwidth: how far from the threshold will data be used to fit the model. The choice also amounts to a bias-variance tradeoff: the wider the bandwidth and thus the more data used, the more efficient the estimator is; but the wider the bandwidth, the further from the point at which units are guaranteed to be similar due to the assumption that nothing differs at the limit just above and just below the threshold but the difference in treatment status. In other words, the further away from the threshold, the bigger the differences there are in the potential outcomes from those at the cut-off. With units with very different potential outcomes, the model fit may be biased.

We illustrate in Figure 15.7 this tradeoff in bandwidth selection for a one-degree polynomial (linear interaction). We see the bias-variance tradeoff directly: the bias increases the wider the bandwidth and the more we are relying on data further from the threshold whereas the variance (here, the standard deviation of the estimates) is decreasing the more data we add by widening the bandwidth.

Bias-variance tradeoff in bandwidth selection for two regression discontinuity design estimators.

Figure 15.7: Bias-variance tradeoff in bandwidth selection for two regression discontinuity design estimators.


  1. Gelman and Imbens (2017) point out that higher order polynomial regression specifications lead to extreme regression weights. One approach to obtaining better estimates is to select a bandwidth, \(h\), around the cutoff, and run a linear regression. Declare a sampling procedure that subsets the data to a bandwidth around the threshold, as well as a first order linear regression specification, and analyze how the power, bias, RMSE, and coverage of the design vary as a function of the bandwidth.

  2. The rdrobust estimator in the rdrobust package implements a local polynomial estimator that automatically selects a bandwidth for the RD analysis and bias-corrected confidence intervals. Declare another estimator using the rdrobust function and add it to the design. How does the coverage and bias of this estimator compare to the regression approaches declared above?

  3. Reduce the number of polynomial terms of the treatment() and control() functions and assess how the bias of the design changes as the potential outcomes become increasingly linear as a function of the running variable.

  4. Redefine the population function so that units with higher potential outcome are more likely to locate just above the cutoff than below it. Assess whether and how this affects the bias of the design.