18 Complex designs

In the designs we presented thus far, the aim was generally to learn about the level of some variable or some particular causal effect. In most cases, a single set of observations was collected and an answer strategy was applied directly to this data to answer a causal or descriptive inquiry.

Very many published studies have this actual form – most draw in complex ways on a series of research designs, each targeted to a different inquiry, that when brought together answer the deeper theoretical question at the heart of the research. Research studies are complex in this way.

But studies can also be “complex” in multiple ways. For instance, although we have assumed researchers start with well defined inquiries, some studies focus on first figuring out what question to ask and then proceed to ask and answer it. The study engages in model building, then reports the results of a research design targeted at questions posed by the new model.

Some studies seek not to learn about levels and effects but search explicitly for a model of a phenomenon, asking for instance “what causes \(Y\)?” or “what model, in some class, best accounts for observed data?” These studies have complex inquiries. Other studies have complex data and answer strategies, for instance, mixing qualitative and quantitative inference strategies or gathering together findings from multiple sub-studies in order to arrive at an overall conclusion.

18.1 Mixed methods

We declare a design in which a researcher combines qualitative and quantitative data to draw integrated inferences about causal effects. We imagine that a researcher has access to experimental data on \(X\) and \(Y\) from many cases but can also invest in gathering “process” data on a mediator \(M\) from a small number of cases. Overall inferences are made based on patterns in both sets of data. Diagnosis helps researchers figure out the expected gains from process data and help determine which kinds of cases are most useful for maximizing expected learning.

In section 15.1 we described a qualitative strategy for making case level inferences from combining observations of within-case information with a background causal model. In section 17.1 we described standard approaches for making population level inferences from experimental data. While these approaches appear to use very different types of data, ask different questions, and employ different inferential strategies, it is possible to combine the two to form integrated inferences. This is one way of doing “mixed methods” analysis (Humphreys and Jacobs 2017).

M We imagine a setting in which a binary treatment \(X\) is randomized. Outcomes for \(Y\), also binary, are recorded. We imagine that researchers can also gather data on a presumed mediator, \(M\), for the effect of \(X\) on \(Y\). Our model involves a few assumptions. First we imagine that any effect of \(X\) on \(Y\) must go through \(M\). Second we imagine that \(X\) does not negatively affect \(M\) and \(M\) does not negatively affect \(Y\). For instance we might imagine that providing a drug might cure a disease but only if the drug is actually consumed. We will assume that \(X\) is randomized but we do not assume “sequential ignorability,” and in particular do not assume that \(M\) is as if random given \(X\). Thus the kind of assumptions we impose here are similar to those invoked when doing estimation using instrumental variables (see section 15.4).

We set up a causal model using the CausalQueries package on binary nodes and impose an assumption that \(X\) does not negatively affect \(M\) and \(M\) does not negatively affect \(Y\). This model (summarized by the DAG \(X \rightarrow M \rightarrow Y\)) can be parameterized by a Bernoulli distribution giving the probability that \(X=1\), a Categorical distribution over the three ways that \(M\) can respond to \(X\) (\(M\) responds positively to \(X\), and \(M=1\) or \(M=0\) regardless), and a Categorical distribution over the three ways that \(Y\) can respond to \(M\) (\(Y\) responds positively to \(X\), and \(Y=1\) or \(Y=0\) regardless) given each of the three ways that \(M\) responds to \(X\): this is equivalent to a joint distribution over types for \(M\) and types for \(Y\), allowing, for instance, for the possibility that \(M\) affecting \(Y\) is more likely for cases for which \(X\) affects \(M\).

We assume that we know that \(X\) is assigned with 50% probability, but otherwise provide a “flat” Dirichlet prior probability distribution over these parameter sets by stipulating \(\alpha = 1\) for each parameter set. This prior is used twice in the design: first the prior characterizes the set \(M\) and a draw \(m\) is a draw from the induced prior. Second the prior is also used in the answer step as part of Bayesian updating. We emphasize however that the same distribution does not have to be used for these two purposes: we can assess the performance of a design using a distribution on \(M\) that is quite different to that employed be the researcher when generating answers.

causal_model <- make_model("X -> M -> Y; M <-> Y") %>%
  set_priors(node = "X", alpha = c(100, 100)) %>%
  set_restrictions(labels = list(M = "10",  Y = "10"))

Table 18.1 shows parameters corresponding to the ways that \(M\) can respond to \(X\) and \(Y\) to \(M\) given \(M\)’s type. Each “run” of the model involves one draw of the parameters from each parameter set.

Table 18.1: Parameters correspond to causal types for \(M\) and \(Y\), with \(\lambda^V_{ab}\) meaning that \(V\) takes value \(a\) when \(V\)’s parent is 0 and \(b\) when \(V\)’s parent is 1. There are 12 parameters for \(M\) and \(Y\) in all and 8 degrees of freedom.
Parameter set Parameter Prior (\(\alpha\))
M \(\lambda^M_{00}\) 1
M \(\lambda^M_{01}\) 1
M \(\lambda^M_{11}\) 1
Y | \(M\_{00}\) \(\lambda^Y_{00|00}\) 1
Y | \(M\_{00}\) \(\lambda^M_{01|00}\) 1
Y | \(M\_{00}\) \(\lambda^M_{11|00}\) 1
Y | \(M\_{01}\) \(\lambda^Y_{00|01}\) 1
Y | \(M\_{01}\) \(\lambda^M_{01|01}\) 1
Y | \(M\_{01}\) \(\lambda^M_{11|01}\) 1
Y | \(M\_{11}\) \(\lambda^Y_{00|11}\) 1
Y | \(M\_{11}\) \(\lambda^M_{01|11}\) 1
Y | \(M\_{11}\) \(\lambda^M_{11|11}\) 1

I: We focus on two inquiries: the average treatment effect and an “attribution” estimand: the probability that \(X=1\) caused \(Y=1\) in a case with \(X=1\) and \(Y=1\). The estimands are calculated by querying the causal model given the parameter vector that has been drawn.

D: The quantitative part of the data strategy involves selecting a set of units, randomly assigning \(X\) and measuring the outcome \(Y\). The qualitative part involves selecting a set of cases for which we get to observe \(M\). Recall in section 8.1 we described case selection strategies to learn about the effects of \(X\) and \(Y\) in which researchers first assessed global relations between \(X\) and \(Y\) and then selected cases with particularly known values of \(X\) and \(Y\) for further investigation. That discussion supposed that the quantitative data structure was fixed. In contrast. the design here lets us simultaneously examine the gains from large \(n\) decisions and different case selection strategies. In practice we will examine strategies that look into 20 case studies and compare a strategy that selects from the regression line (\(X=Y=0\) and \(X=Y=1\) cases) against a strategy that selects positive cases only (\(X=Y=1\) cases). One wrinkle in the design is that we specify the within case analysis strategy before we know the \(X\), \(Y\) pattern. This gives rise to the problem that we might want to examine 20 cases on the regression line (for instance) but only have 5 to choose from. To address this in a way consistent with Principle 3.6 we specify that we will gather data on as many cases as possible given the observed data.

# Quantitative data
data_handler = function(data, n = n) 
                             parameters = data$parameters, 
                             n = n)

# Case selection data targets within X, Y data combinations
strategy <- c("00" = 10, "01" = 0, "10" = 0, "11" = 10)
strategy_names <- names(strategy)

# strat_n flexible to take account of the possibility of no data in some strata
strata_n <- function(strategy, strata) 
  sapply(1:4, function(i) min(strategy[i], sum(strata == strategy_names[i])))[strategy_names %in% strata]

A: The answer strategy involves updating the causal model given observed data. Bayesian updating is done using a stan function from the CausalQueries package using a multinomial likelihood implied by the causal model. The updated model is then queried using the same function as used to generate the estimand, illustrating the application of principle 3.7 (seek parallelism) in Bayesian analysis.

estimation_handler = function(data) 
        causal_model %>% update_model(data = data) %>%
    query_model(query = "Y[X=1] - Y[X=0]", 
                using = "posteriors", 
                given = c(TRUE, "X==1 & Y==1")) %>%
  rename(estimate = mean) %>%
  select(estimate, sd) %>% 
  mutate(Inquiry = c("ATE", "POC"))

The full design is then declared as follows.

n <- 50

design <-

  declare_model(data.frame(parameters = CausalQueries::get_parameters(causal_model, param_type = "prior_draw"))) +
  declare_inquiry(ATE = CausalQueries::query_model(causal_model, "Y[X=1] - Y[X=0]", 
  parameters = parameters, using = "parameters")$mean) +   
  declare_inquiry(POC = CausalQueries::query_model(causal_model, "Y[X=1] - Y[X=0]", given = "X==1 & Y==1",  
  parameters = parameters, using = "parameters")$mean) +
  declare_measurement(handler = data_handler, n = n) +
    strata = paste0(X,Y),
    M_observed = strata_rs(strata = paste0(X,Y), strata_n = strata_n(strategy, strata)),
    M = ifelse(M_observed==1, M, NA)) +
  declare_estimator(handler = label_estimator(estimation_handler), inquiry = c("ATE", "POC")) 

We extract as our key diagnosand the expected posterior variance over runs. This quantity tells us how uncertain you expect to be after implementing your study. If this expected posterior variance were 0 it means that you would expect to be certain of whatever answer you get no matter what data you get. If the expected posterior variance is the same as the prior variance then you do not expect to learn anything, regardless of what you see. The expected posterior variance also has a nice interpretation as a measure of the expected squared error of the posterior mean, (with expectations are taken over the prior).

mixed_diagnosands <- 
  declare_diagnosands(mean_estimate = mean(estimate),
                      sd_estimate = sd(estimate),
                      bias = mean(estimate - estimand),
                      posterior_variance = mean(sd^2))
diagnosis <- design %>%
  redesign(N = c(50, 100, 150), 
           strategy = list(c(0,0,0,0), c(10,0,0,10), c(0,0,0,20))) %>% 
  diagnose_design(sims = 2, diagnosands = mixed_diagnosands)

Diagnoses for a range of quantitative and qualitative data strategies are shown in Figure 18.1 for both estimands. The figure shows first that there is learning from case studies for the average treatment effect inquiry. Learning is greatest when cases are drawn on the regression line, but there is also some learning even when there is no variation in \(X\) and \(Y\) in the cases. For both inquiries however we see that we reduce variance by about the same amount by collecting data on \(M\) within 20 cases as we do from gathering data on \(X\) and \(Y\) for an additional 20 cases. With similar marginal gains, whether marginal resources should be allocated to going “wide” or going “deep” likely depends on the relative costs of these strategies.

Expected posterior variance

Figure 18.1: Expected posterior variance

As with the Bayesian analysis in Section 15.1, the design declared here uses the same model and priors for the reference model set (M) and for the answer strategy. This is not required however. Instead you can assess the performance of the answer strategy with respect to reference model sets that differ in at least three ways from that used in the answer strategy: with respect to the assumed causal structure, with respect to monotonicity assumptions, and with respect to the prior distribution over parameters.

18.2 Discovery using causal forests

We declare a design in which a researcher examines a large set of continuous covariates to assess (i) which covariate best explains heterogeneity in the effects of a treatment and (ii) the effect for the subjects for whom the treatment effect is weakest or strongest. The design declaration clarifies the inquiries in a discovery analysis and can be used to guide design decisions regarding how best to split data into training and testing sets.

In most designs we have discussed, researchers have a clear idea what they are looking for when they begin the research. How big is some population? What is the size of some effect? But some research involves questions that are much more open in nature. We focus here on discovery research that has two types of open question. The first inquiry poses an open question of the form “what matters?” (rather than than the more common closed question of the form “what is the effect of this on that?”). The second inquiry poses a question about a not yet identified group—who are the people for whom effects are especially strong or weak and what is the effect for those people?

We imagine a setting in which a researcher has access to a large group of covariates \(X\) and has conducted an experiment to assess the effects of \(Z\) on \(Y\). The researcher is interested in the heterogeneity of effects of \(Z\) as a function of variables in \(X\) and in particular asks:

  • Which covariate best “explains” variation in the effect of \(Z\)?
  • What combinations of covariates describe individuals for whom effects are particularly weak?

M: We first need to describe a background model, or set of models. For this we imagine a complex function linking \(Z\) to \(Y\) in which the effect of \(Z\) depends non linearly on some but not all variables in \(X\). As always, this model can and should be altered to help understand whether diagnosis depends on the particular type of structure assumed.

covariates <- paste0("X.", 1:10)

f_Y <- function(z, X.1, X.2, X.3, X.4, u) {
  z * X.1 + z * X.2 ^ 2 + z * exp(X.3) + z * X.3 * X.4 + u

I: For the first inquiry we need to be specific about what we mean by “best explains.” We will imagine asking which \(X\) produces the lowest conditional variance \(E_x(Y(1)- Y(0)|X=x)\). Specifically we will partition each covariate into deciles and take the average variance in treatment effect across each decile. We will call this the best_predictor inquiry and calculate it using a fixed effects model applied to potential outcomes data.

The following handler function calculates this estimand.

best_predictor <- function(data) {
    inquiry = "best_predictor",
    estimand = lapply(covariates, function(j) {
      lm_robust(tau ~ cut(data[[j]], 20), data = data)$r.squared
    }) %>%
      unlist %>% which.max

There is a simple and a more complex understanding of the second inquiry. The simple understanding is that we are interested in the average effect among the units whose effects are in the bottom 20% (say) of all effects. We will call this the worst_effects inquiry. This is a natural notion of the worst affected. But it is a very difficult quantity to estimate.

Another approach is to examine realized data and do our best to identify a set of units (say of size 20%) that we think will have weak effects, and with this set identified we will return to the model and ask what is the average effect for this set. We will call this the weak_effects inquiry, to acknowledge that the effects for this group may not be the worst effects.

We assume data strategy is the same as a two simple two arm trial (Section 17.1).

The answer strategy employs a “causal forests” algorithm. The approach randomly partitions data into a training and testing group. Within the training group it repeatedly generates “trees” by repartiting the covariates (generating “branches”) to identify subgroups (“leafs”) with similar treatment effects. At each step, partitioning is implemented to yield estimated minimum variance in treatment effects and unit level predictions of treatment effects are generated by combining estimates of effects for units over different trees. See Wager and Athey (2018) for full details of the approach. Our estimate of the best_predictor is based on the variable that is most frequently partitioned to reduce variance. We have two estimates of the weak effects, one based on the training set and one based on the test set. We will assess the performance of these against both the weak_effects inquiry and the worse_effects inquiry

We note that the “most common variable” indicator is not designed to capture the variable that induces the greatest reduction in heterogeneity. Indeed it is not very clear which inquiry corresponds to this measure, and, perhaps for this reason Wager and Athey (2018) do not emphasize this quantity. Including it here however allows us to illustrate the ability of diagnosis to assess the performance of an estimator for a task for which it was not designed.

The estimates are generated by a causal forest handler which makes use of the grf package. In addition to estimating these quantities we note that the handler calculates, weak_effects, our post estimation estimand.

causal_forest_handler <- function(data, ...) {

  X <- as.matrix(data %>% select(all_of(covariates)))
  train <- data$train
  cf <- causal_forest(X = X[train, ], Y = data$Y[train], W = data$Z[train], ...) 
  # Prep and return data
  data$pred <- NA
  data$pred[train]  <- predict(cf, estimate.variance=FALSE)$predictions
  data$pred[!train] <- predict(cf, newdata=X[!train,], estimate.variance=FALSE)$predictions

data %>%
  mutate(var_imp = variable_importance(cf) %>% which.max,
         low_test  = (!train & (pred < quantile(pred[!train], .2))),
         high_test = (!train & (pred > quantile(pred[!train], .8))),
         low_all = pred < quantile(pred, .2))

take_1 <-
  function(data) {
    data %>%
      slice(1) %>%
      mutate(estimate = var_imp) %>%

Declaration 18.1 \(~\)

N <- 1000

design <- 
    N = N,  
    X = matrix(rnorm(10*N), N),
    u = rnorm(N),
    Z = sample(0:1, N, replace = TRUE)) + 
  declare_measurement(handler = fabricate,
    Y_1 = f_Y(1, X.1, X.2, X.3, X.4, u),
    Y_0 = f_Y(0, X.1, X.2, X.3, X.4, u), 
    tau = Y_1 - Y_0,
    Y = f_Y(Z, X.1, X.2, X.3, X.4, u), 
    train = simple_rs(N)==1) +
  declare_inquiry(handler = best_predictor, label = "custom") +
  declare_step(handler = causal_forest_handler) +
    worst_effects = mean(tau[tau <= quantile(tau, .2)]),
    weak_effects = mean(tau[low_test]),
    weak_all = mean(tau[low_all]),
    strong_effects = mean(tau[high_test])) +
  declare_estimator(Y ~ Z, model = lm_robust, subset = low_test, 
                    inquiry = c("weak_effects", "worst_effects"), label = "lm_weak") +
  declare_estimator(Y ~ Z, model = lm_robust, subset = low_all, 
                    inquiry = "weak_all", label = "lm_weak_all") +
  declare_estimator(Y ~ Z, model = lm_robust, subset = high_test, 
                    inquiry = "strong_effects", label = "lm_strong") +
  declare_estimator(handler = label_estimator(take_1),
                    inquiry = "best_predictor", label = "cf") 


Declaration 18.1 is relatively straightforward though we point out that the causal forests estimation is introduced as a general step and not specifically as an estimation step. The reason for this is that the procedure produces predictions for each observation and so the output is naturally added to the primary data frame. Estimates are then defined using this output.

Before turning to diagnosis we can check the performance of the causal forest model by comparing the predicted effects for each unit generated by the model, with the actual unit level treatment effects generated by M. Figure 18.2 illustrates.

One draw from the causal forests design

Figure 18.2: One draw from the causal forests design

We see that we do reasonably well but also that the range of the predictions is narrower than the range of the true effects, which will mean that the average effects in the groups with the best or worst predicted effects will generally not be the same as the effects for the groups with the best and worst actual effects.

For the diagnosis we need to take account of the fact that the answers to one of the inquiries (“Which X accounts for the most variation in effects?”) should be treated as categorical. For this, rather than replying on the average estimate and average estimand we report the modal estimate and estimand; and rather than relying on bias we calculate the probability that we get the correct answer.

Table 18.2: Casual forests design diagnosis
Inquiry Estimator Correct Bias Mean Estimate Modal Estimate Mean Estimand Modal Estimand
best_predictor cf 0.99 -0.01 2.99 3.00 3.00 3.00
(0.00) (0.00) (0.00) (0.00) (0.00) (0.00)
strong_effects lm_strong 0.00 0.00 6.01 5.90 6.01 5.90
(0.00) (0.02) (0.03) (0.12) (0.02) (0.11)
weak_all lm_weak_all 0.00 0.03 0.07 0.10 0.04 -0.00
(0.00) (0.01) (0.01) (0.07) (0.01) (0.00)
weak_effects lm_weak 0.00 0.01 0.04 -0.00 0.03 0.00
(0.00) (0.01) (0.01) (0.04) (0.01) (0.02)
worst_effects lm_weak 0.00 0.41 0.04 -0.00 -0.37 -0.40
(0.00) (0.01) (0.01) (0.04) (0.00) (0.01)

We see that we do very well in identifying the most powerful predictor of heterogeneity, correct nearly all of the time. (We are never “correct” for the continuous estimands, but we would never expect to be.) Our in-sample (or full sample) estimate of effects for the weak group is biased. Our out-of-sample subgroup estimate estimates the average effects for the “weak” and “strong” groups without bias however. Finally we see that the out-of-sample estimate for the weak group does not estimate the effects for the “worst” group. This highlights the fact that the procedure can get unbiased estimates for a group that does poorly but not the group that does most poorly.

Overall the approach fares very well and through diagnosis we get clarity over which quantities are well estimated.

Modifications of this design can let you assess how sensitive performance is to the type of model stipulated and what are the best divisions between treatment and training sets for any stipulated model.

18.3 Structural estimation

We declare a design in which a researcher wants to use experimental data to estimate the parameters of a game theoretic model. The promise of the approach is that, if the underlying theory is correct, parameter estimation allows the researcher to make interesting external inferences to different types of treatment effect in other settings. The risk is that if the model is wrong these inferences may be invalid. Diagnosis helps assess the gains and risks of the approach.

Structural estimation is used in situations where researchers have a general model in mind for how processes work and their goal is to fit the parameters of the model. With the fitted model they might then estimate levels of unobserved variables, treatment effects, or other quantities. They might even extrapolate to estimate counterfactual quantities, such as the effects of interventions that have not been implemented (Reiss and Wolak 2007).

We illustrate using an example of a model of bargaining between pairs of players, drawing on an example use in Wilke and Humphreys (2020).

Wilke and Humphreys (2020) imagine a bargaining game with payments from customer \(i\) to a taxi driver given by:

\[\pi_i = \theta_i(z_iy + (1-z_i)(1-y)) + (1-\theta_i)\chi\]

Here \(y = \sum_{j = 2}^n(-1^{j})\delta^{j-1}\) is the equilibrium offer made by the first mover as predicted by the Rubinstein (1982): alternating offers bargaining model with \(n\) possible bargaining rounds given discount factor \(\delta\). The customer’s payoff depends on whether she goes first (\(z_i = 1\)) or second (\(z_i = 0\)). Non-rational customers (\(\theta_i = 0\)) do not engage in bargaining but successfully invoke a norm, insisting on giving the driver some share \(\chi\) of their endowment, irrespective of whether they go first or second. We let \(q\) denote the probability that \(\theta = 1\).

To allow for a disturbance we assume that measured payments are a draw from a Beta distribution with expectation given by the expected payment and variance parameterized by \(\kappa\).

We imagine that \(Z\) is randomly assigned and we have access to data on payments, \(\pi\). We will also assume we know price \(\chi\). Our goal however is not simply to measure the effect of \(Z\) on \(\pi\) but to estimate the model parameters, \(q\), \(\delta\), \(\kappa\), which themselves can be used to estimate this effect and other counterfactual quantities (if we assume the model is true).

M. In this declaration we will assume that the data is indeed produced by a process similar to that assumed at the estimation stage.

I. Our inquiries will include parameters k, d, and q corresponding to \(\kappa\), \(\delta\) and \(q\) in the model (we will treat \(\chi\) as known). In addition we will be interested in the effect of \(Z\) on payments in a two period game and in the game of indefinite duration.

D. First mover position, \(Z\), is randomly assigned. Payments are measured with some error (in this model however we cannot distinguish between measurement error and decision making error). We imagine that we have access to data from games with \(n=2\) and games with \(n=\infty\) in order to compare the performance of estimators in both conditions.

A. The analysis is implemented using maximum likelihood to identify which parameter values are most consistent with the data (which collection of parameter values produce the observed data with greatest likelihood). In this declaration the model employed in A is the same as that employed in M. We will report analysis results from both differences in means and structural estimation generated both from using the \(n=2\) data (DIM_two, Struc_two) and from using \(n=\infty\) data (DIM_inf, Struc_inf)

The most complex part of the design is the specification of the estimator, shown next:

structural_estimator <- function(data, pi, y, chi = 3/4){
  # Define negative log likelihood as a function of k, d and q
  LL  <- function(k, d, q) {
    m <- with(data, y(Z, d))
    R <- q * dbeta(data[pi][[1]], k * chi, k * (1- chi)) +
      (1 - q) * dbeta(data[pi][[1]], k * m, k * (1 - m))
    - sum(log(R))
  # Estimation
  M <- mle2(
    method = "L-BFGS-B",
    start = list(k = 2, d = 0.50,  q = 0.50),
    lower = c(k = 1,    d = 0.01,  q = 0.01),
    upper = c(k = 1000, d = 0.99,  q = 0.99)
  # Format output from estimation
  out <- data.frame(coef(summary(M)), outcome = pi)
  names(out) <- c("estimate", "std.error", "statistic", "p.value", "outcome")
  # Use estimates of q and delta to predict average treatment effects (ATEs)
  # Predicted ATE for n=2
  out[4, 1] <- (1 - out["q", "estimate"]) * (2 * out["d", "estimate"] - 1)
  # Predicted ATE for n=infinity
  out[5, 1] <- (1 - out["q", "estimate"]) * (2 * out["d", "estimate"] /
                                               (1 + out["d", "estimate"]) - 1)

The design makes use of this estimator to estimate parameter values as well as treatment effects. It is accompanied by a simpler difference-in-means estimator of treatment effects.

An attraction of structural estimation is that, with a fitted model, one can generate estimates of effects of treatments that have not been implemented. In this case, the same parameters that describe the equilibrium outcomes in a two round game are sufficient to describe outcomes in the infinitely repeated game. So if you understand the effects of a treatment in one case you understand it in the other. At least if the model is correct. In Declaration 18.2, we generate such estimates for the effects of unimplemented treatments.

Declaration 18.2 \(~\)

d = 0.8       # True delta (unknown)
k = 6         # Parameter to governance variance (unknown)
q = 0.5       # Share of behavioral types in the population (unknown)
chi = 0.75    # Price paid by norm following ("behavioral" customers) (known)

design <- 
    # Define the population: indicator for behavioral type (norm = 1)
    N = 500, norm = rbinom(N, 1, q),
    # Define mean potential outcomes for n = 2 
      pi_two ~ norm*chi + (1-norm)*(Z*d + (1-Z)*(1-d))
    # Define mean potential outcomes for n = infinity
      pi_inf ~ norm*chi + (1-norm)*(Z*d/(1+d) + (1-Z)*(1-d/(1+d)))
  ) +
  declare_inquiry(ATE_two = mean(pi_two_Z_1 - pi_two_Z_0), # ATE n = 2
                  ATE_inf = mean(pi_inf_Z_1 - pi_inf_Z_0), # ATE n = infinity
                  k = k,                                   # kappa
                  d = d,                                   # delta
                  q = q) +                                 # q
  declare_assignment(Z = complete_ra(N)) +

    pi_two = reveal_outcomes(pi_two ~ Z),
    pi_inf = reveal_outcomes(pi_inf ~ Z),
    # Get draws from beta distribution given means for n = 2 and n = infinity
    pi_two_obs = rbeta(N, pi_two*k, (1-pi_two)*k),      
    pi_inf_obs = rbeta(N, pi_inf*k, (1-pi_inf)*k)
  ) +
  # Difference-in-means for n = 2
  declare_estimator(pi_two_obs ~ Z, inquiry = "ATE_two", label = "DIM_two") +
  # Difference-in-means for n = infinity
  declare_estimator(pi_inf_obs ~ Z, inquiry = "ATE_inf", label = "DIM_inf") +
  # MLE for n = 2
  declare_estimator(handler = tidy_estimator(structural_estimator), 
                    pi = "pi_two_obs", 
                    y = function(Z, d) Z * d + (1 - Z) * (1 - d), 
                    inquiry = c("k","d", "q", "ATE_two", "ATE_inf"), 
                    label = "Struc_two") +
  # MLE for n = infinity
  declare_estimator(handler = tidy_estimator(structural_estimator),
                    pi = "pi_inf_obs", 
                    y = function(Z, d) Z*d/(1+d) +  (1-Z)*(1-d/(1+d)),
                    inquiry = c("k","d","q","ATE_two", "ATE_inf"), 
                    label = "Struc_inf") 


Now for the diagnosis.

Table 18.3: Complete random sampling design diagnosis
Design Label Inquiry Label Estimator Label Bias RMSE Power Mean Estimate SD Estimate Mean Estimand Design Estimator
design ATE_inf DIM_inf 0.00 0.02 0.80 -0.05 0.02 -0.06 design DIM_inf
ATE_inf (0.00) (0.00) (0.01) (0.00) (0.00) (0.00) design DIM_inf
design ATE_two DIM_two 0.00 0.02 1.00 0.30 0.02 0.30 design DIM_two
ATE_two (0.00) (0.00) (0.00) (0.00) (0.00) (0.00) design DIM_two
design ATE_inf Struc_inf 0.00 0.01 NA -0.05 0.02 -0.06 design Struc_inf
design ATE_two Struc_inf 0.00 0.05 NA 0.30 0.05 0.30 design Struc_inf
design d Struc_inf 0.00 0.05 1.00 0.80 0.05 0.80 design Struc_inf
design k Struc_inf 0.05 0.44 1.00 6.05 0.44 6.00 design Struc_inf
design q Struc_inf -0.00 0.04 1.00 0.50 0.04 0.50 design Struc_inf
ATE_inf (0.00) (0.00) NA (0.00) (0.00) (0.00) design Struc_inf
ATE_two (0.00) (0.00) NA (0.00) (0.00) (0.00) design Struc_inf
d (0.00) (0.00) (0.00) (0.00) (0.00) (0.00) design Struc_inf
k (0.01) (0.01) (0.00) (0.01) (0.01) (0.00) design Struc_inf
q (0.00) (0.00) (0.00) (0.00) (0.00) (0.00) design Struc_inf
design ATE_inf Struc_two -0.00 0.01 NA -0.06 0.01 -0.06 design Struc_two
design ATE_two Struc_two 0.00 0.02 NA 0.30 0.02 0.30 design Struc_two
design d Struc_two -0.00 0.01 1.00 0.80 0.01 0.80 design Struc_two
design k Struc_two 0.05 0.46 1.00 6.05 0.45 6.00 design Struc_two
design q Struc_two -0.00 0.04 1.00 0.50 0.04 0.50 design Struc_two
ATE_inf (0.00) (0.00) NA (0.00) (0.00) (0.00) design Struc_two
ATE_two (0.00) (0.00) NA (0.00) (0.00) (0.00) design Struc_two
d (0.00) (0.00) (0.00) (0.00) (0.00) (0.00) design Struc_two
k (0.01) (0.01) (0.00) (0.01) (0.01) (0.00) design Struc_two
q (0.00) (0.00) (0.00) (0.00) (0.00) (0.00) design Struc_two

We see here that we do a good job in recovering parameter values and we also recover treatment effects. When using two period data the estimate for the ATE_two is as good when estimated using the structural and design based approaches. In the case with data from \(n=\infty\) games however the estimate from the structural model is less precise, though unbiased. In contrast we have no estimate using design based methods for this inquiry. Finally, \(\delta\), we see, is better estimated using data from the 2 period games; the estimate for \(\kappa\) is biased though the bias is small.

The basic structure used here can be used for a wide range of structural model: write down your theory and specify the implied likelihood function—which reports the probability of observing different types of data given model parameters. This might require adding noise to your model so that all data that is seen in practice can be seen in theory. Then identify the parameters that have the greates likelihood of produceing the data that you see. The same fundamental approach can be used for estimation voa Bayesian methods or methods of moments. As suggested here, the payoffs from this approach can be great. The risks are large too however. Insofar as inferences are model dependent, model misspecification can lead to faulty conclusions. When doing structural estimation, apply Principle 3.9 liberally.

18.4 Meta-analysis

We declare a design for meta-analytic study in which a researcher seeks to combine findings from a collection of existing studies to form an overall conclusion. Declaration helps clarify the estimand in such studies. Diagnosis highlights risks associated with a common estimator used in meta-analysis, the fixed effect estimator.

In a meta-analysis, inquiries are summaries of the findings of past research studies. In most designs in this book, an observation in the data we analyze represent people, places, businesses, or governments. In a meta-analysis, an observation represents one estimate from a past empirical study.

Meta-analyses can tell us about the empirical consensus on a particular inquiry, often represented as the average value of an inquiry across studies. The mean value of the inquiry might not be especially interesting, however, if the true effect size varies across studies. True effects might vary for two reasons: true effect heterogeneity across studies of different units or contexts with the same research design, and differences in the treatments, outcomes, and bias induced by variations in research design. In fact, it is rare for at least small differences in research design to creep in, making the case of effect homogeneity uncommon. As a result, a second common inquiry in meta-analysis is how much true effects vary across studies, often represented by the variance in effect sizes.

Not every estimate from past literature is created equal. Some come from large studies with credible research designs; other small ones that introduce bias. We can use two features of past studies to guide how to treat their estimates. The research design — its MIDA — can tell us about how much bias may be induced in the design. We may want to exclude studies entirely with high bias. The estimated standard error reported in the past study also provides a clue to how informative the study will be. An estimate that comes from an unbiased design with a very low standard error — with a lot of precision — is more informative than one with a high standard error. Most meta-analysis techniques directly incorporate the estimated standard error by weighting by weights that are in proportion to the amount of precision (inverse of the estimated standard error) from the study.

We can also learn what we don’t know from a meta-analysis. Analysis may reveal that the confidence interval on our estimates of the average value of the inquiry are very wide. Or we may find that there are few or no studies that can deliver unbiased results. The model describes where our sample of studies comes from and the statistics we collect from them. Data collection for a meta-analysis involves searching for relevant studies, filtering for eligibility on topical and research design grounds, extracting estimates and uncertainty measures, and in some cases postprocessing to standard estimates across studies. We declare a design with 100 studies with a mean true effect size (\(\mu\)) of 0.2 and estimated standard errors between 0.025 and 0.6.

We consider two common true models of the studies, which are commonly known (confusingly!) as random effects and fixed effect. In the fixed effect model—perhaps better called the “common effects” model so as to avoid confusion with the fixed effect model—we start with the true effect \(\mu\) and estimated effects are drawn from a normal distribution centered on \(\mu\) with standard deviation set to the estimated standard error extracted from the study. For each of our 100 studies, the true effect in this setup is assumed to be \(\mu\) and we obtain noisy estimates of that true effect from studies with higher and lower precision depending on features of those individual studies such as sample size. We summarize the precision of past studies using the estimated standard error. In the random effects model, we relax the assumption that there is a single true effect for all studies, and that instead effects may vary due to contextual or design differences in the study. It may be that there are heterogeneous effects across studies due to different political institutions or economic conditions, or that the studies used slightly different measurement tools and so target slightly different inquiries. Variation in true effects across studies is represented in the model by drawing effect estimates from that same normal distribution, but centered on a true study effect \(\theta\) which itself is a draw from a normal distribution centered on \(\mu\). Thus, there is still an average of true effects across studies of \(\mu\), but the true effects vary across contexts.

The two most common inquiries for a meta-analysis are the mean true effect across studies, often labeled \(\mu\), and the variance of true effects, \(\tau^2\). When true effects do not vary across studies, then \(\tau^2 = 0\) and all true studies effects are equal to \(\mu\). However, this is not always the case. Thus, whether the mean is a good single summary of the past evidence depends on the value of \(\tau^2\), suggesting we should always estimate it and assess both statistics. A common goal in meta-analysis is to try to assess what the effect would be for a new study being contemplated or for a context in which a policy studied in past literature is being implemented. In both cases, we want to predict from past evidence what the effect for this current setting will be. To do so, we need to know how much true effects vary and what their average is. We also may want to more formally predict using characteristics that predict effect heterogeneity. A simple example of this prediction would be to estimate \(\mu\) and \(\tau^2\), observe the mean is around \(0.2\) but there is substantial variance of \(0.2\) and so find a past study with similar characteristics to the new study in terms of political institutions, economic conditions, and measurement tools.

We estimate the average true effect \(\mu\) and the variance in true effects \(\tau^2\) using a frequentist random effects statistical model. Importantly, the random effects statistical model does not impose variance in true effects but rather estimates \(\tau^2\) and allows the effects to vary if it is estimated to be above zero. When the variance is estimated to be zero, then the random effects statistical model reduces to the fixed effect statistical model in which study estimates are drawn from a common distribution across sites.

Declaration 18.3 \(~\)

design <-
    N = 100,
    site = 1:N,
    mu = 0.2,
    tau = case_when(model == "random-effects" ~ 1,
                    model == "fixed-effects" ~ 0),
    std.error = pmax(0.1, abs(rnorm(N, mean = 0.8, sd = 0.5))),
    eta = rnorm(N),
    theta = mu + tau * eta, # note when tau = 0, theta = mu 
    estimate = rnorm(N, mean = theta, sd = std.error)
  ) + 
  declare_inquiry(mu = first(mu), tau_sq = first(tau^2)) + 
    yi = estimate, sei = std.error, method = "REML",
    model = rma_estimator, model_summary = rma_mu_tau,
    term = c("mu", "tau_sq"), inquiry = c("mu", "tau_sq"),
    label = "random-effects")


designs <- redesign(design, model = c("random-effects", "fixed-effects"))

simulations <- simulate_design(designs, sims = sims)

To illustrate the properties of the fixed effect estimator in some settings, we add it as a second estimator:

  yi = estimate, sei = std.error, method = "FE",
  model = rma_estimator, model_summary = rma_mu_tau,
  term = c("mu", "tau_sq"), inquiry = c("mu", "tau_sq"),
  label = "fixed-effects")

We explore the bias, efficiency (root mean-squared error), and coverage of the two estimators under each model both for the mean effects inquiry \(\mu\) and for the variance of effects inquiry \(\tau^2\). The random effects model, across the possible models of how effects are realized, performs best. Whether the model is fixed effect or random effects it estimates both parameters without bias. Coverage is approximately nominal, but the standard errors are slightly biased for \(\tau^2\) when the fixed effect model is the right one (i.e., when \(\tau^2 = 0\)). However, the bias in a conservative direction (confidence intervals are wider than they should be), which may be preferable. The fixed effect estimator has two problems. When the random effects assumption is correct, then the estimator gets the variance in estimates wrong, because it assumes it is zero. When it is not zero, there is bias. This leads to a second problem: the estimator confuses variability in true study effects for variability in estimation of the study effects, so the standard errors it produces are highly biased (coverage is about \(0.2\)). Therefore, if we are even a little unsure of the true distribution, we should not trust the uncertainty estimates from the fixed effect model. And, of course, we know that there may be variation in true effects that we assumed away by adopting this estimator.

Table 18.4: Bias, RMSE, and coverage from the random effects and fixed effect estimators under the models assumed by the estimators.
estimator inquiry model bias rmse coverage
fixed-effects mu fixed-effects 0.00 0.03 0.95
fixed-effects mu random-effects 0.00 0.26 0.19
fixed-effects tau_sq fixed-effects 0.00 0.00
fixed-effects tau_sq random-effects -1.00 1.00
random-effects mu fixed-effects 0.00 0.03 0.96
random-effects mu random-effects 0.00 0.13 0.95
random-effects tau_sq fixed-effects 0.00 0.01 0.98
random-effects tau_sq random-effects 0.02 0.22 0.95

18.5 Multi-site studies

We declare a design for a coordinated research project in which multiple research teams combine data gathering an analysis strategies to address a common inquiry. Diagnosis clarifies the benefits and risks of coordination versus tailoring of treatments to contexts.

Nearly all social science is produced atomistically: individual scientists identify a question and a research strategy for answering it, apply the research strategy, and try to publish the results of that one study. Increasingly, there is a realization that, though this promotes discovery, it may not be the most efficient way to accumulate general insights. One reason is that scientists are opportunistic in the contexts and units they study. If inquiry values differ in the contexts and units that scientists choose to study from the population of contexts and units of general interest then we may not be learning general insights. One response to this problem of generalizability has been to promote replication of past studies in new contexts or with new populations. Another has been to develop methods for extrapolating from single studies, relying on variation within studies in unit characteristics and effect heterogeneity. A third, which we explore here, is multi-site studies in which by design a study is conducted in multiple contexts in order to produce general insights by averaging effects across contexts and exploring how effects vary by context.

Working in more than one site is expensive. Fixed implementation and data collection costs may be duplicated in each site, and typically outcome measurement and other implementation details must be coordinated across sites. To justify conducting a study in multiple sites, at a minimum it should be clear what inferential gains are to be had relative to a study of the same size in a single site. Concentrating resources in the single site is likely to yield lower costs for the same total sample size.

The scientific value from studying multiple sites that could outweigh cost considerations come from the design’s ability to improve our understanding of the generalizability of estimates. If we find a treatment has an effect of around 0.1 in five contexts, we have learned more about how effective that treatment generally is than if we had only learned in the first site the treatment effect is 0.1. In that single site case, we might wonder if the effect was larger or smaller or nonexistent in the other four sites.

How much we can learn about the generalizability of effects depends on how many sites we study, how we select the sites, our beliefs about effect heterogeneity, and the level of coordination across studies. The coordination level is important because its its absence we might not be learning about the generalizability of our answers to a single inquiry but rather about one answer to many distinct inquiries.

We declare a multi-site study with five sites and 500 subjects per site. We define in our model 100 possible sites we could work in in order to be able to assess the generalizability of the answers our design yields. The model reflects the fact that there is a true study effect (unknown of course) for each site between 0 and 0.2 (note the 0.1 bump in the potential outcomes below).

We explicitly define sampling of sites — here complete random sampling — to call attention to the fact that how you sample sites shapes whether answers are generalizable. If the researchers select only feasible contexts to work in, we can learn generalizable insights from the design — but likely only about feasible contexts. We then sample individuals at random within sites. Treatment assignment is blocked at the site level, which is the same as complete random assignment within each site.

Estimation proceeds in two steps: first we calculate site-level effects, just as we do for a meta-analysis, then we meta-analyze those site-level effects. We use here a random effects model (see meta-analysis design library entry for a discussion of alternative modeling strategies), reflecting our expectation that true site effects differ.

Declaration 18.4 \(~\)

estimate_study_effects <- function(formula, data) {
  data %>%
    group_by(sites) %>%
    do(tidy(lm_robust(formula, data = .))) %>%
    filter(term == "Z") %>%

n_study_sites <- 5
n_subjects_per_site <- 500

design <- 
    sites = add_level(
      N = 100, 
      study_effect = seq(from = -0.1, to = 0.1, length.out = N) 
    subjects = add_level(
      N = n_subjects_per_site, 
      U = rnorm(N),
      potential_outcomes(Y ~ Z * (0.1 + study_effect) + U)
  ) +
  declare_inquiry(PATE = mean(Y_Z_1 - Y_Z_0),
                  tau_sq = ) + 
  declare_sampling(S = cluster_rs(clusters = sites, n = n_study_sites)) + 
  declare_sampling(S = strata_rs(strata = sites, n = n_subjects_per_site)) + 
  declare_assignment(Z = block_ra(blocks = sites, prob = 0.5)) + 
  declare_measurement(Y = reveal_outcomes(Y ~ Z)) + 
  declare_step(Y ~ Z, handler = estimate_study_effects) + 
  declare_estimator(yi = estimate, sei = std.error, method = "REML", 
                    model = rma_estimator, model_summary = rma_mu_tau, inquiry = "PATE", 
                    label = "random-effects")


18.5.1 Coordinating treatments

Without coordination of research designs across sites in a multi-site study, researchers risk answering different questions in different sites. If the goal is to understand a common inquiry and how values of the inquiry vary across contexts, then each site must be providing answers to (close to) the same inquiry. To do so, researchers coordinate measurement strategies so as to collect the same outcome measures and coordinate the details of treatment procedures to ensure they are studying the effects of the same intervention.22

To fix ideas we might imagine an information campaign in which the common treatment uses radio dissemination of messages, but in some sites a lower tech strategy might be more effective (e.g. community meetings) and in others a higher tech strategy (e.g. Facebook ads) might be.

The disadvantage of coordinating is that researchers end up answering a different question than they started out with. In the worst case they end up answering one common question across all sites that is interesting in none. We explore these tradeoffs by declaring a modified multi-site design in which different treatments have different effects in different sites. There is a treatment that is expected to perform best on average, which researchers are interested in assessing. However qualitative insight and local ownership can enable researchers to select questions that are responsive to local conditions. We might imagine for instance that contextual research identifies a treatment that is most likely to be effective for a given context. This is the version that would be selected absent coordination (this type of self selection is often association with the “Roy model”). However it could also be that local selection results in selecting the least effective treatment: for example if local politicians has a hand in determining which anti-corruption intervention to implement.

We consider different possible designs that vary in terms of the level of coordination and the type of self selection if coordination fails. With no coordination, the selected treatment may be more or less effective than the researcher selected treatment; with full coordination there is no self selection and so suboptimal treatments might be imposed in different settings.

In this setting we can define multiple distinct estimands such as effect of the treatment that is chosen by the researcher, the effect of the best or worrst treatment for each site, the average effect across all treatments, across sites, or the effect of the treatment that would be chosen via self selection in the absence of any coordination.

Our data strategy takes account of how self-selection plays out if coordination is imperfect (either accidentally or deliberately). In the design below a single variable \(Z\) indicates treatment (randomly assigned) but the outcome observed, \(Y\), depends on which version of treatment is employed.

Declaration 14.1 \(~\)

coordination <- 1
prob_select_pos <- .2
n_subjects_per_site <- 2500

design <- 
    sites = add_level(
      N = 5, 
      tau_1 = rnorm(N, .1),
      tau_2 = rnorm(N),
      tau_3 = rnorm(N),
      compliant = runif(N) < coordination,
      selects_pos = simple_rs(N, prob_select_pos)),
    subjects = add_level(
      N = n_subjects_per_site, 
      tau_mean =  (tau_1 + tau_2 + tau_3)/3,
      tau_max  =  pmax(tau_1, tau_2, tau_3),
      tau_min  =  pmin(tau_1, tau_2, tau_3),
      tau_selected = tau_max*selects_pos + tau_min*(1-selects_pos),
      U = rnorm(N))) +
    PATE_1 = mean(tau_1),
    PATE_average = mean(tau_mean),
    PATE_best = mean(tau_max),
    PATE_worst = mean(tau_min),
    PATE_selected = mean(tau_selected)
    ) + 
  declare_assignment(Z = block_ra(blocks  = sites)) +
    Y   = Z*(compliant*tau_1 + (!compliant)*tau_selected) + U
    ) +
  declare_step(Y ~ Z, handler = estimate_study_effects) +
    yi = estimate,
    sei = std.error,
    method = "REML",
    model = rma_estimator,
    model_summary = rma_mu_tau,
    inquiry = c(
    label = "random-effects"


designs <- redesign(design, coordination = (0:4)/4, prob_select_pos = (0:3)/3)
diagnoses <- diagnose_design(designs, sims = sims, bootstrap_sims = 100)

Figure 18.3 shows the distribution of estimands as a function of the level of coordination. The estimand for the (unknown) “best” treatment and the researcher controlled treatment do not vary with coordination. However the value of the self selected treatment does in the obvious way: if units self select into treatments with low or negative effects then the researcher controlled treatment trumps the self selected treatment. The opposite is true if units self select into treatments with more positive effects; in those case weaker control can put a focus on an estimand with srtonger effects.


Figure 18.3: Estimands.

Figure 18.4 describes the bias associated with different strategies given imperfect coordination.

Bias for the for different estimands as a function of coordination and the probability of self selection into more treatments with more positive effects.

Figure 18.4: Bias for the for different estimands as a function of coordination and the probability of self selection into more treatments with more positive effects.

The take home message is that if you can ensure coordination you can get unbiased estimates for the treatment you select. More importantly, if you cannot ensure coordination then you can still get unbiased estimates but for the self-selected treatment. For moderate levels of coordination you do not get unbiased estimates for any of these estimands. You can neither claim to have estimated effects for a particular treatment or for whatever treatment a site would self select into.