5 Declaring designs

In Chapter 2, we gave a high-level overview of our framework for describing research designs in terms of their models, inquiries, data strategies, and answer strategies, our process for diagnosing their properties, and a general purpose approach for improving them to better fit research goals. Now in this chapter, we place our approach on a firmer formal footing. We employ elements from Pearl’s (2009) approach to causal modeling, which provides a syntax for mapping design inputs to design outputs. We also use the potential outcomes framework as presented, for example, in Imbens and Rubin (2015), which many social scientists use to clarify their inferential targets.

Describing a research design in the MIDA framework allows us to see the fundamental symmetries across the theoretical (M and I) and empirical (D and A) halves of a research design. A recurring theme of our book is that research designs tend to be stronger when the relationship of M to I is mirrored by the relationship of D to A; the aim of this chapter is to make this somewhat abstract claim more concrete.

5.1 Definition of research designs

Research designs are defined by four elements: a model M, an inquiry I, a data strategy D, and an answer strategy A. Describing a research design entails “declaring” each of these four elements.

M is a set of possible models of how the world works. Following Pearl’s definition of a probabilistic causal model, a model in M contains three core elements. The first is the “signature” , or the specification of the variables $$X$$ about which research is being conducted, including the endogenous and exogenous variables ($$V$$ and $$U$$ respectively) and their ranges. The second element ($$F$$) is a specification of how each endogenous variable depends on other variables. These dependencies can be considered functional relations or, as in Imbens and Rubin (2015), potential outcomes because they describe what would happen under different possible conditions. The third and final element is a probability distribution over exogenous variables, written as $$P(U)$$. Sometimes it is useful to think of the draws from $$U$$ as implying distinct models of their own, in which case we might think of M as a family of models that fully specifies what would happen under all conditions and a particular model $$m$$ as an element of M that describes one of those conditions. We eschew the phrase “data generating process” when referring to M (since data are generated by the data strategy) and instead use the phrase “event generating process.”

The inquiry I is a summary of the variables $$X$$. All inquiries are either descriptive or causal. Descriptive inquiries are those that do not involve comparisons across counterfactual worlds, for example, the average value of an outcome Y over all N units in the population: $$\frac{\sum_i^N Y_i}{N}$$. Causal inquiries do involve comparisons across counterfactuals, as in the average treatment effect: $$\frac{\sum_i^N Y_i(Z_i = 1) -Y_i(Z_i = 0)}{N}$$.

We let $$a_m$$ denote the answer to I under the model. Conditional on the model, $$a_m$$ is the value of the estimand, the quantity that the researcher wants to learn about, or would want to learn about if the world were like the model. The connection of $$a_m$$ to the model is given by: $$a_m = I(m)$$.

As the saying goes, models are wrong but some may be useful. We denote the true causal process as $$m^*$$: the process that generates events in the real world. The right answer, then, is $$a_{m^*} = I(m^*)$$. The answer under a reference model $$a_m$$ may be close or far from the true value $$a_{m^*}$$, which is to say it could be wrong. If the model $$m$$ is far from $$m^*$$, then of course $$a_m$$ need not be correct. Moreover $$a_{m^*}$$ might even be undefined, since inquiries can only be stated in terms of theoretical models. If the theoretical model is wrong enough—for instance it conditions on events that could not arise—then the inquiry might be nonsensical. For example, “what is the ideological slant of a speech that is not given” is an inquiry that is undefined.

A data strategy D generates data $$d$$. Data $$d$$ arises under model M with probability $$P_M(d|D)$$. The data strategy includes sampling, assignment, and measurement strategies. Nearly all data strategies sample and measure, but not all assign treatments. Whether or not the data strategy includes assignment is the defining distinction between experimental and observational studies. When applied in the real world, the data strategy operates on $$m^*$$ to produce the realized data: $$D(m^*) = d^*$$. When we simulate research designs, the data strategy operates on a simulated model draw $$m$$ to produce fabricated data: $$D(m) = d$$.

Finally, the answer strategy A generates answers using data. When applied to realized data, the answer strategy returns the empirical answer: $$A(d^*) = a_{d^*}$$. When applied to simulated data, it returns a simulated answer: $$A(d) = a_{d}$$.

Table 5.1 provides a concise description of each element of a research design and relates them to some common terms. We flag here that the term estimand has a slightly different meaning in our framework than elsewhere. We say that an estimand $$a_m$$ is the value of an inquiry $$I$$, whereas in some traditions “estimand” can refer to the inquiry $$I$$ or to an intermediate parameter that happens to be targeted by an estimator.

Table 5.1: Elements of research design.
Notation Description Related terms
M a stipulated collection of causal models
$$m$$ a single model in $$M$$, represented by events a hypothetical data generating process
$$m^*$$ the true model true data generating process
I the inquiry estimand, quantity of interest
$$a_m = I(m)$$ the answer under the model, an estimand
$$a_{m^*} = I(m^*)$$ the true answer, the estimand quantity of interest
D the data strategy
$$d = D(m)$$ fabricated data; simulated data
$$d^* = D(m^*)$$ realized data
A the answer strategy data analysis, estimator, method
$$a_{d} = A(d)$$ a simulated answer, an estimate a hypothetical estimate
$$a_{d^*} = A(d^*)$$ the empirical answer, the estimate the observed estimate

The full set of causal relationships between M, I, D, and A, with respect to $$m$$ and $$m^*$$, $$a_m$$ and $$a_{m^*}$$, $$d$$ and $$d^*$$, and $$a_d$$ and $$a_{d^*}$$ can be seen in the schematic representation of a research design given in Figure 5.1. The figure illustrates how a research design involves a correspondence between $$I(m) = a_m$$ and $$A(d) = a_d$$. The theoretical half of a research design produces an answer to the inquiry in theory. The empirical half of a research design produces an empirical estimate of the answer to the inquiry. Neither answer is necessarily close to the truth $$a_{m^*}$$, of course. And, as shown in the figure, the truth is not directly accessible either to us in theory or in empirics. Our gamble in empirical research, however, is that our theoretical models are close enough to the truth; that the truth is like the set of models we imagine. If the models in $$M$$ do not contain $$m^*$$ or are too different from it, then even seemingly strong research designs could yield incorrect answers.

Figure 5.1 reveals a striking analogy between the M, I relationship and the D, A relationship. The answer we aim for (the estimand) is obtained by applying I to a draw from M. The answer we have access to (the estimate) is obtained by applying A to a draw from D. Our hope, usually, is that these two answers are quite similar. In some cases, this parallelism suggests that the function A should be “like” the function I. For instance, if we are interested in the mean of a population and we have access to a random sample, the data available to us from D is like the ideal data we would have if we could observe the nodes and edges in M directly.

Finally, in Figure 5.1 no arrows go into M, I, D, or A, since they are not caused by any of the other nodes. We could have included a node for the research designer, who deliberately sets the details of M, I, D, and A, but we omit it for clarity.

5.2 Declaration in code

Table 5.2 illustrates these different quantities through DeclareDesign. We stipulate a model, $$M$$, in which $$Y$$ depends on $$X$$. We define a inquiry, $$I$$: what is the average value of $$Y$$ when $$X=1$$? We calculate what the value of our inquiry (the estimand) would be under one of our simulated models, $$I(m)$$. We also imagine we could describe how the world in fact is, $$m^*$$, and calculate what the right answer would be in that case, $$a_{m^*}$$. We then apply the data strategy $$D$$ to produce realized data $$d^*$$ and use an answer strategy $$A$$ and use it to calculate the answer $$a_d^*$$ we would get given $$d^*$$.

For each of these steps we show DeclareDesign code in the first column. In the second column, we show the simulated $$m$$, $$m^*$$, and $$d^*$$ datasets, along with the values of $$a_m$$, $$a_{m^*}$$, and $$a_{d^*}$$ for one run of the simulation.

Description Draw
M <- declare_model(N = 1000,
U = rnorm(N),
X = rbinom(N, 1, prob = pnorm(U)),
Y = rbinom(N, 1, prob = pnorm(U + X)))
m <- M()
I <- declare_inquiry(Ybar = mean(Y[X==1]))
a_m <- I(m)
mstar <- fabricate(N = 1000,
U = rnorm(N),
X = rbinom(N, 1, prob = pnorm(U)),
Y = rbinom(N, 1, prob = pnorm(U)))
a_mstar <- I(mstar)
D <- declare_sampling(
S = simple_rs(N, prob = 0.1))
dstar <- D(mstar)
A <- declare_estimator(
Y ~ 1, .method = lm_robust,
subset = X == 1, inquiry = "Ybar")
a_dstar <- A(dstar)

As described in the getting started guide in Chapter 4, we concatenate the design steps into a full design declaration using the + operator:

Declaration 5.1 Example declaration

declaration_5.1 <-
declare_model(
N = 1000,
U = rnorm(N),
X = rbinom(N, 1, prob = pnorm(U)),
Y = rbinom(N, 1, prob = pnorm(U + X))
) +
declare_inquiry(Ybar = mean(Y[X == 1])) +
declare_sampling(S = simple_rs(N, prob = 0.1)) +
declare_estimator(Y ~ 1,
.method = lm_robust,
subset = X == 1,
inquiry = "Ybar")

This design declaration includes a specification of all four design elements: model, inquiry, data strategy, and answer strategy. The next four chapters will describe each of these four design elements in great detail. For now, notice that the declaration does not include a specification of $$m^*$$ (the true causal model), only $$M$$, a model we entertain for research planning purposes.