5 Declaration
In Chapter 2, we gave a highlevel overview of our framework for describing research designs in terms of their models, inquiries, data strategies, and answer strategies, our process for diagnosing their properties, and a general purpose approach for improving them to better fit research tasks. Now in this chapter, we place our approach on a firmer formal footing. To do so, we employ elements from Pearl’s (2009) approach to causal modeling, which provides a syntax for mapping design inputs to design outputs. We also use the potential outcomes framework as presented, for example, in Guido W. Imbens and Rubin (2015), which many social scientists use to clarify their inferential targets.
Describing a research design as a DAG helps us to see the fundamental symmetries across the theoretical (M and I) and empirical (D and A) halves of a research design. A recurring theme of our book is that research designs tend to be stronger when the relationship of M to I is mirrored by the relationship of D to A; the aim of this chapter is to make this somewhat abstract claim more concrete.
5.1 Definition of research designs
Research design are defined by four elements: a model M, an inquiry I, a data strategy D, and an answer strategy A. Describing a research design entails “declaring” each of these four elements.
M is a set of possible models of how the world works. Following Pearl’s definition of a probabilistic causal model, a model in M contains three core elements. The first is a specification of the variables \(X\) about which research is being conducted. This includes endogenous and exogenous variables (\(V\) and \(U\) respectively) and the ranges of these variables. In the formal literature, this is sometimes called the signature of a model (Halpern 2000). The second element (\(F\)) is a specification of how each endogenous variable depends on other variables. These can be considered functional relations or, as in Guido W. Imbens and Rubin (2015), potential outcomes because they describe what would happen under different possible conditions. The third and final element is a probability distribution over exogenous variables, written as \(P(U)\). Sometimes it is useful to think of the draws from \(U\) as implying distinct models of their own, in which case we might think of M as a family of models and a particular model \(m\) as an element of M that fully specifies what would happen under all conditions. We avoid the phrase “data generating process” to refer to \(m\) (since data are generated by the data strategy) and instead use the phrase “event generating process.”
The inquiry I is a summary of the variables \(X\), perhaps given interventions on some variables. An inquiry might be the average value of an outcome \(Y\): \(\mathbb{E}[Y] = \sum\left({y\times \Pr(Y=y)}\right)\), or the average value of the outcome conditional on the value of a treatment \(Z\): \(\mathbb{E}[YZ=1] = \sum\left({y\times \Pr(Y=yZ=1)}\right)\). Using Pearl’s notation we can distinguish between descriptive inquiries and causal inquiries. Causal inquiries are those that summarize distributions that would arise under interventions, as indicated by the \(\mathrm{do}()\) operator, e.g., \(\Pr(Y  \mathrm{do}(Z = 1))\). Descriptive inquiries summarize distributions that arise without intervention, such as \(\Pr(Y  Z =1)\). This is the difference between the average outcome if you “set” \(Z\) to 1 compared to the average outcome when \(Z\) so happens to be 1. The difference, to use an example of the form found in Pearl (2009), between the probability that it is raining when you make people put up umbrellas (low) versus the probability it’s raining when people have umbrellas up (high).
We let \(a_m\) denote the answer to I under the model. Conditional on the model, \(a_m\) is the value of the estimand, the quantity that the researcher wants to learn about, or would want to learn about if the world were like the model. The connection of \(a_m\) to the model is given by: \(a_m = I(m)\).
As the saying goes, models are wrong but some may perhaps be useful. We denote the true causal process as \(m^*\): the process that generates events in the real world. The right answer, then, is \(a_{m^*} = I(m^*)\). The answer under a reference model \(a_m\) may be close or far from the true value \(a_{m^*}\), which is to say it could be wrong. If the model \(m\) is far from \(m^*\), then of course \(a_m\) need not be correct. Moreover \(a_{m^*}\) might even be undefined, since inquiries can only be stated in terms of theoretical models. If the theoretical model is wrong enough—for instance conditioning on events that do not in fact arise—then the inquiry might be nonsensical when applied to the real world. For example, “what is the ideological slant of a speech that is not given” is an inquiry that is undefined.
A data strategy D generates data \(d\). Data \(d\) arises under model M with probability \(P_M(dD)\). The data strategy includes sampling, assignment, and measurement strategies. Nearly all data strategies sample and measure, but not all assign treatments. Whether or not the data strategy includes assignment is the defining distinction between experimental and observational studies. When applied in the real world, the data strategy operates on \(m^*\) to produce the realized data: \(D(m^*) = d^*\). When we simulate research designs, the data strategy operates on a simulated model draw \(m\) to produce fabricated data: \(D(m) = d\).
Finally, the answer strategy A generates answers using data. When applied to realized data, the answer strategy returns the empirical answer: \(A(d^*) = a_{d^*}\). When applied to simulated data, it returns a simulated answer: \(A(d) = a_{d}\)
Table 5.1 provides a concise description of each element of a research design and relates them to some common terms. We flag here that the term estimand has a slightly different meaning in our framework than elsewhere. We say that an estimand \(a_m\) is the value of an inquiry \(I\), whereas in some traditions “estimand” can refer to the inquiry \(I\) or to an intermediate parameter that happens to be targeted by an estimator.
Notation  Description  Related terms 

M  a stipulated collection of causal models  
\(m\)  a single model in \(M\), represented by events  a hypothetical data generating process 
\(m^*\)  the true model  true data generating process 
I  the inquiry  estimand; quantity of interest 
\(a_m = I(m)\)  the answer under the model; an estimand  
\(a_{m^*} = I(m^*)\)  the true answer; the estimand  estimand; quantity of interest 
D  the data strategy  
\(d = D(m)\)  fabricated data; simulated data  
\(d^* = D(m^*)\)  realized data  
A  the answer strategy  data analysis; estimator 
\(a_{d} = A(d)\)  a simulated answer; an estimate  the observed estimate 
\(a_{d^*} = A(d^*)\)  the empirical answer; the estimate  the observed estimate 
The full set of causal relationships between M, I, D, A, with respect to \(m\) and \(m^*\), \(a_m\) and \(a_{m^*}\), \(d\) and \(d^*\), and \(a_d\) and \(a_{m^*}\) can be seen in the schematic representation of a research design given in Figure 5.1. The figure illustrates how a research design involves a correspondence between \(I(m) = a_m\) and \(A(d) = a_d\). The theoretical half of a research design produces an answer to the inquiry in theory. The empirical half of a research design produces an empirical estimate of the answer to the inquiry. Neither answer is necessarily close to the truth \(a_{m^*}\), of course. And, as shown in the figure, the truth is not directly accessible either to us in theory or in empirics. Our gamble in empirical research, however, is that our theoretical models are close enough to the truth: that the truth is like the set of models we imagine. If the models in \(M\) do not contain \(m^*\) or are too different from the truth, then the research design process – ex post – could cause researchers to select poor designs.
Figure 5.1. also reveals a striking analogy between the M, I relationship and the D, A relationship. The answer we aim for is obtained by applying I to a draw from M. But the answer we have access to is obtained by applying A to a draw from D. And our hope, usually, is that these two answers are quite similar. In some cases, this suggests that the function A should be “like” the function I. For instance, if we are interested in the mean of a population and we have access to a random sample, the data available to us from D is like the ideal data we would have if we could observe the nodes and edges in M directly. This mirroring across the two halves of a research design is the root of Principle 3.7: Seek M:I::D:A parallelism.
Finally, in Figure 5.1 no arrows go into M, I, D, or A, since they are not caused by any of the other nodes in the DAG. We could have included a node for the research designer, who deliberately sets the details of M, I, D, and A, but we omit it for clarity.
5.2 Declaration in code
Table 5.2 illustrates these different quantities. We stipulate a set of event generating processes, \(M\), in which \(Y\) depends on \(X\). We define a question, \(I\): what is the average value of \(Y\) when \(X=1\)? We calculate what the right answer to our inquiry would be under one of our stipulated event generating processes, \(I(w)\). We also imagine we could describe how the world in fact is, \(m^*\), and calculate what the right answer would be in that case, \(a_{m^*}\). We then apply the data strategy to produce \(d^*\) and use an answer strategy \(A\) and use it to calculate what answer we would get given \(d^*\).
For each of these steps we show DeclareDesign code in the first column. In the second column, we show the simulated \(m\), \(m^*\), and \(d^*\) datasets, along with the values of \(a_m\), \(a_{m^*}\), and \(a_{d^*}\) for one run of the simulation.
Description  Draw 










 

   
Further reading
 Guido W. Imbens and Rubin (2015) on potential outcomes
 Halpern (2000) on causal models