6 Specifying the model
Models are theoretical abstractions we use to make sense of the world and organize our understanding of it. They play many critical roles in research design. First and foremost, models describe the units, conditions, and outcomes that define inquiries. Without well-specified models, we cannot pose well-specified inquiries. Second, models provide a framework to evaluate the sampling, assignment, and measurement procedures that form the data strategy. Models encode our beliefs about the kinds of information that might result when we conduct empirical observations. Third, they guide the selection of answer strategies: what variables should we condition on, what variables should we not condition on, how flexible or rigid should our estimation procedure be? Whenever we rely on assumptions in the model—for example, normality of errors, conditional independencies, or latent scores—we are betting that the real causal model \(m^*\) has these properties.
We need to imagine models in order to declare and diagnose research designs. This need often generates discomfort among students and researchers who are new to thinking about research design this way. In order to compute the root-mean-squared error, bias, or statistical power of a design, we need to write down more than we know for sure in the model. We have to describe joint distributions of covariates, treatments, and outcomes, which entails making guesses about the very means, covariances, and effect sizes (among many other things) that the empirical research design is supposed to measure. “What do you mean, write down the potential outcomes—that’s what I’m trying to learn about!”
The discomfort arises because we do not know the true causal model of the world—what we referred to as \(m^*\) in Figure 5.1 We are uncertain about which of the many plausible models of the world we entertain is the correct one. In fact we can be fairly certain that none of them is really correct.
The good news is that they do not have to be correct. The M in MIDA refers to these possible models, which we call “reference models.” M is a set of reference models. Their role is to provide a stipulation of how the world works, which allows us to answer some questions about our research design. If the reference model were true, what then would the value of the inquiry be? Would the estimator generate unbiased estimates? How many units would we need to achieve an RMSE of 0.5? Critically, whether a design is good or bad depends on the reference models. A data and analysis strategy might fare very well under one model of the world but poorly under another. Thus to get to the point where we can assess a design we need to make the family of reference models explicit. Our hope is that when we apply A and I to the real event generating processes we will have a similar relation between the answer we seek and the answers we get as we have between conjectured estimands and simulated estimates. Beyond that, we don’t have to actually believe any of the models in M.
6.1 Elements of models
Models are characterized by three elements: the signature, the functional relationships, and a probability distribution over exogenous variables (Halpern 2000). We’ll describe each in turn.
6.1.1 Signature
The signature of the model describes the variables in the model and their ranges. The signature comprises two basic kinds of variables: exogenous variables and endogenous variables. Exogenous means “generated from without” and endogenous means “generated from within.” Stated more plainly, exogenous variables are not caused by other variables in the model because they are randomly assigned by nature or by human intervention. Endogenous variables result as a consequence of exogenous variables; they are causally downstream from exogenous variables.
What kinds of variables are exogenous? Typically, we think of explicitly randomly assigned variables as exogenous: the treatment assignment variable in a randomized experiment is exogenous. We’ll often use the variable letter \(Z\) to refer to assignments that were explicitly randomized. We also often characterize the set of unobserved causes of observed variables as exogenous. We summarize the set of unobserved causes of an observed variable with the letter \(U\). These unobserved causes are exogenous in the sense that, whatever the causes of \(U\) may be, they do not cause other endogenous variables in a model.
What kinds of variables are endogenous? Everything else: covariates, mediators, moderators, and outcome variables. We’ll often use the letter \(X\) when describing covariates or moderators, the letter \(M\) when describing mediators, and the letter \(Y\) when describing outcome variables. Each of these kinds of variables is the downstream consequence of exogenous variables, whether those exogenous variables are observed or not.
Critically the signature of a model is itself a part of the design: we as designers must choose the variables of interest. We do not, however, get to decide the functional relations between variables—those are set by \(m^*\).
6.1.2 Functional relations
The second element of the model is the set of functions that produce endogenous variables. The output of these functions are always endogenous variables and the inputs can be either exogenous variables or other endogenous variables. We embrace two different, but ultimately compatible, ways of thinking about these functional relationships: structural causal models and the potential outcomes model.
The structural causal model account of causality is often associated with directed acyclic graphs or DAGs (Pearl 2009). Each node on a graph is a variable and the edges that connect them represent possible causal effects. An arrow from a “parent” node to a “child” node indicates that the value of the parent sometimes influences the outcome of the child. More formally: the parent’s value is an argument in a functional equation determining the child’s outcome. DAGs emphasize a mechanistic notion of causality. When the exposure variable changes, the outcome variable changes as a result, possibly in different ways for different units.
DAGs represent nonparametric structural causal models. The qualifier “nonparametric” means that DAGs don’t show how variables are related, just that they are related. This is no criticism of DAGs — they just don’t encode all of our causal beliefs about a system. We illustrate these ideas using a DAG to describe a model for an abstract research design in which we will collect information about \(N\) units. We will assign a treatment \(Z\) at random, and collect an outcome \(Y\). We know there are other determinants of the outcome beyond \(Z\), but we don’t know much about them. All we’ll say about those other determinants \(U\) is that they are causally related to \(Y\), but not to \(Z\), since \(Z\) will be randomly assigned by us.
This nonparametric structural causal model can be written like this:
\[\begin{align*} Y &= f_Y(Z, U). \end{align*}\]
Here, the outcome \(Y\) is related to \(Z\) and \(U\) by some function \(f_Y\), but the details of the function \(f_Y\)—whether \(Z\) has a positive or negative effect on \(Y\), for example—are left unstated in this nonparametric model. The DAG in Figure 6.1 encodes this model in graphical form. We use a blue circle around the treatment assignment to indicate that \(Z\) is randomly assigned as part of the data strategy.
To assess many properties of a research design we often need to make the leap from nonparametric models to parametric structural causal models. We need to enumerate beliefs about effect sizes, correlations between variables, intra-cluster correlations (ICCs), specific functional forms, and so forth. Since any particular choice for these parameters could be close or far from the truth, we will typically consider a range of plausible values for each model parameter.
One possible parametric model is given by the following:
\[\begin{align*} Y &= 0.5 \times Z + U. \end{align*}\]
Here, we have specified the details of the function that relates \(Z\) and \(U\) to \(Y\). In particular, it is a linear function in which \(Y\) is equal to the unobserved characteristic \(U\) in the control condition, but is 0.5 higher in the treatment condition. We could also consider a more complicated parametric model in which the relationship between \(Z\) and \(Y\) depends on an interaction with the unobservables in \(U\):
\[\begin{align*} Y &= -0.25 \times Z - 0.05\times Z \times U + U \end{align*}\]
Both of these parameterizations are equally consistent with the DAG in Figure 6.1, which underlines the powerful simplicity of DAGs, but also their theoretical sparsity. The two parametric models are theoretically quite different from one another. In the first, the effects of the treatment \(Z\) are positive and the same for all units; in the second, the effects may be negative and quite different from unit to unit, depending on the value of \(U\). If both of these reference models are plausible, we’ll want to include them both in M, to ensure that our design is robust to both possibilities. Here we have a small instance of Principle Principle 3.2: Design agnostically. We want to consider a wide range of plausible parameterizations since we are ignorant of the true causal model (\(m^*\)).
The potential outcomes formalization emphasizes a counterfactual notion of causality. \(Y_i(Z_i = 0)\) is the outcome for unit \(i\) that would occur were the causal variable \(Z_i\) were set to 0 and \(Y_i(Z_i = 1)\) is the outcome that would occur if \(Z_i\) were set to 1. The difference between them defines the effect of the treatment on the outcome for unit \(i\). Since at most only one potential outcome can ever be revealed, at least one of the two potential outcomes is necessarily counterfactual, meaning not observable. Usually, the potential outcomes notation \(Y_i(Z_i)\) reports how outcomes depend on one feature, \(Z_i\), ignoring all other determinants of outcomes. That’s not to say those other causes don’t matter—they might—they are just not the focus. In a sense, they are contained in the subscript \(i\) since the units carry with them all relevant features other than \(Z_i\). We can generalize to settings where we want to consider more than one cause, in which case we use expressions of the form \(Y_i(Z_i = 0, X_i = 0)\) or \(Y_i(Z_i = 0, X_i = 1)\).
The potential outcomes version of the first structural causal model might be written for \(i \in \{1,2,\dots, n\}\) as:
\[\begin{align*} Y_i(0) &= U_i \\ Y_i(1) &= 0.5 + U_i. \end{align*}\]
The potential outcomes under the second model would be written:
\[\begin{align*} Y_i(0) &= U_i \\ Y_i(1) &= -0.25 - 0.05 \times U_i + U_i. \end{align*}\]
Despite what might be inferred from the sometimes heated disagreements between scholars who prefer one formalization to the other, structural causal models and potential outcomes are compatible systems for thinking about causality. Potential outcome distributions can also be described using Pearl’s \(\pearldo()\) operator: \(\Pr(Y|\pearldo(Z = 1))\) is the probability distribution of the treated potential outcome. We could use only the language of structural causal models or we could use only the language of potential outcomes, since a theorem in one is theorem in the other (Pearl 2009, 243). We choose to use both languages because they are useful for expressing different facets of research design. We use structural causal models to describe the web of causal interrelations in a concise way (writing out the potential outcomes for every relationship in the model is tedious). We use potential outcomes when the inquiry involves comparisons across conditions and to make fine distinctions between inquiries that apply to different sets of units.
6.1.3 Probability distributions over exogenous variables
The final element of a model is a description of the probability distribution of exogenous variables. For example, we might describe the distribution of the treatment assignment as Bernoulli distribution with \(p\) = 0.1 for “coin flip” random assignment with a 10% chance of a unit being assigned to treatment. We might stipulate that the unobserved characteristics \(U\) are normally distributed with a mean of 1 and a standard deviation of 2. The distributions of the exogenous variables then ramify through to the distributions of the endogenous variables through the functional relations.
In general, multiple distributions can behave equivalently in a model. For instance, if we specify that \(u\) is distributed normally with mean 0 and standard deviation \(\sigma\) and that \(Y = 1\) if and only if \(u \geq 0\), then the distribution induced on \(Y\) will be the same regardless of the choice of \(\sigma\). Indeed the same distribution on \(Y\) could be generated by countless other distributions on \(u\). The focus then is not on getting these distributions “right,” but on selecting distributions in light of their implications about the distributions of endogeneous nodes.
6.2 Types of variables in models
Any particular causal model can be a complex web of exogenous and endogenous variables woven together via a set of functional relationships. Despite the heterogeneity across models, we can describe the roles variables play in a research design with reference to the roles they play in structural causal models. There are seven roles:
- Outcomes: Variables whose level or responses we want to understand, generally referred to as \(Y\), as in Figure 6.2. Variously described as “dependent variables,” “endogenous variables,” “left-hand side variables,” or “response variables.”
- Treatments: Variables that affect outcomes. We will most often use \(D\) to refer to the main causal variable of interest in a particular study. Sometimes labeled as “independent variables,” or “right-hand side variables.”
- Moderators: Variables that condition the effects of treatment variables on outcomes: depending on the level of a moderator, treatments might have stronger or weaker effects on outcomes. Nonparametric structural causal models (like those represented in DAGs) represent moderators as additional causes of Y. See, for example, \(X2\) in Figure 6.2. Be warned however that the fact that moderators are represented as additional causes of an outcome does not imply that every additional cause of an outcome is necessarily a moderator.
- Mediators: Variables “along the path” from treatment variables to outcomes. \(M\) is an example of a mediator in Figure 6.2. Mediators are often studied to assess “how” or “why” \(D\) causes \(Y\).
- Confounders: Variables that introduce a non-causal dependence between \(D\) and \(Y\). In Figure 6.2, \(X1\) is a confounder because it causes both \(D\) and \(Y\) and could introduce a dependence between them even if \(D\) did not cause \(Y\).
- Instruments: An instrumental variable is an exogenous variable that affects a treatment variable which itself causes an outcome variable. We give a much more detailed treatment of these variables in Section 16.4. We reserve the letter \(Z\) for instruments. Random assignments are instruments in the sense that the assignment is the instrument and the actual treatment received is the treatment variable.
- Colliders: Colliders are variables that are caused by two other variables. Colliders can be important because conditioning on a collider introduces a non-causal relationship between all parents of the collider. The intuition is that if you learn that a child is tall, then learning that one parent is small makes you more likely to believe that the other parent is tall. In Figure 6.2, \(K\) is a collider that can create a non-causal dependence between \(D\) and \(Y\) (via \(U\)) if conditioned upon.
These labels reflect the researcher’s interest as much as their position in a model. Another researcher examining the same graph might, for instance, label \(M\) as their treatment variable or \(K\) as their outcome of interest.
6.2.1 What variables are needed?
Our models of the world can be more or less complex, or at least articulated at higher or lower levels of generality. How specific and detailed we need to be in our specification of possible models depends on the other features of the research design: the inquiry, the data strategy, and the answer strategy. At a minimum, we need to describe the variables required for each of these research design elements.
Inquiry: In order to reason about whether the model is sufficient to define the inquiry, we need to define the units, conditions, and outcomes used to construct our inquiry. If the inquiry is an average causal effect among a subgroup, we need to specify the relevant potential outcomes, the treatment, and the covariate that describes the subgroup.
Data strategy: If a sampling procedure involves stratification or clustering, then in the model, we need to define the variables that will be used to stratify and cluster. Similarly, treatment assignment might be blocked or clustered; correspondingly, the variables that are used to construct blocks or clusters must be defined in the model. Finally, all of the variables that will be measured should also be defined in the model. When we measure latent variables imperfectly, the model describes the latent trait and how measured responses may deviate from it. If you expect to encounter complexities in D that you need to take account of in A, such as missing data, non compliance or attritition, then possible drivers of these should also be included in M.
Answer strategy: Any measured variable that will be used in the answer strategy should be included in the model. This requirement clearly includes the observed outcomes and treatments, but also any covariates that are used to address confounding or to increase precision.
The variables required by the inquiry, data strategy, and answer strategy are necessary components of the model, but they are not always sufficient. For example, we might be worried about an unobserved confounder. Such a confounder would not be obviously included in any of the other research design elements, but is clearly important to include in the model. Ultimately, we need to specify all variables that are required for “diagnosand completeness” (see Chapter 10), which is achieved when research designs are described in sufficient detail that diagnosands can be calculated.
6.3 How to specify models
To this point, we have described formal considerations, but we have not described substantive considerations, for including particular variables or stipulating particular relations between them. The justification for the choice of reference models will depend on the purpose of the design. Broadly, we distinguish between two desiderata for selecting reference models: reality tracking and agnosticism.
6.3.1 Reality tracking
In stipulating reference models, we have incentives to focus on models that we think track reality (\(m^*\)) as well as possible. Why waste time and effort stipulating processes we don’t think will happen?
The justification for a claim that a model is reality tracking typically comes from two places: past literature and qualitative research. Past theoretical work can guide the set of variables that are relevant and how they relate to one another. Past empirical work can provide further insight into the distributions and dependencies across variables. However, when past research is thin, there is no substitute for insights gained through qualitative data collection: focus groups and interviews with key informants who know aspects of the model that are hidden from the researcher, archival investigations to understand a causal process, or immersive participant observation to see with your own eyes how social actors behave. Fenno (1978) memorably describes this process as “soaking and poking.” This mode of model development is separate from the qualitative research designs that provide answers to well-specified inquiries (see the process tracing entry, Section 16.1 for an example of such a design). Instead, qualitative insights such as this, which Lieberman (2005) labels “model-building” case studies, do not aim to answer a question, but rather yield a new theoretical model. Quantitative research is often seen as distinct from qualitative research, but the model building phase in both is itself qualitative.
The next step — selecting statistical distributions and their parameters to describe exogenous variables and the functional forms of endogenous variables — is often more uncomfortable. We do not know the magnitude of the effect of an intervention or the correlation between two outcomes before we do the research; that’s why we are conducting the study. However, we are not fully in the dark in most cases and can make educated guesses about most parameters.
We can conduct meta-analyses of past relevant studies on the same topic to identify the range of plausible effect sizes, intra-cluster correlations, correlations between variables, and other model parameters. Conducting such a meta-analysis might be as simple as collecting the three papers that measured similar outcomes in the past and calculating the intra-cluster correlations in each of them. How informative past studies are for your research setting depends on the similarity of units, treatments, and outcomes. Except in the case of pure replication studies, we are typically studying a (possibly new) treatment in a new setting, with new participants, or with new outcomes, so there will not be perfect overlap. However, the variation in effects across contexts and these other dimensions will help structure the range of our guesses specified in the model.
When there are past studies that are especially close to our own, we may want our model to match the observed empirical distribution from that past study as closely as possible. To do so, we can resample or bootstrap from the past data in order to simulate realistic data. Where there are no past studies that are sufficiently similar in some dimensions, we can collect new data through pilot studies (see Section 21.4) or baseline surveys to serve a similar purpose.
It is also possible to use the distribution over quantities in a model to represent our uncertainty over some quantities and in this way let our diagnostics integrate over our uncertainty. We did this already in Declaration 2.1, where we let the treatment effect be a random draw from a stipulated distribution. In doing this, the success function represents expected success with respect to this distribution over treatment effects.
Since it excludes cases we deem improbable, a focus on reality tracking models seems to contradict Principle 3.2: Design agnostically (see also below, Section 6.3.2). However, by focusing on reality-tracking models, we aim to contain the smallest set of plausible models that are needed to capture the essentials of the true process. In practice, of course, we might never represent \(m^*\) accurately. For instance, we might contemplate a set of worlds in which an effect lies between 0 and 1 yet not include the true value of 2. This is not necessarily a cause for concern. The lessons learned from a diagnosis do not depend on the realized world \(m^*\) being among the set of possible draws of M, the relevant question is only whether the kinds of inferences one might draw given stipulated reference models would also hold reasonably well for the true event generating process. For instance, if our aim is to assess whether an analysis strategy generates an unbiased estimate of a treatment effect we may go to pains to make sure to model treatment assignment carefully, but modeling the size of a treatment effect correctly may not be important. The idea is that what you learn from the models that you do study is sufficient for inferences about a broader class of models within which the true event generating process might lie.
6.3.2 Agnosticism
For some purposes, the reference model might be developed not to track reality, as you see it, but rather to reflect assumptions in a scholarly debate. For instance, the purpose might be to question whether a given conclusion is valid under the assumptions maintained by some scholarly community. Indeed, it is possible that a reference model is used specifically because the researcher thinks it is inaccurate, allowing them to show that even if they are wrong about some assumptions about the world in M, their analysis will produce useful answers.
One useful exercise is to return to and question the assumptions you have built into your model. Many of these can be read directly from a DAG, but some are more subtle. In a directed acyclic graph, every arrow indicates a possible relation between a cause and an outcome. The big assumptions in these models, however, are seen not in the arrows but in the absence of arrows: every missing arrow represents a claim that an outcome is not affected by a possible cause. Answer strategies often depend upon such assumptions. Even when arrows are included, functional relations might presuppose particular features that are important for inference. For instance, a researcher using instrumental variables analysis (see Section 16.4) will generally assume that \(Z\) causes \(Y\) through \(D\) but not through other paths. This “excludability” assumption is about absent arrows. The same analysis might also assume that \(Z\) never affects \(D\) negatively. That “monotonicity” assumption is about functional forms. An agnostic reference model might loosen these assumptions to allow for possible violations of the excludability or monotonicity assumptions.
When we are agnostic, we admit we don’t know whether the truth is in the set of models we consider reasonable—so we entertain a wider set than we might think plausible. We suggest three guides for choosing these ranges: the logical minimum and maximum bounds of a parameter, a meta-analytic summary of past studies, or best- and worst-case bounds, based on the substantive interpretations of previous work. A design that performs well in terms of power and bias under many such ranges might be labeled “robust to multiple models.”
A separate goal is assessing the performance of a research design under different models implied by alternative theories. A good design will provide probative evidence no matter the treue event generating process. A poor design might provide reliable answers only for specific types of event generating processes.
An important example is the performance of a research design under a “null model,” where the true effect size is zero. A good research design should report with a high probability that there is insufficient evidence to reject a null effect. That same research design, under an alternative model with a large effect size, should with a high probability return evidence rejecting the null hypothesis of zero effect. The example makes clear that in order to understand whether the research design is strong, we need to understand how it performs under different models.
6.4 Summary
If this section left you spinning from the array of choices we have to make in declaring a model, in some ways that was our goal. Inside every power calculator and bespoke design simulation code is an array of assumptions. Some crucially determine design quality and others are unimportant. The salve to the dizziness is in Principle 3.2: Design agnostically. Where you are uncertain, explore whether both options produce similar diagnosands. The goal is for our data and answer strategies to hold up under many models.