6 Specifying the model

Models are theoretical abstractions we use to make sense of the world and organize our understanding of it. They play many critical roles in research design. First and foremost, models describe the units, conditions, and outcomes that define inquiries. Without well-specified models, we cannot ask well-specified questions. Second, models provide a framework to evaluate the sampling, assignment, and measurement procedures that form the data strategy. Models encode our beliefs about the kinds of information that will result when we conduct empirical observations. Third, provide a framework to evaluate answer strategies: what variables should we condition on, what variables should we not condition on, how flexible or rigid should our estimation procedure be? Whenever we rely on assumptions in the model – for example, normality of errors, conditional independencies, or latent scores – we are betting that the real causal model \(m^*\) has these properties.

We need to imagine models in order to declare and diagnose research designs. This need often generates discomfort among students and researchers who are new to thinking about research design this way. In order to compute the root mean squared error, bias, or statistical power of a design, we need to write down more than we know for sure in the model. We have to describe joint distributions of covariates, treatments, and outcomes, which entails making guesses about the very means, covariances, and effect sizes (among many other things) that the empirical research design is supposed to measure. “What do you mean, write down the potential outcomes – that’s what I’m trying to learn!”

The discomfort arises because we do not know the true causal model of the world – what we referred to as \(m*\) in Figure 5.1. We are uncertain about which of the many plausible models of the world we entertain is the correct one. The M in MIDA refers to these possible models, which we call “reference models.” M is a set of reference models, with the typical element \(m\), though in some cases the M may just be a single model. We have no particular theoretical commitment to reference models. Their role is to provide a stipulation of how the world works, which allows us to answer some questions about our research design. If the reference model were true, what would the value of the inquiry be? Would my estimator be unbiased? How many units would I need to achieve an RMSE of 0.5? Critically, whether a design is good or bad depends on the reference model. A data and analysis strategy might fare very well under one model of the world but poorly under another. Thus to get to the point where we can assess a design we need to make the family of reference models explicit. This chapter is about how to go about this difficult task.

6.1 Elements of models

Models are characterized by three elements: the signature, the functional relationships, and a probability distribution over exogenous variables. We’ll describe each in turn.

6.1.1 Signature

The signature of the model describes the variables in the model and their ranges. The signature comprises two basic kinds of variables: exogenous variables and endogenous variables. Exogenous means “generated from without” and endogenous means “generated from within.” Stated more plainly, exogenous variables are not caused by other variables in the model because they are randomly assigned by nature or by human intervention. Endogenous variables result as a consequence of exogenous variables; they are causally downstream from exogenous variables.

What kinds of variables are exogenous? Typically, we think of explicitly randomly assigned variables as exogenous: the treatment assignment variable in a randomized experiment is exogenous. We’ll often use the variable letter \(Z\) to refer to assignments that were explicitly randomized. We also often characterize the set of unobserved causes of observed variables as exogenous. We summarize the set of unobserved causes of an observed variable with the letter \(U\). These unobserved causes are exogenous in the sense that, whatever the causes of \(U\) may be, they do not cause other endogenous variables in a model.

What kinds of variables are endogenous? Everything else: covariates, mediators, moderators, and outcome variables. We’ll often use the letter \(X\) when describing covariates or moderators, the letter \(M\) when describing mediators, and the letter \(Y\) when describing outcome variables. Each of these kinds of variables is the downstream consequence of exogenous variables, whether those exogenous variables are observed or not.

Critically the signature of a model is itself a part of the design. We get to choose what are the variables of interest and even their scales. We do not, however, get to decide the functional relations between variables – those are set according to \(m^*\).

6.1.2 Functional relations

The second element of the model is the set of functions that produce endogenous variables. The output of these functions are always endogenous variables and the inputs can be either exogenous variables or other endogenous variables. We embrace two different, but ultimately compatible, ways of thinking about these functional relationships: structural causal models and the potential outcomes model.

The structural casual model account of causality is often associated with directed acyclic graphs (DAGs). Each node on a graph is a variable and the edges that connect them represent possible causal effects. An arrow from a “parent” node to a “child” node indicates that the value of the parent sometimes influences the outcome of the child. More formally: the parent’s value is an argument in a functional equation determining the child’s outcome. DAGs emphasize a mechanistic notion of causality. When the exposure variable changes, the outcome variable changes as a result, possibly in different ways for different units.

DAGs represent nonparametric structural causal models. This means that they don’t show how variables are related, just that they are related. This is no criticism of DAGs — they just don’t encode all of our causal beliefs about a system. We illustrate these ideas using a simple DAG to describe a model with an abstract research design in which we will collect information about \(N\) units. We will assign a treatment \(Z\) at random, and collect an outcome \(Y\). We know there are other determinants of the outcome beyond \(Z\), but we don’t know much about them. All we’ll say about those other determinants \(U\) is that they are causally related to \(Y\), but not to \(Z\), since \(Z\) will be randomly assigned by us.

This nonparametric structural causal model can be written like this:

\[\begin{align*} Y &= f_Y(Z, U) \end{align*}\]

Here, the outcome \(Y\) is related to \(Z\) and \(U\) by some function \(f_Y\), but the details of what the function \(f_Y\) is – whether \(Z\) has a positive or negative effect on \(Y\), for example – are left unstated in this nonparametric model. The DAG in Figure 6.1 encodes this model in graphical form. We use a blue circle around the treatment assignment to reflect the idea that \(Z\) is randomly assigned as part of the data strategy.

Simple DAG. U is unobserved.

Figure 6.1: Simple DAG. U is unobserved.

To assess many properties of a research design we will often need to make the leap from nonparametric models to parametric structural causal models. We need to enumerate beliefs about effect sizes, correlations between variables, intra-class correlations (ICCs), specific functional forms, and so forth. Since any particular choice for these parameters could be close or far from the truth, we will typically consider a range of plausible values for each model parameter.

One possible parametric model is given by the following:

\[\begin{align*} Y &= 0.5 \times Z + U \end{align*}\]

Here, we have specified the details of the function that relates \(Z\) and \(U\) to \(Y\). In particular, it is a linear function in which \(Y\) is equal to the unobserved disturbance \(U\) in the control condition, but is 0.5 higher in the treatment condition. We could also consider a more complicated parametric model in which the relationship between \(Z\) and \(Y\) depends on an interaction with the unobservables in \(U\):

\[\begin{align*} Y &= -0.25 \times Z - 0.05\times Z \times U + U \end{align*}\]

Both of these parameterizations are consistent with the DAG in Figure 6.1, which underlines the powerful simplicity of DAGs but also their theoretical sparsity. The two parametric models are theoretically quite different from one another. In the first, the effects of the treatment \(Z\) are positive and the same for all units; in the second, the effects are negative and quite different from unit to unit, depending on the value of \(U\). If both of these reference models are plausible, we’ll want to include them both in M, to ensure that our design is robust to both possibilities. This is a small instance of Principle 3.3 to entertain many models – we want to consider a wide range of plausible parameterizations since we are ignorant of the true causal model (\(m^*\)).

By contrast with structural causal models, the potential outcomes formalization emphasizes a counterfactual notion of causality. \(Y_i(Z = 0)\) is the outcome for unit \(i\) that would occur were the causal variable \(Z\) were set to zero and \(Y_i(Z = 1)\) is the outcome that would occur if \(Z\) were set to one. The difference between them defines the effect of the treatment on the outcome for unit \(i\). Since at most only one potential outcome can ever be revealed, at least one of the two potential outcomes is necessarily counterfactual. Usually, the potential outcomes notation \(Y_i(Z)\) reports how outcomes depend on one feature, \(Z\), ignoring all other determinants of outcomes. This is not to say that these don’t matter—they do—they are just not the focus. In a sense, they are contained in the subscript \(i\) since the units carry with them all relevant features other than \(Z\). We can generalize to settings where we want to consider more than one cause, in which case we use expressions of the form \(Y_i(Z = 0, X = 0)\) or \(Y_i(Z = 0, X = 1)\).

Under the first structural model we consider, the potential outcomes (for \(Y\)) might be written, for \(i \in \{1,2,\dots, n\}\) as:

\[\begin{align*} Y_i(0) &= u_i \\ Y_i(1) &= 0.5 + u_i \end{align*}\]

The potential outcomes under the second model would be written:

\[\begin{align*} Y_i(0) &= u_i \\ Y_i(1) &= -0.25 + 0.95*u_i \end{align*}\]

Despite what you may have inferred from the sometimes heated disagreements between scholars who prefer one formalization to the other, structural causal models and potential outcomes are compatible systems for thinking about causality. Potential outcome distributions can also be described using Pearl’s \(\mathrm{do}()\) operator: \(Pr(Y|\mathrm{do}(Z = 1))\) is the probability distribution of the treated potential outcome. We could use only the language of structural causal models or we could use only the language of potential outcomes, since a theorem in one is theorem in the other (Pearl 2009, 243) We choose to use both languages because they are useful for expressing different facets of research design. We use structural causal models to describe the web of causal interrelations in a concise way (writing out the potential outcomes for every relationship in the model is tedious). We use potential outcomes when the inquiry involves comparisons across conditions and to make fine distinctions between inquiries that apply to different sets of units.

6.1.3 Probability distribution over exogenous variables

The final element of a model is a description of the probability distribution of exogenous variables. For example, we might describe the distribution of the treatment assignment as a Bernoulli distribution with \(p\) = 0.1 to describe “coin flip” random assignment with a 10% chance of a unit being assigned to treatment. We might stipulate that the unobserved disturbances \(U\) are normally distributed with a mean of 1 and a standard deviation of 2. The distributions of the exogenous variables then ramify through to the distributions of the endogenous variables through the functional relations.

How we draw from these probability distributions has consequences for what we think of as the set of units about whom we are drawing inferences. There are three distinct ways of thinking about this problem.

The first, and possibly most common way of thinking is the “finite population” setting. Here we enumerate a fixed population of units about whom we seek to draw inferences. We sample from this finite population in the data strategy in such a way that we can use the sample to draw inferences about the population. The probability distribution over the exogenous variables simply enumerates the values that these variables take on in the population. Any randomness in the design is generated by the sampling and assignment procedures, not in the values of the exogenous variables.

A second, and closely related, framework is the finite sample setting. Here the population is the sample, and we don’t contemplate any extrapolation from the sample to the population. Finite sample inference is common in research designs that involve random assignment of treatments. The only source of randomness in the finite sample setting is the random assignment itself.

A third class is the superpopulation setting, in which we imagine that any particular population is just a draw from an infinite population. In this case, we can conceive of the randomness in the design as being fundamental – every unit is a random draw from the superpopulation.

6.2 Lexicon of common variable types

Any particular causal model will be a complex web of exogenous and endogenous variables woven together via a set of functional relationships. Despite the heterogeneity across models, we can describe the roles variables play in a research design with reference to the roles they play in structural causal models. There are seven:

  1. Outcomes: the variable whose level or responses we want to understand, generally referred to as \(Y\), as in Figure 6.2. Variously described as “dependent variables,” “endogenous variables,” “left-hand side variables,” or “response variables.”
  2. Treatments: the main variable or variables that affect outcome variables under study. We will use \(D\) most often to refer to the main causal variable of interest in a particular study.
  3. Moderators. Variables that condition the effects of treatment variables on outcomes. See for example \(X2\) in Figure 6.2. The figure indicates that \(X2\) is a cause of \(Y\) but does not explicitly indicate that there is an interaction between \(D\) and \(X2\). One account for this is that as a general matter if two variables cause an outcome it would be surprising if they did not interact in some way.
  4. Mediators. Variables “along the path” from treatment variables to outcomes. \(M\) is an example of a mediator in this figure. Mediators are often studied to assess “how” or “why” \(D\) causes \(Y\).
  5. Confounders. Variables that introduce a noncausal correlation between \(D\) and \(Y\). In the figure, \(X1\) is a confounder because it causes both \(S\) and \(Y\) and could introduce a correlation between them even if \(D\) did not cause \(Y\).
  6. Instruments. An instrumental variable is an exogenous variable that affects a treatment variable and can help us figure out the relationship between the treatment variable and the outcome. Such variables are studied instrumentally and not for their own sake. We give a much more detailed treatment of these variables in Section 15.4. We reserve the letter \(Z\) for instruments. Random assignments are instruments in the sense that the assignment is the instrument and the actual treatment received is the treatment variable.
  7. Colliders. Colliders are variables that are caused by two other variables. Colliders can be important because conditioning on a collider introduces a noncausal relationship between the causes of the collider. In figure 6.2, \(K\) is a collider that can create a noncausal correlation between \(D\) and \(Y\) (via \(U\)) if conditioned upon.

These labels reflect the researcher’s interest as much as their position in a model. Another researcher examining the same graph might, for instance, label \(M\) as their treatment variable or \(K\) as their outcome of interest.

A DAG with an explanatory variable of interest (D), an outcome of interest (Y), a mediator (M), a confounder (X1), a moderator (X2), an instrument (Z), and a collider (K).

Figure 6.2: A DAG with an explanatory variable of interest (D), an outcome of interest (Y), a mediator (M), a confounder (X1), a moderator (X2), an instrument (Z), and a collider (K).

6.2.1 What variables are needed?

Our models of the world can be more or less complex, or at least articulated at higher or lower levels of generality. How specific and detailed we need to be in our specification of possible models depends on the other features of the research design: the inquiry, the data strategy, and the answer strategy. At a minimum, we need to describe the variables required for each of these research design elements.

Inquiry: In order to reason about whether the model is sufficient to define the inquiry, we need to define the units, conditions, and outcomes used to construct our inquiry. If the inquiry is an average causal effect among a subgroup, we need to specify the relevant potential outcomes and the covariate that describes the subset.

Data strategy: Sampling procedures often involve stratification or clustering, so in the model, we need to define the variables that will be used to stratify and cluster. Similarly, treatment assignment might be blocked or clustered; the variables that are used to construct blocks or clusters must be defined in the model. Finally, all of the variables that will be measured should also be defined in the model. When we measure latent variables imperfectly, the model describes the latent trait and how measured responses may deviate from it.

Answer strategy: Any measured variable that will be used in the answer strategy should be included in the model. This clearly includes the observed outcomes and treatments, but also the covariates that are used to address confounding or to increase precision.

The variables required by the inquiry, data strategy, and answer strategy are necessary components of the model, but they are not always sufficient. For example, we might be worried about an unobserved confounder. Such a confounder would not be obviously included in any of the other research design elements but is clearly important to include in the model. Ultimately, we need to specify all variables that are required for diagnosand completeness (see Section 10.5), which is achieved when research designs are described in sufficient detail that diagnosands can be calculated.

6.3 Substantive justifications for choices

To this point, we have described formal considerations but we have not described substantive considerations for including particular variables or stipulating particular relations between them. The justification for your choice of reference model will depend on the purpose of your design. Broadly we distinguish two desiderata: reality tracking and agnosticism.

6.3.1 Reality tracking

In stipulating reference models we have incentives to focus on models that we think reasonably track reality (\(m^*\)) as well as possible. Why waste resources stipulating processes that do not characterize those that you expect to be in operation?

The justification for reality tracking typically comes from two places: reading the past literature and qualitative research. Past theoretical work can guide the set of variables that are relevant and how they relate to one another. Past empirical work can provide further insight into the distributions and dependencies across variables. However, when past research is thin on a topic, there is no substitute for insights gained through qualitative data collection: focus groups and interviews with key informants who know aspects of the model that are hidden from the researcher, archival investigations to understand a causal process, or immersive participant observation to see with your own eyes how social actors behave. Fenno (1978) calls this “soaking and poking.” This mode of model development is separate from the qualitative research designs that provide an answer to an inquiry deductively. We examine those throughout the book. Instead, qualitative insights such as this, which Lieberman (2005) labels “model-building” case studies, do not aim to answer a question but rather yield a new theoretical model. Quantitative research is often seen as distinct from qualitative research, but the model building phase in both is itself qualitative.

The next step — selecting statistical distributions and their parameters to describe exogenous variables and the functional forms of endogenous variables — is often more uncomfortable. We do not know the magnitude of the effect of an intervention or the correlation between two outcomes before we do the research, that’s why we are conducting the study. However, we are not fully in the dark in most cases and can make educated guesses about most parameters.

We can conduct meta-analyses of past relevant studies on the same topic to identify the range of plausible effect sizes, intraclass correlations, correlations between variables, and other model parameters. Conducting such a meta-analysis might be as simple as collecting the three papers that measured similar outcomes in the past and calculating the intraclass correlations across the three. How informative past studies are for your research setting may depending on the similarity of units, treatments, and outcomes across contexts. Except in the case of pure replication studies, we are typically studying a (possibly new) treatment in a new setting, with new participants, or with new outcomes, so there will not be perfect overlap. However, the variation in effects across contexts and these other dimensions will help structure the range of our guesses specified in the model.

When there are past studies that are especially close to our own, we may want our model to match the observed empirical distribution from that past study as closely as possible. To do so, we can resample or bootstrap from the past data in order to simulate realistic data. Where there are no past studies that are sufficiently similar in some dimensions, we can collect new data through pilot studies (see Section 21.5) or baseline surveys to serve a similar purpose.

Since it excludes cases we deem improbable, a focus on reality tracking models seems to contradict Principle 3.3: Entertain many models. However, by focusing on reality-tracking models, we aim to contain the smallest set of plausible models that contain the true one. In practice of course we might never include \(m^*\). For instance we might contemplate a set of worlds in which an effect lies between 0 and 1 yet not include the true value of 2. This is not necessarily a cause for concern. The lessons learned from a diagnosis do not depend on the realized world \(m^*\) being among the set of possible draws of M, the relevant question is only whether the kinds of inferences one might draw given stipulated reference models would also hold reasonably well for the true data generating process. For instance, if your aim is to assess whether an analysis strategy generates an unbiased estimate of a treatment effect you may go to pains to make sure that that you model treatment assignment carefully but modeling the size of a treatment effect correctly may not be important. The idea is that what you learn from the models that you do study is sufficient for inferences about a broader class of models within which the true data generating process might lie.

6.3.2 Agnosticism

For some purposes, the reference model might be developed not to track reality, as you see it, but rather to reflect assumptions in a scholarly debate. For instance, the purpose might be to question whether a given conclusion is valid under the assumptions maintained by some scholarly community. Indeed it is possible that a reference model is used specifically because the researcher thinks it is inaccurate, allowing them to show that even if they are wrong about some assumptions about the world in M, their analysis will produce useful answers.

In a directed acyclic graph, every arrow indicates a possible relation between a cause and an outcome. The big assumptions in these models, however, are not seen in the arrows but the absence of arrows: every missing arrow represents a claim that an outcome is not affected by a possible cause. Analysis strategies often depend upon such assumptions. Even when arrows are included, functional relations might presuppose particular features important for inference. For instance, a researcher using instrumental variables analysis (see Section 15.4) will generally assume that \(Z\) causes \(Y\) through \(D\) but not through other paths. This “excludability” assumption is about absent arrows. The same analysis might also assume that \(Z\) never affects \(D\) negatively. That “monotonicity” assumption is about functional forms. An agnostic reference model might loosen these assumptions to allow for possible violations of the excludability or monotonicity assumptions.

If we are agnostic, it is because we don’t know whether the truth is in the set of models we consider – so we entertain a wider set than the we might think plausible. We suggest three guides for choosing these ranges: the logical minimum and maximum bounds of a parameter, a meta-analytic summary of past studies, or best- and worst-case bounds, based on the substantive interpretations of previous work. A design that performs well in terms of power and bias under many such ranges might be labeled “robust to multiple models.”

A separate goal is assessing the performance of a research design under different models implied by alternative theories. A good design will provide probative evidence about which model is correct no matter the underlying state. A poor design might only affirm one model when it is true but fail to provide support for an alternative when it is true.

An important example is assessing the performance of a research design under a “null model” where the true effect size is zero. A good research design should report with a high probability that there is insufficient evidence to reject a null effect. That same research design, under an alternative model with a large effect size, should with a high probability return evidence rejecting the null hypothesis of zero effect. The example makes clear that in order to understand whether the research design is strong, we need to understand how it performs under the models implied by alternative theoretical understandings of the world.

6.4 Declaring models in code

In this section, we describe how to declare models in practice in the DeclareDesign code language. We start with declarations of units and the heirarchical structures that contain them, then move on to declarations of the characteristics of units. An important feature of models is the set of potential outcomes associated with each unit, so we spend some time describing a few approaches for thinking about them. This section is meant as a reference guide so it covers common settings and a few uncommon ones as well.

6.4.1 Units

The model is first defined by the units under study. If there are 1,000 people who live in a city that you wish to learn about, but you don’t know anything else about them, you can declare:

M <- declare_model(N = 1000)

Units often sit within multiple, sometimes overlapping geographic and social hierarchies. Households live on blocks that make up neighborhoods. Workers have jobs at firms and also often are represented by unions that sometimes represent workers in multiple firms (a nonnested hierarchy). These hierarchies are important to declare in the model as they often form the basis for why units are similar and different. Within declare_model, we define hierarchy using add_level, which adds a new level of hierarchy. This model declaration creates 100 households of varying size, then creates the appropriate number of individuals within households.

M <- declare_model(
  households = add_level(
    N = 100, 
    N_members = sample(c(1, 2, 3, 4), N, 
                       prob = c(0.2, 0.3, 0.25, 0.25), replace = TRUE)
  ),
  individuals = add_level(
    N = N_members, 
    age = sample(18:90, N, replace = TRUE)
  )
)

Panel data have a different structure. For example, in a country-year panel dataset, we observe every country in every year. To create data with this structure, we first declare a country-level dataset, then a years-level dataset, then we join them. The join is accomplished in cross_levels call, which defines the variables we join by with by = join(countries, years). In cross_levels, we also create the observation-level outcome variable, which is a function in this case of a country shock, a year shock, an observation shock, and a time trend.

M <- declare_model(
  countries = add_level(
    N = 196, 
    country_shock = rnorm(N)
  ),
  years = add_level(
    N = 100, 
    time_trend = 1:N,
    year_shock = runif(N, 1, 10), 
    nest = FALSE
  ),
  observation = cross_levels(
    by = join(countries, years),
    observation_shock = rnorm(N),
    Y = 0.01 * time_trend + country_shock + year_shock + observation_shock 
  )
)

6.4.2 Unit characteristics

We can describe the characteristics of units in two ways: we can use existing data or we can simulate.

Here is an example of the simulation approach. We imagine 100 units with a characteristic X that is uniformly distributed between 0 and 100.

M <- 
  declare_model(
    N = 100, 
    X = runif(N, min = 0, max = 100)
  )

You can use any of the enormous number of data simulation functions available in R for this purpose. Here we gather six functions we tend to use in our own declarations, but they are by no means exhaustive. Each function has arguments that govern exactly how the data are created; we chose arbitrary values here to show how they work. Figure 6.3 shows what these six look like for a 1,000 unit model.

M <-
  declare_model(
    N = 1000,
    X1 = rnorm(N, mean = 5, sd = 2),
    X2 = runif(N, min = 0, max = 5),
    X3 = rbinom(N, size = 1, prob = 0.5),
    X4 = rbinom(N, size = 5, prob = 0.5),
    X5 = rlnorm(N, meanlog = 0, sdlog = 1),
    X6 = sample(c(1, 2, 3, 4, 5), N, replace = TRUE)
  ) 
Six kinds of characteristics of units

Figure 6.3: Six kinds of characteristics of units

Binary variables are very important in social science, but they can be particularly tricky to make, so we’ll spend a little more time on them. In a common way of thinking, binary variables are translations from a latent, continuous variable into a observed binary outcomes.

We can draw binary outcomes in one of three ways. First, we can simulate a binary outcomes using the rbinom function as we have already seen. The function rbinom(N, size = 1, prob = 0.5) flips 1 coin for each of N subjects with a constant latent probability of success across units.

M1 <- 
  declare_model(
    N = 1000, 
    Y = rbinom(N, 1, prob = 0.5)
  )

If you believe the latent probability varies across units, you might want to set up latent variable first before the call to rbinom. A major reason to do it this way is to build in correlation between the binary outcome Y and some other variable, like Y2 in this model declaration.

M2 <- 
  declare_model(
    N = 1000, 
    latent = runif(N, min = 0, max = 1),
    Y = rbinom(N, 1, prob = latent),
    Y2 = latent + rnorm(N)
  )

A third way to create binary variable skips the call to rbinom altogether and creates a binary variable by assessing whether the latent variable exceeds some threshold, here 0.75. A major reason to do this is when we want to control the sources of randomness. The latent variable is random because of the call to the runif function. If we pass the resulting latent probabilities to rbinom as in M2, then we add a second layer of randomness, the coin flips. If one layer is enough, then M3 might be a more appropriate model declaration.

M3 <- 
  declare_model(
    N = 1000, 
    latent = runif(N, min = 0, max = 1), 
    Y = if_else(latent > 0.75, 1, 0)
  )

6.4.2.1 Building in correlations between simulated variables

Most social science variables are interrelated. The correlations between variables may affect the quality of a research design in many ways. If we control for a variable in a regression to improve power, how much power we gain depends on the correlation between that variable and the outcome.

Here we walk through two main ways to create correlated variables. The first way is simply to make one variable a function of another:

M1 <- 
  declare_model(
    N = 1000,
    X1 = rnorm(N),
    X2 = X1 + rnorm(N)
  )

The second way draws on an explicit draw from a multivariate distribution. The mvrnorm function in the MASS package generates draws from the multivariate normal distribution. We have to give it two means (mu) and a variance-covariance matrix (Sigma). The draw_multivariate function is a neat wrapper that makes these functions (that return more than one column of data!) play nicely with declare_model.

M2 <- 
  declare_model(
    draw_multivariate(c(X1, X2) ~ MASS::mvrnorm(
      N, mu = c(0, 0),
      Sigma = matrix(c(1, 0.3, 0.3, 1), nrow = 2))
    )
  )

6.4.2.2 Building in correlations within clusters

A second important form of correlation is correlation within clusters, described bt the intraclass correlation coefficient (ICC). When ICC is low, the within-cluster differences are similar across clusters. When it is high, clusters are more homogeneous within themselves and more heterogeneous across themselves. For more on clustered designs for surveys and for experiments, see Section 14.2 and Section 17.3.

We can introduce ICC using the draw_normal_icc function:

M <-
  declare_model(households = add_level(N = 1000),
                individuals = add_level(
                  N = 4,
                  X = draw_normal_icc(
                    mean = 0,
                    clusters = households,
                    ICC = 0.65
                  )
                ))

6.4.2.3 Unknown heterogeneity

Most declarations include unobserved variables, or unknown heterogeneity. These Us represent the lurking variables that confound our inferences and the variation in outcomes not correlated with observed values. In virtually every declaration in the book, we include a U term to represent these unobserved values.s

But now that we are at the point of declaration, of actually writing down a design in code, we face problems immediately: how much heterogeneity do we introduce, where, and of what kind?

M1 shows how to make rather benign unknown heterogeneity. U is normally distributed and only affects Y. The observed binary variable X is independent of U and affects Y in its own way.

M1 <- declare_model(
  N = 100,
  U = rnorm(N),
  X = rbinom(N, size = 1, prob = 0.5),
  Y = 0.1 * D + U
)

M2 is more worrisome. U now affects X as well, by affecting the probability of X equaling one. See Section 15.2 for designs that face this sort of problem.

M2 <- declare_model(
  N = 100,
  U = rnorm(N),
  X = rbinom(N, size = 1, prob = pnorm(U)),
  Y = 0.1 * D + U
)

A further question for U is how much it should vary. This question is tougher to answer. U is an important source of variation in outcomes, so it needs to be calibrated in a way that the simulated data look like the data you expect to generate or encounter in the world. So our best advice here is to follow Principle 3.3: entertain many types of unknown heterogeneity and figure out which ones seem reasonable.

6.4.3 Potential outcomes

We declare “potential” outcomes when describing counterfactual quantities.

The most straightforward way to declare potential outcomes is to make one variable per potential outcome. Here the two potential outcomes come from coin flip distributions with different success probabilities.

M2 <- 
  declare_model(N = 100, 
                Y_Z_0 = rbinom(N, size = 1, prob = 0.5),
                Y_Z_1 = rbinom(N, size = 1, prob = 0.6)
                )

The potential_outcomes function can do the same thing, but using R’s “formula” syntax, allowing us to write down potential outcomes in a “regression-like” way.

M <- 
  declare_model(N = 100, 
                potential_outcomes(Y ~ rbinom(N, size = 1, prob = 0.1 * Z + 0.5))
  )

By default, potential imagines you are making two potential outcomes with respect to a treatment variable Z that can take on two values, 0 and 1.

Table 6.1: One draw of two potential outcomes
ID Y_Z_0 Y_Z_1
001 0 0
002 1 0
003 0 0
004 1 1
005 0 0

But we can vary those to multiple treatment conditions:

M <- 
  declare_model(
    N = 100, 
    potential_outcomes(
      Y ~ rbinom(N, 1, prob = 0.1 * (Z == 1) + 0.2 * (Z == 2)), 
      conditions = list(Z = c(0, 1, 2))
    )
  )
Table 6.2: One draw of three potential outcomes
ID Y_Z_0 Y_Z_1 Y_Z_2
001 0 0 0
002 0 0 1
003 0 0 0
004 0 0 0
005 0 0 0

Or to multiple treatment factors (see Section 17.5 on factorial experiments):

M <- 
  declare_model(
    N = 100, 
    potential_outcomes(
      Y ~ rbinom(N, 1, prob = 0.1 * Z1 + 0.2 * Z2 + 0.1 * Z1 * Z2), 
      conditions = list(Z1 = c(0, 1), Z2 = c(0, 1))
    )
  )
Table 6.3: One draw of four potential outcomes
ID Y_Z1_0_Z2_0 Y_Z1_1_Z2_0 Y_Z1_0_Z2_1 Y_Z1_1_Z2_1
001 0 0 0 1
002 0 0 0 1
003 0 0 0 1
004 0 0 0 1
005 0 0 0 1

6.4.3.1 Effect sizes

We often want to consider a range of plausible effect sizes, for example when estimating the minimum detectable effect of a design. A strategy we commonly use is to sample the treatment effect in the model declaration itself, and then draw the potential outcomes using that single number. When we diagnose many times, then we will get many different treatment effects (here, tau), which we can then summarize in a diagnosis. Section 10.8 describes how to create a plot of the power across values of tau.

M <-
  declare_model(
    N = 100, 
    tau = runif(1, min = 0, max = 1), 
    U = rnorm(N), 
    potential_outcomes(Y ~ tau * Z + U)
  )

Where do our expectations about effect sizes, and the minimal plausible effect size, come from? We may conduct meta-analysis of past studies when there are more than one sufficiently relevant estimates of the same effect, or a systematic review or literature review when they are less comparable but we may want to find a range (see Section 18.4 for a discussion of meta-analysis). We may be tempted to conduct a pilot study to estimate the effect size. We need to be careful when doing so what to infer and how much to update from small pilot studies, as we discuss in Section 21.5, but we can often shrink our uncertainty about them. In the absence of pilot studies or past studies to draw on, we need to make educated guesses and understand under what true effect sizes our design will perform well and when it will not following Principle 3.10 (Diagnose to break designs).

6.4.3.2 Effect heterogeneity

Sometimes, the inquiry centers on treatment effect heterogeneity by subgroups. This heterogeneity has to be present in the model in order for the simulation to pick it up. Here we declare effect heterogeneity according to a binary covariate X. This example really shows off the utility of the formula syntax in the potential_outcomes function. We can write in our expectations about hetergeneity as if they were regression coefficients. Here, the “interaction term” is equal to 0.1.

M <- 
  declare_model(
    N = 100, 
    U = rnorm(N), 
    X = rbinom(N, 1, prob = 0.5),
    potential_outcomes(Y ~  0.3 * Z + 0.2*X + 0.1*Z*X + U)
  )

6.4.3.3 Correlation between potential outcomes

Treated and untreated potential outcomes are typically highly correlated. When there is no treatment effect heterogeneity, there is often perfect correlation between the two; the only difference is the shift due to the treatment effect. Rarely are potential outcomes negatively correlated, but it can occur. The sign and magnitude of the correlation especially affects the standard errors of estimates for causal effects (for more discussion of the correlation of potential outcomes in experiments and also standard error estimators, see Section 17.1.1).

We described some complexities of generating binary variables above. They transfer over to the generation of correlated potential outcomes in special ways. In the declarations below, M1 generates uncorrelated potential outcomes, because the draws from rbinom are independent of one another. In M2, we still use rbinom, but with a transformation of the normally-distributed variable into a probability via pnorm. This allows the potential outcomes to be correlated because they are both influence by the same latent variable. Finally, in M3, we generate highly correlated potential outcomes because we peel off the layer of randomness introduced by rbinom.

M1 <- 
  declare_model(
    N = 100, 
    potential_outcomes(Y ~ rbinom(N, 1, prob = 0.2))
  )

M2 <- 
  declare_model(
    N = 100,
    latent = rnorm(N), 
    potential_outcomes(Y ~ rbinom(N, 1, prob = pnorm(latent + 0.2 * Z)))
  )

M3 <- 
  declare_model(
    N = 100, 
    latent = rnorm(N), 
    potential_outcomes(Y ~ if_else(latent + 0.2 * Z > 0.5, 1, 0))
  )

6.4.4 Summary

If this section left you spinning from the array of choices we have to make in declaring a model, in some ways that was our goal. Inside every power calculator and bespoke design simulation code are an array of assumptions. Some crucially determine design quality. Others are unimportant. The salve to the dizziness is Principle 3.3: entertain many models. Where you are uncertain, explore whether both options produce the same diagnosands. The goal is for your data and answer strategy to hold up to many models, and to find out whether it does, you often need to build the many options into your model.