10  Diagnosing designs

Research design diagnosis is the process of evaluating the properties of a research design. We invent the term “diagnosand” to refer to those properties of a research design we would like to diagnose. Many diagnosands are familiar. Power is the probability of obtaining a statistically significant result. Bias is the average deviation of estimates from the estimand. Other diagnosands are more exotic, like the Type-S error rate, which is the probability the estimate has the incorrect sign, conditional on being statistically significant (Gelman and Carlin 2014). This chapter focuses mainly on the use of Monte Carlo computer simulations to estimate diagnosands, though we touch on analytic design diagnosis as well.

Research designs are strong when the empirical answer \(a_{d^*}\) generated by the design is guaranteed to be close to the true answer \(a_{m^*}\). In general, we cannot know either \(a_{d^*}\) or \(a_{m^*}\) in advance. This problem means we have to assess the performance of our designs, not with respect to the real world, but instead with respect to our unverified and unverifiable beliefs about the world. We can do this using simulation, assessing the properties of research designs by comparing the simulated answer \(a_d\) to the answers under the model \(a_m\), over many possible realizations of the design.

Figure 10.1 below is similar to Figure 5.1, which we described when defining the components of a research design. To recapitulate the main points of that discussion: The theoretical half of a research design defines an inquiry I in terms of a model M; the true answer to that question is the inquiry applied to the real world \(I(m^*) = a_{m^*}\); the empirical half of a research design applies the data strategy to the real world to generate a dataset (\(D(m^*) = d^*\)), then applies the answer strategy to the realized dataset to generate an answer, (\(A(d^*) = a_{d^*}\)). The asterisks reflect the fact that reality plays an important role in empirical research designs. We commit to the notion that there are real answers to the questions we have posed. Our empirical answers are tethered to reality because the data strategy generates information that depends on the real world.

When we simulate and diagnose designs, however, this tether to reality is snipped, and we find ourselves in the bottom half of Figure 10.1. When we simulate, the theoretical half of the design entertains many possibilities in the model set M, which we label \(m_1\), \(m_2\), …, \(m_k\), indexed by \(j\). Each of the \(k\) possibilities generates a different answer to the inquiry: \(a_{m_1}\), \(a_{m_2}\), …, \(a_{m_k}\). Importantly, the simulated research design does not have access to the true answer \(a_{m^*}\), but has access only to answers under the \(k\) considered models.

We pause to address one confusing aspect of our notation. The set of models \(m_1\), \(m_2\), …, \(m_k\) could refer to a set of \(k\) separate theoretical perspectives. Or it could refer to \(m_1\), \(m_2\), …, \(m_k\) draws from the same basic model, but the values of the exogenous variables (following the specific meaning described in Section 6.1.1) are different. In other words, our notation doesn’t draw a formal distinction between large differences between benchmark models and small ones.

When we simulate, the empirical half of the design is similarly dissociated from reality. We apply the data strategy \(D\) to each of the model draws \(m_k\) to produce simulated datasets indexed as \(d_k\). These fabricated datasets may be similar to or different from the true dataset \(d^*\) that would result if the design were realized (the more similar the better). We apply the answer strategy A to each simulated dataset \(d_j\) in order to produce simulated answers \(a_{d_j}\).

A comparison of the top row to the bottom three rows of Figure 10.1 shows how actual research designs differ from simulated research designs. Actual research designs are influenced by reality: we learn about the real world by conducting empirical research. We don’t learn about the real world from simulated research designs. Instead, we learn about research designs themselves. We learn how the research design would behave if the real world \(m^*\) were like the model draw \(m_1\)—or like \(m_{100}\). The diagnosis of the design is then a summary of the performance of the model over the multiple models in M. Of course, we only diagnose under the possibilities we consider in \(M\). If the world \(m^*\) is not in that set of possibilities, we won’t have evaluated the design under the most important setting, i.e., in reality. We want to follow Principle 3.2 to “design agnostically” or in order words, to consider such a wide range of possibilities that some \(a_{m_j}\) will be close enough to \(a_{m^*}\).

Figure 10.1: Simulations in the MIDA framework

With this definition of a diagnosis in hand, we can now sharpen our thinking on when we have a complete design declaration. A design declaration is “diagnosand-complete” when the declaration contains enough information to calculate a specific diagnosand, or a statistic about the design. The “bias” diagnosand requires that all four of M, I, D, and A are specified in sufficient detail because we need to be able to compare the answer under the model (\(a_m\)) to the simulated empirical answer (\(a_d\)). The “statistical power” diagnosand often does not require the inquiry \(I\) to be specified, since we can mechanistically conduct null hypothesis significance tests without reference to inquiries. Neither the “bias” nor “statistical power” diagnosands require any details of confidence interval construction, but without those details, the declaration is not diagnosand-complete for “coverage,” which assesses the quality of the standard error estimator by checking how often the confidence intervals cover the true value.

Every design we have ever declared is diagnosand-complete for some diagnosands, but not for others. Diagnosand-completeness for every common diagnosand is not a goal for design declaration. We want to be diagnosand-complete for the set of diagnosands that matter most in a particular research setting.

10.1 Elements of diagnoses

10.1.1 Diagnosand statistics

Diagnosands are summaries of the distributions of diagnostic statistics. We’ll start by defining diagnostic statistics, then move on to describing how to generate the distribution of diagnostic statistics, then how to summarize those distributions in order to estimate diagnosands.

A diagnostic statistic \(\phi_j\) is itself a function of \(a_{m_j}\) and \(a_{d_j}\): \(\phi_j = g(a_{m_j}, a_{d_j})\). We elide here two important notational annoyances. For the purpose of the following discussion of diagnostic statistics and diagnosands, \(a_{d_j}\) can refer to an estimate like a difference-in-means estimate or an associated uncertainty statistic like a standard error or a \(p\)-value. Importantly, diagnostic statistics can be functions that depend on \(a_{m_j}\) only, \(a_{d_j}\) only, or both.

Each diagnostic statistic is the result of a different function \(g\). For example, the “error” diagnostic statistic is produced by this function: \(g(a_{m_j}, a_{d_j}) = a_{d_j} - a_{m_j}\). The “significant” diagnostic statistic is \(g(a_{d_j}) = \mathbf{I}[p(a_{d_j}) \leq \alpha]\), where \(p()\) is a function of the empirical answer that returns a \(p\)-value, \(\alpha\) is a significance cutoff, and \(\mathbf{I}\) is an indicator function that returns 1 when the \(p\)-value is below the cutoff and \(0\) when it is above. A diagnostic statistic is something that can be calculated on the basis of a single run of the design.

Typically, diagnostic statistics will be different from simulation to simulation. In other words, \(\phi_1\) will be different from \(\phi_2\), \(\phi_2\) will be different from \(\phi_3\), and so on. These differences arise partially from the variation in M: \(m_1\) is different from \(m_2\), \(m_2\) is different from \(m_3\), and so on. Differences can also arise from the explicitly random procedures in \(D\): sampling, assignment, and measurement can all include stochastic elements that will ramify through to the diagnostic statistics. As a result of these sources of variation, a diagnostic statistic is a random variable \(\Phi\).

10.1.2 Diagnosands

A diagnosand is a summary of the distribution of \(\Phi\), written \(f(\Phi)\). For example, the expectation function \(\mathbb{E}[\Phi]\) reports the expectation (the mean) of \(\Phi\).

Let’s now work through two concrete examples of common diagnosands: bias and power (see Section 10.2 for a more exhaustive list).

Consider the diagnosand “bias” in the context of a two-arm randomized trial where the inquiry is the average treatment effect, the data strategy entails complete random assignment, and the answer strategy is the difference-in-means. Bias is the average difference between the estimand and the estimate. Under a single realization \(m_j\) of the model M, the value of the ATE will be a particular number, which we call \(a_{m_j}\). We simulate a random assignment and measurement of observed outcomes, then apply the difference-in-means estimator. The diagnostic statistic is the error \(a_{d_j} - a_{m_j}\); this error is a random variable because each \(m_j\) differs slightly. The bias diagnosand is the expectation of this random variable is \(\mathbb{E}[a_{d_j} - a_{m_j}]\), where the expectation is taken over the randomization distribution implied by M and D (or distinct regions of the randomization distribution).

Now consider the diagnosand “power.” Like bias, statistical power is an expectation, this time of the “significant” diagnostic statistic \(\mathbf{I}(p(a_{d_k}) \leq 0.05)\). Power describes how frequently the answer strategy will return a statistically significant result. Some textbooks define statistical power as one minus the Type II error rate, where a Type II error is the failure to reject the null hypothesis, given that the null hypothesis is false. This definition is accurate, but hard to understand. The phrase “given that the null hypothesis is false” could refer to any number of model possibilities (\(a_{m_j}\)’s) in which the null hypothesis does not hold. Our definition of power is instead, “the probability of getting a statistically significant result, conditional on a set of beliefs about the model.”

Uncertainty statistics like standard errors can be thought of as empirical estimates of diagnosands describing the quality of your design under some set of models.

Suppose we report that the estimate is equal to 15 and carries with it a standard error of 6. Here we are communicating that we now think our design has the property that, if we were to run it over and over, the standard deviation of our estimates would be 6. That is, if we were to fix the design at a particular \(m\) and combine it with I, D, and A, the standard deviation of the estimates across simulations of the design would be 6.

Likewise, estimated \(p\)-values can also be thought of as estimates of a peculiar diagnosand: the probability of obtaining a test statistic of a particular value, under a maintained null model. That is, if we were to write down a null model \(m_{0}\) under which the estimand were in fact zero, then combined it with I, D, and A, we could diagnose that design to learn the fraction of estimates that are as large or larger than 15—a \(p\)-value.

Thinking about \(p\)-values as diagnosands can be seen most directly in the randomization inference approach to hypothesis testing. Randomization inference often considers “sharp” null hypotheses in which we impute the missing potential outcomes with exactly their observed values, then simulate all the ways the design could come out holding I, D, and A constant. The \(p\)-value is the fraction of simulations in which the null model generates estimates that are more extreme than the observed estimate, with variation in estimates arising from different assignments to treatment. See Section 9.3.2.1 for a worked example of this approach.

The payoff from thinking about uncertainty statistics as estimates of diagnosands is in drawing attention to the fact that uncertainty statistics can be poor estimators of diagnosands and to the fact that their performance can be assessed. For example, social scientists criticize one another’s choice of standard error—classical or robust, clustered or not, bootstrapped, jackknifed, and on and on. The reason for this debate is that when the procedures we follow to form uncertainty statistics are inappropriate for the design setting, we can be falsely confident in our answers.

Figure 10.2: Why robust standard errors are preferred to classical standard errors

Figure 10.2 illustrates what’s at stake in the age-old contest between classical standard errors and robust standard errors. Here we are in the context of a two-arm trial under three settings: the control potential outcomes are higher variance, the groups have the same variances, or the treated outcomes have higher variance. The calculation for classical standard errors pools the data from both groups when estimating a variance, thereby assuming “homoskedasticity,” a Greek work for having the “same spread.” This estimation choice leads to poor performance. Depending on the fraction of the sample that receives treatment, the estimate of the sampling distribution can be upwardly biased (conservative) or downwardly biased (anti-conservative or falsely confident). We’re usually worried about standard error estimates being too small because anti-conservatism is probably worse for scientific communication than conservatism.

Robust standard errors are “heteroskedasticity-robust,” which means that the estimation does not assume that the two groups have the same error variance. Samii and Aronow (2012) show that a particular variant of robust standard errors (HC2) is exactly equal to the Neyman variance estimator described in Section 11.3.1, which is why HC2 robust standard errors are the default in the lm_robust function. The bottom row of Figure 10.2 shows that the robust standard error estimator hews closely to the true value of the diagnosand in all of the design settings considered. We prefer robust SEs because they do not impose extra model assumptions like homoskedasticity that are not grounded in any particular design feature.

10.2 Types of diagnosands

As described above, a diagnostic statistic is any summary function of \(a_m\) and \(a_d\), and a diagnosand is any summary of the distribution of diagnostic statistics. As a result, there are a great many diagnosands researchers may consider. In Table 10.1, we introduce a nonexhaustive set of diagnostic statistics, and in Table 10.2 a nonexhaustive set of diagnosands.

Table 10.1: Examples of diagnostic statistics.
Diagnostic statistic Definition
Estimate \(a_{d_k}\)
Estimand under the model \(a_{m_k}\)
\(p\)-value \(p(a_{d_k})\)
\(p\)-value is no greater than \(\alpha\) \(\mathbf{I}(p \leq \alpha)\)
Confidence interval \(\mathrm{CI}_{1-\alpha}\)
Confidence interval covers the estimand under the model \(\mathrm{covers}_{a_m} \equiv \mathbb{I}\{ a_m \in \mathrm{CI}_{1-\alpha} \}\)
Estimated standard error \(\widehat\sigma(A)\)
Cost \(\mathrm{cost}\)
Proportion of subjects harmed \(\Pr(\mathrm{harm}) \equiv \frac{1}{n} \sum_i \mathrm{harm_i}\)
Table 10.2: Examples of iagnosands.
Diagnosand Description Definition
Average estimate \(\mathbb{E}(a_d)\)
Average estimand \(\mathbb{E}(a_m)\)
Power Probability of rejecting null hypothesis of no effect \(\mathbb{E}(I(p \leq \alpha))\)
Null risk Probability of failing to reject null hypothesis of no effect \(\mathbb{E}(I(p > \alpha))\)
Bias Expected difference between estimate and estimand \(\mathbb{E}(a_d - a_m)\)
Variance of the estimates \(\mathbb{V}(a_d)\)
True standard error \(\sqrt{\mathbb{V}(a_d)}\)
Average estimated standard error \(\widehat\sigma(A)\)
Root mean-squared-error (RMSE) \(\sqrt{\mathbb{E}(a_d - a_m)}\)
Coverage Probability confidence interval overlaps estimand \(\Pr(\mathrm{covers}_{a_m})\)
Bias‐eliminated coverage Probability confidence interval overlaps average estimate (Morris, White, and Crowther (2021)) \(\Pr(\mathrm{covers}_{a_d})\)
Type-S error rate Probability estimate has an incorrect sign, if statistically significant (Gelman and Carlin 2014) \(\Pr(\mathrm{sgn}(a_d) \neq \mathrm{sgn}(a_m) \mid p \leq \alpha)\)
Exaggeration ratio Expected ratio of absolute value of estimate to estimand, if statistically significant (Gelman and Carlin 2014) \(\mathbb{E}( a_d / a_m \mid p \leq \alpha)\)
Type I error Rejecting the null hypothesis when it is true \(\Pr(p \leq \alpha \mid a_m = a^0)\)
Type II error Failure to reject the null hypothesis when it is false \(\Pr(p \geq \alpha \mid a_m \neq a^0)\)
Sampling bias Expected difference between population average treatment effect and sample average treatment effect (Imai, King, and Stuart 2008) \(\mathbb{E}(a_{m_{\mathrm{sample}}} - a_{m_{\mathrm{population}}})\)
Expected maximum cost Maximum cost across possible realizations of the study \(\max{\mathrm{cost}}\)
Bayesian learning Difference between prior and posterior guess of the value of the estimand \(a_{m_{\mathrm{post}}} - a_{m_{\mathrm{pre}}}\)
Value for money Probability that a decision based on estimated effect yields net benefits
Success Qualitative assessment of the success of a study
Minimum detectable effect (MDE) Smallest effect size for which the power of the design is nominal (e.g., powered at 80%) \(\mathrm{argmin}_{a_{m_*}} \Pr(p \leq \alpha) = 0.8\)
Robustness Joint probability of rejecting the null hypothesis across multiple tests
Maximum proportion of subjects harmed \(\max{\Pr(\mathrm{harm})}\)

10.3 Estimation of diagnosands

10.3.1 Analytic design diagnosis

Diagnosis can be done with analytic, pencil-and-paper methods. In an analytic design diagnosis, we typically derive a formula that returns the value of a diagnosand under a stipulated set of beliefs about the model, inquiry, data strategy, and answer strategy. For example, research design textbooks often contain analytic design diagnoses for statistical power. Gerber and Green (2012) write:

“To illustrate a power analysis, consider a completely randomized experiment where \(N>2\) of \(N\) units are selected into a binary treatment. The researcher must now make assumptions about the distributions of outcomes for treatment and for control units. In this example, the researcher assumes that the control group has a normally distributed outcome with mean \(\mu_c\), the treatment group has a normally distributed outcome with mean \(\mu_t\), and both group’s outcomes have a standard deviation \(\sigma\). The researcher must also choose \(\alpha\), the desired level of statistical significance (typically 0.05). Under this scenario, there exists a simple asymptotic approximation for the power of the experiment (assuming that the significance test is two-tailed): \[ \beta = \Phi \bigg(\frac{|\mu_t - \mu_c| \sqrt{N}}{2\sigma} - \Phi^{-1} \left(1 - \frac{\alpha}{2}\right) \bigg) \] where \(\beta\) is the statistical power of the experiment, \(\Phi(\cdot)\) is the normal cumulative distribution function (CDF), and \(\Phi^{-1}(\cdot)\) is the inverse of the normal CDF.”

This power formula is usually motivated by a set of assumptions about M, I, D, and A. Under M, it usually assumes that both potential outcomes are normally distributed with group specific means and a common variance. Under I, it usually assumes the average treatment effect. Under D, it usually assumes a particular randomization strategy (simple random assignment). Under A, it usually assumes a particular hypothesis testing approach (equal variance \(t\)-test with \(N - 2\) degrees of freedom). Whether this set of assumptions is “close enough” will depend on the research setting.

Analytic design diagnosis can be hugely useful, since they cover a large families of designs that meet the scope criteria. For example, consider the “standard error” diagnosand \(\sqrt{\mathbb{E}[(a_{d_j} - \mathbb{E}[a_{d_j}])^2]}\) for the Sample Average Treatment Effect estimate in a two arm trial with complete random assignment and estimation via difference in means. This has been worked out by statisticians to be \(\sqrt{\frac{1}{n-1}\left\{\frac{m\mathbb{V}(Y_i(0))}{n-m} + \frac{(N-m)\mathbb{V}(Y_i(1))}{m} + 2\mathrm{Cov}(Y_i(0), Y_i(1))\right\}}\), where \(N\) is the number of units and \(m\) is the number in treatment. The Neyman standard error estimator targets this quantity (see Equation 18.2). Many advances in our understanding of the properties of research design come from analytic design diagnosis. For example, Middleton (2008) and Imai, King, and Nall (2009) show that cluster randomized trials with heterogeneous cluster sizes are not unbiased for the ATE, which leads to the design recommendation that clustered trials should block on cluster size. These lessons apply broadly, more broadly perhaps than the lessons learned about a specific design in a specific Monte Carlo simulation.

That said, scholars conducting analytic design diagnosis have only worked out a few diagnosands for a limited set of designs. Since designs are so heterogeneous and can vary on so many dimensions, computer simulation is often the only feasible way to diagnose. We learn a lot from analytic design diagnosis—what the important parameters to consider are, what the important inferential problems are—but they often cannot provide direct answers to practical questions like, how many subjects do I need for my conjoint experiment? For that reason, we turn to design diagnosis via Monte Carlo simulation.

10.3.2 Design diagnosis by simulation

Research design diagnosis by simulation occurs in two steps. First we simulate research designs repeatedly, collecting diagnostic statistics from each run of the simulation. Second, we summarize the distribution of the diagnostic statistics in order to estimate the diagnosands.

Monte Carlo simulations of research designs can be written in any programming language. To illustrate the most common way of simulating a design—a for loop—we’ll write a concise simulation in base R code. Diagnosis 10.1 conducts the simulation 500 times, each time, collecting the \(p\)-value associated with a regression estimate of the average effect of an outcome on a treatment.

Loops like this one can be implemented in any language and they remain a good tool for design diagnosis. We think, however, writing simulations in this way obscures what parts of the design refer to the model, the inquiry, the data strategy, or the answer strategy. We might imagine Z and Y are generated by a data strategy that randomly assigns Z, then measures Y (though without a language for linking code to design steps, it’s a bit unclear). The answer strategy involves running a regression of Y on Z, then conducting a hypothesis test against the null hypothesis that the coefficient on Z is 0. The model and inquiry are left entirely implicit. The inquiry might be the ATE, but it might also be a question of whether the ATE is equal to zero. The model might include only two potential outcomes for each subject, or it might have more, we don’t know.

Diagnosis 10.1 By-hand diagnosis example.

sims <- 500
p.values <- rep(NA, sims)

for(i in 1:sims){
  Z <- rbinom(100, 1, 0.5)
  U <- rnorm(100)
  Y <- 0.2 * Z + U
  p.values[i] <- summary(lm(Y ~ Z))$coefficients[2, 4]
}

power <- mean(p.values <= 0.05)
power
Table 10.3: By-hand design diagnosis with a for-loop
power
0.16

For this reason, we advocate for explicit description of all four research design components. Explicit design declaration can also occur in any programming language, but DeclareDesign is purpose-built for this task. We begin with the explicit declaration of the same two-arm trial as above. We have 100 subjects with a constant response to treatment of 0.2 units. Our inquiry is the average difference between the treated and untreated potential outcomes—the ATE. We assign treatment using complete random assignment and estimate treatment effects using the difference-in-means.

Declaration 10.1 Example of declaration using DeclareDesign

declaration_10.1 <-
  declare_model(
    N = 100,
    U = rnorm(N),
    potential_outcomes(Y ~  0.2 * Z + U)
  ) +
  declare_inquiry(ATE = mean(Y_Z_1 - Y_Z_0)) +
  declare_assignment(Z = complete_ra(N)) +
  declare_measurement(Y = reveal_outcomes(Y ~ Z)) +
  declare_estimator(Y ~ Z, inquiry = "ATE")

Diagnosis 10.2 Diagnosis in DeclareDesign

We can now diagnose the declared design. For comparison with the loop, we calculate just one diagnosand: statistical power. Both approaches return approximately the same answer. Any difference between the two can be attributed to simulation error, which is why DeclareDesign by default returns a bootstrapped standard error for each diagnosand estimate. If we were to increase the number of simulations for both approaches dramatically (100,000 simulations would be plenty), the differences would vanish.

diagnosands <- declare_diagnosands(power = mean(p.value <= 0.05))

diagnosis <- 
  diagnose_design(declaration_10.1, 
                  diagnosands = diagnosands)

diagnosis
Table 10.4: Design diagnosis using DeclareDesign
power se(power) n_sims
0.1615 0.0083482 2000

Backing up, we now construct that same diagnosis step-by-step. We build up from a single simulation run of the design to the distribution of simulations, to summaries of that distribution.

set.seed(343)

simulation_10.1 <- 
  simulate_design(declaration_10.1, sims = 10) |>
  mutate(significant = p.value <= 0.05)

We can run this simulation a single time with run_design:

run_design(declaration_10.1)
Table 10.5: One simulation draw
estimand estimate std.error df p.value conf.low conf.high
0.2 0.278 0.209 98 0.186 -0.136 0.692

Figure 10.3 shows the information we might obtain from a single run of the simulation. The filled point is the estimate \(a_{d_j}\). The open triangle is the estimand \(a_{m_j}\). The bell-shaped curve is our normal-approximation based estimate of the sampling distribution. The standard deviation of this estimated distribution is our estimated standard error, which expresses our uncertainty. The confidence interval around the estimate is another expression of our uncertainty: We’re not sure where \(a_{d_j}\) is, but if things are going according to plan, confidence intervals constructed this way will bracket \(a_{m_j}\) 95% of the time.

Figure 10.3: Visualization of one draw from a design diagnosis.

From this single draw, we can’t yet estimate diagnosands, but we can calculate diagnostic statistics. The estimate was higher than the estimand in this draw, so the error is 0.45 - 0.20 = 0.25. Likewise, the squared error is (0.45 - 0.20)\(^2\) = 0.0625. The \(p\)-value is 0.01. This corresponds to the sum of the volumes of the two shaded areas in the right panel. It is above the threshold of 0.05, so the, so the “statistical significance” diagnostic statistic is equal to FALSE. The confidence interval stretches from 0.11 to 0.79, and since the value of the estimand (0.20) is between those bounds, the “covers” diagnostic statistic is equal to TRUE.

To calculate the distributions of the diagnostic statistics, we have to simulate designs not just once, but many times over. The bias diagnosand is the average error over many runs of the simulation. The statistical power diagnosand is the fraction of runs in which the estimate is significant. The coverage diagnosand is the frequency with which the confidence interval covers the estimand.

Figure 10.4 visualizes 10 runs of the simulation. We can see that some of the draws produce statistically significant estimates (the shaded areas are small and the confidence intervals don’t overlap with zero), but not all. We get a sense of the true standard error (the standard deviation of the estimates over multiple runs of the design) by seeing how the point estimates bounce around. We get a feel for the difference between the estimates of the standard error and true standard error. Design diagnosis is the process of learning about many ways the study might come out, not just the one way that it will.

Figure 10.4: Visualization of ten draws from a design diagnosis.

In Diagnosis 10.3, one line of code does all of these steps at once. We simulate the design (500 times by default) to calculate the diagnostic statistics, then we summarize them in terms of bias, the true standard error, RMSE, power, and coverage.

Diagnosis 10.3 Design diagnosis in one step

diagnosis_10.3 <-
  diagnose_design(
    declaration_10.1,
    diagnosands = declare_diagnosands(
      bias = mean(estimate - estimand),
      true_se = sd(estimate),
      power = mean(p.value <= 0.05),
      coverage = mean(estimand <= conf.high &
                        estimand >= conf.low)
    )
  )
Table 10.6: Diagnosand estimates with bootstrapped standard errors in parentheses
bias true se power coverage
-0.00 (0.00) 0.20 (0.00) 0.17 (0.01) 0.96 (0.00)

10.3.3 Simulation error

We use Monte Carlo simulation to estimate diagnosands. We only get estimates—not the exact value—of diagnosands under a particular model because of simulation error. Simulation error declines as we conduct more simulations. When we conduct many thousands of simulations, we’re relatively certain of the value of the diagnosand under the model. If we conduct just tens of simulations, we are much more uncertain.

We can characterize our uncertainty regarding the diagnosand estimate by calculating a standard error. If the Monte Carlo standard error is large relative to the estimated diagnosand, then we need to increase the number of simulations. Unlike empirical settings where additional data collection can be very expensive, in order to get more observations of diagnostic statistics, we can just increase the number of simulations. Computation time isn’t free—but it’s cheap.

The standard error of diagnosand estimates can be calculated using standard formulas. If the diagnosand can be written as a population mean, and the simulations are fully independent, then we can estimate the standard error as \(\frac{\mathrm{sd}(\phi)}{\sqrt{k}}\), where \(k\) is the number of simulations. The power diagnosand summarizes a binary diagnostic statistic (is the estimate significant or not). Binary variables that take a value of 1 with probability \(p\) have variance \(p(1-p)\). So with 1,000 independent simulations, the standard error for the mean of a binary diagnostic statistic is \(\frac{\sqrt{p(1-p)} }{\sqrt{1000}}\). This level of simulation uncertainty is often acceptable (in the worst case when \(p=0.05\), the 95% confidence interval is approximately 4 * 0.016 = 6.4 percentage points wide), but if it isn’t, you can always increase the number of simulations.

Since some diagnosands are more complex than a population mean (i.e., we can’t characterize the estimation error with simple formulas), the diagnose_design() function uses the nonparametric bootstrap.

10.4 How to diagnose designs

10.4.1 Exploring design tradeoffs

Choosing among designs means choosing which diagnosands are important—and not all diagnosands are equally important for every study. For example, in a descriptive study whose goal is to estimate the fraction of people in France who are left-handed, statistical power is likely irrelevant. A hypothesis test against the null hypothesis that zero percent of the people in France are left-handed is preposterous (though we could of course have other policy relevant hypotheses). A much more important diagnosand for this study would be RMSE (root-mean-squared error), which is a measure of how well estimated the estimand is that incorporates both bias and variance.

Often, we need to look at several diagnosands in order to understand what might be going wrong. If our design exhibits “undercoverage” (e.g., a procedure for constructing “95%” confidence interval only covers the estimand 50% of the time), that might be because the standard errors are too small or because the point estimator is biased, or some combination of the two. In really perverse instances, we might have a biased point estimator which, thanks to overly wide confidence intervals, just happens to cover the estimand 95% of the time.

Many research design decisions involve trading off bias and variance. In trade-off settings, we may need to accept higher variance in order to decrease bias. Likewise, we may need to accept a bit of bias in order to achieve lower variance. The trade-off is captured by mean-squared error, which is the average squared distance between \(a_d\) and \(a_m\). Of course, we would ideally like to have as low a mean-squared error as possible, that is, we would like to achieve low variance and low bias simultaneously.

To illustrate, consider the following three designs as represented by three targets. The inquiry is the bullseye of the target. The data and answer strategies combine to generate a process by which arrows are shot toward the target. On the left, we have a very bad archer: even though the estimates are unbiased in the sense that they hit the bullseye “on average”, very few of the arrows are on target. In the middle, we have an excellent shot: they are both on target and low variance. On the right, we have an archer who is very consistent (low variance) but biased. The mean squared error is highest on the left and lowest in the middle.

Figure 10.5: Visualization of the bias and variance of three ‘estimators’ of the bullseye

The archery metaphor is common in research design textbooks because it effectively conveys the difference between variance and bias, but it does elide an important point. It really matters which target the archer is shooting at. Figure 10.6 shows a bizarre double-target representing two inquiries. The empirical strategy is unbiased and precise for the left inquiry, but it is clearly biased for the right inquiry. When we are describing the properties of an answer strategy, we have to be clear about which inquiry it is associated with.

Figure 10.6: The bias, variance, and RMSE of an answer strategy depend on the inquiry.

MSE is an exactly equal weighting of variance and bias (squared). Yet many other weightings of these two diagnosands are possible, and different researchers will vary in their weightings.

In evaluating a research design through diagnosis, we must construct a weighting of all relevant diagnosands. We can think of this as our research design utility function. Our utility function describes how important it is to study big questions, to shift beliefs in a research field, to overturn established findings, to obtain unbiased answers, and to get the sign of the inquiry right. When we compare across empirical designs, we compare them on this utility function that aggregates all the diagnosands according to their importance.

For example, we too often consider the diagnosand power on its own. This diagnosand is the probability of getting a statistically significant result, which of course depends on many things about your design, including, crucially, the unknown magnitude of the parameter to be estimated. But considering power alone is also misleading: no researcher wants to design a study that is 80% powered but returns highly biased estimates. Another way of saying this is that researchers always care about both power and bias. How much they care about each feature determines the weight of power and bias in their utility function.

Diagnosands need not be about hypothesis testing or even statistical analysis of the data at all. We often trade off how much we learn from a research design with its cost in terms of money and our time. We have financial and time budgets that provide hard constraints to our designs. Many researchers wish to select cheaper (or shorter) designs in order to carry out more studies or finish their degree sooner. Time and cost are also diagnostic statistics! We may wish to explore the maximum cost of a study or the average amount of time it would take.

Ethical considerations also often enter the process of assessing research designs, if implicitly. We can explicitly incorporate them into our utility function by caring about harm to subjects and the degree of informed consent. When collecting data, researchers often encounter a trade-off between informing subjects about the purpose of the study (an ethical consideration, or a requirement of the IRB) on the one hand and the bias that comes from Hawthorne or demand effects on the other. We can incorporate these considerations in a research design diagnosis by specifying diagnostic statistics related to the amount of disclosure about the purposes of research or the number of subjects harmed in the research. See Section 21.1.1 for an example.

10.4.2 Diagnosis under model uncertainty

We are always uncertain about the model in M. If we were certain of M (or there was no real dispute about it), there would be no need to conduct new empirical research. Research design diagnosis can incorporate this uncertainty by evaluating the performance of the design under alternative models. For example, if we are unsure of the exact value of the intra-cluster correlation (ICC), we can simulate the design under a range of plausible ICC values. If we are unsure of the true average treatment effect, we can diagnose the power of the study over a range of plausible effect sizes. Uncertainty over model inputs like the means, variances, and covariances in data that will eventually be collected is a major reason to simulate under a range of plausible values.

We illustrate diagnosis under model uncertainty with the declaration below. Here we have a 200-unit two-arm trial in which we explicitly describe our uncertainty over the value of the true average treatment effect. In the potential_outcomes call, we have Y ~ runif(1, 0, 0.5) * Z + U which indicates that the treatment effect in each run of the simulation is one draw from uniform distribution between 0.0 and 0.5.

Declaration including model uncertainty

declaration_10.2 <-
  declare_model(
    N = 200, U = rnorm(N),
    # this runif(n = 1, min = 0, max = 0.5) 
    # generates 1 random ATE between 0 and 0.5
    potential_outcomes(Y ~ runif(n = 1, min = 0, max = 0.5) * Z + U)) +
  declare_inquiry(ATE = mean(Y_Z_1 - Y_Z_0)) +
  declare_assignment(Z = complete_ra(N, prob = 0.5)) +
  declare_measurement(Y = reveal_outcomes(Y ~ Z)) +
  declare_estimator(Y ~ Z, inquiry = "ATE")

Diagnosis 10.4 Diagnosis under model uncertainty.

Figure 10.7 shows how the statistical power of this design varies over the set of model possibilities. We plot the possible effect sizes we entertain in the model on the horizontal axis and the statistical power on the vertical axis. We also plot a loess curve that flexibly smooths over the full distribution of effect sizes. We see that at low levels of the effect size, statistical power is quite poor. With only 200 subjects, we couldn’t achieve 80% statistical power unless the true effect size were approximately 0.45 standard deviations. The effect size at which a design achieves 80% power is often referred to as the minimum detectable effect size (MDE). This exercise shows how the MDE diagnosand is a summary of a design that admits uncertainty over the effect size. We return to this example in Section 10.4.2, when we redesign this study to learn how the MDE changes at different sample sizes.

What is “the” power of this design? Under one view, the true power of the design is the whichever value for power is associated with the effect size. But under an agnostic view, the ex ante “power” of the design is a weighted average of all these power values, weighted by the researcher’s prior beliefs over the distribution of possible effect sizes.

Figure 10.7: Diagnosing an experiment over uncertainty about the true effect size

10.4.3 Adjudicating between competing models

The principle to design agnostically extends to competing models. Imagine that you believe \(M1\) is true but that your scholarly rival believes \(M2\) is true. Suppose that under \(M1\), the treatment affects \(Y1\) but not \(Y2\). Your rival posits the reverse: Under \(M2\), the treatment affects \(Y2\) but not \(Y1\). In the spirit of scientific progress, the both of you engage in an “adversarial collaboration.” You design a study together. The design you choose should, first, demonstrate \(M1\) is true if it is true and, second, demonstrate \(M2\) is true if it is true. In order to come to an agreement about the properties of the design, you will need to simulate the design under both models.

Declaration of competing models.

M1 <-
  declare_model(
    N = 200,
    U = rnorm(N),
    potential_outcomes(Y1 ~ 0.2 * Z + U),
    potential_outcomes(Y2 ~ 0.0 * Z + U)
  )

M2 <-
  declare_model(
    N = 200,
    U = rnorm(N),
    potential_outcomes(Y1 ~ 0.0 * Z + U),
    potential_outcomes(Y2 ~ 0.2 * Z + U)
  )

IDA <- 
  declare_inquiry(ATE1 = mean(Y1_Z_1 - Y1_Z_0),
                  ATE2 = mean(Y2_Z_1 - Y2_Z_0)) +
  declare_assignment(Z = complete_ra(N)) +
  declare_measurement(Y1 = reveal_outcomes(Y1 ~ Z),
                      Y2 = reveal_outcomes(Y2 ~ Z)) +
  declare_estimator(Y1 ~ Z, inquiry = "ATE1", label = "DIM1") +
  declare_estimator(Y2 ~ Z, inquiry = "ATE2", label = "DIM2")

declaration_10.3a <- M1 + IDA
declaration_10.3b <- M2 + IDA

Diagnosis 10.5 Diagnosis of competing models

To diagnose these alternative models, we count how frequently each perspective receives support. If we define support by statistical significance (other metrics are of course possible), then your model is supported if the effect of treatment on \(Y1\) is significant but the effect on \(Y2\) is not significant. If the reverse pattern is obtained, your rival can claim victory. Two kinds of split decisions are possible: neither estimate is significant, or both are. By simulation, we can estimate the rates of these four possibilities, under both your beliefs and those of your theoretical adversary.

diagnosis_10.5 <-
  diagnose_design(
    list(declaration_10.3a, declaration_10.3b)
  )
Table 10.7: Design diagnosis under two alternative theories
design Only theory 1 supported Only theory 2 supported Both supported Neither supported
design_1 0.2610 0.026 0.0260 0.687
design_2 0.0255 0.260 0.0255 0.689

The diagnosis shows that the study is responsive to the truth. When theory 1 is correct, the design is more likely to yield empirical evidence in favor of it; the reverse holds when theory 2 is correct. That said, the major concern facing this adversarial collaboration is that the study is too small to resolve the dispute. About two thirds of the time—regardless of who is right!—neither theory receives support. This problem can be ameliorated either by elaborating more tests of each theoretical perspective or by increasing the size of the study.

10.5 Summary

Diagnosis is the process of estimating the properties of research designs under uncertain beliefs about the world. We simulate under many alternative designs because we want to choose designs that are strong under a large range of model possibilities. Each run of a design generates a diagnostic statistic. We learn about the distribution of the diagnostic statistic by running the design repeatedly. Diagnosands are summaries of the distribution of diagnostic statistics. Which diagnosands are most important will vary from study to study, and depend on the relative weight you place on one diagnosand versus another. Analytic design diagnosis is possible and can be quite powerful—we nevertheless recommend full simulation of research designs in order to learn about a range of diagnosands.