10 Diagnosing designs
Research design diagnosis is the process of evaluating the properties of a research design. We invent the term “diagnosand” to refer to those properties of a research design we would like to diagnose. Many diagnosands are familiar. Power is the probability of obtaining a statistically significant result. Bias is the average deviation of estimates from the estimand. Other diagnosands are more exotic, like the Type-S error rate, which is the probability the estimate has the incorrect sign, conditional on being statistically significant (Gelman and Carlin 2014). This chapter focuses mainly on the use of Monte Carlo computer simulations to estimate diagnosands, though we touch on analytic design diagnosis as well.
Research designs are strong when the empirical answer
Figure 10.1 below is similar to Figure 5.1, which we described when defining the components of a research design. To recapitulate the main points of that discussion: The theoretical half of a research design defines an inquiry I in terms of a model M; the true answer to that question is the inquiry applied to the real world
When we simulate and diagnose designs, however, this tether to reality is snipped, and we find ourselves in the bottom half of Figure 10.1. When we simulate, the theoretical half of the design entertains many possibilities in the model set M, which we label
We pause to address one confusing aspect of our notation. The set of models
When we simulate, the empirical half of the design is similarly dissociated from reality. We apply the data strategy
A comparison of the top row to the bottom three rows of Figure 10.1 shows how actual research designs differ from simulated research designs. Actual research designs are influenced by reality: we learn about the real world by conducting empirical research. We don’t learn about the real world from simulated research designs. Instead, we learn about research designs themselves. We learn how the research design would behave if the real world
With this definition of a diagnosis in hand, we can now sharpen our thinking on when we have a complete design declaration. A design declaration is “diagnosand-complete” when the declaration contains enough information to calculate a specific diagnosand, or a statistic about the design. The “bias” diagnosand requires that all four of M, I, D, and A are specified in sufficient detail because we need to be able to compare the answer under the model (
Every design we have ever declared is diagnosand-complete for some diagnosands, but not for others. Diagnosand-completeness for every common diagnosand is not a goal for design declaration. We want to be diagnosand-complete for the set of diagnosands that matter most in a particular research setting.
10.1 Elements of diagnoses
10.1.1 Diagnosand statistics
Diagnosands are summaries of the distributions of diagnostic statistics. We’ll start by defining diagnostic statistics, then move on to describing how to generate the distribution of diagnostic statistics, then how to summarize those distributions in order to estimate diagnosands.
A diagnostic statistic
Each diagnostic statistic is the result of a different function
Typically, diagnostic statistics will be different from simulation to simulation. In other words,
10.1.2 Diagnosands
A diagnosand is a summary of the distribution of
Let’s now work through two concrete examples of common diagnosands: bias and power (see Section 10.2 for a more exhaustive list).
Consider the diagnosand “bias” in the context of a two-arm randomized trial where the inquiry is the average treatment effect, the data strategy entails complete random assignment, and the answer strategy is the difference-in-means. Bias is the average difference between the estimand and the estimate. Under a single realization
Now consider the diagnosand “power.” Like bias, statistical power is an expectation, this time of the “significant” diagnostic statistic
Uncertainty statistics like standard errors can be thought of as empirical estimates of diagnosands describing the quality of your design under some set of models.
Suppose we report that the estimate is equal to 15 and carries with it a standard error of 6. Here we are communicating that we now think our design has the property that, if we were to run it over and over, the standard deviation of our estimates would be 6. That is, if we were to fix the design at a particular
Likewise, estimated
Thinking about
The payoff from thinking about uncertainty statistics as estimates of diagnosands is in drawing attention to the fact that uncertainty statistics can be poor estimators of diagnosands and to the fact that their performance can be assessed. For example, social scientists criticize one another’s choice of standard error—classical or robust, clustered or not, bootstrapped, jackknifed, and on and on. The reason for this debate is that when the procedures we follow to form uncertainty statistics are inappropriate for the design setting, we can be falsely confident in our answers.
Figure 10.2 illustrates what’s at stake in the age-old contest between classical standard errors and robust standard errors. Here we are in the context of a two-arm trial under three settings: the control potential outcomes are higher variance, the groups have the same variances, or the treated outcomes have higher variance. The calculation for classical standard errors pools the data from both groups when estimating a variance, thereby assuming “homoskedasticity,” a Greek work for having the “same spread.” This estimation choice leads to poor performance. Depending on the fraction of the sample that receives treatment, the estimate of the sampling distribution can be upwardly biased (conservative) or downwardly biased (anti-conservative or falsely confident). We’re usually worried about standard error estimates being too small because anti-conservatism is probably worse for scientific communication than conservatism.
Robust standard errors are “heteroskedasticity-robust,” which means that the estimation does not assume that the two groups have the same error variance. Samii and Aronow (2012) show that a particular variant of robust standard errors (HC2) is exactly equal to the Neyman variance estimator described in Section 11.3.1, which is why HC2 robust standard errors are the default in the lm_robust
function. The bottom row of Figure 10.2 shows that the robust standard error estimator hews closely to the true value of the diagnosand in all of the design settings considered. We prefer robust SEs because they do not impose extra model assumptions like homoskedasticity that are not grounded in any particular design feature.
10.2 Types of diagnosands
As described above, a diagnostic statistic is any summary function of
Diagnostic statistic | Definition |
---|---|
Estimate | |
Estimand under the model | |
|
|
|
|
Confidence interval | |
Confidence interval covers the estimand under the model | |
Estimated standard error | |
Cost | |
Proportion of subjects harmed |
Diagnosand | Description | Definition |
---|---|---|
Average estimate | ||
Average estimand | ||
Power | Probability of rejecting null hypothesis of no effect | |
Null risk | Probability of failing to reject null hypothesis of no effect | |
Bias | Expected difference between estimate and estimand | |
Variance of the estimates | ||
True standard error | ||
Average estimated standard error | ||
Root mean-squared-error (RMSE) | ||
Coverage | Probability confidence interval overlaps estimand | |
Bias‐eliminated coverage | Probability confidence interval overlaps average estimate (Morris, White, and Crowther (2021)) | |
Type-S error rate | Probability estimate has an incorrect sign, if statistically significant (Gelman and Carlin 2014) | |
Exaggeration ratio | Expected ratio of absolute value of estimate to estimand, if statistically significant (Gelman and Carlin 2014) | |
Type I error | Rejecting the null hypothesis when it is true | |
Type II error | Failure to reject the null hypothesis when it is false | |
Sampling bias | Expected difference between population average treatment effect and sample average treatment effect (Imai, King, and Stuart 2008) | |
Expected maximum cost | Maximum cost across possible realizations of the study | |
Bayesian learning | Difference between prior and posterior guess of the value of the estimand | |
Value for money | Probability that a decision based on estimated effect yields net benefits | |
Success | Qualitative assessment of the success of a study | |
Minimum detectable effect (MDE) | Smallest effect size for which the power of the design is nominal (e.g., powered at 80%) | |
Robustness | Joint probability of rejecting the null hypothesis across multiple tests | |
Maximum proportion of subjects harmed |
10.3 Estimation of diagnosands
10.3.1 Analytic design diagnosis
Diagnosis can be done with analytic, pencil-and-paper methods. In an analytic design diagnosis, we typically derive a formula that returns the value of a diagnosand under a stipulated set of beliefs about the model, inquiry, data strategy, and answer strategy.
For example, suppose a standard error of an effect estimate is given by:
and the critical value is:
We can then define the statistical power as:
Analytic design diagnosis can be hugely useful, since they cover a large families of designs that meet the scope criteria. For example, consider the “standard error” diagnosand
That said, scholars conducting analytic design diagnosis have only worked out a few diagnosands for a limited set of designs. Since designs are so heterogeneous and can vary on so many dimensions, computer simulation is often the only feasible way to diagnose. We learn a lot from analytic design diagnosis—what the important parameters to consider are, what the important inferential problems are—but they often cannot provide direct answers to practical questions like, how many subjects do I need for my conjoint experiment? For that reason, we turn to design diagnosis via Monte Carlo simulation.
10.3.2 Design diagnosis by simulation
Research design diagnosis by simulation occurs in two steps. First we simulate research designs repeatedly, collecting diagnostic statistics from each run of the simulation. Second, we summarize the distribution of the diagnostic statistics in order to estimate the diagnosands.
Monte Carlo simulations of research designs can be written in any programming language. To illustrate the most common way of simulating a design—a for loop—we’ll write a concise simulation in base R code. Diagnosis 10.1 conducts the simulation 500 times, each time, collecting the
Loops like this one can be implemented in any language and they remain a good tool for design diagnosis. We think, however, writing simulations in this way obscures what parts of the design refer to the model, the inquiry, the data strategy, or the answer strategy. We might imagine Z
and Y
are generated by a data strategy that randomly assigns Z
, then measures Y
(though without a language for linking code to design steps, it’s a bit unclear). The answer strategy involves running a regression of Y
on Z
, then conducting a hypothesis test against the null hypothesis that the coefficient on Z
is 0. The model and inquiry are left entirely implicit. The inquiry might be the ATE, but it might also be a question of whether the ATE is equal to zero. The model might include only two potential outcomes for each subject, or it might have more, we don’t know.
Diagnosis 10.1 By-hand diagnosis example.
For this reason, we advocate for explicit description of all four research design components. Explicit design declaration can also occur in any programming language, but DeclareDesign
is purpose-built for this task. We begin with the explicit declaration of the same two-arm trial as above. We have 100 subjects with a constant response to treatment of 0.2 units. Our inquiry is the average difference between the treated and untreated potential outcomes—the ATE. We assign treatment using complete random assignment and estimate treatment effects using the difference-in-means.
Declaration 10.1 Example of declaration using DeclareDesign
Diagnosis 10.2 Diagnosis in DeclareDesign
We can now diagnose the declared design. For comparison with the loop, we calculate just one diagnosand: statistical power. Both approaches return approximately the same answer. Any difference between the two can be attributed to simulation error, which is why DeclareDesign
by default returns a bootstrapped standard error for each diagnosand estimate. If we were to increase the number of simulations for both approaches dramatically (100,000 simulations would be plenty), the differences would vanish.
diagnosands <- declare_diagnosands(power = mean(p.value <= 0.05))
diagnosis <-
diagnose_design(declaration_10.1,
diagnosands = diagnosands)
diagnosis
Backing up, we now construct that same diagnosis step-by-step. We build up from a single simulation run of the design to the distribution of simulations, to summaries of that distribution.
set.seed(343)
simulation_10.1 <-
simulate_design(declaration_10.1, sims = 10) |>
mutate(significant = p.value <= 0.05)
We can run this simulation a single time with run_design
:
run_design(declaration_10.1)
Figure 10.3 shows the information we might obtain from a single run of the simulation. The filled point is the estimate
From this single draw, we can’t yet estimate diagnosands, but we can calculate diagnostic statistics. The estimate was higher than the estimand in this draw, so the error is 0.45 - 0.20 = 0.25. Likewise, the squared error is (0.45 - 0.20)FALSE
. The confidence interval stretches from 0.11 to 0.79, and since the value of the estimand (0.20) is between those bounds, the “covers” diagnostic statistic is equal to TRUE
.
To calculate the distributions of the diagnostic statistics, we have to simulate designs not just once, but many times over. The bias diagnosand is the average error over many runs of the simulation. The statistical power diagnosand is the fraction of runs in which the estimate is significant. The coverage diagnosand is the frequency with which the confidence interval covers the estimand.
Figure 10.4 visualizes 10 runs of the simulation. We can see that some of the draws produce statistically significant estimates (the shaded areas are small and the confidence intervals don’t overlap with zero), but not all. We get a sense of the true standard error (the standard deviation of the estimates over multiple runs of the design) by seeing how the point estimates bounce around. We get a feel for the difference between the estimates of the standard error and true standard error. Design diagnosis is the process of learning about many ways the study might come out, not just the one way that it will.
In Diagnosis 10.3, one line of code does all of these steps at once. We simulate the design (500 times by default) to calculate the diagnostic statistics, then we summarize them in terms of bias, the true standard error, RMSE, power, and coverage.
Diagnosis 10.3 Design diagnosis in one step
10.3.3 Simulation error
We use Monte Carlo simulation to estimate diagnosands. We only get estimates—not the exact value—of diagnosands under a particular model because of simulation error. Simulation error declines as we conduct more simulations. When we conduct many thousands of simulations, we’re relatively certain of the value of the diagnosand under the model. If we conduct just tens of simulations, we are much more uncertain.
We can characterize our uncertainty regarding the diagnosand estimate by calculating a standard error. If the Monte Carlo standard error is large relative to the estimated diagnosand, then we need to increase the number of simulations. Unlike empirical settings where additional data collection can be very expensive, in order to get more observations of diagnostic statistics, we can just increase the number of simulations. Computation time isn’t free—but it’s cheap.
The standard error of diagnosand estimates can be calculated using standard formulas. If the diagnosand can be written as a population mean, and the simulations are fully independent, then we can estimate the standard error as
Since some diagnosands are more complex than a population mean (i.e., we can’t characterize the estimation error with simple formulas), the diagnose_design()
function uses the nonparametric bootstrap.
10.4 How to diagnose designs
10.4.1 Exploring design tradeoffs
Choosing among designs means choosing which diagnosands are important—and not all diagnosands are equally important for every study. For example, in a descriptive study whose goal is to estimate the fraction of people in France who are left-handed, statistical power is likely irrelevant. A hypothesis test against the null hypothesis that zero percent of the people in France are left-handed is preposterous (though we could of course have other policy relevant hypotheses). A much more important diagnosand for this study would be RMSE (root-mean-squared error), which is a measure of how well estimated the estimand is that incorporates both bias and variance.
Often, we need to look at several diagnosands in order to understand what might be going wrong. If our design exhibits “undercoverage” (e.g., a procedure for constructing “95%” confidence interval only covers the estimand 50% of the time), that might be because the standard errors are too small or because the point estimator is biased, or some combination of the two. In really perverse instances, we might have a biased point estimator which, thanks to overly wide confidence intervals, just happens to cover the estimand 95% of the time.
Many research design decisions involve trading off bias and variance. In trade-off settings, we may need to accept higher variance in order to decrease bias. Likewise, we may need to accept a bit of bias in order to achieve lower variance. The trade-off is captured by mean-squared error, which is the average squared distance between
To illustrate, consider the following three designs as represented by three targets. The inquiry is the bullseye of the target. The data and answer strategies combine to generate a process by which arrows are shot toward the target. On the left, we have a very bad archer: even though the estimates are unbiased in the sense that they hit the bullseye “on average”, very few of the arrows are on target. In the middle, we have an excellent shot: they are both on target and low variance. On the right, we have an archer who is very consistent (low variance) but biased. The mean squared error is highest on the left and lowest in the middle.
The archery metaphor is common in research design textbooks because it effectively conveys the difference between variance and bias, but it does elide an important point. It really matters which target the archer is shooting at. Figure 10.6 shows a bizarre double-target representing two inquiries. The empirical strategy is unbiased and precise for the left inquiry, but it is clearly biased for the right inquiry. When we are describing the properties of an answer strategy, we have to be clear about which inquiry it is associated with.
MSE is an exactly equal weighting of variance and bias (squared). Yet many other weightings of these two diagnosands are possible, and different researchers will vary in their weightings.
In evaluating a research design through diagnosis, we must construct a weighting of all relevant diagnosands. We can think of this as our research design utility function. Our utility function describes how important it is to study big questions, to shift beliefs in a research field, to overturn established findings, to obtain unbiased answers, and to get the sign of the inquiry right. When we compare across empirical designs, we compare them on this utility function that aggregates all the diagnosands according to their importance.
For example, we too often consider the diagnosand power on its own. This diagnosand is the probability of getting a statistically significant result, which of course depends on many things about your design, including, crucially, the unknown magnitude of the parameter to be estimated. But considering power alone is also misleading: no researcher wants to design a study that is 80% powered but returns highly biased estimates. Another way of saying this is that researchers always care about both power and bias. How much they care about each feature determines the weight of power and bias in their utility function.
Diagnosands need not be about hypothesis testing or even statistical analysis of the data at all. We often trade off how much we learn from a research design with its cost in terms of money and our time. We have financial and time budgets that provide hard constraints to our designs. Many researchers wish to select cheaper (or shorter) designs in order to carry out more studies or finish their degree sooner. Time and cost are also diagnostic statistics! We may wish to explore the maximum cost of a study or the average amount of time it would take.
Ethical considerations also often enter the process of assessing research designs, if implicitly. We can explicitly incorporate them into our utility function by caring about harm to subjects and the degree of informed consent. When collecting data, researchers often encounter a trade-off between informing subjects about the purpose of the study (an ethical consideration, or a requirement of the IRB) on the one hand and the bias that comes from Hawthorne or demand effects on the other. We can incorporate these considerations in a research design diagnosis by specifying diagnostic statistics related to the amount of disclosure about the purposes of research or the number of subjects harmed in the research. See Section 21.1.1 for an example.
10.4.2 Diagnosis under model uncertainty
We are always uncertain about the model in M. If we were certain of M (or there was no real dispute about it), there would be no need to conduct new empirical research. Research design diagnosis can incorporate this uncertainty by evaluating the performance of the design under alternative models. For example, if we are unsure of the exact value of the intra-cluster correlation (ICC), we can simulate the design under a range of plausible ICC values. If we are unsure of the true average treatment effect, we can diagnose the power of the study over a range of plausible effect sizes. Uncertainty over model inputs like the means, variances, and covariances in data that will eventually be collected is a major reason to simulate under a range of plausible values.
We illustrate diagnosis under model uncertainty with the declaration below. Here we have a 200-unit two-arm trial in which we explicitly describe our uncertainty over the value of the true average treatment effect. In the potential_outcomes
call, we have Y ~ runif(1, 0, 0.5) * Z + U
which indicates that the treatment effect in each run of the simulation is one draw from uniform distribution between 0.0 and 0.5.
Declaration including model uncertainty
declaration_10.2 <-
declare_model(
N = 200, U = rnorm(N),
# this runif(n = 1, min = 0, max = 0.5)
# generates 1 random ATE between 0 and 0.5
potential_outcomes(Y ~ runif(n = 1, min = 0, max = 0.5) * Z + U)) +
declare_inquiry(ATE = mean(Y_Z_1 - Y_Z_0)) +
declare_assignment(Z = complete_ra(N, prob = 0.5)) +
declare_measurement(Y = reveal_outcomes(Y ~ Z)) +
declare_estimator(Y ~ Z, inquiry = "ATE")
Diagnosis 10.4 Diagnosis under model uncertainty.
Figure 10.7 shows how the statistical power of this design varies over the set of model possibilities. We plot the possible effect sizes we entertain in the model on the horizontal axis and the statistical power on the vertical axis. We also plot a loess curve that flexibly smooths over the full distribution of effect sizes. We see that at low levels of the effect size, statistical power is quite poor. With only 200 subjects, we couldn’t achieve 80% statistical power unless the true effect size were approximately 0.45 standard deviations. The effect size at which a design achieves 80% power is often referred to as the minimum detectable effect size (MDE). This exercise shows how the MDE diagnosand is a summary of a design that admits uncertainty over the effect size. We return to this example in Section 10.4.2, when we redesign this study to learn how the MDE changes at different sample sizes.
What is “the” power of this design? Under one view, the true power of the design is the whichever value for power is associated with the effect size. But under an agnostic view, the ex ante “power” of the design is a weighted average of all these power values, weighted by the researcher’s prior beliefs over the distribution of possible effect sizes.
10.4.3 Adjudicating between competing models
The principle to design agnostically extends to competing models. Imagine that you believe
Declaration of competing models.
M1 <-
declare_model(
N = 200,
U = rnorm(N),
potential_outcomes(Y1 ~ 0.2 * Z + U),
potential_outcomes(Y2 ~ 0.0 * Z + U)
)
M2 <-
declare_model(
N = 200,
U = rnorm(N),
potential_outcomes(Y1 ~ 0.0 * Z + U),
potential_outcomes(Y2 ~ 0.2 * Z + U)
)
IDA <-
declare_inquiry(ATE1 = mean(Y1_Z_1 - Y1_Z_0),
ATE2 = mean(Y2_Z_1 - Y2_Z_0)) +
declare_assignment(Z = complete_ra(N)) +
declare_measurement(Y1 = reveal_outcomes(Y1 ~ Z),
Y2 = reveal_outcomes(Y2 ~ Z)) +
declare_estimator(Y1 ~ Z, inquiry = "ATE1", label = "DIM1") +
declare_estimator(Y2 ~ Z, inquiry = "ATE2", label = "DIM2")
declaration_10.3a <- M1 + IDA
declaration_10.3b <- M2 + IDA
Diagnosis 10.5 Diagnosis of competing models
To diagnose these alternative models, we count how frequently each perspective receives support. If we define support by statistical significance (other metrics are of course possible), then your model is supported if the effect of treatment on
diagnosis_10.5 <-
diagnose_design(
list(declaration_10.3a, declaration_10.3b)
)
The diagnosis shows that the study is responsive to the truth. When theory 1 is correct, the design is more likely to yield empirical evidence in favor of it; the reverse holds when theory 2 is correct. That said, the major concern facing this adversarial collaboration is that the study is too small to resolve the dispute. About two thirds of the time—regardless of who is right!—neither theory receives support. This problem can be ameliorated either by elaborating more tests of each theoretical perspective or by increasing the size of the study.
10.5 Summary
Diagnosis is the process of estimating the properties of research designs under uncertain beliefs about the world. We simulate under many alternative designs because we want to choose designs that are strong under a large range of model possibilities. Each run of a design generates a diagnostic statistic. We learn about the distribution of the diagnostic statistic by running the design repeatedly. Diagnosands are summaries of the distribution of diagnostic statistics. Which diagnosands are most important will vary from study to study, and depend on the relative weight you place on one diagnosand versus another. Analytic design diagnosis is possible and can be quite powerful—we nevertheless recommend full simulation of research designs in order to learn about a range of diagnosands.