# 10 Diagnosis

Research design diagnosis is the process of evaluating the properties of a research design. We invent the term “diagnosand” to refer to those properties of a research design we would like to diagnose. Many diagnosands are familiar. Power is the probability of obtaining a statistically significant result. Bias is the average deviation of estimates from the estimand. Other diagnosands are more exotic, like the Type-S error rate, which is the probability the estimate has the incorrect sign, conditional on being statistically significant . This chapter focuses mainly on the use of Monte Carlo computer simulations to estimate diagnosands, though we touch on analytic design diagnosis as well.

Research designs are strong when the empirical answer $$a_{d^*}$$ generated by the design is close to the true answer $$a_{m^*}$$. We can never know $$a_{m^*}$$ with certainty – some combination of the fundamental problems of generalization, causal inference, and descriptive inference conspire to hide the truth from us. This problem means we have to assess the performance of our designs, not with respect to the real world, but instead with respect to our unverified and unverifiable beliefs about the world. In other words, we assess the properties of research designs by comparing the simulated sanswer $$a_d$$ to the answers under the model $$a_m$$, over many possible realizations of the design.

Figure 10.1 below is similar to Figure 5.1, which we described when defining the components of a research design. To recapitulate the main points of that discussion: The theoretical half of an actual research design defines an inquiry I in terms of a model M. The true answer to that question is the inquiry applied to the real world $$I(w) = a_{m^*}$$. The empirical half of a research design applies the data strategy to the real world to generate a dataset ($$D(m^*) = d^*$$), then applies the answer strategy to the realized dataset to generate an answer: ($$A(d^*) = a_{d^*}$$). All these stars reflect that fact that reality plays an important role in empirical research designs. We commit to the notion that there are real answers to our theoretical questions. Our empirical answers are tethered to reality because the data strategy generates information that depends on the real world. A fundamental premise of empirical research is that there is in fact a real truth to learn.

When we simulate and diagnose designs, however, this tether to reality is snipped, and we find ourselves in the bottom half of Figure 10.1. When we simulate, the theoretical half of the design entertains many possibilities in the model M, which we label $$m_1$$, $$m_2$$, …, $$m_k$$. The answers under each of $$m_1$$, $$m_2$$, …, $$m_k$$ models are labeled $$a_{m_k}$$. Each of the $$k$$ possibilities is generates a different answer to the inquiry: $$a_{m_1}$$, $$a_{m_2}$$, …, $$a_{m_k}$$. Importantly, the simulated research design does not have access to the true answer $$a_{m^*}$$, only answers under the $$k$$ considered models.

We pause to address one confusing aspect of our notation. The set of models $$m_1$$, $$m_2$$, …, $$m_k$$ could refer to a set of $$k$$ separate theoretical perspectives. Or it could refer to $$m_1$$, $$m_2$$, …, $$m_k$$ draws from the same basic model, but the values of the exogenous variables (following the specific meaning described in Section 6.1.1) are slightly different. In other works, our notations doesn’t draw a deep distinction between large differences between theoretical models and small ones.

When we simulate, the empirical half of the design is similarly dissociated from reality. We apply the data strategy $$D$$ to each of the model draws $$m_k$$ to produce simulated datasets index as $$d_k$$. These fabricated datasets may be similar to or different from the true dataset $$d^*$$ that would result if the design were realized (the more similar the better). We apply the answer strategy A to each simulated dataset $$d_k$$ in order to produce simulated answers $$a_{d_k}$$.

A comparison of the top row to the bottom three rows of Figure 10.1 shows how actual research designs differ from simulated research designs. Actual research designs are influenced by reality – we learn about the real world by conducting empirical research. We don’t learn about the real world from simulated research designs. Instead, we learn about research designs themselves. We learn how the research design would behave if the real world $$m^*$$ were like the model draw $$m_1$$ – or like $$m_{100}$$. We can only evaluate designs under the possibilities we consider in $$M$$. If the world $$m^*$$ is not in that set of possibilities, we won’t have evaluated the design under the most important setting, i.e., what will happen when we actually apply the design in reality. We want to follow the research design Principle 3.3 to “entertain many models” or in order words, to consider such a wide range of possibilities that some $$a_{m_k}$$ will be close enough to $$a_{m^*}$$.

## 10.1 Diagnostic statistics and diagnosands

Diagnosands are summaries of the distributions of diagnostic statistics. We’ll start by defining diagnostic statistics, then move on to describing how to generate the distribution of diagnostic statistics, then how to summarize those distributions in order to estimate diagnosands.

A diagnostic statistic $$\phi_k$$ is itself a function of $$a_{m_k}$$ and $$a_{d_k}$$: $$\phi_k = g(a_{m_k}, a_{d_k})$$. We elide here two important notational annoyances. For the purpose of the following discussion of diagnostic statistics and diagnosands, $$a_{d_k}$$ can refer to a estimate like an difference-in-means estimate or an associated uncertainty statistic like a standard error or a $$p$$-value. Importantly, diagnostic statistics can be functions that depend on $$a_{m_k}$$ only, $$a_{d_k}$$ only, or both together.

Each diagnostic statistic is the result of a different function $$g$$. For example, the “error” diagnostic statistic is produced by this function: $$g(a_{m_k}, a_{d_k}) = a_{d_k} - a_{m_k}$$. The “significant” diagnostic statistic is $$g(a_{d_k}) = \mathbf{I}[p(a_{d_k}) \leq \alpha]$$, where $$p()$$ is a function of the empirical answer that returns a $$p$$-value, $$\alpha$$ is a significance cutoff, and $$\mathbf{I}$$ is an indicator function that returns 1 when the $$p$$-value is below the cutoff and $$0$$ when it is above. A diagnostic statistic is something that can be calculated on the basis of a single run of the design.

Typically, diagnostic statistics will be different from simulation to simulation In other words, $$\phi_1$$ will be different from $$\phi_2$$, $$\phi_2$$ will be different from $$\phi_3$$, and so on. These differences arise partially from the variation in M: $$m_1$$ is different from $$m_2$$, $$m_2$$ is different from $$m_3$$, and so on. Differences can also arise from the explicitly random procedures in $$D$$: sampling, assignment, and measurement can all include stochastic elements that will ramify through to the diagnostic statistics. As a result of these sources of variation, a diagnostic statistic is a random variable $$\Phi$$.

A diagnosand is a summary of the random variable $$\Phi$$, written $$f(\Phi)$$, where $$f()$$ is a statistical functional that summarizes the random variable. A statistical functional is a function of a random variable that is not itself a random variable. For example, the expectation function $$\mathbb{E}[X]$$ summarizes the random variable $$X$$ with its expectation, the mean, while the variance function summarizes the expectation of the squared deviation of a random variable from its mean.

Let’s back up a moment to work through two concrete examples of common diagnosands: bias and power (see Section 10.4 for a more exhaustive list).

Consider the diagnosand “bias” in the context of a two-arm randomized trial where the inquiry is the average treatment effect, the data strategy entails complete random assignment, and the answer strategy is the difference-in-means. Bias is the average difference between the estimand and the estimate. Under a single realization $$m_k$$ of the model M, the value of the ATE will be a particular number, which we call $$a_{m_k}$$. We simulate a random assignment and measurement of observed outcomes, then apply the difference-in-means estimator. The diagnostic statistic is the error $$a_{d_k} - a_{m_k}$$; this error is a random variable because each $$m_k$$ differs slightly. The bias diagnosand is the expectation of this random variable is $$\mathbb{E}[a_{d_k} - a_{m_k}]$$, where the expectation is taken over the randomization distribution implied by M and D (or distinct regions of the randomization distribution).

Now consider the diagnosand “power.” Like bias, statistical power is an expectation, this time of the “significant” diagnostic statistic $$\mathbf{I}(p(a_{d_k}) \leq 0.05)$$. Power describes how frequently the answer strategy will return a statistically significant result. Some textbooks define statistical power as one minus the Type II error rate, where a Type II error is the failure to reject the null hypothesis, given that the null hypothesis is false. This definition is accurate, but hard to understand. The phrase “given that the null hypothesis is false” refers to model possibilities ($$a_{m_k}$$’s) in which the null hypothesis does not hold. Our definition of power is instead, “the probability of getting a statistically significant result, conditional on a set of beliefs about the model.”

## 10.2 Estimating diagnosands analytically

Diagnosis can be done with analytic, pencil-and-paper methods. In an analytic design diagnosis, we typically derive a formula that returns the value of a diagnosand under a stipulated set of beliefs about the model, inquiry, data strategy, and answer strategy. For example, research design textbooks often contain analytic design diagnoses for statistical power. write:

“To illustrate a power analysis, consider a completely randomized experiment where $$N>2$$ of $$N$$ units are selected into a binary treatment. The researcher must now make assumptions about the distributions of outcomes for treatment and for control units. In this example, the researcher assumes that the control group has a normally distributed outcome with mean $$\mu_c$$, the treatment group has a normally distributed outcome with mean $$\mu_t$$, and both group’s outcomes have a standard deviation $$\sigma$$. The researcher must also choose $$\alpha$$, the desired level of statistical significance (typically 0.05). Under this scenario, there exists a simple asymptotic approximation for the power of the experiment (assuming that the significance test is two-tailed): $\beta = \Phi \bigg(\frac{|\mu_t - \mu_c| \sqrt{N}}{2\sigma} - \Phi^{-1} \left(1 - \frac{\alpha}{2}\right) \bigg)$ where $$\beta$$ is the statistical power of the experiment, $$\Phi(\cdot)$$ is the normal cumulative distribution function (CDF), and $$\Phi^{-1}(\cdot)$$ is the inverse of the normal CDF.”

This power formula makes detailed assumptions about M, I, D, and A. Under M, it assumes that both potential outcomes are normally distributed with group specific means and a common variance. Under I, it assumes the average treatment effect. Under D, it assumes a particular randomization strategy (simple random assignment). Under A, it assumes a particular hypothesis testing approach (equal variance $$t$$-test with $$N - 2$$ degrees of freedom). Whether this set of assumptions is “close enough” will depend on the research setting.

Analytic design diagnosis can be hugely useful, since they cover a large families of designs that meet the scope criteria. For example, the “standard error” diagnosand $$\sqrt{\mathbb{E}[(a_{d_k} - \mathbb{E}[a_{d_k}])^2]}$$ of a standard two-arm trial has been worked out by statisticians to be $$\sqrt{\frac{1}{n-1}\left\{\frac{m\mathbb{V}(Y_i(0))}{n-m} + \frac{(N-m)\mathbb{V}(Y_i(1))}{m} + 2Cov(Y_i(0), Y_i(1))\right\}}$$ (see Section 17.1.1 for details). This standard error is accurate for any completely randomized design with stable potential outcomes and a difference-in-means estimator. Many, if not most, advances in our understanding of the properties of research design come from analytic design diagnosis. For example, and show that cluster randomized trials with heterogeneous cluster sizes are not unbiased for the ATE, which leads to the design recommendation that clustered trials should block on cluster size. These lessons apply broadly, more broadly perhaps than the lessons learned about a specific design in a specific Monte Carlo simulation.

That said, scholars conducting analytic design diagnosis have only worked out a few diagnosands for a limited set of designs. Since designs are so heterogeneous and can vary on so many dimensions, computer simulation is often the only feasible way to diagnose. We learn a lot from analytic design diagnosis – what are the important parameters to consider, what are the important inferential problems – but they often cannot provide direct answers to practical questions like, how many subjects do I need for my conjoint experiment? For that reason, we turn to design diagnosis via Monte Carlo simulation.

## 10.3 Estimating diagnosands via simulation

Research design diagnosis by simulation occurs in two steps. First we simulate research designs repeatedly, collecting diagnostic statistics from each run of the simulation. Second, we summarize the distribution of the diagnostic statistics in order to estimate the diagnosands.

Monte Carlo simulations of research designs can be written in any programming language. To illustrate the most common way of simulating a design – a for loop – we’ll write a concise simulation in base R code. We conduct the simulation 500 times, each time, collecting the $$p$$-value associated with a regression estimate of the average effect of an outcome on a treatment.

Loops like this one can be implemented in any language and they remain a good tool for design diagnosis. We think, however, writing simulations in this way obscures what parts of the design refer to the model, the inquiry, the data strategy, or the answer strategy. We might imagine Z and Y are generated by a data strategy that randomly assigns Z, then measures Y (though without a language for linking code to design steps, it’s a bit unclear). The answer strategy involves running a regression of Y on Z, then conducting a hypothesis test against the null hypothesis that the coefficient on Z is 0. The model and inquiry are left entirely implicit. The inquiry might be the ATE, but it might also be a question of whether the ATE is equal to zero. The model might include only two potential outcomes for each subject, or it might have more, we don’t know.

sims <- 500
p.values <- rep(NA, sims)

for(i in 1:sims){
Z <- rbinom(100, 1, 0.5)
U <- rnorm(100)
Y <- 0.2 * Z + U
p.values[i] <- summary(lm(Y ~ Z))\$coefficients[2, 4]
}

power <- mean(p.values <= 0.05)
power
Table 10.1: By-hand design diagnosis with a for-loop
power
0.15

For this reason, we advocate for explicit description of all four research design components. Explicit design declaration can also occur in any programming language, but DeclareDesign is purpose-built for this task. We begin with the explicit declaration of the same two-arm trial as above. We have 100 subjects with a constant response to treatment of 0.2 units. Our inquiry is the average difference between the treated and untreated potential outcomes – the ATE. We assign treatment using complete random assignment and estimate treatment effects using difference-in-means.

Declaration 10.1 $$~$$

design <-
declare_model(
N = 100,
U = rnorm(N),
potential_outcomes(Y ~  0.2 * Z + U)
) +
declare_inquiry(ATE = mean(Y_Z_1 - Y_Z_0)) +
declare_assignment(Z = complete_ra(N)) +
declare_measurement(Y = reveal_outcomes(Y ~ Z)) +
declare_estimator(Y ~ Z, inquiry = "ATE")

$$~$$

We can now diagnose the declared design. For comparison with the loop, we calculate just one diagnosand: statistical power. Both approaches return the same answer (within simulation error) of 16% statistical power.

diagnosands <- declare_diagnosands(power = mean(p.value <= 0.05))

diagnosis <-
diagnose_design(design,
diagnosands = diagnosands,
sims = 500)
diagnosis
Table 10.2: Design diagnosis using DeclareDesign
power se(power) n_sims
0.158 0.0154331 500

### 10.3.1 Breaking down diagnosis

In this section, we break down the diagnosis process from start to finish. We build up from a single simulation run of the design to the distribution of simulations, to summaries of that distribution.

We can run this simulation a single time with run_design:

run_design(design)
Table 10.3: One simulation draw
estimand estimate std.error df p.value conf.low conf.high
0.2 0.448 0.171 98 0.01 0.108 0.789

Figure 10.2 shows the information we might obtain from a single run of the simulation. The filled point is the estimate $$a_{d_k}$$. The open triangle is the estimand $$a_{m_k}$$. The bell-shaped curve is our normal-approximation based estimate of the sampling distribution. The standard deviation of this estimated distribution is our estimated standard error, which expresses our uncertainty. The confidence interval around the estimate is another expression of our uncertainty: We’re not sure where $$a_{d_k}$$ is, but if things are going according to plan, confidence intervals constructed this way will bracket $$a_{m_k}$$ 95% of the time.

From this single draw, we can’t yet estimate diagnosands, but we can calculate diagnostic statistics. The estimate was higher than the estimand in this draw, so the error is 0.45 - 0.20 = 0.25. Likewise, the squared error is (0.45 - 0.20)$$^2$$ = 0.0625. The $$p$$-value is 0.01, which is well below the threshold of 0.05, so “statistical significance” diagnostic statistic is equal to TRUE. The confidence interval stretches from 0.11 to 0.79, and since the value of the estimand (0.20) is between those bounds, the “covers” diagnostic statistic is equal to TRUE as well.

To calculate the distributions of the diagnostic statistics, we have to simulate designs not just once, but many times over. The bias diagnosand is the average error over many runs of the simulation. The statistical power diagnosand is the fraction of runs in which the estimate is significant. The coverage diagnosand is the frequency with which the confidence interval covers the estimand.

Figure 10.3 visualizes 10 runs of the simulation. We can see that some of the draws produce statistically significant estimates (the shaded areas are small and the confidence intervals don’t overlap zero), but not all. We get a sense of the true standard error by seeing how the point estimates bounce around. We get a feel for the difference between the estimates of the standard error and true standard error. Design diagnosis is the process of learning about many ways the study might come out, not just the one way that it will.

This line of code does it all in one. We simulate the design 1000 times to calculate the diagnostic statistics, then we summarize them in terms of bias, the true standard error, RMSE, power, and coverage.

diagnosis <-
diagnose_design(
design, sims = 1000,
diagnosands = declare_diagnosands(
bias = mean(estimate - estimand),
true_se = sd(estimate),
power = mean(p.value <= 0.05),
coverage = mean(estimand <= conf.high & estimand >= conf.low)
)
)
Table 10.4: Diagnosand estimates with bootstrapped standard errors
bias true se power coverage
0.00 (0.01) 0.20 (0.00) 0.17 (0.01) 0.94 (0.01)

### 10.3.2 Simulation error

We use Monte Carlo simulation to estimate diagnosands. We only get estimates – not the exact value – of diagnosands under a particular model because of simulation error. Simulation error declines as we conduct more simulations. When we conduct many thousands of simulations, we’re relatively certain of the value of the diagnosand under the model. If we conduct just tens of simulations, we are much more uncertain.

We can characterize our uncertainty attending to the diagnosand estimate by calculating a standard error. If the Monte Carlo standard error is large relative to the estimated diagnosand, then we need to increase the number of simulations. Unlike empirical settings where additional data collection can be very expensive, in order to get more observations of diagnostic statistics, we can just increase the number of simulations. Computation time isn’t free – but it’s cheap.

The standard error of diagnosand estimates can be calculated using standard formulas. If the diagnosand can be written as a population mean, and the simulations are fully independent, then we can estimate the standard error as $$\frac{\mathrm{sd}(\phi)}{\sqrt{k}}$$, where $$k$$ is the number of simulations. The power diagnosand summarizes a binary diagnostic statistic (is the estimate significant or not). Binary variables exhibit maximum variability when the probability of a 1 is 0.5. So with 1,000 independent simulations, the standard error for the mean of a binary diagnostic statistic is $$\frac{\sqrt{0.5 * 0.5} }{\sqrt{1000}} = 0.016$$. This level of simulation uncertainty is often acceptable (the 95% confidence interval is approximately 4 * 0.016 = 6.4 percentage points wide), but if it isn’t, you can always increase the number of simulations.

Since some diagnosands are more complex than a population mean (i.e., we can’t characterize the estimation error with simple formulas), so DeclareDesign does nonparametric bootstrapping by default whenever the diagnose_design() function is called.

## 10.4 Types of diagnosands

As described above, a diagnostic statistic is any summary function of $$a_m$$ and $$a_d$$, and a diagnosand is any summary of the distribution of diagnostic statistics. As a result, there are a great many diagnosands researchers may consider. In Table 10.5, we introduce a nonexhaustive set of diagnostic statistics, and in Table 10.6 a nonexhaustive set of diagnosands.

Table 10.5: Diagnostic statistics.
Diagnostic statistic Definition
Estimate $$a_{d_k}$$
Estimand under the model $$a_{m_k}$$
$$p$$-value $$p(a_{d_k})$$
$$p$$-value is no greater than $$\alpha$$ $$\mathbf{I}(p \leq \alpha)$$
Confidence interval $$\mathrm{CI}_{1-\alpha}$$
Confidence interval covers the estimand under the model $$\mathrm{covers}_{a_m} \equiv \mathbb{I}\{ a_m \in \mathrm{CI}_{1-\alpha} \}$$
Estimated standard error $$\widehat\sigma(A)$$
Cost $$\mathrm{cost}$$
Proportion of subjects harmed $$\Pr(\mathrm{harm}) \equiv \frac{1}{n} \sum_i \mathrm{harm_i}$$
Table 10.6: Diagnosands.
Diagnosand Description Definition
Average estimate $$\mathbb{E}(a_d)$$
Average estimand $$\mathbb{E}(a_m)$$
Power Probability of rejecting null hypothesis of no effect $$\mathbb{E}(I(p \leq \alpha))$$
Bias Expected difference between estimate and estimand $$\mathbb{E}(a_d - a_m)$$
Variance of the estimates $$\mathbb{V}(a_d)$$
True standard error $$\sqrt{\mathbb{V}(a_d)}$$
Average estimated standard error $$\widehat\sigma(A)$$
Root mean-squared-error (RMSE) $$\sqrt{\mathbb{E}(a_d - a_m)}$$
Coverage Probability confidence interval overlaps estimand $$\Pr(\mathrm{covers}_{a_m})$$
Bias‐eliminated coverage Probability confidence interval overlaps average estimate () $$\Pr(\mathrm{covers}_{a_d})$$
Type-S error rate Probability estimate has an incorrect sign, if statistically significant $$\Pr(\mathrm{sgn}(a_d) \neq \mathrm{sgn}(a_m) \mid p \leq \alpha)$$
Exaggeration ratio Expected ratio of absolute value of estimate to estimand, if statistically significant $$\mathbb{E}( a_d / a_m \mid p \leq \alpha)$$
Type I error Rejecting the null hypothesis when it is true $$\Pr(p \leq \alpha \mid a_m = a^0)$$
Type II error Failure to reject the null hypothesis when it is false $$\Pr(p \geq \alpha \mid a_m \neq a^0)$$
Sampling bias Expected difference between population average treatment effect and sample average treatment effect $$\mathbb{E}(a_{m_{\mathrm{sample}}} - a_{m_{\mathrm{population}}})$$
Expected maximum cost Maximum cost across possible realizations of the study $$\max{\mathrm{cost}}$$
Bayesian learning Difference between prior and posterior guess of the value of the estimand $$a_{m_{\mathrm{post}}} - a_{m_{\mathrm{pre}}}$$
Value for money Probability that a decision based on estimated effect yields net benefits
Success Qualitative assessment of the success of a study
Minimum detectable effect (MDE) Smallest effect size for which the power of the design is nominal (e.g., powered at 80%) $$\mathrm{argmin}_{a_{m_*}} \Pr(p \leq \alpha) = 0.8$$
Robustness Joint probability of rejecting the null hypothesis across multiple tests
Maximum proportion of subjects harmed $$\max{\Pr(\mathrm{harm})}$$

## 10.5 Diagnosand completeness

A design declaration is “diagnosand-complete” when the declaration contains enough information to calculate the diagnosand. The “bias” diagnosand requires that all four of M, I, D, and A are specified in sufficient detail because we need to be able to compare the answer under the model ($$a_m$$) to the simulated empirical answer ($$a_d$$). The “statistical power” diagnosand often does not require the inquiry $$I$$ to be specified, since we can mechanistically conduct null hypothesis significance tests without reference to estimands. Neither the “bias” nor “statistical power” diagnosands require any details of confidence interval construction, but without those details, the declaration is not diagnosand-complete for “coverage.”

Every design we have ever declared is diagnosand-complete for some diagnosands but not for others. Attaining diagnosand-completeness for each of the diagnosands in Table 10.6 in a single design declaration would be challenging and cumbersome, though of course not impossible. Total diagnonsand-completeness is not even a goal for design declaration. We want to be diagnosand-complete for the set of diagnosands that matter most in a particular research setting, which is the topic of the next section.

## 10.6 Choosing diagnosands to explore design tradeoffs

Principle 3.8 says we should consider all important diagnosands – but not all diagnosands are important for every study. For example, in a descriptive study whose goal is to estimate the fraction of people in France who are left-handed, statistical power is irrelevant. A hypothesis test against the null hypothesis that zero percent of the people in France are left-handed is preposterous. We know for sure that the fraction is not zero, we just don’t know its precise value. A much more important diagnosand for this study would be RMSE (root-mean-squared-error), which is a measure of how well-estimated the estimand is that incorporates both bias and variance.

Often, we need to look at several diagnosands in order to understand what might be going wrong. This If your design exhibits “undercoverage” (e.g., a prodecure for constructing “95%” confidence interval only covers the estimand 50% of the time), that might be because your standard errors are too small or because your point estimator is biased, or some combination of the two. In really perverse instances, you might have a biased point estimator which, thanks to overly-wide confidence intervals, just happens to cover 95% of the time.

Many research design decisions involve trading off bias and variance. In trade-off settings, we may need to accept higher variance in order to decrease bias. Likewise, we may need to accept a bit of bias in order to achieve lower variance. The tradeoff is captured by mean-squared error, which is the average squared distance between $$a_d$$ and $$a_m$$. Of course, we would ideally like to have as low a mean-squared error as possible. We would like to achieve low variance and low bias simultaneously.

To illustrate, consider the following three designs as represented by three targets. The inquiry is the bullseye of the target. The data and answer strategies combine to generate a process by which arrows are shot towards the target. On the left, we have a very bad archer: even though the estimates are unbiased in the sense that they hit the bullseye “on average,” very few of the arrows are on target. In the middle, we have an excellent shot: they are both on target and low variance. On the right, we have an archer who is very consistent (low variance) but biased. The mean squared error is highest on the left and lowest in the middle.

The archery metaphor is common in research design textbooks because it effectively conveys the difference between variance and bias, but it does elide an important point. It really matters which target your archer is shooting at. Figure 10.5 shows a bizarre double-target representing two inquiries. The empirical strategy is unbiased and precise for the left inquiry, but it is clearly biased for the right inquiry. When we are describing the properties of an answer strategy, we have to be clear about which inquiry it is associated with.

MSE is an exactly equal weighting of variance and bias (squared). Yet many other weightings of these two diagnosands are possible, and different researchers will vary in their weightings.

In evaluating a research design diagnosis, what understand the weighting of all relevant diagnosands. We can think of this as out research design utility function. Out utility function descrives how important it is to study big questions, to shift beliefs in a research field, to overturn established findings, to obtain unbiased answers, and to get the sign of the inquiry right. Your utility function evaluated for a given design will yield a utility and these can be compared across empirical designs (we call this process of comparison redesign, described in detail in the next section).

We often consider the diagnosand power on its own. This diagnosand is the probability of getting a statistically significant result, which of course depends on many things about your design including, crucially, the unknown magnitude of the parameter to be estimated. You can think of statistical power as the probability of a success, where success is defined as getting significant results. The conventional power target is 80% power. One could imagine redefining statistical power as “null risk,” or the probability of obtaining a null result. In these terms, the conventional power target entails a 20% null risk, or a one in five chance of “failure.” Those odds aren’t great, so we recommend designing studies with lower null risk. But considering power alone is also misleading: no researcher wants to design a study that is 80% powered but returns highly biased estimates. Another way of saying this is that researchers always care about both power and bias. How much they care about each feature determines the weight of power and bias in their utility function.

Diagnosands need not be about hypothesis testing or even statistical analysis of the data at all. We often trade off how much we learn from a research design with its cost in terms of money and our time. We have financial and time budgets that provide hard constraints to our designs, but we also at the margin many researchers wish to select cheaper (or shorter) designs in order to carry out more studies or finish their degree sooner. Time and cost are also diagnostic statistics! We may wish to explore the maximum cost of a study or the maximum amount of time it would take.

Ethical considerations also often enter the process of assessing research designs, if implicitly. We can explicitly incorporate them into our utility function by valuing minimizing harm and maximizing the degree of informed consent requested of subjects. When collecting, researchers often believe that they face a tradeoff between informing subjects about the subject of the data collection (an ethical consideration, or a requirement of the IRB) on the one hand and the bias that comes from Hawthorne or demand effects. We can incorporate these considerations in a research design diagnosis by specifying diagnostic statistics related to the amount of disclosure about the purposes of research or the number of subjects harmed in the research.

## 10.7 Diagnosis under model uncertainty

We are always uncertain about the model in M (Principle 3.3). If we were certain of M (or there was no real dispute about it), there would be no need to conduct new empirical research. Research design diagnosis can incorporate this uncertainty by evaluating the performance of the design under alternative models. For example, if we are unsure of the exact value of the intra-class correlation (ICC), we can simulate the design under a range of plausible ICC values. If we are unsure of the true average treatment effect, we can diagnose the power of the study over a range of plausible effect sizes. Uncertainty over model inputs like the means, variances, and covariances in data that will eventually be collected is a major reason to simulate under a range of plausible values.

We illustrate diagnosis under model uncertainty with the declaration below. Here we have a 200 unit two-arm trial in which we explicitly describe our uncertainty over the value of the true average treatment effect. In the potential_outcomes call, we have Y ~ runif(1, 0, 0.5) * Z + U which indicates that the treatment effect in each run of the simulation is one draw from uniform distribution between 0.0 and 0.5.

Declaration 10.2 $$~$$

design <-
declare_model(
N = 200, U = rnorm(N),
# this runif(1, 0, 0.5) generates 1 random ATE between 0 and 0.5
potential_outcomes(Y ~ runif(1, 0, 0.5) * Z + U)) +
declare_inquiry(ATE = mean(Y_Z_1 - Y_Z_0)) +
declare_assignment(Z = complete_ra(N, prob = 0.5)) +
declare_measurement(Y = reveal_outcomes(Y ~ Z)) +
declare_estimator(Y ~ Z, inquiry = "ATE")

$$~$$

Figure 10.6 shows how the statistical power of this design varies over the set of model possibilities. We plot the possible effect sizes we entertain in the model on the horizontal axis and the statistical power on the vertical axis. We also plot a loess curve flexibly smooths over the full distribution of effect sizes. We see that at low levels of the effect size, statistical power is quite poor. With only 200 subjects, we couldn’t achieve 80% statistical power unless the true effect size were approximately 0.45 standard deviations. The effect size at which a design achieves 80% power is often referred to as the minimum detectable effect size (MDE). This exercise shows how the MDE diagnosand is a summary of a design that admits uncertainty over the effect size. We return to this example in Section 11.4, when we redesign this study to learn how the MDE changes at different sample sizes.

What is “the” power of this design? Under one view, the true power of the design is the whichever value for power is associated with the effect size. But under an agnostic view, the ex ante “power” of the design is a weighted average of all these power values, weighted by the researcher’s prior beliefs over the distribution of possible effect sizes.

### 10.7.1 Adjudicating between competing models

Principle 3.3 says we should entertain many models. The principle extends to competing models. Imagine that you believe $$M1$$ is true but that your scholarly rival believes $$M2$$. Suppose that under $$M1$$, the treatment affects $$Y1$$ but not $$Y2$$. Your rival posits the reverse: Under $$M2$$: the treatment affects $$Y2$$ but not $$Y1$$. In the spirit of scientific progress, the both of you engage in an “adversarial collaboration.” You design a study together. The design you choose should, first, demonstrate $$M1$$ is true if it is true and, second, demonstrate $$M2$$ is true if it is true. In order to come to an agreement about the properties of the design, you will need to simulate the design under both models.

Declaration 10.3 $$~$$

M1 <-
declare_model(
N = 200,
U = rnorm(N),
potential_outcomes(Y1 ~ 0.2 * Z + U),
potential_outcomes(Y2 ~ 0.0 * Z + U)
)

M2 <-
declare_model(
N = 200,
U = rnorm(N),
potential_outcomes(Y1 ~ 0.0 * Z + U),
potential_outcomes(Y2 ~ 0.2 * Z + U)
)

IDA <-
declare_inquiry(ATE1 = mean(Y1_Z_1 - Y1_Z_0),
ATE2 = mean(Y2_Z_1 - Y2_Z_0)) +
declare_assignment(Z = complete_ra(N)) +
declare_measurement(Y1 = reveal_outcomes(Y1 ~ Z),
Y2 = reveal_outcomes(Y2 ~ Z)) +
declare_estimator(Y1 ~ Z, inquiry = "ATE1", label = "DIM1") +
declare_estimator(Y2 ~ Z, inquiry = "ATE2", label = "DIM2")

design1 <- M1 + IDA
design2 <- M2 + IDA

$$~$$

We simulate the design, then count up how frequently each perspective recieves support. If we define support by statistical significant (other metrics are of course possible), then you are correct if the effect of treatment on $$Y1$$ is significant but the effect on $$Y2$$ is nonsigificant. If the reverse pattern is obtained, your rival can claim victory. Two kinds of split decisions are possible: neither estimate is significant, or both are. By simulation, we can estimate the rates of these four possibilities, both under your beliefs and your rivals.

Table 10.7: Design diagnosis under two alternative theories
design Only theory 1 supported Only theory 2 supported Both supported Neither supported
design1 0.26 0.030 0.028 0.682
design2 0.03 0.262 0.028 0.680

The diagnosis shows that the study is responsive to the truth. When theory 1 is correct, the design is more likely to yield empirical evidence in favor of it; the reverse holds when theory 2 is correct. That said, the major concern facing this adversarial collaboration is that the study is too small to resolve the dispute. About two-thirds of the time – regardless of who is right! – neither theory receives support. This problem can be ameliorated either by elaborating more tests of each theoretical perspective or by increasing the size of the study.

Once a design is declared in code, diagnosing it is usually the easy part. diagnose_design handles all the details and bookkeeping for you. In this section, we outline how to simulate the design and learn about the properties of it by hand. We don’t provide built-in plotting features in DeclareDesign because every design diagnosis is a little bit different. But plotting simulations is a great way to get acquainted with the design and to explore design variations.

We declare a two-arm randomized experiment as an example to simulate, and simulate the design 100 times.

Declaration 10.4 $$~$$

effect_size <- 0.1
design <-
declare_model(
N = 100,
U = rnorm(N),
X = rnorm(N),
potential_outcomes(Y ~ effect_size * Z + X + U)
) +
declare_inquiry(ATE = mean(Y_Z_1 - Y_Z_0)) +
declare_assignment(Z = complete_ra(N)) +
declare_measurement(Y = reveal_outcomes(Y ~ Z)) +
declare_estimator(Y ~ Z, inquiry = "ATE", label = "unadjusted") +
declare_estimator(Y ~ Z + X, inquiry = "ATE", label = "adjusted")

$$~$$

simulations_df <- simulate_design(design, sims = 100)

You could diagnose your design using dplyr tools on your own. We provide a basic pipeline that includes all of the default diagnosands below. You can modify them, add or subtract, and group your data more flexibly with the simulations data in hand.

simulations_df %>%
group_by(design, inquiry, estimator, term) %>%
summarize(
bias = mean(estimate - estimand),
rmse = sqrt(mean((estimate - estimand)^2)),
power = mean(p.value < 0.05),
coverage = mean(estimand <= conf.high & estimand >= conf.low),
mean_estimate = mean(estimate),
sd_estimate = sd(estimate),
mean_se = mean(std.error),
type_s_rate =
mean((sign(estimate) != sign(estimand))[p.value < 0.05]),
mean_estimand = mean(estimand),
.groups = "drop"
)

Many diagnosands are summaries of the sampling distribution of the estimates. To get a deeper sense of what estimates look like, we can plot them using a histogram. In this design, we have two estimators, so we create two facets one for each estimator’s sampling distribution using facet_wrap. We display the sampling distribution and overlay that with the value of the estimand in each case (which is the same!) so that you can get a sense for whether the sampling distribution is centered on the estimand or not — meaning it is biased. The width of the sampling distribution indicates the precision of the estimates.

# first create summary for vertical lines
summary_df <-
simulations_df %>%
group_by(estimator) %>%
summarize(estimand = mean(estimand))

# then plot simulations
ggplot(simulations_df) +
geom_histogram(aes(estimate),
bins = 40, fill = "lightblue") +
geom_vline(data = summary_df,
aes(xintercept = estimand),
lty = "dashed", color = "red") +
annotate("text", y = 80, x = 0, label = "Estimand",
color = "red", hjust = 1) +
facet_wrap( ~ estimator) +
labs(x = "Estimate", y = "Count of simulations")

Diagnosing over model uncertainty is a crucial part of diagnosis. We want to understand when our design performs well and when it does not. A classical example of this in wide practice is the power curve. In a power curve, we display the power of a design (the probability of achieving statistical significance) along different possible effect sizes.

design <-
declare_model(
N = 200,
U = rnorm(N),
potential_outcomes(Y ~ runif(1, 0.0, 0.5) * Z + U)
) +
declare_inquiry(ATE = mean(Y_Z_1 - Y_Z_0)) +
declare_assignment(Z = complete_ra(N)) +
declare_measurement(Y = reveal_outcomes(Y ~ Z)) +
declare_estimator(Y ~ Z, inquiry = "ATE")

simulations_df <-
simulate_designs(design, sims = 500) %>%
mutate(significant = if_else(p.value <= 0.05, 1, 0))

ggplot(simulations_df) +
stat_smooth(aes(estimand, significant), method = 'loess', color = "blue", fill = "lightblue", formula = 'y ~ x') +
geom_hline(yintercept = 0.8, color = "red", linetype = "dashed") +
annotate("text", x = 0, y = 0.85, label = "Conventional power threshold = 0.8", hjust = 0, color = "red") +
scale_y_continuous(breaks = seq(0, 1, 0.2)) +
coord_cartesian(ylim = c(0, 1)) +
theme(legend.position = "none") +
labs(x = "Model parameter: true effect size",
y = "Diagnosand: statistical power")

## 10.9 Summary

Diagnosis is the process of estimating the properties of research designs under uncertain beliefs about the world. We simulate under many alternative designs because we want to choose designs that are strong under a large range of model possibilities. Each run of a simulate generates a diagnostic statistic. We learn about the distribution of the diagnositc statistic by running the simulation repeatedly. Diagnosands are summaries of the distribution of diagnostic statistics. Which diagnosands are most important will vary from study to study, and depend on the relative weight you place on one diagnosand versus another. Analytic design diagnosis is possible and can be quite powerful – we nevertheless recommend full simulation of research designs in order to learn about a range of diagnosands.