# 11 Redesigning

Redesign is the process of choosing the single empirical design to be implemented from a large family of possible designs. To make this choice, we systematically vary aspects of the data and answer strategies to understand their impact on the most important diagnosands. Redesign entails diagnosing many possible empirical designs over the range of plausible theoretical models, and comparing them.

A sample size calculation is the prototypical example of a redesign. Holding the model, inquiry, and answer strategy constant, we vary the “sample size” feature of the data strategy in order to understand how a diagnosand like the width of the confidence interval changes as we change $$N$$.

Not surprisingly, most designs get stronger as we allocate more resources to them. The expected width of a confidence interval could always be tighter, if only we had more subjects. Standard errors could always be smaller, if only we took more pre-treatment measurements. At some point, though, the gains are not worth the increased costs, so we settle for an affordable design that meets our scientific goals well enough. (Of course, if the largest affordable design has poor properties, no version of the study is worth implementing). The knowledge-expense tradeoff is a problem that every empirical study faces. The purpose of redesign is to explore this and other tradeoffs in a systematic way.

## 11.1 Redesigning over data strategies

A redesign over a data strategy choice can be summarized with a “power curve.” We want to learn the power of a test at many sample sizes, either so we can learn the price of precision, or so we can learn what sample size is required for a minimum level of statistical power.

We start with a minimal design declaration: we draw samples of size $$N$$ and measure a single binary outcome $$Y$$, then conduct a test against the null hypothesis that the true proportion of successes is equal to 0.5.

Declaration 11.1 A baseline declaration intended to be redesigned over $$N$$.

N <- 100

declaration_11.1 <-
declare_model(N = N) +
declare_measurement(Y = rbinom(n = N, size = 1, prob = 0.55)) +
declare_test(handler =
label_estimator(function(data) {
test <- prop.test(x = table(data$Y), p = 0.5) tidy(test) })) Diagnosis 11.1 Diagnosing over a redesign To construct a power curve, we redesign our baseline declaration over values of $$N$$ that vary from 100 to 1000. diagnosis_11.1 <- declaration_11.1 |> redesign(N = seq(100, 1000, 100)) |> diagnose_designs() Redesigns are often easiest to understand graphically, as in Figure 11.1. At each sample size, we learn the associated level of statistical power. We might then choose the least expensive design (sample size 800) that meets a minimum power standard (0.8). Figure 11.1: Redesigning with respect to sample size #### 11.1.0.1 Redesigning under model uncertainty When we diagnose studies, we do so over the many theoretical possibilities we entertain in the model. Through diagnosis, we learn how the values of the diagnosands change depending on model parameters. When we redesign, we explore a range of empirical strategies over the set of model possibilities. Redesign might indicate that one design is optimal under one set of assumptions, but that a different design would be preferred if a different set holds. We illustrate this idea with an analysis of the minimum detectable effect (MDE) and how it changes at different sample sizes. The MDE diagnosand is complex. Whereas most diagnosands can be calculated with respect to a single possible model in $$M$$, the MDE is defined over a range of possible models. It is obtained by calculating the statistical power of the design over a range of possible effect sizes (holding the empirical design constant), then reporting the effect size that is associated with (typically) 80% statistical power. MDEs can be a useful heuristic for thinking about the multiplicity of possibilities in the model. If the minimum detectable effect of a study is enormous – a one standard deviation effect, say – then we don’t have to think much harder about our beliefs about the true effect size. Whatever our priors over the true effect size are, they are probably smaller than 1.0 SDs, so we can immediately conclude that the design is too small. Declaration 11.2 contains uncertainty over the true effect size. This uncertainty is encoded in the runif(n = 1, min = 0, max = 0.5) command, which corresponds to our uncertainty over the ATE. It could be as small as 0.0 SDs or as large as 0.5 SDs, and we are equally uncertain about all the values in between. We redesign over three values of $$N$$: 100, 500, and 1000, then simulate each design. Each run of simulation features a different true ATE somewhere between 0.0 and 0.5. Declaration 11.2 Uncertainty over effect size design N <- 100 declaration_11.2 <- declare_model(N = N, U = rnorm(N), # this runif(n = 1, min = 0, max = 0.5) generates 1 random ATE between 0 and 0.5 potential_outcomes(Y ~ runif(n = 1, min = 0, max = 0.5) * Z + U)) + declare_inquiry(ATE = mean(Y_Z_1 - Y_Z_0)) + declare_assignment(Z = complete_ra(N, prob = 0.5)) + declare_measurement(Y = reveal_outcomes(Y ~ Z)) + declare_estimator(Y ~ Z, inquiry = "ATE") Diagnosis 11.2 Redesigning under uncertainty diagnosis_11.2 <- declaration_11.2 |> redesign(N = c(100, 500, 1000)) |> diagnose_designs() Figure 11.2 summarizes the simulations by smoothing over effect sizes: the loess curves describes the fraction of simulations that are significant at each effect size.15 The MDEs for each sample size can be read off the plot by examining the intersection of each curve with the dotted line at 80% statistical power. At N = 1000, the MDE is approximately 0.175 SDs. At N = 500, the MDE is larger, at approximately 0.225 SDs. If the design only includes 100 units, the MDE is some value higher than 0.5 SDs. We could of course expand the range of effect sizes considered in the diagnosis, but if effect sizes above 0.5 SDs are theoretically unlikely, we don’t even need to – we’ll need a design larger that 100 units in any case. This diagnosis and redesign shows how our decisions about the data strategy depend on beliefs in the model. If we think the true effect size is likely to be 0.225 SDs, then a design with 500 subjects is a reasonable choice, but if it is smaller than that, we’ll want a larger study. Small differences in effect size have large consequences for design. Researchers who arrive at a plot like Figure 11.2 through redesign should be inspired to sharpen up their prior beliefs about the true effect size, either through literature review, meta-analysis of past studies, or through piloting (see Section 21.4). Figure 11.2: Redesigning an experiment over model uncertainty about the true effect size #### 11.1.0.2 Redesigning over two data strategy parameters Sometimes, we have a fixed budget (in terms of financial resources, creative effort, or time), so the redesign question isn’t about how much to spend, but how to spend it across competing demands. For example, we might want to find the sample size N and the fraction of units to be treated prob that minimize a design’s error subject to a fixed budget. Data collection costs$2 per unit and treatment costs $20 per treated unit. We need to choose how many subjects to sample and how many to treat. We might rather add an extra 11 units to the control units (additional cost$2 * 11 = $22) than add one extra unit to the treatment group (additional cost$2 + $20 =$22).

We solve the optimization problem:

\begin{align*} & \underset{N, N_t}{\text{argmin}} & & E_M(L(a^{d} - a^{m}|D_{N, m})) \\ & \text{s.t.} & & 5 N + 20 m \leq 5000 \end{align*}

where $$L$$ is a loss function, increasing in the difference between $$a^{d}$$ and $$a^{m}$$.

We can explore this optimization with bare-bones declaration of a two-arm trial that depends on two separate data strategy parameters, N and prob:

Declaration 11.3 Bare-bones two-arm trial

N <- 100

declaration_11.3 <-
declare_model(N = N, U = rnorm(N),
potential_outcomes(Y ~ 0.2 * Z + U)) +
declare_inquiry(ATE = mean(Y_Z_1 - Y_Z_0)) +
declare_assignment(Z = complete_ra(N = N, prob = prob)) +
declare_measurement(Y = reveal_outcomes(Y ~ Z)) +
declare_estimator(Y ~ Z, inquiry = "ATE")

Diagnosis 11.3 Redesigning over two parameters

We redesign, varying those two parameters over reasonable ranges: 100 to 1000 subjects, with probabilities of assignment from 0.1 to 0.5. The redesign function smartly generates designs with all combinations of the two parameters. We want to consider the consequences of these data strategy choices for two diagnosands: cost and a very common loss function: mean squared error.

diagnosands <-
declare_diagnosands(cost = unique(N * 2 + prob * N * 20),
rmse = sqrt(mean((estimate - estimand) ^ 2)))

diagnosis_11.3 <-
declaration_11.3 |>
redesign(N = seq(100, 1000, 25),
prob = seq(0.1, 0.5, 0.2)) |>
diagnose_designs(diagnosands = diagnosands)

The diagnosis is represented in Figure 11.3. The top panel shows the cost of empirical designs, at three probabilities of assignment over many sample sizes. The bottom panel shows the RMSE of each design. According to this diagnosis, the best combination that can be achieved for less than $5,000 is N = 600 with prob = 0.3. This conclusion is in mild tension with common the design advice that under many circumstances, balanced designs are preferable (see Section 10.3.1 in the design library for an in-depth discussion of this point). Here, untreated subjects are so much less expensive than treated subjects, we want to tilt the design towards having a larger control group. How far to tilt depends on model beliefs as well as the cost structure of the study. Figure 11.3: Redesigning an experiment with respect to RMSE and subject to a budget constraint ## 11.2 Redesigning over answer strategies Redesign can also take place over possible answer strategies. An inquiry like the average treatment effect could be estimated using many different estimators: difference-in-means, logistic regression, covariate-adjusted ordinary least squares, the stratified estimator, doubly robust regression, targeted maximum likelihood regression, regression trees – the list of possibilities is long. Redesign is an opportunity to explore how many alternative analysis approaches work. A key tradeoff in the choice of answer strategy is the bias-variance tradeoff. Some answer strategies exhibit higher bias but lower variance while others have lower bias but higher variance. Choosing which side of the bias-variance tradeoff to take is complicated and the process for choosing among alternatives must be motivated by the scientific goals at hand. A common heuristic for trading off bias and variance is the mean squared error (MSE) diagnosand. Mean squared error is equal to the square of bias plus variance, which is to say MSE weighs bias and variance equally. Typically, researchers choose among alternative answer strategies by minimizing MSE. If in your scientific context, bias is more important than variance, you might choose a an answer strategy that accepts slightly more variance in exchange for a decrease in bias. To illustrate the bias-variance tradeoff, Declaration 11.4 describes a setting in which the goal is to estimate the conditional expectation of some outcome variable Y with respect to a covariate X. The true conditional expectation function (produced by the custom dip function) is not smooth, but we estimate it with smooth polynomial functions of increasing order. Declaration 11.4 Conditional expectation function design library(purrr) dip <- function(x) (x <= 1) * x + (x > 1) * (x - 2) ^ 2 + 0.2 x_range <- seq(from = 0, to = 3, length.out = 50) polynomial_degrees <- 1:6 declaration_11.4 <- declare_model( N = 100, X = runif(N, 0, 3)) + declare_inquiry( X = x_range, inquiry = str_c("X_", X), estimand = dip(X), data = NULL, handler = tibble ) + declare_measurement(Y = dip(X) + rnorm(N, 0, .5)) + declare_estimator(handler = function(data) { map(polynomial_degrees, ~lm(Y ~ poly(X, .), data = data)) |> set_names(nm = str_c("A", polynomial_degrees)) |> map_dfc(~predict(., newdata = tibble(X = x_range))) |> bind_cols(tibble(X = x_range)) |> mutate(inquiry = str_c("X_", X)) |> pivot_longer(cols = starts_with("A"), names_to = "estimator", values_to = "estimate") }) Figure 11.4 shows one draw of this design – the predictions of the CEF made by nine regressions of increasing flexibility. A polynomial of order 1 is just a straight line, a polynomial of order 2 is a quadratic, order 3 is a cubic etc. Aronow and Miller (2019) show (Theorem 4.3.3) that even nonlinear CEFs can be approximated to up to an arbitrary level of precision by increasing the order of the polynomial regression used to estimate it, given enough data. The figure provides some intuition for why. As the order of the polynomial increases, the line becomes more flexible and can accommodate unexpected twists and turns in the CEF. Figure 11.4: Estimating a CEF with polynomials of increasing order Diagnosis 11.4 Conditional expectation function diagnosis Increasing the order of the polynomial decreases bias, but this decrease comes at the cost of variance. Figure 11.4 shows how, when the order increases, bias goes down while variance goes up. Mean squared error is one way to trade these two diagnosands off one another. Here, MSE is minimized with a polynomial of order 3. If we were to care much more about bias than variance, perhaps we would choose a polynomial of even higher order. diagnosis_11.4 <- diagnose_design(diagnosis_11.4) Figure 11.5: The bias-variance tradeoff when choosing the flexibility of polynomial approximations to the CEF #### 11.2.0.1 Redesigning over estimators: Logit, Probit, or OLS? A perennial debate among social scientists is whether to use a binary choice model like logit or probit when the outcome is binary, or if the workhorse OLS estimator is preferable. Unsurprisingly, who is right in this debate depends on other features of the research design. For example, in an observational descriptive study in which the inquiry is a prediction for the probability of success among a particular group of units, explicitly accounting for the binary nature the outcome variable is important; OLS can generate predictions that lie outside the theoretically permissible zero to one range. However, in an experimental causal study in which the inquiry is the average treatment effect, it is not possible for a comparison of treatment and control group means (as estimated by OLS) to generate a nonsense treatment effect estimate in the sense of being outside the theoretically permissible -100 to +100 percentage point range. By contrast, binary choice models like logit or probit can theoretically impossible average treatment effect estimates, once average marginal effect estimates are calculated. This problem is demonstrated in the following declaration, diagnosis, and redesign: Declaration 11.5 Choosing logit, probit, or OLS library(margins) tidy_margins <- function(x) { tidy(margins(x, data = x$data), conf.int = TRUE)
}

N <- 10

declaration_11.5 <-
declare_model(N = N,
U = rnorm(N),
potential_outcomes(Y ~ rbinom(N, 1, prob = 0.2 * Z + 0.6))) +
declare_inquiry(ATE = mean(Y_Z_1 - Y_Z_0)) +
declare_assignment(Z = complete_ra(N, prob = 0.5)) +
declare_measurement(Y = reveal_outcomes(Y ~ Z)) +
declare_estimator(Y ~ Z,
inquiry = "ATE",
term = "Z",
label = "OLS") +
declare_estimator(
Y ~ Z,
.method = glm,
family = binomial("logit"),
.summary = tidy_margins,
inquiry = "ATE",
term = "Z",
label = "logit"
) +
declare_estimator(
Y ~ Z,
.method = glm,
family = binomial("probit"),
.summary = tidy_margins,
inquiry = "ATE",
term = "Z",
label = "probit"
) 

Diagnosis 11.5 Redesigning alternative estimators over sample sizes

diagnosis_11.5 <-
declaration_11.5 |>
redesign(N = seq(10, 100, by = 10)) |>
diagnose_designs()

Figure 11.6 displays the distribution of average treatment effect estimates from this design, over three answer strategies (OLS, logit, and probit) and 10 sample sizes. As sample size increases, the differences across these three approaches vanishes. Eventually, all three sampling distributions center on the ATE. However, at small sample sizes, the three approaches do differ in an important respect. The estimates from OLS are always constrained to the theoretical minimum and maximum average treatment effects of -100 percentage points to 100 percentage points, whereas logit and probit sometimes generate estimates outside this theoretical range.

This example underlines how the choice among estimators depends so deeply on the inquiry. When the inquiry is a predicted probability, OLS has a bad property of sometimes generating predictions outside the zero-one range. When the inquiry is an average treatment effect, the linear model works just fine, but the nonlinear binary choice models can get tripped up when calculating average marginal effects. Figure 11.6: Sampling distribution of three estimators, varying sample size

## 11.3 Summary

We use the redesign process to learn about design tradeoffs. Most often, the tradeoff is some measure of design quality like power against cost. We want to trade quality off against cost until we find a good enough study for the budget. Sometimes the tradeoff is across design parameters – holding budget fixed, should we sample more clusters or should we sample more people within clusters? Sometimes the tradeoff is across diagnosands – more flexible answer strategies may exhibit lower bias but higher variance. Minimizing RMSE is weighs bias and variance equally, but other weightings are possible. Tradeoffs across diagnosands are implicit in many design decisions, but design diagnosis and redesign can help make those tradeoffs explicit.