11 Redesign

Redesign is the process of choosing the single empirical design you will implement from a very large family of possible designs. To make this choice, you systematically vary aspects of the data and answer strategies to understand their impact on the most important diagnosands. Redesign entails diagnosing many possible empirical designs over the range of plausible theoretical models, and comparing them.

A sample size calculation is the prototypical example of a redesign. Holding the model, inquiry, and answer strategy constant, we vary the “sample size” feature of the data strategy in order to understand how a diagnosand like the width of the confidence interval changes as we change \(N\).

Not surprisingly, most designs get stronger as we allocate more resources to them. The expected width of a confidence interval could always be tighter, if only we had more subjects. Standard errors could always be smaller, if only we took more pre-treatment measurements. At some point, though, the gains are not worth the increased costs, so we settle for an affordable design that meets our scientific goals well enough. (Of course, if the largest affordable design has poor properties, no version of the study is worth implementing). The knowledge-expense tradeoff is a problem that every empirical study faces. The purpose of redesign is to explore this and other tradeoffs in a systematic way.

11.1 Power curve example

A power curve is a common tool for redesign. We want to learn the power of a test at many sample sizes, either so we can learn the price of precision, or so we can learn what sample size is required for a minimum level of statistical power.

We start with a minimal design declaration: we draw samples of size \(N\) and measure a single binary outcome \(Y\), then conduct a test against the null hypothesis that the true proportion of successes is equal to 0.5.

Declaration 11.1 Power curve design

design <-
  declare_model(N = N) +
  declare_measurement(Y = rbinom(n = N, size = 1, prob = 0.55)) +
  declare_test(
    handler =
      function(data) {
        test <- prop.test(x = table(data$Y), p = 0.5)
        tidy(test)
      }
  )

\(~\)

To construct a power curve, we redesign over values of \(N\) that vary from 100 to 1000.

diagnosis <- 
  design %>%
  redesign(N = seq(100, 1000, 100)) %>%
  diagnose_designs()

Redesigns are often easiest to understand graphically, as in Figure 11.1. At each sample size, we learn the associated level of statistical power. We might then choose the least expensive design (sample size 800) that meets a minimum power standard (0.8).

Redesigning a sample survey with respect to power

Figure 11.1: Redesigning a sample survey with respect to power

11.2 Redesign over multiple design parameters

Sometimes, we have a fixed budget (in terms of financial resources, creative effort, or time), so the redesign question isn’t about how much to spend, but how to spend it across competing demands. For example, we might want to find the sample size N and the fraction of units to be treated prob that minimize a design’s error subject to a fixed budget. Data collection costs $2 per unit and treatment costs $20 per treated unit. We need to choose how many subjects to sample and how many to treat. We might rather add an extra 11 units to the control units (additional cost $2 * 11 = $22) than add one extra unit to the treatment group (additional cost $2 + $20 = $22).

We solve the optimization problem:

\[\begin{align*} & \underset{N, N_t}{\text{argmin}} & & E_M(L(a^{d} - a^{m}|D_{N, m})) \\ & \text{s.t.} & & 5 N + 20 m \leq 5000 \end{align*}\]

where \(L\) is a loss function, increasing in the difference between \(a^{d}\) and \(a^{m}\).

We can explore this optimization with bare-bones declaration of a two-arm trial that depends on two separate data strategy parameters, N and prob:

Declaration 11.2 Bare-bones two arm trial

design <-
  declare_model(N = N, U = rnorm(N),
                potential_outcomes(Y ~ 0.2 * Z + U)) +
  declare_inquiry(ATE = mean(Y_Z_1 - Y_Z_0)) +
  declare_assignment(Z = complete_ra(N = N, prob = prob)) +
  declare_measurement(Y = reveal_outcomes(Y ~ Z)) +
  declare_estimator(Y ~ Z, inquiry = "ATE")

\(~\)

We redesign, varying those two parameters over reasonable ranges: 100 to 1000 subjects, with probabilities of assignment from 0.1 to 0.5. The redesign function smartly generates designs with all combinations of the two parameters. We want to consider the consequences of these data strategy choices for two diagnosands: cost and a very common loss function: mean squared error.

diagnosands <-
  declare_diagnosands(cost = unique(N * 2 + prob * N * 20),
                      rmse = sqrt(mean((estimate - estimand) ^ 2)))

diagnosis <-
  design %>%
  redesign(N = seq(100, 1000, 50),
           prob = seq(0.1, 0.5, 0.2)) %>%
  diagnose_designs(diagnosands = diagnosands)

The diagnosis is represented in Figure 11.2. The top panel shows the cost of empirical designs, at three probabilities of assignment over many sample sizes. The bottom panel shows the RMSE of each design. According to this diagnosis, the best combination that can be achieved for less than $5,000 is N = 600 with prob = 0.3. This conclusion is in mild tension with common the design advice that under many circumstances, balanced designs are preferable (see Section 17.1.1 in the design library for an in-depth discussion of this point). Here, untreated subjects are so much less expensive than treated subjects, we want to tilt the design towards having a larger control group. How far to tilt depends on model beliefs as well as the cost structure of the study.

Redesigning a sample survey with respect to power

Figure 11.2: Redesigning a sample survey with respect to power

11.3 Redesign over answer strategies

Redesign can also take place over possible answer strategies. An inquiry like the average treatment effect could be estimated using many different estimators: difference-in-means, logistic regression, covariate-adjusted ordinary least squares, the stratified estimator, doubly robust regression, targeted maximum likelihood regression, regression trees – the list of possibilities is long. Redesign is an opportunity to explore how many alternative analysis approaches work.

A key tradeoff in the choice of answer strategy is the bias-variance tradeoff. Some answer strategies exhibit higher bias but lower variance while others have lower bias but higher variance. Choosing which side of the bias-variance tradeoff to take is complicated and the process for choosing among alternatives must be motivated by the scientific goals at hand.

A common heuristic for trading off bias and variance is the mean squared error (MSE) diagnosand. Mean squared error is equal to the square of bias plus variance, which is to say MSE weighs bias and variance equally. Typically, researchers choose among alternative answer strategies by minimizing MSE. If in your scientific context, bias is more important than variance, you might choose a an answer strategy that accepts slightly more variance in exchange for a decrease in bias.

To illustrate the bias-variance tradeoff, we consider a setting in which the goal is to estimate the conditional expectation of some outcome variable Y with respect to a covariate X. The design declaration below depends on three user-defined functions (dip, cef_inquiry, and cef_estimator) that we have hidden so as not to clog the narrative flow of the section.

Declaration 11.3 Conditional expectation function design

design <-
  declare_model(
    N = 100,
    X = runif(N, 0, 3)) +
  declare_inquiry(handler = cef_inquiry) +
  declare_measurement(Y = dip(X) + rnorm(N, 0, .5)) +
  declare_estimator(handler = cef_estimator)

\(~\)

Figure 11.3 shows one draw of this design – the predictions of the CEF made by nine regressions of increasing flexibility. A polynomial of order 1 is just a straight line, a polynomial of order 2 is a quadratic, order 3 is a cubic etc. Aronow and Miller (2019) show (Theorem 4.3.3) that even nonlinear CEFs can be approximated to up to an arbitrary level of precision by increasing the order of the polynomial regression used to estimate it, given enough data. The figure provides some intuition for why. As the order of the polynomial increases, the line becomes more flexible and can accommodate unexpected twists and turns in the CEF.

Estimating a CEF with polynomials of increasing order

Figure 11.3: Estimating a CEF with polynomials of increasing order

Increasing the order of the polynomial decreases bias, but this decrease comes at the cost of variance. Figure 11.4 shows how, when the order increases, bias goes down while variance goes up. Mean squared error is one way to trade these two diagnosands off one another. Here, MSE is minimized with a polynomial of order 3. If we were to care much more about bias than variance, perhaps we would choose a polynomial of even higher order.

Redesigning a sample survey with respect to power

Figure 11.4: Redesigning a sample survey with respect to power

11.4 Redesign under model uncertainty

When we diagnose studies, we do so over the many theoretical we entertain in the model. Through diagnosis, we learn how the values of the diagnosands change depending on model parameters. When we redesign, we explore a range of empricial designs over the set of model possibilities. Redesign might indicate that one design is optimal under one set of assumptions, but that a different design would be preferred if a different set holds.

We illustrate this idea with an analysis of the minimum detectable effect (MDE) and how it changes at different sample sizes. The MDE diagnosand is complex. Whereas most diagnosands can be calculated with respect to a single possible model in \(M\), the MDE is defined over a range of possible models. It is obtained by calculating the statistical power of the design over a range of possible effect sizes (holding the empirical design constant), then reporting the effect size that is associated with (typically) 80% statistical power.

MDEs can be a useful heuristic for thinking about the multiplicity of possibilities in the model. If the minimum detectable effect of a study is enormous – a one standard deviation effect, say – then we don’t have to think much harder about our beliefs about the true effect size. Whatever our priors over the true effect size are, they are probably smaller than 1.0 SDs, so we can immediately conclude that the design is too small.

The declaration below contains uncertainty over the true effect size. This uncertainty is encoded in the runif(n = 1, min = 0, max = 0.5) command, which corresponds to our uncertainty over the ATE. It could be as small as 0.0 SDs or as large as 0.5 SDs, and we are equally uncertain about all the values in between. We redesign over three values of \(N\): 100, 500, and 1000, then simulate each design. Each run of simulation features a different true ATE somewhere between 0.0 and 0.5.

Declaration 11.4 Uncertainty over effect size design

N <- 100
design <-
  declare_population(N = N, U = rnorm(N),
  # this runif(n = 1, min = 0, max = 0.5) generates 1 random ATE between 0 and 0.5
                     potential_outcomes(Y ~ runif(n = 1, min = 0, max = 0.5) * Z + U)) +
  declare_inquiry(ATE = mean(Y_Z_1 - Y_Z_0)) +
  declare_assignment(Z = complete_ra(N, prob = 0.5)) +
  declare_measurement(Y = reveal_outcomes(Y ~ Z)) +
  declare_estimator(Y ~ Z, inquiry = "ATE")

\(~\)

designs <- redesign(design, N = c(100, 500, 1000))
simulations <- simulate_designs(designs, sims = 500)

Figure 11.5 summarizes the simulations by smoothing over effect sizes: the loess curves describes the fraction of simulations that are significant at each effect size.11 The MDEs for each sample size can be read off the plot by examining the intersection of each curve with the dotted line at 80% statistical power. At N = 1000, the MDE is approximately 0.175 SDs. At N = 500, the MDE is larger, at approximately 0.225 SDs. If the design only includes 100 units, the MDE is some value higher than 0.5 SDs. We could of course expand the range of effect sizes considered in the diagnosis, but if effect sizes above 0.5 SDs are theoretically unlikely, we don’t even need to – we’ll need a design larger that 100 units in any case.

This diagnosis and redesign shows how our decisions about the data strategy depend on beliefs in the model. If we think the true effect size is likely to be 0.225 SDs, then a design with 500 subjects is a reasonable choice, but if it is smaller than that, we’ll want a larger study. Small differences in effect size have large consequences for design. Researchers who arrive at a plot like Figure 11.5 through redesign should be inspired to sharpen up their prior beliefs about the true effect size, either through literature review, meta-analysis of past studies, or through piloting (See section 21.5).

Redesigning an experiment over uncertainty about the true effect size

Figure 11.5: Redesigning an experiment over uncertainty about the true effect size

11.5 Redesigning in code

This chapter already displays the main computational approaches to redesign. The basic principle is that we need to create a list of designs which then get passed to simulate_designs or diagnose_designs. (The plural versions of these functions are identical to their singular counterparts, we just provide both to allow the code to speak for itself a little more easily).

You can make lists of designs to redesign across directly with list:

designs <- list(design1, design2)

More often, you’ll vary designs over a parameter with redesign. Here, we’re imagining we’ve already declared a design that has an N parameter that we allow to have 3 values.

designs <- redesign(design, N = c(100, 200, 300))

Whichever way you use to create designs, you can then diagnose all of the designs in the list with:

designs <- diagnose_designs(designs)

11.6 Summary

This section has explored some of the ways that you can use the redesign process to learn about design tradeoffs. Most often, the tradeoff is some measure of design quality like power against cost. We want to trade quality off against cost until we find a good enough study for the budget. Sometimes the tradeoff is across design parameters – should I sample more clusters or should I sample more people within clusters, holding costs constant? Sometimes the tradeoff is across diagnosands – more flexible answer strategies may exhibit lower bias but higher variance. Minimizing RMSE is weighs bias and variance equally, but other weightings are possible. Tradeoffs across diagnosands are implicit in many design decisions, but design diagnosis and redesign can help make those tradeoffs explicit.