9 Choosing an answer strategy

Your answer strategy is your plan for what you will do with the information gathered from the world in order to generate an answer to the inquiry. Qualitative and quantitative methods courses overflow with advice about which answer strategies to choose. Under what conditions should you use ordinary least squares, when should you use logit? When is a machine learning algorithm the appropriate choice and when would a comparative case study be more informative? When is no answer strategy worth pursuing because of the fundamental limitations of the data strategy?

Evaluation of an answer strategy depends on your ultimate goals: what is the answer to be used for? A perfect answer is generally elusive in empirical research and so in practice we often need to select among strategies that might all be appropriate, but which come with different strengths and weaknesses. For instance some might suffer less from bias while others might be more precise. In other words, which answer strategy is best depends on what diagnosands you care about.

This chapter first describes four types of answer strategies: point estimators, tests, Bayesian posteriors, and interval estimation. Second, we describe how to provide estimates of uncertainty along with answers. Uncertainty estimates can often be thought of as empirical estimates of a diagnosand like the true standard error, or the probability of obtaining an estimate larger than a specified value under a null model. Third, we revisit several design principles that guide how to choose an answer strategy. Principle 3.6 emphasizes that answers strategies as procedures, since they describe what analyses will be undertaken under different contingencies. Principle 3.7 says we should “seek M:I::D:A parallelism.” We want to choose answer strategies that strengthen the analogy between the empirical and theoretical halves of the research. How best to choose A in order to strengthen the analysis depends, of course, on the three elements of research design: the model, the inquiry, and the data strategy. Principle 3.10 is a reminder to diagnose holistically: we can’t choose answer strategies in isolation from the the other design elements because the influence of the choice of estimator on diagnosands is different, depending on the other design elements.

9.1 Types of answer strategies

9.1.1 Point estimation

The most familiar class of answer strategies are point-estimators that produce estimates of scalar parameters. The sample mean of an outcome, the difference-in-means estimate, the coefficient on a variable in a logistic regression, and the estimated number of topics in a text corpus are all examples of point estimators.

To illustrate point estimation in general, we’ll try to estimate the average age of the citizens of a small village in Italy. Our model is straightforward – the citizens of the small village all have ages – and the inquiry is the average of them. In our data strategy, we randomly sample three citizens whose ages are then measured via survey to be 5, 15, and 25. Our answer strategy is the sample mean estimator, so our estimate of the population average age is a point estimate of 15.

Is this a good answer? It is almost certainly wrong in the sense that the population average age in the small village is probably not fifteen (Italy’s population is aging!), but we don’t know how wrong because, of course, we don’t actually know the value of the inquiry under study. We have instead to evaluate the properties of the procedure. Under a random sampling design – even an egregiously stingy random sampling design that only selects three citizens! – we can justify the approach on the basis of the “bias” diagnosand. The average of all the answers you would get if you repeated the data strategy (random sampling) and the answer strategy (taking the sample average) over and over would correspond to the correct answer.

Though the bias diagnosand is zero, the variance diagnosand is not. Over many repeated draws, the estimates bounce around dramatically. The design in Declaration 9.1 can be used to generate a view of what answers we might get for a particular distribution when we choose just three subjects for our sample. We use a linear regression of the age variable on a constant to estimate the sample mean. Using OLS in this way is a neat trick for estimating sample means along with various uncertainty statistics, discuss in more detail in the next section.

Declaration 9.1 Italian village design

design <-
  declare_model(N = 100, age = sample(0:80, size = N, replace = TRUE)) +
  declare_inquiry(mean_age = mean(age)) +
  declare_sampling(S = complete_rs(N = N, n = 3)) +
  declare_estimator(age ~ 1, model = lm_robust) 

\(~\)

simulations <- simulate_design(design, sims = sims)

Figure 9.1 shows that we are right on average but wrong a lot. The average estimate lies right on top of the true value of the estimand (40), but the estimates range enormously widely, from close to zero to close to 80 in some draws. The answer strategy – the sample mean estimator – is just fine, the problem here lies in the data strategy that generates tiny samples.

Distribution of the point estimator answer strategy

Figure 9.1: Distribution of the point estimator answer strategy

9.1.2 Tests

Tests are an elemental kind of answer strategy. Tests yield binary yes/no answers to a binary yes/no inquiry. In some qualitative traditions, hoop tests, straw-in-the-wind tests, smoking-gun tests, and doubly-decisive tests are common. These tests are procedures for making analysis decisions in a structured way. Similarly, many forms of quantitative tests have been developed. Sign tests assess whether a test statistic is positive, negative, or zero. Null hypothesis significance tests assess whether a parameter is different than a null value, such as zero. Equivalence tests assess whether a parameter falls within a range, rather than comparing to a fixed value. Many procedures for conducting tests are also available, with different assumptions about the null hypothesis, the distributions of variables, and the data strategy.

A typical null hypothesis test proceeds by imagining a null model \(M_{0}\) and imagining the sampling distribution of the empirical answer \(a_{d_0}\) under a hypothetical design M\(_{0}\)IDA. That sampling distribution enumerates all the ways the design could have come out if the null model \(M_{0}\) were the correct one. For a null hypothesis test, we entertain the null model and consider its implications. We ask, under \(M_{0}\) how frequently would we obtain an answer as large or larger than the empirical answer \(a_d\) (or other test statistic)? That frequency is known as a \(p\)-value.9 The last step of the test is to turn the \(p\)-value into a binary significance choice. The typical threshold in the social sciences is 0.05: hypothesis test with \(p\)-values less than 0.05 indicate statistical significance. This threshold is arbitrary, reflecting the inertia of the scientific community much more than some a priori scientific standard. The appropriate threshold value for statistical significance is a matter of furious debate, with some authors calling for the threshold to be lowered to \(0.005\) to guard against false positives (Benjamin et al. 2018).

We’ll illustrate the idea of a hypothesis test in general with the Italian village example. Here, we test against the hypothesis that the average age is 20. If we have strong evidence against this hypothesis, we will reject it. If we have weak evidence against the hypothesis, we will fail to reject it. For instance, we might reject 10 and 70 but fail to reject 35 and 45.

Declaration 9.2 Italian village design continued

design <- 
  design +
  declare_test(age ~ 1, 
               linear_hypothesis = "(Intercept) = 20", 
               model = lh_robust, label = "test")

\(~\)

Here’s one run of that design. The output can be confusing. By default, most statistical software tests against the null hypothesis that the true parameter value is zero – so the p.value in the first row refers to that null hypothesis test. The second row is the test against the hypothesis that the mean is equal to twenty. The “estimate” in the second row is the differece of the observed estimate from 20. The p.value in the second row is the one we care about when testing against the null hypothesis that the average age is 20.

Table 9.1: Estimates from the test of the mean age equalling 20 from Italian villages example
estimator estimate p.value
estimator 37 0.05
test 17 0.19

Figure 9.2 shows how frequently we reject the null that the average age is 20. When estimate is close to 20, we rarely reject the null, but when the estimate is far from 20, we are more like to reject. Again, this simulation comes from a design with a weak data strategy of sampling only three citizens at time. We need to see estimates breaking 60 before the testing answer strategy reliably rejects this (false) null hypothesis.

Distribution of the test answer strategy

Figure 9.2: Distribution of the test answer strategy

9.1.2.1 Randomization inference

Randomization inference describes a large class of testing procedures that merit special attention. Randomization inference tests leverage known features of the randomization procedure to implement tests of many kinds (see Gerber and Green (2012), chapter 3, for an introduction to randomization inference). In a common case, a randomization inference test proceeds by stipulating a null model under which the counterfactual outcomes of each unit are exactly equal to the observed outcomes, the so-called “sharp null hypothesis of no effect.” Under this null hypothesis, the treated and untreated potential outcomes are exactly equal for each unit, reflecting a model in which the treatment has exactly zero effect for each unit.

As described above, a \(p\)-value is an answer to the question: what is the probability the null model would generate estimates as large or larger in absolute value than the observed estimate? We can answer this question by diagnosing the design under the sharp null model. We take the example of exercise 3.6 from Gerber and Green (2012), which prompts students to evaluate the sharp null hypothesis of no effect in the context of Clingingsmith, Khwaja, and Kremer (2009)’s study of the effect of being randomly assigned to go on Hajj on tolerance of foreigners.

Here is the observed estimate, which is positive, indicating that our best guess is that going on Hajj increases tolerance.

clingingsmith_etal <- read_csv("data/clingingsmith_et_al_2009.csv")

observed_estimate <-
  difference_in_means(views ~ success, data = clingingsmith_etal)

observed_estimate
Table 9.2: Results from Clingingsmith, Khwaja, and Kremer (2009) study
term estimate std.error statistic p.value conf.low conf.high df outcome
success 0.47 0.16 2.92 0 0.16 0.79 954.3 views

Here we declare the null model and add it to the data and answer strategies:

Declaration 9.3 Randomization inference under the sharp null

null_model <- 
  declare_model(data = clingingsmith_etal,
                potential_outcomes(Y ~ 0 * Z + views))

null_design <-
  null_model +
  declare_assignment(Z = complete_ra(N = N, m = sum(success))) +
  declare_measurement(Y = reveal_outcomes(Y ~ Z)) +
  declare_estimator(Y ~ Z)

\(~\)

We diagnose the null design with respect to the p.value diagnosand: what fraction of simulations under the null model exceed the observed estimate? We find that the p.value is low, suggesting that the observed estimate is unlikely to have been generated under the null model.

p.value <-
  declare_diagnosands(
    p.value = mean(abs(estimate) >= abs(observed_estimate$coefficients))
  )

nhst <-
  diagnose_design(null_design, diagnosands = p.value, sims = 1000)

get_diagnosands(nhst)
Table 9.3: Diagnosands of randomization inference example
design estimator term p.value se(p.value) n_sims
null_design estimator Z 0 0 1000

9.1.3 Bayesian formalizations

Bayesian answer strategies sometimes target the same inquiries as classical approaches, but rather than seeking a point estimate, they try to generate rational beliefs over possible values of the estimand. Rather than trying to provide a single best guess for the average age in a village, a Bayesian answer would try to figure out how likely different answers are given the data. To do so they need to know how likely different age distributions are before seeing the data—the priors—and the likelihood of different types of data for each possible age distribution. A Bayesian would likely not be very impressed by the 15 answer given by the point estimator in Section 9.1.1 because, prior to see any samples, they would likely expect that the answer had to be bigger than this. Bayesians would chalk the answer “15” down to an unusual draw.

A Bayesian answer strategy might look like this. Here we need to modify the default model summary option to deal with the fact that the estimates are returned in log form by default.

The Bayesian answer strategy specifies a prior distribution (here distributed normal centered on 50 to reflect a prior that Italian villages skew older) as well as a log normal distribution for ages. Here we retain the (median) posterior estimates for average age alongside a standard error based on the posterior variance. In the model_summary argument we ask the tidier to exponentiate the coefficient estimate and standard error before returning them.

Declaration 9.4 \(~\)

library(rstanarm)
library(broom.mixed)

design <-
  declare_model(N = 100, age = sample(0:80, size = N, replace = TRUE)) +
  declare_inquiry(mean_age = mean(age)) +
  declare_sampling(S = complete_rs(N = N, n = 3)) +
  declare_estimator(
    age ~ 1,
    model = stan_glm,
    family = gaussian(link = "log"),
    prior_intercept = normal(50, 5),
    model_summary = ~tidy(., exponentiate = TRUE),
    inquiry = "mean_age"
  )

\(~\)

We can then simulate this design in the same way and examine the distribution of estimates we might get.

simulations <- simulate_design(design, sims = sims)

What we see in Figure 9.3 is that using the same (poor) data strategy as before, a Bayesian answer strategy gets us a somewhat tighter distribution on our answer, but exhibits greater bias: the average estimate is higher than the estimand. We might accept higher bias for lower variance if overall, the root-mean-squared error is lower for the Bayesian approach. See Section 10.6 for a further discussion of RMSE. The main difference between the Bayesian and classical approaches is the handling of prior beliefs, which carry a lot of weight in the Bayesian estimation but no weight in the classical approach.

Sampling distribution of the Bayesian answer strategy

Figure 9.3: Sampling distribution of the Bayesian answer strategy

Bayesian approaches are also used by qualitative researchers drawing case level inferences from causal process observations. Recent developments in qualitative methods have sought to take Bayes’ rule “from metaphor to analytic tool” (Bennett 2015). This approach characterizes qualitative inference as one in which prior beliefs about the world can be specified numerically and then are updated on the basis of evidence observed. At a minimum, writing down such an answer strategy on a computer requires specifying beliefs, expressed as probabilities, about the likelihood of seeing certain kinds of evidence under different hypotheses. We provide an example of such a strategy the design library Section 15.1. Herron and Quinn (2016) provide one approach to formalizing a qualitative answer strategy that focuses on understanding an average treatment effect. Humphreys and Jacobs (2015) provide an approach that can be used to formalize answer strategies targeting both causal effect and causal attribution inquiries, while Fairfield and Charman (2017) formalize a Bayesian approach that approaches causal attribution as a problem of attaching a posterior probability to competing alternative hypotheses. Abell and Engel (2019) suggest the use of “supra-Bayesian” methods to aggregate multiple participant-provided narratives in ethnographic studies targeting causal attribution estimands.

9.1.4 Interval estimation

In many circumstances, the details of the data strategy alone are insufficient to pin down a parameter value exactly, in which case the parameter is not “point-identified,” which means we can’t generate a point estimate without further assumptions. The standard approach is to simply make those further assumptions and move on to reporting point estimates. Under an agnostic approach – we don’t know if those assumptions are right because they aren’t grounded in the data strategy – we can turn to interval estimation instead.

One way to handle settings in which parameters are not point-identified is to generate “extreme value bounds.” These bounds report the the best and worst possibilities according to the logical extrema of the outcome variable.

We illustrate interval estimation back in our Italian village where we have learned the ages of three of the 100 citizens. Suppose we did not know whether the data strategy used random sampling, so we can’t rely on the guarantee that, under random sampling, the sample mean is unbiased for the population mean. Now we have reason about best and worst case scenarios. Let’s agree that the youngest a person can be is zero and the oldest is 110. Starting with an estimate of 15 among three citizens, we can generate lower and upper bound estimates for the average age of the entire 100-person village like this:

lower_bound <- (3 * 15 + 97 * 0)/100
upper_bound <- (3 * 15 + 97 * 110)/100
 c(lower_bound, upper_bound)
Table 9.4: Extreme value bounds estimate
Lower bound Upper bound
0.45 107.15

This procedure generates enormously wide bounds – we already knew before we started that the average age had to be somewhere between 0.45 and 107.15 years. But consider if we had data on 90 of the 100 citizens and among those 90, the average is 44. Now when we generate the bounds, they are still wide but not ridiculously so – the bounds put the average age somewhere between 40 and 50.

lower_bound <- (90 * 44 + 10 * 0)/100
upper_bound <- (90 * 44 + 10 * 110)/100
c(lower_bound, upper_bound)
Table 9.5: Tigher extreme value bounds estimate with more data
Lower bound Upper bound
39.6 50.6

Extreme value bounds and variations on the idea can be applied when experiments encounter missingness or when we want to estimate effects among subgroups that only reveal themselves in some but not all treatment conditions (see Aronow, Baron, and Pinson (2019) or Coppock (2019) for examples). The extreme value bound approach can also be used in qualitative settings in which we can impute some but not all of the missing potential outcomes using qualitative information; the bounds reflect our uncertainty about those missing values (Coppock and Kaur 2021).

9.2 Uncertainty

Answers should usually be accompanied by uncertainty estimates. We want to communicate to others not just what our answer is, but also how certain we are of it. Our uncertainty about our answers stems from the properties of a design: when designs are stronger, our uncertainty is smaller. One reason to communicate the uncertainty associated with your estimates is to communicate the strength of your design.

For point and interval estimators, uncertainty is often expressed as a standard error estimate or confidence interval. Many approaches to standard error estimation are available. Indeed, just like point estimators for inquiries, we have point estimators for standard errors. You might choose classical standard errors or cluster-robust standard errors; you might bootstrap your standard errors or use the jackknife. Similarly, many approaches to confidence interval construction are available. Most often, confidence intervals are built from variance estimates under an appeal to sampling theory. Alternatively, a confidence interval can be formed by “inverting the test,” i.e. finding the range of null hypotheses we fail to reject. Whether any particular approach to uncertainty estimation is appropriate in a context will depend on the full set of design parameters and we encourage you to diagnose your uncertainty estimates as well.

Here is the output of the answer strategy from Declaration 9.1, applied to the realized data set, rounded to the nearest whole number. We see the sample mean estimate of 15, the standard error estimate of 6, and the confidence interval from -10 to 40. These numbers communicate that our answer is 15 – but also that we know that number is shaky. We’re uncertain because the tool we used to answer the inquiry is high variance: it could bounce around a lot depending on which three people we happened to sample.

three_italian_citizens <- fabricate(N = 3, age = c(5, 15, 25))
answer_strategy <- declare_estimator(age ~ 1)
answer_strategy(three_italian_citizens)
Table 9.6: One draw of the answer strategy
estimate std.error statistic p.value conf.low conf.high
15 5.77 2.6 0.12 -9.84 39.84

For tests, uncertainty is expressed by describing the properties of a procedure in terms of error rates. A test is an answer strategy that returns a binary answer to a binary estimand. The result of a test is an error if the empirical answer \(a_d\) does not equal the truth \(a_{m^*}\). Conventionally, a Type 1 error occurs when \(a_d = 1\) but \(a_{m^*} = 0\) and a Type 2 error occurs when \(a_d = 0\) but \(a_{m^*} = 1\). A perfect test (i.e., a test about about which we are fully certain) has Type 1 error rate of 0% and a Type 2 error rate of 0% as well. A test about which we are less certain might return \(a_d = 1\) 40% of the time when \(a_{m^*} = 0\) (a Type I error rate of 40%) and might return \(a_d = 1\) 90% of the time when \(a_{m^*} = 1\) (a Type II error rate of 10%).

The test reported by the answer_strategy function is obscured by its presentation, but because this sort of presentation is so common is social science, it’s worth showing raw. The test implicit in the output is a null hypothesis significance test against the null hypothesis that the average age in this Italian village is equal to exactly zero. The test returns “yes” if we reject the null and “no” if we fail to reject it.10 If we use the standard significance threshold of \(\alpha = 0.05\) we fail to reject the null model because the p.value reported in the table is 0.12.

Our uncertainty about the decision we made in the hypothesis test to fail to reject is not represented by the information in the table. Importantly, the \(p\)-value does not represent the probability that the null model is correct. The p.value is the probability that with our data and answer strategy, draws from the null model would lead to estimates of 15 or larger. According to our calculations, draws from the null model will do so 12 percent of the time. We use this probability along the way to making a decision about whether to reject the null model, but amazingly, a p.value does not describe our certainty about the significance test!

What does characterize our uncertainty about a significance test? The Type I and type II error rates of the test. The Type I error rate is controlled by the significance threshold. A Type I error occurs if we reject the null model when it is true. If we use \(\alpha = 0.05\) and the test is correctly accounts for all design elements, then a Type I error should only happen 5% of the time. Type II error rates are harder to learn about. In our case, we failed to reject the null model. To characterize our uncertainty about the test, we also want to calculate the probability that a design like this one would generate Type II errors. To do so, we have to imagine what it means for the null model to be false, since they can be false in many ways. One approach is to imagine how the test would perform on under a series of non-null models.

Figure 9.4 describes the Type II error rate over a range of non-null models. If the true population mean is around 25 or lower, we fail to reject the null 75% of the time or more. With this comically small sample size, even if the true mean were 75, we would still fail to reject 20% of the time. We are rightly uncertain about this test – it may have a low enough Type I error rate (set as \(\alpha = 0.05\)), but the Type II errors are way too big.

Declaration 9.5 \(~\)

design <-
  declare_model(N = 100, 
                age = round(rnorm(N, mean = true_mean, sd = 23))) +
  declare_inquiry(mean_age = mean(age)) +
  declare_sampling(S = complete_rs(N = N, n = 3)) +
  declare_estimator(age ~ 1, model = lm_robust) 

\(~\)

designs <- redesign(design, true_mean = seq(0, 100, length.out = 10))
Type II error rates of Italian village design

Figure 9.4: Type II error rates of Italian village design

9.2.1 Uncertainty estimates as diagnosands

In this section, we encourage you to think of uncertainty statistics as empirical estimates of diagnosands describing the quality of your design under some set of models.

Suppose we report that the estimate is equal to 15 and carries with it a standard error of 6. Here we are communicating that we now think our design has the property that, if we were to run it over and over, the standard deviation of our estimates would be 6. That is, if we were to declare a design over a large class of models that all have the same outcome variance and combined it with I, D, and A, and diagnosed, the standard deviation of the estimates under all those models would be 6.

Likewise, estimated \(p\)-values can also be thought of as estimates of a peculiar diagnosand: the probability of obtaining a test statistic of a particular value, under a maintained null model. That is, if we were to write down a null model \(m_{0}\) under which the estimand were in fact zero, then combined it with I, D, and A, we could diagnose that design to learn the fraction of estimates that are as large or larger than 15 – a \(p\)-value.

Thinking about \(p\)-values as diagnosands can be seen most directly in the randomization inference approach to hypothesis testing. Randomization inference often considers “sharp” null hypothesis in which we impute the missing potential outcomes with exactly their observed values, then simulate all the ways the design could come out holding I, D, and A constant. The \(p\)-value is the fraction of simulations in which the null model generates estimates that are more extreme than the observed estimate. See Section 9.1.2.1 for a worked example of this approach.

Even frequentist confidence intervals, with their notoriously confusing interpretation (a 95% confidence interval is an interval that has the goal of covering the estimand 95% of the time) can be viewed as estimates of the 2.5th and 97.5th quantiles of the sampling distribution under the model that the estimand equals the estimate.

The payoff from thinking about uncertainty statistics as estimates of diagnosands is that uncertainty statistics can be poor estimators of diagnosands. For example, social scientists criticize one another’s choice of standard error – classical or robust, clustered or not, bootstrapped, jackknifed, and on and on. The reason for this debate is that when the procedures we follow to form uncertainty statistics are inappropriate for the design setting, we can be falsely confident in our answers.

Figure 9.5 illustrates what’s at stake in the age old contest between classical standard errors and robust standard errors. Here we are in the context of a two arm trial under three settings: the control potential outcomes are higher variance, the groups have the same variances, or the treated outcomes have higher variance. The calculation for classical standard errors pools the data from both groups when estimating a variance, thereby assuming “homoskedasticity,” a Greek work for having the “same spread.” This estimation choice leads to poor performance. Depending on the fraction of the sample that receive treatment, the estimate of the sampling distribution can be upwardly biased (conservative) or downwardly biased (anti-conservative or falsely confident). We’re usually worried about standard error estimates being too small because anti-conservatism is probably worse for scientific communication than conservatism.

Robust standard errors are “heteroskedasticity-robust,” which means that the calculation does not pool the data from the two groups. Samii and Aronow (2012) show that a particular variant of robust standard errors (HC2) is exactly equal to the Neyman variance estimator described in Section 17.1.1, which is why HC2 robust standard errors are default in the lm_robust function. The bottom row of Figure 9.5 shows that the robust standard error estimator hews closely to the true value of the diagnosand in all of the design settings considered. We prefer robust SEs because they do a better job of estimating the standard deviation diagnosand in more settings.

Why robust standard errors are preferred to classical standard errors

Figure 9.5: Why robust standard errors are preferred to classical standard errors

9.3 Applying design principles

Now that we have discussed all four research design elements in detail, we can return to some of the design principles laid out in Chapter 3.

9.3.1 Declare data and answer strategies as functions

Both the data and answer strategies are functions. What we mean by this is that they respond to their inputs. If the events generated by the world had been different, the data produced by the data strategy would be different too. If the data produced by the data strategy had been different, the answers rendered by the answer strategy would be different too. These design elements are procedures and we want to understand the properties of those procedures over many possible ways the world could be. We can’t do that unless we declare data and answer as functions.

When declaring answer strategies as functions, we have to think about more than just the single estimation function that ends up in the final paper. To see this, consider an estimator that is selected through on an exploratory procedure in which multiple estimators are compared on the basis of fit statistics. The answer strategy is not this final estimator – it is this entire multi-step if-then procedure.

The reason to declare the procedure rather than the final estimator is that the diagnosis of the design may differ. The procedure may be more powerful, if for example we assessed multiple sets of covariate controls and selected the specification with the lowest standard error of the estimate. But that procedure would also exhibit poor coverage, since the confidence interval produced by the final estimator does not account for these multiple bites at the apple.

Answer strategies can become multi-stage procedures in unexpected ways. For example, sometimes a planned-on maximum likelihood estimator won’t converge when executed on the realized data. In these cases, analysts switch estimators (or sometimes inquiries!). The full set of steps — a decision tree, depending on what is estimable — is the answer strategy we want to declare and compare to alternative decision trees.

This principle extends to settings in which analysts run diagnostic tests, like falsification or placebo tests. If we learn from a sensitivity test that a mediation estimate is very sensitive to unobserved confounding, we might choose not to present it at all. By this logic, the answer strategy includes the sensitivity test, the decision made on the basis of the test, and the resulting distribution of mediation estimates, some of which are undefined.

Writing down the full set of if-then choices you might make in the answer strategy depending on revealed data is hard to do. It’s hard to do because it’s difficult we often imagine answer strategies if things go well but spend less imagination on elaborating what might happen if things go wrong. When things do go wrong — missing data, noncompliance, suspension of the data collection — answer strategies will change. One way to guard against over-correcting to the revealed data is to write down a standard operating procedures document that systematize these procedures in advance (Green and Lin 2016).

9.3.2 Seek M:I::D:A parallelism

The model and the inquiry form the empirical half of the design, and the data and answer strategies make up the empirical half. Research designs that have parallel theoretical and empirical halves tend to be strong (though not all strong designs need be parallel in this way). This principle is motivated by the intersection of two ideas from statistics: the “plug-in principle” and “analyze as you randomize.”

The plug-in principle refers to the idea that sometimes, the answer strategy function and the inquiry function are very similar in form. The estimand, \(I(m) = a_m\), can often be estimated by choosing an A that is very similar to I and then “plugging-in” the realized data \(d\) that result from the data strategy for the unobserved data \(m\), i.e. \(A(d) = a_d\).

More formally, Aronow and Miller (2019) describe a plug-in estimator as:

For i.i.d. random variables \(X_1, X_2, \ldots, X_n\) with common CDF \(F\), the plug-in estimator of \(\theta = T(F)\) is: \(\widehat\theta = T(\widehat F)\).

For example, suppose that our inquiry is the average treatment effect among the \(N\) units in the population.

\(I(m) = \frac{1}{N}\sum_1^N[Y_i(1) - Y_i(0)] = \frac{1}{N}\sum_1^NY_i(1) - \frac{1}{N}\sum_1^NY_i(0) = \mathrm{ATE}\)

We can develop a plug-in ATE estimator by replacing the population means — \(\frac{1}{N}\sum_1^N Y_i(1)\) and \(\frac{1}{N}\sum_1^N Y_i(0)\) — with sample analogues:

\(A(d) = \frac{1}{m}\sum_1^m{Y_i} - \frac{1}{N - m}\sum_{m+1}^N{Y_i}\),

where units 1 though \(m\) reveal their treated potential outcomes and the remainder reveal their untreated potential outcomes.

Following plug-in principle only yields good answer strategies under some circumstances. Those circumstances are determined by the data strategy. In order to seek M:I::D:A parallelism, we need data strategies that sample units, assign treatment conditions, and measure outcomes such that the revealed data can indeed by “plugged in” to the inquiry function. Whether this plug-in ATE estimator is a good answer strategy depends of features of the data strategy. It’s a good estimator when units are assigned to treatment with equal probabilities, but it’s a bad estimator if the probabilities differ.

When the data strategy introduces distortions like differential probabilities of assignment, the answer strategy function should not equal the inquiry function: we can no longer just plug in the observed data. In order to seek parallelism, we should adjust for those distortions, reversing them to reestablish parallelism.

This idea can be summarized as “analyze as you randomize,” a dictum attributed to R.A. Fisher. We use known features of the data strategy to adjust the answer strategy. We can undo the distortion introduced by differential probabilities of assignment by weighting units by the inverse of the probability of being in the condition that they are in. If we use an inverse-probability weighted (IPW) estimator, we restore parallelism because even though A no longer equals I, the relationship of D to A once again parallels the relationship of M to I.

Declaration 9.6 illustrates this idea. We declare the theoretical half of the design as MI then consider the intersection of two data strategies with two answer strategies. D1 has constant probabilities of assignment and D2 has differential probabilities of assignment. A1 is the plug-in estimator and A2 is the IPW estimator with the inverse probability weights generated by the D2 randomization protocol.

Declaration 9.6 Seeking parallelism design

MI <-
  declare_model(
    N = 100,
    X = rbinom(N, size = 1, 0.5),
    U = rnorm(N),
    potential_outcomes(Y ~ 0.5 * Z+-0.5 * X + 0.5 * X * Z + U)
  ) +
  declare_inquiry(ATE = mean(Y_Z_1 - Y_Z_0))

D1 <- declare_assignment(Z = complete_ra(N = N)) +
  declare_measurement(Y = reveal_outcomes(Y ~ Z))
D2 <- declare_assignment(Z = block_ra(blocks = X, block_prob = c(0.1, 0.8))) +
  declare_measurement(Y = reveal_outcomes(Y ~ Z))


A1 <- declare_estimator(Y ~ Z, label = "Unweighted")
A2 <-
  declare_step(
    handler = fabricate,
    ipw = 1 / obtain_condition_probabilities(
      assignment = Z,
      blocks = X,
      block_prob = c(0.1, 0.8)
    )
  ) +
  declare_estimator(Y ~ Z, weights = ipw, label = "Weighted")

designs <- list(MI + D1 + A1,
                MI + D1 + A2,
                MI + D2 + A1,
                MI + D2 + A2)

\(~\)

We diagnose the bias of all four design. Figure 9.6 shows that when the answer strategy and the data strategy match (D1 + A1 and D2 + A2), we have no bias. When they do not match (D1 + A2 and D2 + A1), we do. In this case, seeking parallelism in the choice of answer strategy improves the design. Of course, an alternative answer strategy we might call A3 that implements the weights corresponding to whatever the data strategy says they should be unbiased under both D1 and D2.

When data and answer strategies are mismatched, we obtain bias.

Figure 9.6: When data and answer strategies are mismatched, we obtain bias.

This principle applies most clearly to the bias diagnosand, but it applies to others as well. For example, Abadie et al. (2017) recommend that answer strategies should include clustered standard errors at the level of sampling or assignment, whichever is higher. The data strategies that include clustering introduce a dependence among units that was not present in the model; clustered standard errors account for this dependence. If we did not do so, our estimated standard error would be a bad estimate of the “standard deviation” diagnosand.

9.3.3 Entertain many models

One of the main reasons to “entertain many models” is that we want to choose answer strategies that are “robust” to models. By robust, we mean that the answer strategy should produce good answers under a wide range of models. Selecting answer strategies that are robust to multiple models ensures that we not only get good answers when our model is spot on — which is rare! — but under many possible circumstances.

Understanding whether the choices over answer strategies — logit or probit or OLS — depend on the model being a particular way is crucial to making a choice. For example, many people have been taught that whenever the outcome variable is binary, OLS is inappropriate they must use a binary choice model like logit instead. When the inquiry is the probabilities of success for each unit and we use covariates to model them, how much better logit performs at estimating probabilities depends on the model. When probabilities are all close to 0.5, the two answer strategies both perform well. When the probabilities spread out from 0.5, OLS is less robust and logit beats it. In the same breath, however, we can consider these same two estimators in the context of a randomized experiment with a binary outcome. Here, OLS is just as strong as logit, no matter the distribution of the potential outcomes. In this setting, when we entertain many models, we find that both estimators are robust.

Entertaining many models has something in common with robustness checks: both share the motivation that we have fundamental uncertainty about the true model. A robustness check is an alternative answer strategy that changes some model assumption that the main answer strategy depends on. Presenting three estimates of the same parameter under different answer strategies (logit, probit, and OLS) and making a joint decision based on the set of estimates about whether the main analysis is “robust” is a procedure for assessing “model dependence.” But robustness checks are just answer strategies themselves, and we should declare them and diagnose them to understand whether they are good answer strategies. We want to understand the properties of the robustness check, e.g., under what models and how frequently does it correctly describe the main answer strategy as “robust.”

9.4 Declaring answer strategies in code

An answer strategy is a function that provides answers to an inquiry. Declaring one in code involves selecting that function and linking the answer or answers it returns to one or more inquiries.

The functions we declare in DeclareDesign for answer strategies differ from those for the other elements of research design to reflect the two-step nature of many answer strategies. Often, first a statistical model (e.g., a linear regression) is fit to data, and then summaries of that model fit (e.g., the coefficient on a variable X, its standard error, t-statistic, p-value, and confidence interval) are combined to form an answer and its associated measures of uncertainty.

9.4.1 Statistical modeling functions

In DeclareDesign, we call these two steps model and model_summary. (This usage is a little confusing, since this “model” refers to the “statistical modeling function” and not the research design element, but so be it.) The model argumenet in declare_estimator can take almost any modelling function in R (e.g., lm for linear regression) and model_summary consists of a summary function that calculates statistics from the fit such as tidy or glance. You can write your own model or model summary. When your answer strategy does not fit this two-step structure, you can as with all declare_ functions write your own handler.

We break down each part of a standard answer strategy declaration using the example of a linear regression of the effect of a variable Z on an outcome Y in Declaration @ref(def:answerstrategyincode}. The first argument in our declare_estimator step defines the model we will use, here lm_robust which is our function in the estimatr package for running linear regressions with robust standard errors. The second is the main argument for lm_robust, the formula for the regression specification, in this case Y on Z with no controls.

Declaration 9.7 Linear regression design

design <- 
  declare_model(
    N = 100, U = rnorm(N), potential_outcomes(Y ~ 0.2 * Z + U)
  ) + 
  declare_assignment(Z = complete_ra(N)) + 
  declare_measurement(Y = reveal_outcomes(Y ~ Z)) + 
  declare_estimator(model = lm_robust, 
                    formula = Y ~ Z, 
                    model_summary = tidy, 
                    term = "Z", 
                    inquiry = "ATE", 
                    label = "lm_no_controls")

\(~\)

draw_estimates(design)
Table 9.7: The estimate from one draw of the linear regression design
estimator term estimate std.error statistic p.value conf.low conf.high df outcome inquiry
lm_no_controls Z 0.02 0.18 0.13 0.89 -0.34 0.39 98 Y ATE

9.4.2 Tidying statistical modelling function output

Let’s unpack the model_summary argument. In this case we sent the tidy function from the broom package (the default). Understanding what tidy does opens a window into the way we match estimates and estimands. The tidy function takes many model fit objects and returns a data frame in which rows represent estimates and columns represent statistics about that estimate. The columns typically include the estimate itself (estimate), an estimated standard error (std.error), a test statistic of some kind reported by the model function such as a t-statistic or Z-statistic (statistic), a p-value based on the test statistic (p.value), a confidence interval (conf.low, conf.high), and the degrees of freedom of the test statistic if relevant (df).

A key column in the output of tidy is term, which represents which coefficient (term) is being described in that row. The term column uniquely identifies the row. We will often need to use the term column in conjunction with the name of the estimator to link estimates to estimands when there are more than one. If in the regression we pull out two coefficients (e.g., for treatment indicator 1 and for treatment indicator 2), we need to be able to link those to separate inquiries representing the true effect of treatment 1 and the true effect of treatment 2. Term is our tool for doing so. The default is for term to pick the first coefficient that is not the intercept, so for the regression Y ~ Z there will be an intercept and then the coefficient on Z which is what will be picked.

The inquiry argument defines which inquiry or inquiries the estimates will be linked to. In this case, we link to a single inquiry, the ATE. You can also declare an estimator that shoots at multiple inquiries: declare_estimator(Y ~ Z, model = lm_robust, term = "Z", inquiry = c("ATE", "ATT")) - useful for learning how well an estimator does for different targets. When we run the answer strategy on data, we get two additional pieces of information tacked on to the model summary data frame: the name of the estimator, which comes from the label argument, and the inquiry. The unit of analysis of a diagnosis is the inquiry-estimator pair, so if you link an estimator to multiple inquiries, then there will be a row for each inquiry.

Tidy is not the only commonly-used model summary function. glance will provide model fit statistics such as the r-squared:

A <- declare_estimator(Y ~ Z, 
                       model = lm_robust, 
                       model_summary = glance)
A(draw_data(design))
Table 9.8: The model fit statistics from one draw of the linear regression design
r.squared adj.r.squared statistic p.value df.residual nobs se_type
0.01 0 1.05 0.31 98 100 HC2

When neither tidy nor glance works well for your answer strategy, you can write your own model summary function. Below, we slowly build up a tidy function for the lm model. (One is already built-in to the broom package, but we do so here to illustrate how you can write your own for a function that does not already have one.) Before you start to write your own summary function, check whether one exists on the Broom Web site.

There are three sets for a tidy function:

  1. Pull out statistics from the model fit object. You can extract out any statistics and transform them in any relevant way.

  2. Return a data frame (or tibble).

  3. Name your estimates in a common format that works across all tidy functions. The estimate column should be called “estimate,” the standard error column “std.error,” etc., as described earlier. However, if you want to add statistics from the model fit that you will diagnose, you can and you can name them whatever you want.

tidy_lm <- function(fit) {
  # calculate estimates by grabbing the coefficients from the model fit
  estimate <- coef(lm)
  
  # get the names of the coefficients (e.g., "(Intercept)", "Z")
  #   we will call these "term" to represent regression terms
  term <- names(estimates)
  
  # calculate the standard error by grabbing the variance-covariance
  #   matrix, then pulling the diagonal elements of it and taking the 
  #   square root to transform from variances to standard errors
  std.error <- sqrt(diag(vcov(lm)))
  
  # return a tibble with term, estimate, and std.error
  tibble(term = term, estimate = unlist(estimate), std.error = std.error)
}

declare_estimator(
  Y ~ Z
  model = lm,
  model_summary = tidy_lm
)

In other cases, you may want to build on functions that interoperate with the broom functions to do specialized summary tasks like calculating marginal effects or predicted effects. The margins function from the margins package calculates marginal effects and the predictions package from the predictions package are especially useful and work well with the tidy workflow. To calculate marginal effects, run margins and then tidy as your model summary:

tidy_margins <- function(x) {
  tidy(margins(x, data = x$data), conf.int = TRUE)
}

declare_estimator(
  Y ~ Z + X,
  model = glm,
  family = binomial("logit"),
  model_summary = tidy_margins,
  term = "Z"
) 

9.4.3 Custom answer strategies

If your answer strategy does not use a model function, you’ll need to provide a function that takes data as an input and returns a data frame with the estimate. Set the handler to be label_estimator(your_function_name) to take advantage of DeclareDesign’s mechanism for matching inquiries to estimators. When you use label_estimator, you can provide an inquiry, and DeclareDesign will keep track of which estimates match each inquiry. (It simply adds a column to your tidy estimates data frame for the name of the estimator and the inquiry.) For example, to calculate the mean of an outcome, you could write your own estimator in this way:

my_estimator <- function(data) {
  data.frame(estimate = mean(data$Y))
}
declare_estimator(handler = label_estimator(my_estimator),
                  label = "mean",
                  inquiry = "Y_bar")

Often you may want to construct a test as part of your answer strategy that does not target an inquiry. Our declare_test function works just like declare_estimator except you need not include an inquiry. The label_test infrastructure works just like label_estimator for custom test functions.