# 2 What is a research design?

At its heart, a research design is a procedure for generating empirical answers to theoretical questions. Research designs can be strong or weak. Assessing whether a design is strong requires having a clear sense of what the question is and knowing whether the answers a study is likely to deliver are reliable. This book offers a language for describing research designs and an algorithm for selecting among them. In other words, it provides a set of tools for weighing and describing the dozens of choices we make in our research activities that together determine whether we can provide useful answers to our questions.

We show that the same basic language can be used to describe research designs whether they target causal or descriptive questions, whether they are focused on theory testing or inductive learning, and whether they use quantitative, qualitative, or mixed methods. We can select a strong design by applying a simple algorithm: declare-diagnose-redesign. Once a design is declared in simple enough language that a computer can understand it, its properties can be diagnosed through simulation. We can then engage in redesign, or comparing across a range of neighboring designs. The same language we use to talk to the computer can be used to talk to others. Reviewers, advisors, students, funders, journalists, and the public need to know four basic things to understand your design.

## 2.1 The four elements of research design

Empirical research designs share in common that they all have an inquiry *I*, a data strategy *D*, and an answer strategy *A*. Less obviously perhaps, these three elements presuppose a model *M* of how the world works. The four together, which we refer to as *MIDA*, represent both a conceptual framework for your inquiries and a description of the choices you will make as a researcher to intervene in and learn about the world.

Figure 2.1 shows how these four elements of a design relate to one another, how they relate to real world quantities, and how they relate to simulated quantities. We will unpack this figure in the remainder of this chapter and highlight especially the important parallelisms, between actual processes and simulated processes and between the theoretical (*M*, *I*) and the empirical (*D*, *A*) halves of a design.

### 2.1.1 Model

The set of models in *M* comprises varied speculations about what causes what and how. It includes guesses about how important variables are generated, how things are correlated, and the sequences of events.

The *M* in *MIDA* does not necessarily represent your beliefs about how the world works. Instead, it describes a set of possible worlds in enough detail that you can assess how your design would perform *if* the real world worked like those in *M*. For this reason we sometimes refer to *M* as a set of “reference” models. Assessment of the quality of a design is carried out with reference to the models of the world that you provide in *M*. In other contexts, you might see *M* described as as the “data generating process.” We would prefer to describe *M* as the “event generating process” to honor the fact that data are produced or gathered via a data strategy – and the resulting data are measurements taken of the events generated by the true causal model of the world.

Defining the model can feel like an odd exercise. Since researchers presumably want to learn about the world, declaring a model in advance may seem to beg the question. The discomfort we feel when writing down a model is real, because we simply don’t know exactly how the real world works. We nevertheless have to declare models, because designing research requires us to imagine how the design would perform under the many possible ways the world might work. In practice, declaring models about which we are uncertain is already familiar to any researcher who has calculated the statistical power of a design along a range of effect sizes.

#### 2.1.1.1 What’s in a model?

The model has two responsibilities. First, the model provides a setting within which a question can be answered. The inquiry *I* should be answerable *under the model*. If the inquiry is the average difference between two potential outcomes, those two potential outcomes should be described in the model. Second, the model governs what data will be produced by any given data strategy *D*. The data that will be produced by a data strategy *D* should be foreseeable under the model. If the data strategy includes random sampling of units from a population and measurement of an outcome, the model should describe the outcome variable all units in that population.

These responsibilities in turn determine what needs to be in the model. In general, the model defines a set of units that we wish to study. Often, this set of units is larger than the set of units that we will actually study empirically, but we can nevertheless define this larger set about which seek to make inferences. The units might be all of the citizens in Lagos, Nigeria, or every police beat in New Delhi. The set may be restricted to the mayors of cities in California or the catchment areas of schools in rural Poland. The model also includes information about the baseline characteristics of those units: how many of each kind of unit there are and how features of the units may be correlated.
For descriptive and causal questions alike, models are *causal* models. Even if questions are fundamentally descriptive, we pose them in the the context of causal models, because our causal theories have implications for the level of one variable, or the correlation between two others.

Causal models include a set of outcome variables that may be functions of baseline characteristics and the effects of treatments. These treatments might be delivered naturally by the world or may be set by researchers. The values that an outcome variable takes depending on the level of a treatment are called *potential* outcomes. In the simplest case of a binary treatment, the treated potential outcome arises when a unit is treated; the untreated potential outcome when it is not. The causal effect of a particular treatment is usually defined as the difference between the treated and untreated potential outcomes.

####
2.1.1.2 *M* is a set.

On Figure 2.1, we describe *M* as the “the worlds you’ll consider.” The reason for this is that we are uncertain about how the world works. We don’t know the “right” model. For this reason we have to think through how our design will play out under different possible models including ones we think likely and ones we think less likely. For instance, the correlation between two variables might be large and positive, but it could just as well be zero. We might believe that, conditional on some background variables, a treatment has been as-if randomly assigned by the world — but we might be wrong about that too. In the figure we use \(m^*\) to denote the “right” model, or the actual, unknown, event generating process. We do not have access to \(m^*\), but our hope is that \(m^*\) is sufficiently well-represented in *M* so that we can reasonably imagine what will happen when our design is applied in the real world.

How can you construct a sufficiently varied model of the world? For this difficult piece of theoretical work, you can draw on existing data, such as baseline surveys, or on new information gathered from pilot studies. Reducing uncertainty over the set of possible models is a core purpose of theoretical reflection, literature review, meta-analysis, and formative research. If there are important known features about your context it generally makes sense to include them in \(M\).

Examples of models

Contact theory: When two members of different groups come into contact under specific conditions, they learn more about each other, which reduces prejudice, which in turn reduces discrimination.

Prisoner’s dilemma. When facing a collective problem, each of two people will choose non-cooperative actions independent of what the other will do.

Health intervention with externalities. When individuals receive deworming medication, school attendance rates increase for them and for their neighbors, leading to improved labor market outcomes in the long run.

### 2.1.2 Inquiry

The inquiry is a question stated in terms of the model. For example, the inquiry might be the average causal effect of one variable on another, the descriptive distribution of a third variable, or a prediction about the value of a variable in the future. We refer to “the” inquiry when talking about the main research question. But our theories are rich, so we may seek to learn about many inquiries in a single research study.

Many people use the word “estimand” to refer to an inquiry, and we do too when talking about research informally. When we are formally describing research designs, however, we distinguish between inquiries and estimands and Figure 2.1 shows why. The inquiry *I* is the function that operates on the events generated by the world \(m^*\). The estimand is the value of that function: \(a_{m^*}\). In other words, we use “inquiry” to refer to the question and “estimand” to refer to the answer to the question.

Inquiries are defined with respect to units, conditions, and outcomes: they are summaries of outcomes of units in or across conditions. Inquiries may be causal, as in the sample average treatment effect (SATE). The SATE is the average difference in the outcome variable across the treatment condition and the control condition among units in a sample. Inquiries may also be descriptive, as in a population average of an outcome. While it may seem that descriptive inquiries do not involve conditions, they always do, since description of outcomes must take place under a particular set of circumstances, often set by the world and not the researcher.

Figure 2.1 shows that when *I* is applied to a model \(m\), it produces an answer \(a^m\). This set of relationships forces discipline on both *M* and *I*: *I* needs to be able to return an answer using information available from *M* and in turn *M* needs to provide enough information so that *I* can do its job.

We might think of the model and inquiry as forming the theoretical half of a research design. Together, they describe notional processes and quantities that we don’t observe directly. The data strategy and the answer strategy form the empirical half of the design, mirroring the theoretical half.

Examples of inquiries

- What proportion of voters live with limited exposure to voters from another party in their neighborhood?

- Does gaining political office make divorce more likely?

- What types of people will benefit most from a vaccine?

### 2.1.3 Data strategy

The data strategy is the full set of procedures we use to gather information from the world. The three basic elements of data strategies parallel the three features of inquiries: units are sampled, conditions are assigned, and variables are measured.

All data strategies involve sampling in the sense that no empirical strategy is comprehensive: some units are sampled into the study and some units aren’t. Seemingly comprehensive research designs like a population census have a sampling strategy in that they don’t sample respondents in different years or different countries.

Assignment procedures describe how researchers generate variation in the world. If you ask some subjects one question, but other subjects a different question, you’ve generated variation on the basis of an assignment procedure. We think of assignment procedures most often when they are randomized, as in a randomized experiment. Yet other kinds of research designs draw on assignment procedures that are not randomized, as in a pre-post design.

Measurement procedures are the ways in which researchers reduce the complex and multidimensional social world into a parsimonious set of data. These data need not be quantitative data in the sense of being numbers or values on a pre-defined scale; qualitative data are data too. Measurement is the vexing but necessary reduction of reality to a few choice representations. Measured values always contain measurement error, because this reduction is hard.

Figure 2.1 shows how the data strategy is applied to both the imagined worlds in *M* and to the real world. In practice, the application of *D* to the real world (\(m^*\)) might look quite different to the application of *D* to the worlds you imagine in *M*. We represent it in this way, however, to encourage you to make the representation of your data strategy as realistic as possible. Include in it not just the idealized elements, but also the challenges you might encounter in an uncooperative world such as nonresponse and noncompliance.

Examples of data strategies

*Sampling procedures.*

Random digit dial sampling of 500 voters in the Netherlands

Respondent-driven sampling of people who are HIV positive, starting from a sample of HIV-positive individuals

“Mall intercept” convenience sampling of men and women present at the mall on a Saturday

*Treatment assignment procedures.*

Random assignment of free legal assistance intervention for detainees held in illegal pretrial detention

Nature’s assignment of the sex of a child

*Measurement procedures.*

Voting behavior gathered from survey responses

Administrative data indicating voter registration

Measurement of stress using Cortisol readings

### 2.1.4 Answer strategy

The answer strategy is how we summarize the data produced by the data strategy. Just like the inquiry summarizes a part of the model, the answer strategy summarizes a part of the data. We can’t just “let the data speak” because complex, multidimensional datasets don’t speak for themselves — they need to be summarized and explained. Answer strategies are the procedures we follow to do so.

Answer strategies are functions that take in data and return answers. For some research designs, this is a literal function like the R function `lm_robust`

that estimates an ordinary least squares (OLS) regression with robust standard errors. For some research designs, the function is embodied by the researchers themselves when they read documents and summarize their meanings in a case study.

The answer strategy is more than the choice of an estimator. It includes the full set of procedures that begins with cleaning the dataset and ends with answers in words, tables, and graphs. These activities include data cleaning, data transformation, estimation, plotting, and interpretation. Not only do we define our choice of OLS as the estimator, we also specify that we will focus attention on a particular coefficient estimate, assess uncertainty using a 95% confidence interval, and construct a coefficient plot to visualize the inference. The answer strategy also includes all of the if-then procedures that researchers implicitly or explicitly take depending on initial results and features of the data. For example, in a stepwise regression procedure, the answer strategy is not the final regression that results from iterative model selection, but that whole procedure itself.

*D* and *A* impose a discipline on each other in the same way as we saw with *M* and *I*.
Just like the model needs to provide the events that are summarized by the inquiry, the data strategy needs to provide the data that are summarized by the answer strategy. Declaring each of these parts in detail reveals the dependencies across the design elements.

*A* and *I* also enjoy a tight connection stemming from the more general parallelism between (*M*, *I*) and (*D*, *A*). We elaborate the principle of parallel answer strategies in the next chapter and in Section 9.3.2. For now though we highlight that the nature of the question you ask can determine whether the answer strategy can does or does not require inference. If the question requires inference, then the design requires an inferential strategy.

Table @(tab:questiontypes02) shows three types of questions—descriptive, causal, general—and examples in which these questions do or do not require inference. Some descriptive questions can in principle be addressed without inference from the measurement to the thing being measured (though we grant some philosophers would doubt even this). But descriptive questions typically do require an inferential strategy that gives us confidence that our measures align with the quantities we care about. Causal questions *always* require inference. This is what is meant by the “fundamental problem of causal inference”: you cannot *see* causal effects, you *have* to infer them. Last, some questions are about general claims. In theory, a general claim could be answered without inference through exhaustive measurement. For instance, “Are all swans are white” could be answered without inference if we find one nonwhite swan. But general claims generally do require inference, including claims whose scope extends into the future. This type of inference is sometimes called “inductive inference” (Ronald A. Fisher 1935), but we’ll refer to it as “generalization inference.” Finally, answering some questions requires facing all three types of inferential challenge at once, for instance general claims about the causal effects of a treatment on latent outcomes.

Inquiry type | Answerable without inference | Requires inference | Sample inferential strategy | Challenge |
---|---|---|---|---|

Descriptive inquiry | The winner got 7 votes | The president was angry | Seek observable implications | Descriptive inference |

Causal inquiry | No examples possible | Human activity caused global warming | Random assignment, instrumental variables, controlled comparisons | Causal inference |

General inquiry | All current British MPs are men | First past the post systems usually have two parties | Sample from a population, sample from history, make theory dependent claim | Generalization inference |

Figure 2.1 shows how the same answer strategy *A* gets applied both to the expected data \(d\) and also to the data that you will ultimately gather \(d^*\). We know that in practice, however, the *A* applied to the real data differs somewhat from the *A* applied to the data we plan for via simulation. Designs sometimes drift in response to data, but too much drift and the inferences we draw can become misleading. The *MIDA* framework encourages you to think through what the real data will actually look like, and adjust *A* accordingly *before* data strategies are implemented.

Examples of answer strategies

Multilevel modeling and poststratification

Bayesian process tracing

Difference-in-means estimation

## 2.2 Declaration, diagnosis, redesign

With the core elements of a design described, we are ready to work through the process of declaration, diagnosis, and redesign.

### 2.2.1 Declaration

Declaring a design entails figuring out which parts of your design belong in *M*, *I*, *D*, and *A*. The declaration process can be a challenge because mapping your ideas and excitement about your project into *MIDA* is not always straightforward, but it is rewarding. When you can express your research design in terms of these four components, you are newly able to think about its properties.

Designs can be declared in words, but declarations often become much more specific when carried out in code. You can declare a design in any statistical programming language: Stata, R, Python, Julia, SPSS, SAS, Mathematica, or many others. Design declaration is even possible – though somewhat awkward – in Excel. We wrote the companion software, DeclareDesign, in R because of the availability of other useful tools in R and because it is free, open-source, and high-quality. We have designed the book so that you can read it even if you do not use R, but you will have to translate the code into your own language of choice. On our Web site, we have pointers for how you might declare designs in Stata, Python, and Excel. In addition, we link to a “Design wizard” that lets you declare and diagnose variations of standard designs via a point-and-click web interface. Chapter 4 provides an introduction to DeclareDesign in R.

### 2.2.2 Diagnosis

Once you’ve declared your design, you can diagnose it. Design diagnosis is the process of simulating your research design in order to understand the range of ways the study could turn out. Each run of the design comes out differently because different units are sampled, or the randomization allocated different units to treatment, or outcomes were measured with different error. We let computers do the simulations for us because imagining the full set of possibilities is – to put it mildly – cognitively demanding.

Diagnosis is the process of assessing the properties of designs, and represents an opportunity to write down what would make the study a success. For a long time, researchers have classified studies as successful or not based on statistical significance. If significant, the study “worked”; if not, it is a failed “null.” Accordingly, statistical power (the probability of a statistically significant result) has been the most front-of-mind design property when researchers plan studies. As we learn more about the pathologies of relying on statistical significance, we learn that features beyond power are more important. For example, the “credibility revolution” throughout the social sciences has trained a laser-like focus on the bias that may result from omitted or “lurking” variables.

Design diagnosis relies on two new concepts: diagnostic statistics and diagnosands.

A “diagnostic statistic” is a summary statistic generated from a single simulation of a design. For example, the statistic \(e\) refers to the difference between the estimate and the estimand. The statistic \(s\) refers to whether the estimate was deemed statistically significant at the 0.05 level.

A “diagnosand” is a summary of the distribution of a diagnostic statistic across many simulations of the design. The bias diagnosand is defined as the average value of the \(e\) statistic and the power diagnosand is defined as the average value of the \(s\) statistic. Other diagnosands include quantities like root-mean-squared-error (RMSE), Type I and Type II error rates, whether any subjects were harmed, and average cost. We describe these diagnosands in much more detail in Chapter 10.

One especially important diagnosand is the “success rate,” which is the average value of the “success” diagnostic statistic. As the researcher, you get to decide what would make your study a success. What matters most in your research scenario? Is it statistical significance? If so, optimize your design with respect to power. Is what matters most whether the answer has the correct sign or not? Then diagnose how frequently your answer strategy yields an answer with the same sign as your inquiry. Diagnosis involves articulating what would make your study a success and then figuring out, through simulation, how often you obtain that success. Success is often a multidimensional aggregation of diagnosands, such as the joint achievement of high statistical power, manageable costs, and low ethical harms.

We diagnose studies over the range of possibilities in the model, since we want to learn the value of diagnosands under many possible scenarios. A clear example of this is the power diagnosand over many possible values of the true effect size. For each effect size that we entertain in the model, we can calculate statistical power. The minimum detectable effect size is a summary of this power curve, usually defined as the smallest effect size at which the design reaches 80% statistical power. This idea extends well beyond statistical power. Whatever the set of important diagnosands, we want to ensure that our design performs well across all model possibilities.

Computer simulation is not the only way to do design diagnosis. Designs can be declared in writing or mathematical notation and then diagnosed using analytic formulas. Enormous theoretical progress in the study of research design has been made with this approach. Methodologists across the social sciences have described diagnosands such as bias, power, and root-mean-squared-error for large classes of designs. Not only can this work provide closed-form solutions for many diagnosands, it can also yield insights about the pitfalls to watch out for when constructing similar designs. That said, pen-and-paper diagnosis is challenging for the majority of social science research designs, first because many designs as actually implemented have idiosyncratic features that are hard to incorporate and second because the analytic formulas for many diagnosands have not yet been worked out by statisticians. For this reason, we usually depend on simulation.

Even when using simulation, design diagnosis doesn’t solve every problem and like any tool, can be misused. We outline two main concerns. The first is the worry that the diagnoses are plain wrong. Given that design declaration includes conjectures about the world, it is possible to choose inputs such that a design passes any diagnostic test set for it. For instance, a simulation-based claim of unbiasedness that incorporates all features of a design is still only good with respect to the precise conditions of the simulation. In contrast, analytic results, when available, may extend over general classes of designs. Still worse, simulation parameters might be chosen opportunistically. A power analysis may be useless if implausible parameters are chosen to raise power artificially. While our framework may encourage more principled declarations, it does not guarantee good practice. As ever, garbage-in, garbage-out. The second concern is the risk that research may be evaluated on the basis of a narrow or inappropriate set of diagnosands. Statistical power is often invoked as a key design feature, but well-powered studies that are biased are of little theoretical use. The importance of particular diagnosands can depend on the values of others in complex ways, so researchers should take care to evaluate their studies along many dimensions.

### 2.2.3 Redesign

Once your design has been declared, and you have diagnosed it with respect to the most important diagnosands, the last step is redesign.

Redesign entails fine-tuning features of the data and answer strategies to understand how they change your diagnosands. Most diagnosands depend on features of the data strategy. We can redesign the study by varying the sample size to determine how big it needs to be to achieve a target diagnosand: 90% power, say, or an RMSE of 0.02. We could also vary an aspect of the answer strategy, for example, the choice of covariates used to adjust a regression model. Sometimes the changes to the data and answer strategies interact. For example, if we want to use covariates that increase the precision of the estimates in the answer strategy, we have to collect that information as a part of the data strategy. The redesign question now becomes, is it better to collect pretreatment information from all subjects or is the money better spent on increasing the total number of subjects and only measuring posttreatment?

The redesign process is mainly about optimizing research designs given ethical, logistical, and financial constraints. If diagnosands such as total harm to subjects, total researcher hours, or total project cost exceed acceptable levels, the design is not feasible. We want to choose the best design we can among the feasible set. If the designs remaining in the feasible set are underpowered, biased, or are otherwise scientifically inadequate, the project may need to be abandoned.

In our experience, it’s during the redesign process that designs become *simpler*. We learn that our experiment has too many arms or that the expected level of heterogeneity is too small to be detected by our design. We learn that in our theoretical excitement, we’ve built a design with too many bells and too many whistles. Some of the complexity needs to be cut, or the whole design will be a muddle. The upshot of many redesign sessions is that our designs pose fewer questions, but obtain better answers.

## 2.3 Example: October surprise

Political pollsters forecast the outcomes of elections by taking samples of eligible voters and asking which candidate they will vote for. A key data strategy choice pollsters have to make is how best to spend their limited budget. They would like to forecast the outcome of the election far in advance, so one possibility is to draw one large sample a few weeks before the the election, ask voters which candidate they prefer, then forecast the winner of the election on the basis of which candidate most subjects prefer.

But it’s possible that public preferences over the candidates change in the run-up to the election. American national elections are held in November, so a late-breaking scandal or event that shakes up the race is called an “October surprise.” Even a large poll conducted a few weeks ahead of the election might miss the October surprise, so pollster might want to consider an alternative data strategy: spreading their limited resources over three smaller polls over the last few weeks of the election instead one large poll.

In this example, we show how the declare, diagnose, redesign algorithm can help us think through the designs that will help us maximize one main diagnosand: the frequency of making the correct election prediction.

### 2.3.1 A “steady race” design

Here we declare a “steady race” design.

Suppose we believe the race is steady. We include this belief in the model by stipulating candidate A is favored by 51 percent of the population and candidate B is favored by 49%. We think the state of the race will stay steady through until the election, so the probability a respondent at time 1, time 2, and time 3 prefers candidate A remains constant at 51 percent.

The inquiry is the final vote share in the actual election at time 4. We imagine that the actual result will be centered on 51 percent, sometimes a little higher, sometimes a little lower, with a standard deviation of 1 percentage point. We imagine that this variation is not due to changes in the preferences of the electorate, but instead idiosyncratic features of election day, like the weather or voting machine problems. Because the actual result is a random draw from this distribution, candidate A wins more often than candidate B, but not always.

In the data strategy, we conduct a poll at time 1, measuring preferences as they stood a few weeks before the election. The answer strategy takes the mean of the observed data using a neat regression trick: regressing the outcome on a constant (`Y_obs ~ 1`

) returns the sample mean.

**Declaration 2.1 **Steady race design

```
prefs_for_A <- 0.51
trend <- 0 # steady!
N <- 1000
# declaring the model and inquiry
steady_race <-
declare_model(
N = N,
Y_time_1 = rbinom(N, size = 1, prob = prefs_for_A + 0 * trend),
Y_time_2 = rbinom(N, size = 1, prob = prefs_for_A + 1 * trend),
Y_time_3 = rbinom(N, size = 1, prob = prefs_for_A + 2 * trend)
) +
declare_inquiry(
A_voteshare =
rnorm(n = 1, mean = prefs_for_A + 3 * trend, sd = 0.01))
# declaring the data and answer strategies
single_poll <-
declare_measurement(Y_obs = Y_time_1) +
declare_estimator(Y_obs ~ 1, model = lm_robust)
# declaring the full design
steady_race_design <- steady_race + single_poll
```

We diagnose this design with respect to the “Correct Call Rate.” Under this model, using this data and answer strategy, how frequently will we make the right decision? Since the sample consists of only 1,000 people, estimating which side of 50% we’re on is no easy task. Under these conditions, we make the right call about 68% of the time.

```
diagnosands <-
declare_diagnosands(
correct_call_rate = mean((estimate > 0.5) == (estimand > 0.5))
)
diagnosis <-
diagnose_design(design = steady_race_design,
diagnosands = diagnosands)
```

Correct Call Rate |
---|

0.68 |

### 2.3.2 An “October surprise” design

But what about under other circumstances? Let’s imagine an alternative model: the October surprise. Now opinion is shifting beneath our feet. Candidate A is losing ground fast. At time 1, the fraction preferring A was 51%, but at time 2 it’s 50%, at time 3 it’s 49%. What we care about – the inquiry – is the fraction preferring candidate A at time 4, the time of the election, when opinions about Candidate A will have arrived at 48%. Now if we were to run the election over and over, Candidate B would win much more often than Candidate A.

If we’re concerned that opinion is shifting, we might want to spread our 1,000 subjects over all three periods in the run up to the election to track the shift. This alternative data strategy randomly samples a third of units at time 1, a third at time 2, and a third at time 3. In the answer strategy, we run an ordinary least squares regression of the outcome on time in order to estimate average opinion trends over time. We use that slope to predict average opinion at time 4, the time the election.

**Declaration 2.2 **October surprise design

```
library(modelr)
trend <- -0.01 # "surprising" trend away from candidate A
# Declaring the new model and inquiry
october_surprise <-
declare_model(
N = N,
Y_time_1 = rbinom(N, size = 1, prob = prefs_for_A + 0 * trend),
Y_time_2 = rbinom(N, size = 1, prob = prefs_for_A + 1 * trend),
Y_time_3 = rbinom(N, size = 1, prob = prefs_for_A + 2 * trend)
) +
declare_inquiry(
A_voteshare =
rnorm(n = 1, mean = prefs_for_A + 3 * trend, sd = 0.01))
# Declaring the new data and answer strategies
three_polls <-
declare_assignment(time = complete_ra(N, conditions = 1:3)) +
declare_measurement(Y_obs = reveal_outcomes(Y ~ time)) +
declare_estimator(
Y_obs ~ time,
model = lm_robust,
model_summary = ~add_predictions(model = .,
data = data.frame(time = 4),
var = "estimate")
)
october_surprise_design <- october_surprise + three_polls
```

Now we diagnose the October surprise design with respect the same “Correct Call Rate.” The diagnosis shows that with this new design, the correct call rate is still 68%.

```
diagnosis <-
diagnose_design(design = october_surprise_design,
diagnosands = diagnosands)
```

Correct Call Rate |
---|

0.68 |

### 2.3.3 Choosing between empirical designs

If the steady race theory is right and we do a single poll, we make the right call about 68% of the time. And if the October surprise theory is right, and we do three polls, we also make the right call about 68% of the time. We don’t know which theory is right, so which empirical strategy should we implement – one poll or three?

To answer this question, we engage in redesign. We redesign theoretical beliefs over empirical strategies, considering what happens when we apply the single poll design to the October surprise theory, or the three poll design to the tight race theory. Redesign can be accomplished by putting all four designs into `diagnose_design`

:

```
diagnosis <-
diagnose_design(
design_1 = steady_race + single_poll,
design_2 = steady_race + three_polls,
design_3 = october_surprise + single_poll,
design_4 = october_surprise + three_polls,
diagnosands = diagnosands
)
```

Figure 2.2 shows that each empirical design does best when it matches the corresponding theoretical model of the race, which is to be expected. But which empirical design does better under the *other* theory? If the October surprise hits, the first empirical approach is terrible. It only makes the right call 25% of the time, because it relies on preferences measured before the big scandal changes political attitudes. How does the three poll design fare under the steady race theory? It’s not as strong as the one poll design in this setting, but it does still make the correct call 56% of the time.

If we are *uncertain* about whether there will be an October surprise (and we should be; surprising things happen), we have to weigh the upsides and downsides of each approach. If we think both theoretical models are equally likely, then we should conduct three polls, since that strategy does well when there is a surprise, and moderately well when there isn’t.

This example illustrates the value of entertaining many theoretical models when declaring research designs. Here we found that neither empirical strategy beats the other across both theoretical settings, but that one did well under more circumstances. We could entertain many more models than these. We could vary how close the race is, for example, how many candidates are running, or the slope of the trend. And we could of course imagine much more complex data and answer strategies that would outperform either of these two under a much wider range of circumstances than we considered. Elaborations like these will be explored in detail in Part II, then applied across common designs in Part III.

## 2.4 Putting designs to use

The two pillars of our approach are the language for describing research designs (*MIDA*) and the algorithm for selecting high-quality designs (declare, diagnose, redesign). Together, these two ideas can shape research design decisions throughout the lifecycle of a project. The full set of implications is drawn out in Part IV but we emphasize the most important ones here.

Broadly speaking, the lifecycle of an empirical research project has four phases: brainstorming, planning, realization, and integration. Planning, realization, and integration describe what happens before, during, and after the implementation of a research design. The inclusion of a pre-phase (brainstorming) reflects the idea that research doesn’t just begin with “planning,” it has to begin with some spark of inspiration.

The inspiration for a good research project can come from many sources: frustration with an article you’re reading, a golden opportunity with a potential research partner, a conversation with a colleague (or adversary!). The spark of an idea might be some bit of a model, perhaps an inquiry in particular, maybe a portion of a data strategy, or just an itch to apply a new answer strategy you learned about. Wherever that kernel of an idea starts, the purpose of brainstorming is to develop each element (*M*, *I*, *D*, and *A*) such that the design becomes a coherent whole, *MIDA*.

After an idea has been fleshed out sufficiently, it’s time to start planning. Planning entails some or all of the following steps, depending on the design: conducting an ethical review, seeking IRB approval, gathering criticism from colleagues and mentors, running pilot studies, and preparing preanalysis documents. The design as encapsulated by *MIDA* will go through many iterations and refinements during this period. Planning is the time when frequent re-application of the declare, diagnose, redesign algorithm will pay the highest dividends. How should you investigate the ethics of a study? By casting the ethical costs and benefits as diagnosands. How should you respond to criticism, constructive or not? By re-interpreting the feedback in terms of *M*, *I*, *D*, and *A*. How can you convince funders and partners that your research project is worth investing in? By credibly communicating your study’s diagnosands: its statistical power, its unbiasedness, its high chance of success, however the partner or funder defines it. What belongs in a pre-analysis plan? You guessed it – a specification of the model, inquiry, data strategy, and answer strategy.

Realization is the phase of research in which all those plans are executed. You implement your data strategy in order to gather information from the world. Once that’s done, you follow your answer strategy in order to finally generate answers to the inquiry. Of course, that’s only if things go exactly according to plan, which has never happened once in our own careers. Survey questions don’t work as we imagine, partner organizations may lose interest in your study, subjects move or become otherwise unreachable. A critic or a reviewer may insist you change your answer strategy, or may think a different inquiry altogether is theoretically appropriate. You may yourself change how you think of the design as you embark on writing up the research project. It is inevitable that some features of *MIDA* will change during the realization phase. Some design changes have very bad properties, like sifting through the data ex-post, finding a statistically significant result, then back-fitting a new *M* and a new *I* to match the new *A*. Indeed, if we declare and diagnose this actual answer strategy (sifting through data ex-post), we can show through design diagnosis that it is badly biased for any of the inquiries it could end up choosing. Other changes made along the way may help the design quite a bit. If the planned design did not include covariate adjustment, but a friendly critic suggests adjusting for the pre-treatment measure of the outcome, the “standard error” diagnosand might drop nicely. The point here is that design changes during the implementation process, whether necessitated by unforeseen logistical constraints or required by the review process, can be understood using in terms of *M*, *I*, *D*, and *A* by reconciling the planned design with the design as implemented.

A happy realization phase concludes with the publication of results. But the research design lifecycle is not finished: the study and its results must be integrated into the broader community of scientists, decisionmakers, and the public. Studies should be archived, along with design information, to prepare for reanalysis. Future scholars may well want to reanalyze your data in order to learn more than is represented in the published article or book. Good reanalysis of study data requires a full understanding of the design as implemented, so archiving design information along with code and data is critical. Not only may your design be reanalyzed, it may also be replicated with fresh data. Ensuring that replication studies answer the same theoretical questions as original studies requires explicit design information without which replicators and original study authors may simply talk past one another. Indeed, as your study is integrated into the scientific literature and beyond, you should anticipate disagreement over your claims. Resolving disputes is very difficult if parties do not share a common understanding of the research design. Finally, you should anticipate that your results will be formally synthesized with others’ work via meta-analysis. Meta-analysts need design information in order to be sure they aren’t inappropriately mixing together studies that ask different questions or answer them too poorly to be of use.