# 3 Research design principles

The declare, diagnose, redesign framework suggests a set of eleven principles that can guide the design process. Not all principles are equally important in all cases but we think all are worth giving consideration when developing and assessing a design. This section offers succinct discussions of the eleven principles. We will discuss the implications of these principles for specific design choices throughout the book, which is just to say that everything we mean to communicate with these principles may not be immediately obvious on a first read.

Design principles

- Design early
- Design often
- Entertain many models
- Select answerable inquiries
- Include strategies for descriptive, causal, and inductive inference
- Declare data and answer strategies as functions
- Seek M:I::D:A parallelism
- Specify diagnosands as a function of research goals
- Diagnose to break designs
- Diagnose whole designs
- Design to share

**Principle 3.1 **Design early

Designing an empirical project entails declaring, diagnosing, and redesigning the components of a research design: its model, inquiry, data strategy, and answer strategy. The design phase yields the biggest gains when we design early. By frontloading design decisions, we can learn about the properties of a design while there is still time to improve them. Once data strategies are implemented – units sampled, treatments assigned, and outcomes measured – there’s no going back. While applying the answer strategy to the revealed dataset, you might well wish you’d gathered data differently, or asked different questions. Post-hoc, we always wish our previous selves had planned ahead.

The deeper reason than regret for designing early is that the declaration, diagnosis, and redesign process inevitably changes designs, almost always for the better. Revealing how each of the four design elements are interconnected yields improvements to each. If the answer strategy and inquiry are mismatched, the designer faces a choice to change one or the other. If the units sampled in the data strategy are theoretically inappropriate, alternative participants might be selected. Models may reveal assumptions that require defense through additional data collection. Better inquiries, with greater theoretical leverage over the model, may be identified. Inquiries that cannot be answered may be replaced.

Designs are fine-tuned through redesign, which entails diagnosing across the feasible combinations of design parameters and selecting from the best-performing combinations. Redesign usually focuses on envisioning changes to the data strategy: alternative sampling procedures, assignment probabilities, or measurement techniques. Redesign can also consider changes to the answer strategy such as variations to the estimation or inferential procedures. These choices are almost always better made before any data are collected or analyzed.^{2}

A common objection to planning ahead is that inevitably, plans change. Empirical researchers encounter empirical problems: missing data, archival documents that cannot be traced, noncompliance with treatment assignments, evidence of spillovers, and difficulties recontacting subjects. Insofar as these are predictable problems, it can be useful to think of them as *parts* of your design not *deviations* from your design. Answer strategies can be developed that anticipate these problems, and account for them, including if-then plans for handling each potential problem. More fundamentally, anticipated failures themselves can be included in your model so that you can diagnose the properties of different strategies, in advance, given risks of different kinds.

**Principle 3.2 **Design often

Designing early does not mean being inflexible. In practice, unforeseen circumstances may change the set of feasible data and answer strategies. Implementation failures due to nonresponse, noncompliance, spillovers, inability to link datasets, funding contractions or logistical errors are common ways the set of feasible designs might contract. The set of feasible designs might expand if new data sources are discovered, additional funding is secured, or if you learn about a new piece of software. Whether the set expands or contracts, we should once again declare, diagnose, and redesign given the new realities.

“Designing often” usually happens in the middle of implementation. The output of that process is usually a modification to the data strategy and any compensating changes to the answer strategy. Sometimes, the unexpected event necessitates a change to the inquiry itself. We need to diagnose over the new feasible set of designs in order to make a new best choice.

The principle that we should design often also extends beyond the implementation. A colleague may suggest an alternative answer strategy; whether or not the suggestion is a good one is a design question that is often best settled through explicit declaration and diagnosis of the proposed alternative. A critic may charge that the model is theoretically ill-specified. Assessing the consequences of this contention requires us to diagnose the alternative designs, holding the inquiry, data strategy, and answer strategy constant while varying the models. Designing often means engaging in declaration, diagnosis, and redesign all throughout the research design lifecycle.

**Principle 3.3 **Entertain many models

When we design a research study, we have in mind a model of how the world works. But really your model is not just one model, it’s a family of possibilities. We think a set of variables are related, but we are uncertain in what ways and how closely related they are. Our family of possibilities includes plausible ranges for the parameters about which we are uncertain. The principle that we should entertain many models suggests that designers should expand the set of possible models they have in mind when deciding how to conduct research. We want to entertain many models because we want to be sure that the true model – how the work really works – is represented in the set we consider.

One way to entertain many models is to explicitly consider threats to inference. Randomized experiments can generate unbiased estimates of average causal effects under some models, but not others. Experiments can be biased for a population average causal effect if the sample is not representative, if the assignment affects outcomes via paths that are not mediated by the treatment, if the outcomes of one unit depend on the treatment status of others, and in many other settings besides. Threats to inference like these represent possible models in the family of possibilities that we entertain.

When we entertain many models, we learn the circumstances in which our designs perform well and poorly. Regression-like approaches for observational causal inference work well when the key “selection-on-observables” assumption holds and less well otherwise. “Doubly-robust” estimation is so named because it performs well when we correctly guess the outcome model, the selection model, or both, but poorly when we are wrong about both. “Design-based,” “nonparametric,” or “agnostic” approaches to inference enjoy the property that they often work well under a larger class of models than “model-based” approaches, since model-based approaches rely on the assumption that the stipulated model is the correct one.

The core idea is that your design gets stronger if it continues to perform well in many circumstances. Entertaining many models leads us to choose these more robust empirical designs.

**Principle 3.4 **Select answerable inquiries

This principle has two components.

First, you should *have* an inquiry. Oddly, it is possible to carry out data analysis — for instance, running a regression of \(Y\) on \(X\) — and getting something that looks like an answer without specifying any question in particular.

Second, your inquiry should be answerable. That’s trickier than it sounds. We can think of being answerable in theory and in practice.

An inquiry is answerable “in theory” if you can write down a model such that, if that model were the true model, and you knew the features of the model, you could answer the question. Simple sounding questions like “Did Germany cause the Second World War?” or “did New Zealand do well against Covid-19 because Prime Minister Jacinda Ardern was a woman?” can turn out to be difficult to ask and answer. This, we think, can give a hint to when a question is poorly posed. We must be able to describe *some* world such that the inquiry has a precise answer. In our framework, an inquiry is answerable in theory if for some \(m\) and \(I\), \(I(m) = a_m\) exists.

An inquiry is answerable “in practice” if the inquiry could be answered with data generated by a feasible data strategy, even if difficult-to-execute. That is, an inquiry is answerable in practice if for some \(D\) and \(A\), \(A(D() = d^*) = a_{d^*}\) is an estimate of \(a_{m^*}\).

Selecting answerable inquiries means choosing inquiries that are answerable both in theory and in practice. Some inquiries that are answerable in theory are not answerable in practice, except under knife-edge subset of models.

**Principle 3.5 **Include strategies for descriptive, causal, and inductive inference

Empirical research designs can face three inferential challenges. Research designs that seek to draw causal inferences encounter the fundamental problem of causal inference that we can observe a unit in its treated state or in its untreated state, but not both. Designs that seek to measure latent variables using measured variables face a challenge of descriptive inference that measurements are different from the concepts they measure. Designs that seek to draw inferences about non-study units from study units face the challenge of general inference that non-study units might be different from study units.

These three inferential challenges can all be thought of as missing data problems. Some information we would like to have is out of reach. For causal inference, we would like to observe counterfactual outcomes, but they are not observable. We can observe \(Y_i(\mathrm{treated} = 1)\) or \(Y_i(\mathrm{treated} = 0)\) but not both. For descriptive inferences, we would like to observe latent values, but we have to content ourselves with measurements. We can observe \(Y_i\), but not \(Y_i^*\). For generalization, we would like to observe nonstudy units, but by definition, they are not available for observation. We can observe \(Y_i(\mathrm{sampled} = 1)\) but we can never observe \(Y_j(\mathrm{sampled} = 0)\).

Confronting these problems requires strong research designs with inferential strategies targeted to the inferential challenges you face. Many such strategies exist. For instance, even though we can’t observe the counterfactual outcome for any particular unit, we can design studies to yield good estimates of *average* treated and untreated outcomes, from which we can construct estimates of average causal effects. Even though we don’t observe latent variables directly, we can sometimes triangulate their values through the aggregation of multiple measures or detailed understanding of the measurement process. Even though we don’t observe any nonstudy units in particular, we can employ sampling designs that license stronger inferences about the average outcomes of some nonstudy units. For all of these strategies there are analytic results we can turn to to justify the strategies and in all cases we can demonstrate that our design is working well at least in the model set we consider.

**Principle 3.6 **Declare data and answer strategies as functions

The data strategy is the function that, when applied to the real world, produces the realized dataset. The answer strategy is the function that, when applied to the realized data set, yields the empirical answer to the inquiry.

The distinction between the data strategy versus the realized data is important. The realized data — what we will ultimately download onto our computer — represent just one draw of the data that can be generated by a data strategy. Under different randomizations, different units would be treated or sampled into the study, each resulting in a different realized dataset. Even with nonrandomized data strategies, we can think of the observed data as being one draw from an underlying event generating process that could have been different.

Similarly, we have to distinguish between the answer strategy and the answer that is produced. If the realized data were different, the empirical answer produced would be different. Thinking of the answer strategy as a function underlines how the empirical answer is just an estimate of the truth, not the truth itself. In general, we can’t know if \(a_{d^*}\) equals \(a_{m^*}\) exactly.

Critically, the data and answer functions should be able to provide outputs for a wide variety of inputs: they should have a wide domain. In the case of the data strategy we should be able to envision what data will be produced for a variety of worlds, including worlds we have not imagined. In the case of the answer strategy we should be able to envision what answer we will get for different types of data that we might find, including data we have not imagined. Often setting up functions in this way requires the functions to operate as *procedures* that are responsive to inputs.

Adaptive random assignment schemes are an example of a data strategy as a procedure. In each round more effective treatments are assigned to more and more units — a change to the assignment probabilities each round. The decision whether to include a control in a regression might depend on a the basis of how well the regression specification fits the data. A three-step procedure — fit regressions, assess fit, report coefficient from best-fitting specification — is what should then be declared as hte answer strategy.

The advantage of specifying flexible functions that can handle diverse inputs is that diagnosis takes account of the full procedure and not just the final result. A diagnosis of a design in which you include only the final specification that got used will be wrong because it does not capture properly the distribution if what would have happened under different circumstances.

**Principle 3.7 **Seek *M*:*I*::*D*:*A* parallelism

The model and inquiry form the theoretical half of a research design and the data and answer strategies form the empirical half. Designs in which the relationship of *M* to *I* is parallel to the relationship of *D* to *A* are often strong, precisely because of the tight correspondence across the theoretical and empirical halves. Some intuition for this idea can be read off our formalism that the theoretical answer can be written \(I(m^*) = a_{m^*}\) and the empirical answer as \(A(D = d^*) = a_{d^*}\). In words, if the data strategy produces data that are “like” the events generated by a model, and if *A* is like *I*, the estimate will be like the estimand. When the theoretical and empirical halves of the design are parallel, then we can write the design as an analogy: *M* is to *I* as *D* is to *A*.

This idea is a version of the “plug-in principle”: under many data strategies, we can “plug-in” the inquiry for the answer strategy. For example if we are interested in estimating the population mean, wedraw a sample from the population and estimate it using the sample mean estimator.

Parallelism could break down if the data strategy does not produce data like the model produces events. When data strategies introduce distortions, we can make compensating changes to the answer strategy. We restore parallelism by seeking an *A’* such that \(A'(D' = d') \approx A(D = d)\). This idea underpins the maxim “analyze as you randomize” (Ronald Aylmer Fisher 1937): estimators should account for differential probabilities of assignment and other known features of the randomization procedure. Even outside randomized studies, parallelism is served when answer strategies respect known features of the data strategies.

**Principle 3.8 **Specify diagnosands as a function of research goals

When designing research, we should give careful thought to our diagnosands, the criteria by which we evaluate the qualities of candidate designs. Too often, researchers focus on a narrow set of diagnosands, and often consider them in isolation. Is the estimator unbiased? Do I have statistical power? The evaluation of a design nearly always requires balancing multiple criteria: scientific precision, logistical constraints, policy goals, as well ethical considerations. Each of these goals can be specified as a function of an implementation of the design. The cost is a function that translates the number of units and the amount of time it took to collect and analyze data into a total cost. Scientific goals may be represented in a number of ways, such as the root mean-squared error or statistical power or most directly the amount of learning between before and after the study was conducted. Ethical goals may also be translated into functions. An ethical diagnosand might be the number of minutes of time taken from participants of the study or whether any participants were harmed.

A diagnosis of designs across multiple criteria provides us with a multidimensional value statement of each design. We then can select the best feasible design among them. Specifying diagnosands intentionally forces us to provide a weighting scheme between possibly competing ethical, logistical, and scientific values. This weighting scheme is important to understand explicitly because finding the best design in this high-dimensional space is difficult and separately evaluating designs on each dimension and then weighting across dimensions can help.

**Principle 3.9 **Diagnose to break designs

A corollary principle to “entertain many models” is that we should know the models under which our design performs well and those under which it performs poorly. We want to diagnose over many models to find where designs break.

Our design might assume for instance that one variable is not affected by another variable and the validity of our answer might depend on the extent to which this is true. A design that contains a set of models that include violations of this assumption can be used to assess the extent to which the assumption matters, how bad a violation has to be to produce misleading results of consequence, and what types of assumptions are critical for inference and which ones are not. In short, we want to diagnose over a model set that includes the worlds for which our design works and the worlds in which we run into problems.

All designs break under some models, so the fact that a design ever breaks is no criticism. As research designers, we just want to know which models pose problems and which do not.

**Principle 3.10 **Diagnose whole designs

When we diagnose, we evaluate designs in their entirety. Too often, researchers evaluate parts of their designs in isolation: is this a good question? Is this a good estimator? What’s the best way to sample? Design diagnosis requires knowing how each part of the design fits together. If you say, “my design has 80% power,” we want know, “power for what?” The power diagnosand could refer to an average effect estimator or to a subgroup effect estimator. If we ask, “What’s your research design?” and you respond “It’s a regression discontinuity design,” we’ve learned what class your answer strategy might be, but we don’t have enough information to decide whether it’s a strong design until we learn about the model, inquiry, data strategy, and other parts of the answer strategy.

In practice we do this by declaring the entire design, and asking how it performs, from start to finish, with respect to specified diagnosands. This process requires a sufficiently complete design declaration. Indeed, the ability to run through a design to the point where a diagnosis can be undertaken (“diagnosand-completeness”) is a good indicator of an adequately-declared design.

Design to share

The declaration, diagnosis, and redesign process can improve the quality of the research designs you implement. This same process can also help you communicate your work, justify your decisions, and contribute to the scientific enterprise. Formalizing design declaration makes this sharing easier. By coding up a design as an object that can be run, diagnosed, and redesigned, you help other researchers see, understand, and question the logic of your research.

We urge you to keep this sharing function in mind as you write code, explore alternatives, and optimize over designs. An answer strategy that is hard-coded to capture your final decisions might break when researchers try to modify parts. Alternatively, designs can be created specifically to make it easier to explore neighboring designs, let others see why you chose the design you chose, and give them a leg up in their own work. In our ideal world, when you create a design, you contribute it to a design library so others can check it out and build on your good work.