21 Planning

We list “design early” among our research design principles (Principle 3.4) to emphasize the gains from early planning. Research projects are very long journeys: going from the kernel of an idea to a published paper typically takes multiple years. Once the data strategy has been implemented, you’re stuck with it, so mindful planning beforehand is important. “Measure twice, cut once.”

The planning process changes designs. We work out designs that meet ethical as well as scientific standards, accommodate the needs of research partners, and operate within financial and logistical constraints. When we are insufficiently sure of key inputs to design declarations, we can run pilots, but we need to be careful about how we incorporate what we learn from them. Finally, when we write up a declaration or a PAP with a declaration, this can be a useful moment to get feedback from our peers to improve the design. We discuss each of these steps in this chapter.

21.1 Ethics

As research designers, we have ethical obligations beyond the requirements of national laws and the regulations of institutional review boards.

For a long time, thinking about research ethics has been guided by the ideas in the Belmont report, which emphasizes beneficence, respect for persons, and autonomy. Recently, more attention has been given to principles that extend beyond care for human subjects to include considerations for the well-being of collaborators and partners and the broader social impact of research. Social scientific professional associations have developed principles and guidelines to help think through these issues. Key references include:

The considerations at play vary across context and methods. For example, Teele (2021) describes ethical considerations in field experimentation, Humphreys (2015) focuses on development settings, Slough (2020) considers the ethics of field experimentation in the context of elections, and Wood (2006) and Baron and Young (2021) consider ethical challenges specific to field research in conflict settings.

However, a common meta-principle underlying many of these contributions is the injunction to consider the ethical dimensions of your work ex ante and to report on ethical implications ex post. Lyall (2022) specifically connects ethical reflection to ex ante design considerations.

We encourage you to engage with ethical considerations in this way, early in the research design lifecycle. Some design declarations and diagnoses elide ethical considerations. For instance, a declaration that is diagnosand-complete for statistical power may tell you little about the level of care and respect accorded to subjects. Many declarations are diagnosand-complete for bias, but obtaining an unbiased treatment effect estimate is not always the highest goal.

In principle, ethical diagnosands can be directly incorporated into the declare-diagnose-redesign framework. Diagnosands could include the total cost to participants, how many participants were harmed, the average level of informed consent measured by a survey about comprehension of study goals, or the risks of adverse events. More complex ethical diagnosands may be possible as well: Slough (2020) provides a formal analysis of the “aggregate electoral impact” diagnosand for experiments that take place in the context of elections. To illustrate, we describe two specific ethical diagnosands here, costs and potential harms, though many others may apply in particular research scenarios.

Costs. A common ethical concern is that measurement imposes a cost on subjects, if only by wasting their time. Subjects’ time is a valuable resource they often donate willingly to the scientific enterprise by participating in a survey or other measurement. Although subjects’ generosity is sometimes repaid with financial compensation, in many scenarios direct payments are not feasible. Regardless of whether subjects are paid, the costs to subjects should be top of mind when designing the study and can be explicitly specified as a diagnosand.

Potential harms. Different realizations of the data from the same data strategy may differ in their ethical implications. Ex post, a study may not have ended up harming subjects, but ex ante, there may have been a risk of harm (Baron and Young 2021). The design’s ethical status depends on judgments about potential harms and potential participants: not only what did happen, but what could have happened. The potential harm diagnosand might be formalized as the maximum harm that could eventuate under any realization of the data strategy. Researchers could then follow (for example) a minimax redesign procedure to find the design that minimizes this maximum potential harm.

When the design is diagnosed, we can characterize the ethical status of possible realizations of the design as well as the ethical status of the distribution of these realizations. Is the probability of harm minimal “enough”? Is the degree of informed consent sufficient? Given that these characteristics vary across designs and across realizations of the same design, writing down concretely both the measure of the ethical status and the ethical threshold can help structure thinking. These diagnoses and the considerations that inspire them can be shared in funding proposals, preanalysis plans, or other reports. Articulating them in a design may help clarify whether proper account was taken of risks ex ante, or, more usefully, remind researchers to be sure to take account of them.

21.1.1 Illustration: Estimating expected costs and expected learning

We illustrate how researchers can weigh the trade-offs between the value of research and its ethical costs with a hypothetical audit study of discrimination in government hiring. We imagine that three characteristics of applicants to a municipal government job are randomized and whether the applicant receives a callback is recorded. The three candidate characteristics are race (Black or White), area of residence (urban or suburban), or high school attended (East or West).

Suppose we could rank the scientific importance of measuring discrimination along all three dimensions and we judged race-based discrimination of high importance and discrimination on the basis of residence or high school to be of medium and low importance, respectively.

The value of the research then is a function of the importance of the inquiries, of course, but also of how much we learn about it. We proxy for the learning from the experiment by sample size: the higher the \(N\), the more we learn, but with decreasing marginal returns (it’s a lot better to have a sample of 100 compared to 10; it matters less if it is 1010 or 1100). Figure 12.1 shows the three expected learning curves labeled by the importance of the inquiry.

We turn now to the expected costs side of the calculation. Because the job applicants are fictitious but appear real, alongside concerns around deception, a primary ethical concern in audit experiments is how much time hiring managers waste reviewing fictitious applications. In the case of government hiring, it is public money spent on their review. Suppose the cost to participants is linear in the number of applications, since an application takes about ten minutes to review. We represent the cost to participants as the purple line in Figure 12.1.

We have placed the costs to participants on the same scale as the value of the research, by placing a value to society of the research and a value to society of the administrative time. Doing so requires taking positions that can be read off from the definition of the diagnosands. When benefit exceeds cost (upper blue region), we decide to move forward with the research; if costs exceed benefits (lower gray region), we do not conduct the study.

The key takeaway from the graph is that there is a region at low sample sizes where the cost to participants exceeds the benefits from the research, because of the very imprecise answers we get from the research. We don’t learn enough about the inquiry, despite its importance, to make it worth wasting the hiring managers’ time. In the medium importance cases there is a “goldilocks” range: there is a region of the benefits curve (highlighted in blue) where it is worth doing the study, but there are two regions (highlighted in gray) above and below it where it is not worth it. The left region is where the sample is too small, so the value of the research is low both because of its medium importance and because we do not learn enough about it. The second gray region at right in the medium importance curve is where, though we learn a lot about the inquiry, the cost is too high from the many hours of hiring managers’ time to justify what we learn because the inquiry is not important enough.

In short, in principle, ethical decisions can be informed by diagnosis both of how much we learn and of how much it costs participants (along with other ethical costs). The key benefit of this approach comes from getting a sharper understanding of the gains from design decisions on the research side; the problem of placing value on those gains is not addressed within the framework of course but using the framework makes it clearer what positions researchers are taking on these issues.

Calculations of this form, however, always open up risks that researchers overestimate the importance of their research or are blind to ethical costs (in this case, for instance, an additional cost might arise if, because of the study, subjects, or other officials, became more skeptical of future real applicants). In practice, strategies that maximize autonomy for implicated actors can make these choices easier. In this case, for instance, prior consultations with members of the subject population that assessed how they saw the benefits of the research relative to the costs, and how they assessed the costs from this type of deception, would go a long way to allay ethical concerns.

Figure 21.1: Tradeoffs between ethical costs and scientific benefits. A design might have too *many* subjects but also too *few* subjects.

21.1.2 Institutional review boards

When researchers sit at universities in the United States, research must be approved by the university’s institutional review board (IRB) under the federal regulation known as the “Common Rule.” Similar research review bodies exist at universities worldwide and at many independent research organizations and think tanks. Though these boards are commonly thought to judge research ethics, they have a second function to protect their institution from liability for research gone awry Schrag (2010). Insofar as institutional and ethical goals are aligned, IRBs help ensure responsible research practices, but as a general matter institutional approval is not a substitute for ethical engagement by researchers.

21.2 Partners

Partnering with third-party organizations in research entails cooperating to intervene in the world or to measure outcomes. Researchers seek to produce (and publish) scientific knowledge; they work with political parties, government agencies, nonprofit organizations, and businesses to learn more than they could if they worked independently. These groups join the collaborations to learn about how to achieve their own organizational goals. Governments may want to expand access to healthcare, corporations to improve their ad targeting, and nonprofits to demonstrate program impact to funding organizations.

In the best-case scenario, the goals of the researchers and partner organizations are aligned. When the scientific question to be answered is the same as the practical question the organization cares about, the gains from cooperation are clear. The research team gains access to the organization’s financial and logistical capacity to act in the world, and the partner organization gains access to the researchers’ scientific expertise. Finding the right research partner almost always amounts to finding an organization with a common – or at least not conflicting – goal. Selecting a research design amenable to both parties requires understanding each partners’ private goals. Research design declaration and diagnosis can help with this problem by formalizing trade-offs between the two sets of goals.

One frequent divergence between partner and researcher goals is that partner organizations often want to learn, but they care most about their primary mission (Levine 2021). This dynamic is sometimes referred to as the “learning versus doing” trade-off. (In business settings, this trade-off goes by names like “learning versus earning” or “exploration versus exploitation”). An aid organization cares about delivering their program to as many people as possible. Learning whether the program has the intended effects on the outcomes of interest is obviously also important, but resources spent on evaluation are resources not spent on program delivery.

Diagnosis 21.1 Learning versus doing diagnosis

Research design diagnosis can help navigate the learning versus doing trade-off. One instance of the trade-off is that the proportion of units that receive a treatment represents the rate of “doing,” but this rate also affects the amount of learning. In the extreme, if all units are treated, we can’t measure the effect of the treatment. The trade-off here is represented in Figure 21.2, which shows the study’s power versus the proportion treated (top facet) and the partner’s utility (bottom facet). The researchers have a power cutoff at the standard 80% threshold. The partner also has a strict cutoff: they need to treat at least 2/3 of the sample to fulfill a donor requirement.

A researcher might use this graph together with the partner to jointly select the design that has the highest power that has a sufficiently high proportion treated to meet the partner’s needs. This is represented in the “zone of agreement” in gray: in this region, the design has at least 80% power and at least two thirds of the sample are treated. Deciding within this region involves a trade-off between power (which is decreasing in the proportion treated here) and the partner’s utility (which is increasing in the proportion treated). The diagnosis surfaces the zone of the agreement and clarifies the choice between designs in that region. Unfortunately, some partnerships simply will not work out if the zone of agreement is empty.

Figure 21.2: Finding the zone of agreement in a research partnership.

Choosing the proportion treated is one example of integrating partner constraints into research designs. A second common problem is that there are a set of units that must be treated or that must not be treated for ethical or political reasons (e.g., the home district of a government partner must receive the treatment). If these constraints are discovered after treatment assignment, they lead to noncompliance, which may substantially complicate the analysis of the experiment and even prevent providing an answer to the original inquiry. Gerber and Green (2012) recommend, before randomizing treatment, exploring possible treatment assignments with the partner organization and using this exercise to elicit the set of units that must or cannot be treated. King et al. (2007) describe a “politically-robust” design, which uses pair-matched block randomization. In this design, when any unit is dropped due to political constraints, the whole pair is dropped from the study.¹

A major benefit of working with partners is their deep knowledge of the substantive area. For this reason, we recommend involving them in the design declaration and diagnosis process. How can we develop intuitions about the means, variances, and covariances of the variables to be measured? Ask your partner for their best guesses, which may be far more educated than your own. For experimental studies, solicit your partner’s beliefs about the magnitude of the treatment effect on each outcome variable, subgroup by subgroup if possible. Engaging partners in the declaration process improves design – and it very quickly sharpens the discussion of key design details. Pro-tip: Share your design diagnoses and mock analyses before the study is launched to quickly build consensus around the study’s goals.

21.3 Funding

Higher quality designs usually come with higher costs. Collecting original data is more expensive than analyzing existing data, but collecting new data may be more or less costly depending on the ease of contacting subjects or conducting measurements. As a result, including cost diagnosands in research design diagnosis can directly aid data strategy decision-making. These diagnosands may usefully include both average cost and maximum cost. Researchers may make different decisions about cost: in some cases, the researcher will select the “best” design in terms of research design quality subject to a budget constraint. Others will choose the cheapest among similar quality designs to save money for future research. Diagnosis can help identify each set and decide among them.

To relax the budget constraint, researchers apply for funding. Funding applications have to communicate important features of the proposed research design. Funders want to know why the study would be useful, important, or interesting to scholars, the public, or policymakers. They also want to ensure that the research design provides credible answers to the question and that the research team is capable of executing the design. Since it’s their money on the line, funders also care that the design provides good value for money.

Researchers and funders have an information problem. Applicants wish to obtain as large a grant as possible for their design but have difficulty credibly communicating the quality of their design given the subjectivity of the exercise. On the flip side, funders wish to get the most value for money in the set of proposals they decide to fund and have difficulty assessing the quality of proposed research. Design declaration and diagnosis provide a partial solution to the information problem. A common language for communicating the proposed design and its properties can communicate the value of the research under design assumptions that can be understood and interrogated by funders.

Funding applications could usefully include a declaration and diagnosis of the proposed design. In addition to common diagnosands such as bias and efficiency, two special diagnosands may be valuable: cost and value for money. The cost can be included for each design variant as a function of design features such as sample size, the number of treated units, and the duration of survey interviews. Simulating the design across possible realizations of each variant explains how costs vary with choices the researcher makes. Value for money is a diagnosand that is a function of cost and the amount learned from the design.

In some cases, funders request applicants to provide multiple options and multiple price points or make clear how a design could be altered so that it could be funded at a lower level. Redesigning over differing sample sizes communicates how the researcher conceptualizes these options and provides the funder with an understanding of trade-offs between the amount of learning and cost in these design variants. Applicants could use the redesign process to justify the high cost of their request directly in terms of the amount learned.

Ex ante power analyses are required by an increasing number of funders. Current practice, however, illustrates the crux of the misaligned incentives between applicants and funders. Power calculators online have difficult-to-interrogate assumptions built in and cannot accommodate the specifics of many common designs (Blair et al. 2019). As a result, existing power analyses can demonstrate that almost any design is “sufficiently powered” by changing expected effect sizes and variances. Design declaration is a partial solution to this problem. By clarifying the assumptions encoded in the design declaration, applicants can more clearly link the assumptions of the power analysis to the specifics of the design setting.

Finally, design declarations can, in principle, help funders compare applications on standard scales: root-mean-squared error, bias, and power. Moving design considerations onto a common scale takes some of the guesswork out of the process and reduces reliance on researcher claims about properties of designs.

21.4 Piloting

Designing a research study always entails relying on a set of beliefs, what we’ve referred to as the set of possible models in M. Choices like how many subjects to sample, which covariates to measure, or which treatments to allocate all depend on beliefs about treatment effects, the correlations of the covariates with the outcome, and the variance of the outcome.

We may have reasonably educated guesses about these parameters from past studies or theory. Our understanding of the nodes and edges in the causal graph of M, expected effect sizes, the distribution of outcomes, feasible randomization schemes, and many other features are directly selected from past research or chosen based on a literature review of past studies.

Even so, we remain uncertain about these values. One reason for the uncertainty is that our research context and inquiries often differ subtly from previous work. Even when replicating an existing study as closely as possible, difficult-to-intuit features of the research setting may have serious consequences for the design. Moreover, our uncertainty about a design parameter is often the very reason for conducting a study. We run experiments because we are uncertain about the average treatment effect. Frustratingly, we always have to design under model uncertainty.

The main goal of pilot studies is to reduce this uncertainty. We would like to learn which models in M are more likely, so that the main study can be designed under beliefs that are closer to the truth. Pilots take many forms: focus groups, small-scale tests of measurement tools, even miniature versions of the main study. We want to learn things like the distribution of outcomes, how covariates and outcomes might be correlated, or how feasible the assignment, sampling, and measurement strategies are.

Almost by definition, pilot studies are inferentially weaker than main studies. We turn to them in response to constraints on our time, money, and capacity. If we were not constrained, we would run a first full-size study, learn what is wrong with our design, then run a corrected full-size study. Since running multiple full studies is too expensive or otherwise unfeasible, we run either smaller mini-studies or test out only a subset of the elements of our planned design. Accordingly, the diagnosands of a pilot design will not measure up to those of the main design. Pilots have much lower statistical power and may suffer from higher measurement error and less generalizability. Accordingly, the goal of pilot studies should not be to obtain a preliminary answer to the main inquiry, but instead to learn the information that will make the main study a success.

Like main studies, pilot studies can be declared and diagnosed – but importantly, the diagnosands for main and pilot studies need not be the same. Statistical power for an average treatment effect may be an essential diagnosand for the main study, but owing to their small size, power for pilot studies will typically be abysmal. Pilot studies should be diagnosed with respect to the decisions they imply for the main study.

Figure 21.3 shows the relationship between effect size and the sample size required to achieve 80% statistical power for a two-arm trial using simple random assignment. Uncertainty about the true effect size has enormous design consequences. If the effect size is 0.17, we need about 1,100 subjects to achieve 80% power. If it’s 0.1, we need 3,200.

Figure 21.3: Minimum required sample sizes and uncertainty over effect size

Suppose we have prior beliefs about the effect size that can be summarized as a normal distribution centered at 0.3 with a standard deviation of 0.1, as in the bottom panel of Figure 21.2. We could choose a design that corresponds to this best guess, the average of our prior belief distribution. If the true effect size is 0.3, then a study with 350 subjects will have 80% power.

However, redesigning the study to optimize for the “best guess” is risky because the true effect could be much smaller than 0.3. Suppose we adopt the redesign heuristic of powering the study for an effect size at the 10th percentile of our prior belief distribution, which works out here to be an effect size of 0.17. Following this rule, we would select a design with 1,100 subjects.

Now suppose the true effect size is, in actuality, only 0.1, so we would need to sample 3,200 subjects for 80% power. The power of our chosen 1,100-subject design is a mere 38%. Here we see the consequences of having incorrect prior beliefs: our ex ante guess of the effect size was too optimistic. Even taking what we thought of as a conservative choice – the 10th percentile redesign heuristic – we ended up with too small a study.

A pilot study can help researchers update their priors about important design parameters. If we do a small-scale pilot with 100 subjects, we’ll get a noisy but unbiased estimate of the true effect size. We can update prior beliefs by taking a precision weighted average of our priors and the estimate from the pilot, where the weights are the inverse of the variance of each guess. Our posterior beliefs will be closer to the truth, and our posterior uncertainty will be smaller. If we then follow the heuristic of powering the 10th percentile of our (now posterior) beliefs about effect size, we will have come closer to correctly powering our study. Figure 21.4 shows how large the studies would be, depending on how the pilot study came out if we were to follow the 10th percentile decision rule. On average, the pilot leads us to design the main study with 1,800 subjects, sometimes more and sometimes less.

This exercise reveals that a pilot study can be quite valuable. Without a pilot study, we would choose to sample 1,100 subjects, but since the true effect size is only 0.1 (not our best guess of 0.3), the experiment would be underpowered. The pilot study helps us correct our diffuse and incorrect prior beliefs. However, since the pilot is small, we don’t update our priors all the way to the truth. We still end up with a main study that is on average too small (1,800), with a corresponding power of 56%. That said, a 56% chance of finding a statistically significant result is better than a 38% chance.

Figure 21.4: Distribution of post-pilot sample size choices

In summary, pilots are most useful when we are uncertain – or outright wrong – about important design parameters. This uncertainty can often be shrunk by quite a bit without running pilot studies by meta-analyzing past empirical studies. Some things are hard to learn by reading others’ work; pilot studies are especially useful tools for learning about those things.

21.5 Criticism

A vital part of the research design process is gathering criticism and feedback from others. Timing is delicate here. Asking for comments on an underdeveloped project can sometimes lead to brainstorming sessions about what research questions one might look into. Such unstructured sessions can be quite useful but essentially restarts the research design lifecycle from the beginning. Sharing work only after a full draft has been produced is worse, since the data strategy will have already yielded the realized data. The investigators may have become attached to favored answer strategies and interpretations. While critics can always suggest changes to I and A post-data collection, an almost finished project is fundamentally constrained by the data strategy as it was implemented.

The best moments to seek advice come before registering preanalysis plans or, if not writing a PAP, before implementing major data strategy elements. The point is not to seek advice exclusively on sampling, assignment, or measurement procedures; the important thing is that there’s still time to modify those design elements (Principle 3.4: Design early). Feedback about the design as a whole can inform changes to the data strategy before it is set in stone.

Feedback will come in many forms. Sometimes the comments are directly about diagnosands. The critic may think the design has too many arms and won’t be well powered for many inquiries. Or they may be concerned about bias due to excludability violations or selection issues. These comments are especially useful because they can easily be incorporated in design diagnosis and redesign exercises.

Other comments are harder to pin down. A fruitful exercise in such cases is to understand how the criticism fits into M, I, D, and A. Comments like “I’m concerned about external validity here” might seem to be about the data strategy. If the units were not randomly sampled from some well-specified population, we can’t generalize from the sample to the population. But if the inquiry is not actually a population quantity, then this inability to use sample data to estimate a population quantity is irrelevant. The question then becomes whether knowing the answer to your sample inquiry helps make theoretical progress or whether we need to generalize – to switch the inquiry to the population quantity to make headway. Critics will not usually be specific about how their criticism relates to each element of design, so it is up to the criticism-seeker to understand the implications for design.

Sometimes we seek feedback from smart people, but they do not immediately understand the design setting. If the critic hasn’t absorbed or taken into account important features of the design, their recommendations and amendments may be off-base. For this reason, it’s important to communicate the design features – the model, inquiry, data strategy, and answer strategy – at a high enough level of detail that the critic is up to speed before passing judgment.

21.6 Preanalysis Plan

In many research communities, it is becoming standard practice to publicly register a preanalysis plan (PAP) before implementing some or all of the data strategy Miguel et al. (2014). PAPs serve many functions, but most importantly, they clarify which design choices were made before data collection and which were made after. Sometimes – perhaps every time! – we conduct a research study, and aspects of M, I, D, and A shift along the way. A concern is that they shift in ways that invalidate the apparent conclusions of the study. For example, “p-hacking” is the shady practice of trying out many regression specifications until the p-value associated with an important test attains statistical significance. PAPs protect researchers by communicating to skeptics when design decisions were made. If the regression specification were detailed in a PAP posted before any data were collected, the test could not have been the result of a p-hack.

PAPs are sometimes misinterpreted as a binding commitment to report all pre-registered analyses and nothing but. This view is unrealistic and unnecessarily rigid. While we think that researchers should report all preregistered analyses somewhere (see Section 22.2 on “populated PAPs”), study write-ups inevitably deviate in some way from the PAP – and that’s a good thing. Researchers learn more by conducting research. This learning can and should be reflected in the finalized answer strategy. One guardrail against extensive post-PAP design changes can be a set of standard operating procedures that lays out what to do when circumstances change (Green and Lin 2016).

Our hunch is that the main consequence of actually writing PAPs is that research designs improve. Just like design declaration forces us to think through the details of our model, inquiry, data strategy, and answer strategy, describing those choices in a publicly posted document surely causes deeper reflection about the design. In this way, the main audience for a PAP is the study authors themselves.

What belongs in a PAP? Recommendations for the set of decisions that should be specified in a PAP remain remarkably unclear and inconsistent across research communities. PAP templates and checklists are proliferating, and the number of items they suggest ranges from nine to 60. PAPs themselves are becoming longer and more detailed. Some in the American Economic Association and Evidence in Governance and Politics (EGAP) study registries reach hundreds of pages as researchers seek to be ever more comprehensive. Some registries emphasize the registration of the hypotheses to be tested, while others emphasize the registration of the tests that will be used. In a review of many PAPs, Ofosu and Posner (2022) find considerable variation in how often analytically relevant pieces of information appear in posted plans.

In our view a PAP should center on a design declaration. Currently, most PAPs focus on the answer strategy A: what estimator to use, what covariates to condition on, and what subsets of the data to include. But of course, we also need to know the details of the data strategy D: how units will be sampled, how treatments will be assigned, and how the outcomes will be measured. We need these details to assess the properties of the design and gauge whether the principles of analysis respecting sampling, treatment assignment, and measurement procedures are being followed. We need to know about the inquiry I because we need to know the target of inference. A significant concern is “outcome switching,” wherein the eventual report focuses on different outcomes than initially intended. When we switch outcomes, we switch inquiries! We need enough of the model M in the plan to describe I in sufficient detail. In short, a design declaration is what belongs in a PAP because a design declaration specifies all of the analytically relevant design decisions.

In addition to a design declaration, a PAP can usefully include mock analyses conducted on simulated data. If the design declaration is made formally in code, creating simulated data that resemble the eventually realized data is straightforward. We think researchers should run their answer strategy on the mock data, creating mock figures and tables that will ultimately be made with real data. In our experience, this is the step that really causes researchers to think hard about all aspects of their design.

PAPs can, optionally, include design diagnoses in addition to declarations, since it can be informative to describe why a particular design was chosen. For this reason, a PAP might include estimates of diagnosands like power, root-mean-squared error, or bias. If a researcher writes in a PAP that the power to detect a very small effect is large, then if the study comes back null, the eventual write-up can much more credibly rule out “low precision” as an explanation for the null.

21.6.1 Example Preanalysis Plan

In Figure 21.5, we provide an example preanalysis plan (see the appendix for the document itself) for Bonilla and Tillery (2020), a study of the effects of alternative framings of Black Lives Matter on support for the movement. The authors of that study posted a preanalysis plan to the As Predicted registry. These study authors are models of research transparency: they prominently link to the PAP in the published article, they conduct no non-preregistered analyses except those requested during the review process, and their replication archive includes all materials required to confirm their analyses, all of which we were able to reproduce exactly with minimal effort. Our goal with this alternative PAP is to show how design declaration can supplement and complement existing planning practices.

We show in Section 22.2 how to “populate” this PAP once the data have been realized and collected.

Baron, Hannah, and Lauren E. Young. 2021. “From Principles to Practice: Methods to Increase the Transparency of Research Ethics in Violent Contexts.” Political Science Research and Methods, 1–8.

Blair, Graeme, Jasper Cooper, Alexander Coppock, and Macartan Humphreys. 2019. “Declaring and Diagnosing Research Designs.” American Political Science Review 113 (3): 838–59.

Bonilla, Tabitha, and Alvin B. Tillery. 2020. “Which Identity Frames Boost Support for and Mobilization in the #BlackLivesMatter Movement? An Experimental Test.” American Political Science Review 114 (4): 947–62.

Casey, Katherine, Rachel Glennerster, and Edward Miguel. 2012. “Reshaping Institutions: Evidence on Aid Impacts Using a Pre-Analysis Plan.” The Quarterly Journal of Economics 127 (4): 1755–1812.

Gerber, Alan S., and Donald P. Green. 2012. Field Experiments: Design, Analysis, and Interpretation. New York: W.W. Norton.

Green, Donald P., and Winston Lin. 2016. “Standard Operating Procedures: A Safety Net for Pre-Analysis Plans.” PS: Political Science & Politics 49 (3): 495–99.

Humphreys, Macartan. 2015. “Reflections on the Ethics of Social Experimentation.” Journal of Globalization and Development 6 (1): 87–112.

Humphreys, Macartan, Raul de la Sierra, and Peter van der Windt. 2013. “Fishing, Commitment, and Communication: A Proposal for Comprehensive Nonbinding Research Registration.” Political Analysis 21 (1): 1–20.

King, Gary, Emmanuela Gakidou, Nirmala Ravishankar, Ryan T Moore, Jason Lakin, Manett Vargas, Martha Marı́a Téllez-Rojo, Juan Eugenio Hernández Ávila, Mauricio Hernández Ávila, and Héctor Hernández Llamas. 2007. “A ‘Politically Robust’ Experimental Design for Public Policy Evaluation, with Application to the Mexican Universal Health Insurance Program.” Journal of Policy Analysis and Management 26 (3): 479–506.

King, Gary, and Melissa Sands. 2015. “How Human Subjects Research Rules Mislead You and Your University, and What to Do about It.” Unpublished Manuscript.

Levine, Adam Seth. 2021. “How to Form Organizational Partnerships to Run Experiments.” In Advances in Experimental Political Science, 199–216. Cambridge, UK: Cambridge University Press.

Lyall, Jason. 2022. “Preregister Your Ethical Redlines: Vulnerable Populations, Policy Engagement, and the Perils of e-Hacking.” Unpublished Manuscript.

Miguel, Edward, Colin Camerer, Katherine Casey, Joshua Cohen, Kevin M Esterling, Alan S. Gerber, Rachel Glennerster, Donald P. Green, Macartan Humphreys, and Guido W. Imbens. 2014. “Promoting Transparency in Social Science Research.” Science 343 (6166): 30.

Ofosu, George, and Daniel Posner. 2022. “Pre-analysis Plans: An Early Stocktaking.” Perspectives on Politics.

Schrag, Zachary M. 2010. Ethical Imperialism: Institutional Review Boards and the Social Sciences, 1965–2009. Johns Hopkins University Press.

Slough, Tara. 2020. “The Ethics of Electoral Experimentation: Design-Based Recommendations.” Unpublished Manuscript.

Teele, Dawn. 2021. “Virtual Consent: A Bronze Standard for Experimental Ethics.” In Cambridge Handbook of Experimental Political Science, edited by Donald P. Green and James N. Druckman. New York: Cambridge University Press.

Wood, Elisabeth Jean. 2006. “The Ethical Challenges of Field Research in Conflict Zones.” Qualitative Sociology 29 (3): 373–86.

This procedure is prone to at risk of bias for the average treatment effect among the “politically feasible” units if within some pairs, one unit is treatable but the other is not.↩︎