22 Realization

Realization, the implementation of a study, starts from the design declaration. Implementing the data strategy means sampling the units as planned, allocating treatments according to the randomization procedure, and executing the measurement protocol. Implementing the answer strategy means applying the planned summary functions to the realized data. Of course, implementation is much easier said than done. Inevitably, some portion of the design fails to go according to plan: subjects do not comply with treatments, others cannot be located to answer survey questions, or governments interfere with the study as a whole. Sometimes, the answer strategies are discovered to be biased or imprecise or otherwise wanting. Declared designs can be adapted as the study changes, both to help make choices when you need to pivot and so that at the end there is a “realized design” to compare to the “planned design.”

When implementation is complete, the design preregistered in an analysis plan can be “populated” to report on analyses as planned and the realized design reconciled with the planned design. In writing up the study, the design forms the center: why we should believe the answers that we report on. The declared design can be used in the writeup to convince reviewers of the study’s quality, and also a tool to assess the impact of reviewer suggestions on the design.

22.1 Pivoting

When something goes wrong or you learn things work differently from how you expect, you need to pivot. You face two decisions: go/no-go, and if go, should you alter your research design to account for the new reality. Redesigning the study and diagnosing the possible new designs can help you make these decisions. Your design declaration is a living document that you can keep updated and use as a tool to guide you along the research path, not just as a document to write at the beginning of the study and revisit when you are writing up. Keeping the declaration updated as you make the changes along the way will make it easier to reconcile the planned design with the implemented design.

We illustrate two real choices we made, one in which we abandoned the study and one in which we changed the design radically. We link to design declarations and diagnoses of the studies pre- and post-pivot.

One of us was involved with a get-out-the-vote canvassing-experiment-gone-wrong during the 2014 Senate race in New Hampshire. We randomly assigned 4,230 of 8,530 subjects to treatment. However, approximately two weeks before the election, canvassers had only attempted 746 subjects (17.6% of the treatment group) and delivered treatment to just 152 subjects (3.6%). In essence, the implementer was overly optimistic about the number of subjects they would be able to contact in time for the election. Upon reflection, the organization estimated that they would only be able to attempt to contact 900 more voters and believed that their efforts would be best spent on voters with above-median vote propensities.

We faced a choice: should we spend (1) the remaining organizational capacity on treating 900 of the 3,484 remaining unattempted treatment group subjects or should we (2) conduct a new random assignment among above-median propensity voters only? The inquiry for both designs is a complier average causal effect (CACE), but who is classified as a never-taker or a complier differs across the two designs. The organization successfully contacts approximately 20% of those it attempts to contact. In the first design, those who are never even attempted are nevertakers (through no fault of their own!), and further deflate the intention-to-treat effect. We can’t just drop them from design 1, because we don’t know which units in the control group wouldn’t have been attempted, had they been in the control group. In design 2, we conduct a brand-new assignment and the treatment group is only as large as the organization thinks it can handle. A design diagnosis reveals a clear course of action. Even though it decreases the overall sample size, restricting the study to the above-median propensity voters substantially increases the precision of the design. This conclusion follows the logic of the placebo-controlled design described in Section 17.7. Our goal is to restrict the experimental sample to compliers only.

Another of us faced another kind of noncompliance problem in a study in Nigeria: failure to deliver the correct treatment. We launched a cluster-randomized placebo-controlled 2x2 factorial trial of a film treatment and a text message blast treatment. A few days after treatment delivery began, we noticed that the number of replies was extremely similar in treatment and placebo communities, counter to our expectation. We discovered that our research partner, the cell phone company, delivered the treatment message to all communities, so placebo communities received the wrong treatment. But by that time, treatments had been delivered to 106 communities (about half the sample).

We faced the choice to abandon the study or pivot and adapt the study. We quickly agreed that we could not continue research in the 106 communities, because they had received at least partial treatment. We were left with 109 from our original sample of 200 plus 15 alternates that were selected in the same random sampling process. We determined we could not retain all four treatment conditions and the pure control. We decided that at most we could have two conditions, with about 50 units in each. But which ones? We were reticent to lose the text message or the film treatments, as both tested two distinct theoretical mechanisms for how to encourage prosocial behaviors. We decided to drop the pure control group, the fifth condition, as well as the placebo text message condition. In this way, we could learn about the effect of the film (compared to placebo) and about the effect of the text messages (compared to none).26

22.2 Populated preanalysis plan

A preanalysis plan describes how study data will eventually be analyzed, but those plans may change in the during the process of producing a finished report, article, or book. Inevitably, authors of pre-analysis plans fail to anticipate how the data generated by the study will eventually be analyzed. Some reasons for discrepancies were discussed in the previous section on pivoting, but others intervene as well. A common reason is that PAPs promise too many analyses. In writing a concise paper, some analyses are dropped, others are combined, and still others are added during the writing and revision process. In the next section, we’ll describe how to reconcile analyses-as-planned with analyses-as-implemented, but this present section is about what to do with your analysis plan immediately after getting the data back.

We echo proposals made in Banerjee et al. (2020) and Alrababa’h et al. (2020) that researchers should produce short reports that fulfill the promises made in their PAPs. Banerjee et al. (2020) emphasize that writing PAPs is difficult and usually time-constrained, so it is natural that the final paper will reflect further thinking about the full set of empirical approaches. A “populated PAP” serves to communicate the results of the promised analyses. Alrababa’h et al. (2020) cite the tendency of researchers to abandon the publication of studies that return null results. To address the resulting publication bias, they recommend “null results reports” that share the results of the pre-registered analyses.

We recommended in Section 21.7 that authors include mock analyses in their PAPs using simulated data. Doing so has the significant benefit of being specific about the details of the answer strategy. A further benefit comes when it is time to produce a populated PAP, since the realized data can quite straightforwardly be swapped in for the mock data. Given the time invested in building simulated analyses for the PAP, writing up a populated PAP takes only as much effort as is needed to clean the data (which will need to be done in any case).

22.2.1 Example

In Section 21, we declared the design for Bonilla and Tillery (2020) following their preanalysis plan. In doing so, we declared an answer strategy in code. In our populated PAP, we can run that same answer strategy code, but swap out the simulated data for the real data collected during the study. We present first the regression table and then the coefficient plot in Figure 22.1.

Statistical models
  Model 1 Model 2 Model 3 Model 4
(Intercept) 0.84*** 0.41*** 0.61*** 0.54***
  (0.02) (0.04) (0.06) (0.07)
Znationalism -0.01 -0.00 0.02 0.09
  (0.02) (0.02) (0.08) (0.09)
Zfeminism -0.04 -0.01 0.01 0.02
  (0.02) (0.02) (0.08) (0.09)
Zintersectional -0.04 -0.03 -0.08 -0.01
  (0.02) (0.02) (0.08) (0.10)
female   0.03*    
lgbtq   0.02    
age   -0.00    
religiosity   -0.01    
income   -0.00    
college   -0.02    
linked_fate   0.27*** 0.30***  
    (0.03) (0.07)  
blm_familiarity   0.07***   0.10***
    (0.01)   (0.02)
Znationalism:linked_fate     -0.04  
Zfeminism:linked_fate     -0.05  
Zintersectional:linked_fate     0.05  
Znationalism:blm_familiarity       -0.03
Zfeminism:blm_familiarity       -0.01
Zintersectional:blm_familiarity       -0.01
R2 0.00 0.20 0.14 0.09
Adj. R2 0.00 0.19 0.13 0.09
Num. obs. 849 849 849 849
RMSE 0.23 0.20 0.21 0.22
p < 0.001; p < 0.01; p < 0.05
Coefficient plot from Bonilla and Tillery design based on the study's realized data.

Figure 22.1: Coefficient plot from Bonilla and Tillery design based on the study’s realized data.

22.3 Reconciliation

Research design as implemented will differ in some way from research designs as planned. Treatments cannot be executed as conceived, some people cannot be found to interview, and sometimes what we learn from baseline measures informs how we measure later. Understanding how your research design changed from conception to implementation is crucial to understanding what was learned from the study.

Suppose the original design described a three-arm trial: one control and two treatments, but the design as implemented drops all subjects assigned to the second treatment. Sometimes, this is an entirely appropriate and reasonable design modification. Perhaps the second treatment was simply not delivered due to an implementation failure. Other times, these modifications are less benign. Perhaps the second treatment effect estimate did not achieve statistical significance, so the author omitted it from the analysis.

For this reason, we recommend that authors reconcile the design as planned with the design as implemented. A reconciliation can be a plain description of the deviations from the PAP, with justifications where appropriate. A more involved reconciliation would include a declaration of the planned design, a declaration of the implemented design, and a list of the differences. This “diff” of the designs can be automated through the declaration of both designs in computer code, then comparing the two design objects line-by-line (see the function compare_designs() in DeclareDesign).

In some cases, reconciliation will lead to additional learning beyond what can be inferred from the final design itself. When some units refuse to be included in the study sample or some units refuse measurement, we learn that important information about those units. Understanding sample exclusions, noncompliance, and attrition not only may inform future research design planning choices but contribute substantively to our understanding of the social setting.

22.3.1 Example

In Section 21, we described the preanalysis plan registered by Bonilla and Tillery (2020). We reconcile the set of conditional average treatment effect (CATE) analyses planned in that PAP, the analyses reported in the paper, and those reported in the appendix at the request of reviewers in Table 22.1. In column two, we see that the authors planned four CATE estimations: effects by familiarity with Black Lives Matter; by gender; LGBTQ status; and linked fate. Only two of those are reported in the paper; the others may have been excluded for space reasons. Another way to handle these uninteresting results would be to present them in a populated PAP posted on their Web site or in the paper’s appendix.

In their appendix, the authors report on a set of analyses requested by reviewers. We see this as a perfect example of transparently presenting the set of planned analyses and highlighting the analyses that were added afterward and why they were added. They write:

We have been asked to consider other pertinent moderations beyond gender and LGBTQ+ status. They are contained in the four following sections.

This small table describes the heterogeneous effects analyses the researchers planned, those reported in the paper, and those reported in the appendix at the request of reviewers.

Table 22.1: Reconciliation of Bonilla and Tillery preanalysis plan.
Covariate In the preanalysis plan In the paper In the appendix (at the request of reviewers)
Familiarity with BLM X
Gender X X
LGBTQ status X X
Linked fate X
Religiosity X
Region X
Age X
Education X

22.4 Writing

When writing up an empirical paper, authors have two sets of goals. First, they want to convince reviewers and readers that the research question they are tackling is important and their research design provides useful answers to that question. Second, they want to influence scholars but also decisionmakers who may make choices about what to believe and what to do on the basis of the study, including policymakers, businesses, and the public.

A common model for social science empirical papers has five sections: introduction, theory, design, results, and discussion. We discuss each in turn.

The introduction section should highlight each aspect of MIDA in brief. The reader is brought quickly up to speed on the whole research design, as well as expectations and actual findings.

The theory, evidence review, and hypotheses section is particularly important for the second goal of empirical papers, integration into a research literature and decisionmaking. The theory and hypotheses clarify many elements of the study’s model M and also its inquiry I. The theory and review of past evidence on the same and related inquiries will be used to structure prior beliefs about the question and related questions, and also to identify which part of past scholarship the study’s inquiry speaks to. Without explicitly linking the present inquiry to those of past studies, we can’t explain to readers how the study updates our understanding over previous work.

With the theoretical relationship of the present inquiry I to the inquiries in past work clarified, reviews of the findings of those past studies represent research designs unto themselves. We should try to prevent general research design issues such as selection on the dependent variable by ensuring we discuss all literature and not only present views consonant with our hypotheses. A meta-analysis or systematic review of past evidence could provide a systematic summary of past answers, or an informal literature review could be offered. In summarizing the literature, the research designs of past studies should be accounted for. Meta-analyses often formally account for the quality of the research designs of past studies by weighting by the inverse of their precision (upweighting more informative studies and down-weighting less informative ones). Literature reviews may do so informally. However, this accounts only for the variance of past studies, not potential biases, which should be accounted for in how studies are selected (filtering out biased research designs).

The research design section should lay out the details of D and A, building on the description of M and I in the preceding section. The section could refer to a design declared in code found in an appendix.

The results and discussion sections report on the realized answer and clarify what inferences from this result can be drawn. These sections implement the answer strategy of the study, but not only in obvious ways. Of course, regression tables and visualizations of the data report on the application of estimation procedures to the realized data. But the text in a discussion section is also part of the answer strategy: it is the application of a strategy for translating numerical and visual results into a qualitative description of the findings. What this translation function is may depend on how the data turn out, which is good and bad. Good in that we should learn as much as we can from our data, and some tests may not be obvious to us before. Bad in that it is hard to imagine what discursive procedure we would use under alternative realizations of the data. It may be helpful to write out the interpretations you would give to plausible ways the study could come out.

In the conclusion section, we turn back to the second goal of writing up the paper: influencing future scholarship and decisionmaking. The conclusion section should, formally where possible, integrate the new findings into past findings and leave readers with the authors’ view of what is now known to date on the inquiry. Bayesian integration could take the form of updating priors formed based on a meta-analysis of past studies, a likelihood function, and the results from the present study as the new data. Informal integration could follow this strategy qualitatively, assessing what was known and how confident we were and what we learned in this study and how confident we are in the findings. An additional way a conclusion section can be written to influence future scholarship is to provide new research designs — new \(MIDA\)s — that future scholars can implement. By providing a statement of the posterior beliefs of the study and new research designs that address specific empirical threats to the results, later scholars can move forward in an informed way.

22.4.1 Example

In the Figure below, we annotate Mousa (2020) by highlighting where in the article each design component is discussed. The study reports on the results of a randomized experiment in which Iraqi Christians were assigned either to an all-Christian soccer team or a team in which they would play alongside Muslims. The experiment tested whether being on a mixed team affected intergroup attitudes and behaviors, both among teammates and back at home after the games were over. We highlight in color areas discussing the model M in yellow, the inquiry I in green, the data strategy D in blue, and the answer strategy A in pink.

Paper with *MIDA* elements highlighted (Mousa 2020).

Figure 22.2: Paper with MIDA elements highlighted (Mousa 2020).

The model and the inquiry largely appear in the abstract and introductory portion of the paper, though aspects of the model are discussed later on. Much of the first three pages are devoted to the data strategy, while the answer strategy only appears briefly. This division makes sense: in this paper, the action is all in the experimental design whereas the answer strategy follows straightforwardly from it. The paper mostly describes M and D, with only a small amount of text devoted to I and A. Finally, it is notable that the data strategy is interspersed with aspects of the model. The reason is that the author is justifying choices about randomization and measurement using features of the model.

22.5 Publication

Publication in a peer-reviewed journal is a major goal of many (but not all) research projects. The advice we gave in the previous section on writing papers was to build the case for your findings by grounding your conclusions in the specifics of the research design. By detailing M, I, D, and A in ways that leave little room for confusion or ambiguity, you greatly improve the chances that reviewers and, later, readers will understand your paper.

Ideally, studies would be selected for publication on the basis of design rather than on the basis of results. This ideal can be hard to achieve. Reviewers and editors must decide whether to devote scarce journal space and editing bandwidth to publishing a paper. Criteria may include the topic fit with the journal, the importance of the question, and how much was learned from the research. The publication filter problem – publishing only studies with statistically-significant or splashy results – has long been recognized as a cause both of “false” findings making their way into the literature and publication bias due to missing null findings.

One major barrier to fixing this problem is that design quality is hard to convey to reviewers, so they substitute their own judgments of design based on the results. When the estimate turns out to be statistically significant, reviewers infer that the design must have been well-enough-powered to discover a statistically-significant result. The trouble with this approach is that a significant result might come from a study with 80% power or it might be a lucky draw from a study with just 10% power. Whether an estimate is statistically significant is a poor measure of design quality.

Formal design declaration and diagnosis is one way to communicate study quality separately from results. The theory and design sections of a paper should describe M, I, D, and A in enough detail that reviewers can understand the empirical thrust of the study. If in addition to this information, authors provide diagnostic information about the ability of the study to generate credible inferences, we may be to induce reviewers to evaluate studies on the basis of design, not results.

A study’s research design is not set in stone until the final version is posted on a journal Web site or published in print. Until then, journal editors and reviewers may ask for design changes that would, in their view, improve the paper. If the changes improve the design, then adopting their suggestions is easy. Some changes are irrelevant to the design – like reporting the reviewer’s preferred descriptive statistics – so why not comply. The trouble comes when reviewers propose changes that actively undermine the design. When this happens, diagnosing the reviewer’s alternative design can effectively demonstrate that the proposed changes would harm the research design. The design context also helps editors, who have to resolve the dispute one way or the other.