23 Integration

After publication, research studies leave the hands of their authors and enter the public domain.

Most immediately, authors share their findings with the public through the media and with decisionmakers. Design information is useful for helping journalists to emphasize design quality rather than splashy findings. Decisionmakers may act on evidence from studies, and researchers who want to influence policymaking and business decisions may wish to consider diagnosands about the decisions these actors make.

Researchers can prepare for the integration of their studies into scholarly debates through better archiving practices and better reporting of research designs in the published article. Future researchers may build on the results of a past study in three ways. First, they may reanalyze the original data. Reanalysts must be cognizant of the original data strategy D when working with the realized data \(d\). Changes to the the answer strategy A must respect D, regardless of whether the purpose of the reanalysis is to answer the original inquiry I or to answer a different inquiry \(I'\). Second, future researchers may replicate the design. Typically, replicators provide a new answer to the same I with new data, possibly improving elements of D and A along the way. If the inquiry of the replication is too different from the inquiry of the original study, the fidelity of the replication study may be compromised. Lastly, future researchers may meta-analyze study’s answer with other past studies. Meta-analysis is most meaningful when all of the included studies target a similar enough inquiry and when all studies rely on credible design. Otherwise, the procedure produces a meta-analytic average that is difficult to interpret.

All three of these activities depend on an accurate understanding of the study design. Reanalysts, replicators, and meta-analysts all need access to the study data and materials, of course. They also need to be sure of the critical design information in M, I, D, and A. Later in this section, we outline how archiving procedures that preserve study data and study design can enable new scientific purposes and describe strategies for doing each of these three particular integration tasks.

23.1 Communicating

The findings from studies are communicated to other scholars through academic publications. But some of the most important audiences – policymakers, businesses, journalists, and the public at large – do not read academic journals. These audiences learn about the study in other in other ways. Authors write opeds, blog posts, and policy reports that translate research for nonspecialist audiences. Press offices pitch research studies for coverage by the media. Researchers present findings directly to decisionmakers and to their research partners.

These new outputs are for different audiences, so they are necessarily diverse in their tone and approach. Some things don’t change: we still need to communicate the quality of the research design and what we learn from the study. But some things do: we need to translate specialist language about the substance of the study to a nonspecialist audience, and translate the features of the research design in a way that nonspecialists can understand.

Too often, a casualty of translating the study from academic to other audiences is the design information. When researchers write for popular blogs or give interviews, emphasis is placed on the study results, not on the reasons why the results of the study are to be believed. In sharing the research for nonspecialist audiences, we revert to saying that the findings are true and not why we know the findings are true. Explaining why we know requires explaining the research design, which in our view ought to be part of any public-facing communication about research.

Of course, even when authors do emphasize design, journalists do not always care. Science reporting is commonly criticized for ignoring study design when picking which studies to publicize, so weak studies are not appropriately filtered out of coverage. Furthermore, journalists emphasize results they believe will drive people to pick up a newspaper or click on a headline. Flashy, surprising, or pandering findings receive far more attention than deserved, with the result that boring but correct findings are crowded out of the media spotlight.

In a review we conducted of recent studies published in The New York Times Well section on health and fitness, we found that two dimensions of design quality were commonly ignored. First, experimental studies on new fitness regimens with tiny samples, sometimes fewer than 10 units, are commonly highlighted. When both academic journals and reporters promote tiny studies, the likely result is that the published record is full of statistical flukes driven by noise, not new discoveries. Second, very large studies that draw observational comparisons between large samples of dieters and non-dieters with millions of observations receive outsize attention. These designs are prone to bias from confounding, but these concerns are swept under the rug.

This state of affairs is not entirely or even mostly the journalists’ fault, since, in the absence of design information, it can be challenging to separate the weak designs from the strong ones. Statistical significance and the stamp of approval from peer review are too-easy heuristics to follow.

How can we improve this scientific communication dilemma? The market incentives for both journalists and authors reward flash over substance, and any real solution to the problem would require addressing those incentives. Short of that, we recommend that authors who wish to communicate the high quality of their designs to the media do so by providing the design information in M, I, D, and A in lay terms. Science communicators should clearly state the research question (I) and explain why applying the data and answer strategies is likely to yield a good answer to the question. The actual result is, of course, also important to communicate, but why it is a credible answer to the research question is just as important to share (Principle 3.11). Building confidence in scientific results requires building confidence in scientific practice.

Here’s an example of how we could cite a (hypothetical) study in a way that conveys at least some design information. "Using a randomized experiment, the researchers (Authors, Year) found that donating to a campaign causes a large increase in the number of subsequent donation requests from other candidates, which is consistent with theories of party behavior that predict intra-party cooperation.".

Citations can’t covey the entirety of MIDA in one sentence, but they can give an inkling. The citation explains that the data strategy included some kind of randomized experiment (we don’t know how many treatment arms or subjects, among other details), and that the answer strategy probably compared the counts of donation requests from any campaign (email requests, or phone, we don’t know) among the groups of subjects that were assigned to donate to a particular campaign. The citation mentions the models described in an unspecified area of the scientific literature on party politics, which all predict cooperation like the sharing of donor lists. We can reason that, if the inquiry, “Is the population average treatment effect effect of donating to one campaign on the number of donation requests from other campaigns positive?” were put to each of these theories, they would all respond “Yes.” The citation serves as a useful shorthand for the reader of what the claim of the paper is and why they should think it’s credible. By contrast, a citation like “The researchers found that party members cooperate (Author, Year).” doesn’t communicate any design information at all.

23.2 Decisionmaking

Policymakers, businesses, humanitarian organizations, and individuals make decisions based on social science research. Research designs, however, are often constructed without considering who will be informed by the evidence and how they will use evidence in decisions. We can optimize our designs for both scientific publication and decisionmaking. The first step is eliciting the inquiries decisionmakers have, and the second is diagnosing how their decisions change depending on the results. How often would the decisionmaker make the right decision, with and without the study? A design that exhibits high statistical power and a high rate of making the right decision will influence not only the scientific literature but also decisions made by the public.

We illustrate this process by declaring an experimental design that compares a status quo policy with an alternative. As the researcher, your inquiry is the average treatment effect, but the policymaker has a subtlety different inquiry. The policymaker would like to know which policy to implement: the status quo or the alternative. Imagine you meet with the policymaker and ask how they would use the evidence you plan to produce. The policymaker says that they would like to switch to the alternative if it is better than the status quo. However, they face a switching cost to adopt the new policy, so for now, they would like to adopt the alternative only if it is at least 0.1 standard deviations better than the status quo.

In your design declaration, you add two new components to assess your design’s probability of the policymaker making the right decision. The first is you add a new inquiry, which is, is the treatment at least 0.1 standard deviations better than the control condition? In addition, you add a statistical test to target this inquiry, which tests whether the treatment effect is larger than 0.1. It does so by testing the null hypothesis that \(\widehat\tau - 0.1 = 0\).

Declaration 23.1 \(~\)

# compare status quo to a new proposed policy, 
# given cost of switching 
N <- 100
effect_size <- 0.1

design <-
  declare_model(N = N,
                U = rnorm(N),
                potential_outcomes(Y ~ effect_size * Z + U)) +
  declare_inquiry(
    ATE = mean(Y_Z_1 - Y_Z_0),
    alternative_better_than_sq = if_else(ATE > 0.1, TRUE, FALSE)
  ) +
  declare_assignment(Z = complete_ra(N)) +
  declare_measurement(Y = reveal_outcomes(Y ~ Z)) +
  declare_estimator(Y ~ Z,
                    model = difference_in_means,
                    inquiry = "ATE",
                    label = "dim") +
  declare_estimator(Y ~ Z,
                    model = lh_robust,
                    linear_hypothesis = "Z - 0.05 = 0",
                    label = "decision")

\(~\)

In addition to power, we set up a diagnosand for the proportion of times the policymaker will make the right decision given the evidence you provide. We redesign to consider alternative sample sizes and we diagnose under different possible true effect sizes, some negative (in which case the policymaker should retain the status quo); no difference (status quo should be retained because the effect is not a big enough improvement to justify switching costs); and positive with different sizes.

In Figure 23.1, we show the probability of retaining the status quo policy (left facet) and the probability of switching to the treatment (right) by different true effect sizes. On the left, we see that there is a very high probability of selecting the right policy when the true effect size is very low. This pattern occurs because when the effect size is low, we are likely to fail to reject the null. With small sample sizes, we are likely to also select the status quo even when we should not, because of the imprecision of the estimates. Looking at the right graph, even when the true effect size is large (i.e., 0.25), we need to have a large sample size, about 1500, to achieve 80% probability of correctly choosing the treatment.

The sample size we might choose based on this analysis of the policymaker’s choice is different than if we only considered statistical power. This is because the decision curve only reaches 80% power at 1500, while power reaches 80% at just over 500. The reason for the divergence is to make a correct decision, we need evidence that the treatment effect is greater than 0.1, as compared to statistical power which considers whether the effect is greater than 0.0.

WIP: Research design diagnosis for study of effectiveness of a policy change compared to the status quo, where a policymaker wishes to switch to the treatment policy only if it is at least 0.05 standard deviations better than the status quo. On the left, we display the statistical power of the study to detect an effect in either direction. On the right, we display the rate of making the right decision to switch policies or not.

Figure 23.1: WIP: Research design diagnosis for study of effectiveness of a policy change compared to the status quo, where a policymaker wishes to switch to the treatment policy only if it is at least 0.05 standard deviations better than the status quo. On the left, we display the statistical power of the study to detect an effect in either direction. On the right, we display the rate of making the right decision to switch policies or not.

23.3 Archiving

One of the biggest successes in the push for greater research transparency has been changing norms surrounding data sharing and analysis code after studies have been published. It has been become de rigeur at many journals to post these materials at publicly-available repositories like the OSF or Dataverse. This development is undoubtedly a good thing. In older manuscripts, sometimes data or analyses are described as being “available upon request,” but of course, such requests are sometimes ignored. Furthermore, a century from now, study authors will no longer be with us even if they wanted to respond to such requests. Public repositories have a much better chance of preserving study information for the future.

What belongs in a replication archive? Enough documentation, data, and design detail that those who wish to reanalyze, replicate, and meta-analyze results can do so without contacting the authors.

Data. First, the realized data itself. Sometimes this is the raw data. Sometimes it is only the “cleaned” data that is actually used by analysis scripts. Where ethically possible, we think it is preferable to post as much of the raw data as possible after removing information like IP addresses and geographic locations that could be used to identify subjects. The output of cleaning scripts – the cleaned data – should also be included in the replication archive.

Reanalyses often reexamine and extend studies by exploring the use of alternative outcomes, varying sets of control variables, and new ways of grouping data. As a result, replication data ideally includes all data collected by the authors even if the variables are not used in the final published results. Often authors exclude these to preserve their own ability to publish on these other variables or because they are worried alternative analyses will cast doubt on their results. We hope norms will change such that study authors instead want to enable future researchers to build on their research by being expansive in what information is included.

Analysis code. Replication archives also include the answer strategy A, or the set of functions that produce results when applied to the data. We need the actual analysis code because the natural-language descriptions of A that are typically given in written reports are imprecise. As a small example, many articles describe their answer strategies as “ordinary least squares” but do not fully describe the set of covariates included or the particular approach to variance estimation. These choices can substantively affect the quality of the research design – and nothing makes these choices explicit like the actual analysis code. Analysis code is needed not only for reanalysis but also replication and meta-analysis. Replication practice today involves inferring most of these details from descriptions in text. Reanalyses may directly reuse or modify analysis code and replication projects need to know the exact details of analyses to ensure they can implement the same analyses on the data they collect. Meta-analysis authors may take the estimates from the past studies directly, so understanding the exact analysis procedure conducted is important. Other times, meta-analyses reanalyze data to ensure comparability in estimation. Conducting analyses with and without covariates, with clustering when it was appropriate, or with a single statistical model when they vary across studies all require having the exact analysis code.

Data strategy materials. Increasingly, replication archives include the materials needed to implement treatments and measurement strategies. Without the survey questionnaires in their original languages and formats, we cannot exactly replicate them in future studies, which hinders our ability to build on and adapt them. The treatment stimuli used in the study should also be included. Data strategies are needed for reanalyses and meta-analyses too: answer strategies should respect data strategies, so understanding the details of sampling, treatment assignment, and measurement can shape reanalysts’ decisions and meta-analysis authors’ decisions about what studies to include and which estimates to synthesize.

Design declaration. While typical replication archives include the data and code, we think that future replication archives should also have a design declaration that fully describes M, I, D, and A. This should be done in code and words. A diagnosis can also be included, demonstrating the properties as understood by the author and indicating the diagnosands that the author considered in judging the quality of the design.

Design details help future scholars not only assess, but replicate, reanalyze, and extend the study. Reanalysts need to understand the answer strategy to modify or extend it and the data strategy used to ensure that their new analysis respects the details of the sampling, treatment assignment, and measurement procedures. Data and analysis sharing enables reanalysts to adopt or adapt the analysis strategy, but a declaration of the data strategy would help more. The same is true of meta-analysis authors, who need to understand the designs’ details to make good decisions about which studies to include and how to analyze them. Replicators who wish to exactly replicate or even just provide an answer to the same inquiry need to understand the inquiry, data strategy, and answer strategy.

The result is disputes that result after the replication is sent out for peer review. The original authors may disagree with inferences the replicators made about the inquiry or data strategy or answer strategy. To protect the original authors and the replicators, including a research design declaration specifying each of these elements resolves these issues so that replication and extension can focus on the substance of the research question and innovation in research design.

Figure 23.2 below shows the file structure for an example replication. Our view on replication archives shares much in common with the TIER protocol. It includes raw data in a platform-independent format (.csv) and cleaned data in a language-specific format (.rds, a format for R data files). Data features like labels, attributes, and factor levels are preserved when imported by the analysis scripts. The analysis scripts are labeled by the outputs they create, such as figures and tables. A master script is included that runs the cleaning and analysis scripts in the correct order. The documents folder consists of the paper, the supplemental appendix, the pre-analysis plan, the populated analysis plan, and codebooks that describe the data. A README file explains each part of the replication archive. We also suggest that authors include a script that consists of a design declaration and diagnosis.

File structure for archiving

Figure 23.2: File structure for archiving

Further Reading

  • Peer, Orr, and Coppock (2021) propose that researchers should “actively maintain” their replication archives by checking that they still run and making updates to obsolete code. In this way, the information about A that is contained in the replication archive stays current and scientifically useful.

  • Elman, Kapiszewski, and Lupia (2018) argues that the benefits of data transparency in political science outweigh its costs.

  • Bowers (2011) describes how good archiving is like collaborating with your future self.

  • Alvarez, Key, and Núñez (2018) provide guidance on how to create good replication archives.

23.4 Reanalysis

A reanalysis of an existing study is a follow-up study that reuses the original realized data for some new purpose. The reanalysis is a study with a research design that can be described in terms of M, I, D, and A. Reanalyses are fundamentally constrained by the data strategy of the original study. The data strategy D and the resulting data are set in stone – but reanalysts can make changes to the answer strategy A and sometimes also to the model M or inquiry I.

We can learn from reanalyses in several ways. First, we can fix errors in the original answer strategy. Reanalyses fixed simple mathematical errors, typos in data transcription, or failures to account for features of the data strategy when analyzing the data. These reanalyses show whether the original results do or do not depend on these corrections. Second, we can reassess the study in light of new information about the world learned after the original study was published. That is, sometimes M changes in ways that color our interpretation of past results. Perhaps we learned about new confounders or alternative causal channels that undermine the original design’s credibility. When reanalyzed, demonstrating the results do (or do not) change when new model features are incorporated improves our understanding of the inquiry. Third, reanalyses may also aim to answer new questions that were not considered by the original study but for which the realized data can provide useful answers.

Lastly, many reanalyses show that original findings are not “robust” to alternative answer strategies. These are better conceptualized as claims about robustness to alternative models: one model may imply one answer strategy, and a different model, with another confounder, suggests another. If both models are plausible, a good answer strategy should be robust to both and even help distinguish between them. A reanalysis could uncover robustness to these alternative models or lack thereof.

Reanalyses are themselves research designs. Just like any design, whether a reanalysis is a strong research design depends on possible realizations of the data (as determined by the data strategy), not just the realized data. Because the realized data is fixed in a reanalysis, analysts are often instead tempted to judge the reanalysis based on whether it overturns or confirms the original study’s results. A successful reanalysis in this way of thinking demonstrates, by showing that the original results are changed under an alternative answer strategy, that the results are not robust to other plausible models.

This way of thinking can lead to incorrect assessments of reanalyses. We need to consider what answers we would obtain under the original answer strategy A and the reanalysis strategy A’ under many possible realizations of the data. A good reanalysis strategy reveals with high probability the set of models of the world under which we can make credible claims about the inquiry. Whether or not the results change under the answer strategies A and A’ tells us little about this probability because the realized data is only one draw.

23.4.1 Example

In this section, we illustrate the flaw in assessing reanalyses based on changing significance of results alone. We demonstrate how to assess the properties of reanalysis plans, comparing the properties of original answer strategies to proposed reanalysis answer strategies.

The design we consider is an observational study with a binary treatment \(Z\) that may or may not be confounded by a covariate \(X\). Suppose that the original researcher had in mind a model in which \(Z\) is not confounded by \(X\):

# X is not a confounder and is measured pretreatment
model_1 <- 
  declare_model(
    N = 100,
    U = rnorm(N),
    X = rnorm(N),
    Z = rbinom(N, 1, prob = plogis(0.5)),
    potential_outcomes(Y ~ 0.1 * Z + 0.25 * X + U),
    Y = reveal_outcomes(Y ~ Z)
  ) 

The reanalyst has in mind a different model. In this second model, \(X\) confounds the relationship between \(Z\) and \(Y\):

# X is a confounder and is measured pretreatment
model_2 <- 
  declare_model(
    N = 100,
    U = rnorm(N),
    X = rnorm(N),
    Z = rbinom(N, 1, prob = plogis(0.5 + X)),
    potential_outcomes(Y ~ 0.1 * Z + 0.25 * X + U),
    Y = reveal_outcomes(Y ~ Z)
  ) 

The original answer strategy A is a regression of the outcome \(Y\) on the treatment \(Z\). The reanalyst collects the covariate \(X\) and proposes to control for it in a linear regression; call that strategy A_prime.

A <- declare_estimator(Y ~ Z, model = lm_robust, label = "A")
A_prime <- declare_estimator(Y ~ Z + X, model = lm_robust, label = "A_prime")

Applying the two answer strategies, we get differing results. The treatment effect estimate is significant under A but not under A_prime. Commonly, reanalysts would infer from this that the answer strategy A_prime is preferred and that the original result was incorrect.

draw_estimates(model_2 + A + A_prime)
estimator estimate std.error p.value
A 0.385 0.176 0.031
A_prime 0.219 0.188 0.246

As we show now, these claims depend on the validity of the model and should be assessed with design diagnosis. Consider a third model in which \(X\) is affected by \(Z\) and \(Y\). (In many observational settings, which variables are causally prior or posterior to others can be difficult to know with certainty). We now diagnose both answer strategies under all three models.

# X is not a confounder and is measured posttreatment
model_3 <- 
  declare_model(
    N = 100,
    U = rnorm(N),
    Z = rbinom(N, 1, prob = plogis(0.5)),
    potential_outcomes(Y ~ 0.1 * Z + U),
    Y = reveal_outcomes(Y ~ Z),
    X = 0.1 * Z + 5 * Y + rnorm(N)
  ) 

I <- declare_inquiry(ATE = mean(Y_Z_1 - Y_Z_0))

design_1 <- model_1 + I + A + A_prime
design_2 <- model_2 + I + A + A_prime
design_3 <- model_3 + I + A + A_prime

What we see in the diagnosis below is that A_prime is only preferred if we know for sure that \(X\) is measured pretreatment. In design 3, where \(X\) is measured posttreatment, A is preferred, because controlling for \(X\) leads to posttreatment bias. This diagnosis indicates that the reanalyst needs to justify their beliefs about the causal ordering of \(X\) and \(Z\) to claim that A_prime is preferred to A. The reanalyst should not conclude on the basis of the realized estimates only that their answer strategy is preferred.

Table 23.1: Diagnosis of the reanalysis design under alternative models
design estimator bias
design_1 A -0.003
design_1 A_prime -0.002
design_2 A 0.218
design_2 A_prime 0.013
design_3 A 0.008
design_3 A_prime -0.116

Three principles emerge from the idea that changing A to A’ should be justified by diagnosis, not the comparison of the realized results of the two answer strategies.

  1. Home ground dominance. Holding the original M constant (i.e., the home ground of the original study), if you can show that a new answer strategy A’ yields better diagnosands than the original A, then A’ can be justified by home ground dominance. In the example above, model 1 is the “home ground,” and the reanalyst’s A’ is preferred to A on this home ground.
  1. Robustness to alternative models. A second justification for a change in answer strategy is that you can show that a new answer strategy is robust to both the original model M and a new, also plausible, M’. In observational studies, we are uncertain about many features of the model, such as the existence of unobserved confounders. In the example above, A’ is robust to models 1 and 2 but is not robust to model 3. By contrast, A is robust to models 1 and 3 but not to model 2.

  2. Model plausibility. If the diagnosands for a design with A’ are worse than those with A under M but better under M’, then the switch to A’ can only be justified by a claim or demonstration that M’ is more plausible than M. As we saw in the example, neither A nor A’ was robust to all three alternative models. A claim about model plausibility would have to be invoked to justify controlling for \(X\). Such a claim could be made on the basis of substantive knowledge or additional data. For example, the reanalyst could demonstrate that data collection of \(X\) took place before the treatment was realized in order to rule out model 3.

23.5 Replication

After your study is completed, it may one day be replicated. Replication differs from reanalysis in that a replication study involves collecting new data to study the same inquiry. A new model, data strategy, or answer strategy may also be proposed.

So-called “exact” replications hold key features of I, D, and A fixed, but draw a new dataset from the data strategy and apply the same answer strategy A to the new data to produce a fresh answer. Replications are said to “succeed” when the new and old answer are similar and to “fail” when they are not. Dichotomizing replication attempts into successes and failures is usually not that helpful, and it would be better to simply characterize how similar the old and new answers are. Literally exact replication is impossible: at least some elements of M have changed between the first study and the replication. Specifying how they might have changed, e.g., how outcomes vary with time, will help judge differences observed between old and new answers.

Replication studies can benefit enormously from the knowledge gains produced by the original studies. For example, we learn a large amount about the model M and the value of the inquiry from the original study. The M of the replication study can and should incorporate this new information. For example, if we learn from the original study that the estimand is positive, but it might be small, the replication study could respond by changing D to increase the sample size. Design diagnosis can help you learn about how to change the replication study’s design in light of the original research.

When changes to the data strategy D or answer strategy A can be made to produce more informative answers about the same inquiry I, exact replication may not be preferred. Holding the treatment and outcomes the same may be required to provide an answer to the same I, but increasing the sample size or sampling individuals rather than villages or other changes may be preferable to exact replication. Replication designs can also take advantage of new best practices in research design.

So-called “conceptual” replications alter both M and D, but keep I and A as similar as possible. That is, a conceptual replication tries to ascertain whether a relationship in one context also holds in a new context. The trouble and promise of conceptual replications lie in the designer’s success at holding I constant. Too often, a conceptual replication fails because in changing M, too much changes about I, muddying the “concept” under replication.

A summary function is needed to interpret the difference between the original answer and the replication answer. This might take the new one and throw out the old if design was poor in the first. It might be taking the average. It might be a precision-weighted average. Specifying this function ex ante may be useful to avoid the choice of summary depending on the replication results. This summary function will be reflected in A and in the discussion section of the replication paper.

23.5.1 Example

Here we have an original study design of size 1000. The original study design’s true sample average treatment effect (SATE) is 0.2 because the original authors happened to study a very treatment-responsive population. We seek to replicate the original results, whatever they may be. We want to characterize the probability of concluding that we “failed” to replicate the original results. We have four alternative metrics for assessing replication failure.

  1. Are the original and replication estimates statistically significantly different from each other? If yes, we conclude that we failed to replicate the original results, and if no, we conclude that the study replicated.

  2. Is the replication estimate within the original 95% confidence interval?

  3. Is the original estimate within the replication 95% confidence interval?

  4. Do we fail to affirm equivalence27 between the replication and original estimate, using a tolerance of 0.2?

Figure 23.3 shows that no matter how big we make the replication, we find that the rate of concluding the difference-in-SATEs is nonzero only occurs about 10% of the time. Similarly, the replication estimate is rarely outside of the original confidence interval, because it’s rare to be more extreme than a wide confidence interval. The relatively high variance of the original study means that it is so uncertain, it’s tough to distinguish it from any number in particular.

If we turn to the third metric, we become more and more likely to conclude that the study fails to replicate as the replication study grows. At very large sample sizes, the replication confidence intervals become extremely small, so in the limit, it will always exclude the original study estimate.

The last metric, equivalence testing, has the nice property that as the sample size grows, we get closer to the correct answer – the true SATEs are indeed within 0.2 standard units of each other. However, again because the original study is so noisy, it is difficult to affirm its equivalence with anything, even when the replication study is quite large.

The upshot of this exercise is that, curiously, when original studies are weak (in that they generate imprecise estimates), it becomes harder to conclusively affirm that they did not replicate. This set of incentives is somewhat perverse: designers of original studies benefit from a lack of precision if it means they can’t “fail to replicate.”

Rates of 'Failure to Replicate' according to four diagnosands

Figure 23.3: Rates of ‘Failure to Replicate’ according to four diagnosands

23.6 Meta-analysis

One of the last stages of the lifecycle of a research design is its eventual incorporation in to our common scientific understanding of the world. Research findings are synthesized into our broader scientific understanding through systematic reviews and meta-analysis. In this section, we describe how a meta-analysis project itself comprises a new research design, whose properties we can investigate through declaration and diagnosis.

Research synthesis takes two basic forms. The first is meta-analysis, in which a series of estimates are analyzed together in order to better understand features of the distribution of answers obtained in the literature (see Section 18.4). Studies can be averaged together in ways that are better and worse. Sometimes the answers are averaged together according to their precision. A precision-weighted average gives more weight to precise estimates and less weight to studies that are noisy. Sometimes studies are “averaged” by counting up how many of the estimates are positive and significant, how many are negative and significant, and how many are null. This is the typical approach taken in a literature review. Regardless of the averaging approach, the goal of this kind of synthesis is to learn as much as possible about a particular inquiry I by drawing on evidence from many studies.

A second kind of synthesis is an attempt to bring together the results many designs, each of which targets a different inquiry about a common model. This is the kind of synthesis that takes place across an entire research literature. Different scholars focus on different nodes and edges of the common model, so a synthesis needs to incorporate the diverse sources of evidence.

How can you best anticipate how your research findings will be synthesized? For the first kind of synthesis – meta-analysis – you must be cognizant of keeping a commonly understood I in mind. You want to select inquiries not for their novelty, but because of their commonly-understood importance. We want many studies on the effects of having women versus men elected officials on public goods because we want to understand this particular I in great detail and specificity. While the specifics of the model M might differ from study to study, the fact that the Is are all similar enough to be synthesized allows for a specific kind of knowledge accumulation.

For the second kind of synthesis – literature-wide progress on a full causal model – even greater care is required. Specific studies cannot make up bespoke models M but instead must understand how the specific M adopted in the study is a special case of some broader M that is in principle agreed to by a wider research community. The nonstop, neverending proliferation of study-specific theories is a threat to this kind of knowledge accumulation. In a telling piece, McPhetres et al. (2020) document that in a decade of research articles published in Psychological Science, 359 specific theories were named, 70% were named only once and a further 12% were named just twice.

Since either kind of synthesis is a research design of its own, declaring it and diagnosing its properties can be informative. The data strategy for any research synthesis includes the process of collecting past studies. Search strategies are sampling strategies, and they can be biased in the same ways as convenience samples of individuals. Conducting a full census of past literature on a topic is usually not possible since not all research is made public, but selecting only published studies may reinforce publication biases. Proactively collecting working papers and soliciting unpublished or abandoned research on a topic are strategies to mitigate these risks. The choice of answer strategy for research synthesis depends on model assumptions about how studies are related. The model for declaring a research synthesis thus might include assumptions not only about how studies reach you as the synthesizer, but also how the contexts and units were selected in those original studies. Three common inquiries for meta-analysis include the average effect across contexts, the extent to which effects vary across contexts, and the best estimate of effects in specific contexts. Diagnosis can help assess the conditions under which your analysis strategies will provide unbiased, efficient estimates of true effects either in a subset of contexts which were studied or about a broader population.

Further Reading

  • McPhetres et al. (2020) on the proliferation of theories in psychology
  • Samii (2016) on the role of “causal empiricists,” as distinct from the role of theorists.