21 Planning

We list “design early” first among our research design principles (Principle 3.1) because it suffuses our point of view about how to construct strong research designs. “Measure twice, cut once.” Research projects are very long journeys: going from the kernel of an idea to a published paper typically takes multiple years. Once the data strategy has been implemented, you’re stuck with it, so mindful planning beforehand is important.

The planning process changes designs. We work out designs that meet ethical as well as scientific standards, accommodate the needs of research partners, and operate within financial and logistical constraints. When we are insufficiently sure of key inputs to design declarations, we can run pilots, but we need to be careful about how we incorporate what we learn from them. Finally, when we write up a declaration or a PAP with a declaration, this can be a useful moment to get feedback from our peers to improve the design. We discuss each of these steps in this chapter.

21.1 Ethics

As researchers, we have ethical obligations beyond the requirements of national laws and the regulations of institutional review boards.

For a long time thinking about research ethics have been guided by the ideas in the Belmont report, that emphasize beneficence, respect for persons, and autonomy. Recently, more attention has been given to principles that extend beyond care for human subjects to include considerations for the well-being of collaborators and partners and the broader social impact of research. Social scientific professional associations have developed principles and guidelines to help think through these issues. Key references include:

The considerations at play vary across context and methods. For example, Teele (2021) describes ethical considerations in field experimentation, Humphreys (2015) focuses on development settings, Slough (2020) considers the ethics of field experimentation in the context of elections, and Wood (2006) and Baron and Young (2020) consider ethical challenges specific to field research in conflict settings.

However a common meta-principle underlying many of these contributions is the injunction to give prominent consideration to ethical issues: reflect on ethical dimensions of your work ex ante and report on ethical implications ex post. Lyall (2020) specifically connects ethical reflection to ex ante design considerations.

We encourage you to engage with ethical considerations in this way, early in the research design lifecycle. Some design declarations and diagnoses elide ethical considerations. For instance, a declaration that is diagnosand-complete for statistical power may tell you little about the level of care and respect accorded to subjects. Many declarations are diagnosand-complete for bias, but obtaining an unbiased treatment effect estimate is not always the highest goal.

Ethical diagnosands can be directly incorporated into the declare-diagnose-redesign framework. Diagnosands could include the total cost to participants, how many participants were harmed, the average level of informed consent measured by a survey about comprehension of study goals, or the risks of adverse events. More complex ethical diagnosands may be possible as well: Slough (2020) provides a formal analysis of the “aggregate electoral impact” diagnosand for experiments that take place in the context of elections. We consider two specific ethical diagnosands here, costs and potential harms, though many others may apply in particular research scenarios.

Costs. A common concern is that measurement imposes a cost on subjects, if only by wasting their time. Subjects’ time is a valuable resource they often donate willingly to the scientific enterprise by participating in a survey or other measurement. Although subjects’ generosity is sometimes repaid with financial compensation, in many scenarios direct payments are not feasible. Regardless of whether subjects are paid, the costs to subjects should be top of mind when designing the study.

Potential harms. Different realizations of the data from the same data strategy may differ in their ethical status. Ex-post, a study may not have ended up harming subjects, but ex-ante, there may have been a risk of harm (Baron and Young 2020). The project’s ethical status depends on judgments about potential harms and potential participants: not only what did happen, but what could have happened. The potential harm diagnosand might be formalized as the maximum harm that could eventuate under any realization of the data strategy. Researchers could then follow a minimax redesign procedure to find the design that minimizes this maximum potential harm.

When the design is diagnosed, we can characterize the ethical status of possible realizations of the design as well as the ethicality of the distribution of these realizations. Is the probability of harm minimal “enough?” Is the degree of informed consent sufficient? Given that these characteristics vary across designs and across realizations of the same design, writing down concretely both the measure of the ethical status and the ethical threshold can help structure thinking. These diagnoses and the considerations that inspire them can be shared in funding proposals, preanalysis plans, or other report. Articulating them in a design may help clarify whether proper account was taken of risks ex ante, or, more usefully, remind researchers to be sure to take account of them.

Often, once an ethical threshold is met, we select among feasible designs based on research design criteria such as statistical power and bias. This approach has appeal since we should only implement designs that meet the relevant research community’s ethical standards. However, dichotomizing designs into “ethical” and “unethical” is a difficult task in general. Instead, we should continue to assess ethical considerations alongside the quality of the research design. Even among ethical designs, we still face tradeoffs between how much time is asked of subjects and the risk of harm. We should select designs that appropriately weight these considerations against other desiderata and be able to articulate and justify the of weighting used. When obtaining a credible answer would come at too high an ethical cost, the study may need to be scrapped altogether.

21.2 Approvals

When researchers sit at universities in the United States, research must be approved by the university’s institutional review board (IRB) under the federal regulation known as the “Common Rule.” Similar research review bodies exist at universities worldwide and at many independent research organizations and think tanks. Though these boards are commonly thought to judge research ethics, in fact, they mainly exist to protect their institution from liability for research gone awry (King and Sands 2015). Accordingly, a researcher’s obligation to consider their study’s ethics is neither constrained nor checked by IRBs. Instead, a set of idiosyncratic rules and practices specific to each institution are checked (Schrag 2010). The researcher, as a result, remains responsible for their own ethical decision about whether or not to move forward with the research. That said, the IRB process is not necessarily without benefit. In some cases, useful discussions can be had with IRB board members about study decisions, and the approval itself may protect the researcher from some kinds of liability.

Laws and regulations at the country, state or province, or municipality level may also govern research on human subjects besides the IRB. Many countries require human subjects approval, especially for health research, in addition to the approvals researchers must seek from their home institutions. These approvals serve a similar purpose to the home institution IRB, but by virtue of their authority coming from the context in which the research is conducted rather than from far away bureaucrats, they may serve to more directly protect human subjects.

Though these bodies’ goals differ from the broader ethical aims social scientists hold, design diagnosis may also be useful here. Many IRBs ask researchers to describe tradeoffs between the costs and benefits to research subjects. In some cases, researchers are asked to defend research design choices that provide benefits to science, but where the only direct effects on participants are costs with no immediate benefits. Defining the costs and benefits to participants in terms of their time and money and the compensation provided by researchers, if any, can both simplify communication with IRBs and provide tools for researchers to more easily clarify these tradeoffs for themselves. The expected benefit and expected cost can be diagnosands across possible realizations of the design. The design diagnosis can highlight tradeoffs between the value to participants and the scientific value in the form of standard diagnosands. Rather than argue in the abstract about these quantities, they can be simulated and described formally through declaration and diagnosis.

21.3 Partners

Partnering with third-party organizations in research entails cooperating to intervene in the world or to measure outcomes. Researchers seek to produce (and publish) scientific knowledge; they work with political parties, government agencies, nonprofit organizations, and businesses to learn more than they could if they worked independently. These groups work with researchers to learn about how to achieve their own organizational goals. Governments may want to expand access to healthcare, corporations to improve their ad targeting, and nonprofits to demonstrate program impact to funding organizations.

In the best-case scenario, the goals of the researchers and partner organizations are aligned. When the scientific question to be answered is the same as the practical question the organization cares about, the gains from cooperation are clear. The research team gains access to the organization’s financial and logistical capacity to act in the world, and the partner organization gains access to the researchers’ scientific expertise. Finding the right research partner almost always amounts to finding an organization with a common – or at least not conflicting – goal. Selecting a research design amenable to both parties requires understanding each partners’ private goals. Research design declaration and diagnosis can help with this problem by formalizing tradeoffs between the two sets of goals.

One frequent divergence between partner and researcher goals is that partner organizations often want to learn, but they care most about their primary mission. This dynamic is sometimes referred to as the “learning versus doing” tradeoff. (In business settings, this tradeoff goes by names like “learning versus earning” or “exploration versus exploitation”). An aid organization cares about delivering their program to as many people as possible. Learning whether the program has the intended effects on the outcomes of interest is obviously also important, but resources spent on evaluation are resources not spent on program delivery.

Research design diagnosis can help navigate the learning versus doing tradeoff. One instance of the tradeoff is that the proportion of units that receive a treatment represents the rate of “doing,” but this rate also affects the amount of learning. In the extreme, if all units are treated, we can’t measure the effect of the treatment. The tradeoff here is represented in Figure 21.1, which shows the study’s power versus the proportion treated (top facet) and the partner’s utility (bottom facet). The researchers have a power cutoff at the standard 80% threshold. The partner also has a strict cutoff: they need to treat at least 2/3 of the sample to fulfill a donor requirement.

Researchers might simply ignore the proportion treated and select the design with the highest power in the absence of partners. With a partner organization, the researcher might use this graph in conversation with the partner to jointly select the design that has the highest power that has a sufficiently high proportion treated to meet the partner’s needs. This is represented in the “zone of agreement” in gray: in this region, the design has at least 80% power and at least two-thirds of the sample are treated. Deciding within this region involves a tradeoff between power (which is decreasing in the proportion treated here) and the partner’s utility (which is increasing in proportion treated). The diagnosis surfaces the zone of the agreement and clarifies the choice between designs in that region.24

Navigating research partnerships.

Figure 21.1: Navigating research partnerships.

Choosing the proportion treated is one example of integrating partner constraints into research designs. A second common problem is that there are a set of units that must be treated or that must not be treated for ethical or political reasons (e.g., the home district of a government partner must receive the treatment). If these constraints are discovered after treatment assignment, they lead to noncompliance, which may substantially complicate the analysis of the experiment and even prevent providing an answer to the original inquiry. Gerber and Green (2012) recommend, before randomizing treatment, exploring possible treatment assignments with the partner organization and using this exercise to elicit the set of units that must or cannot be treated. King et al. (2007) describe a “politically-robust” design, which uses pair-matched block randomization. In this design, when any unit is dropped due to political constraints, the whole pair is dropped from the study.25

A major benefit of working with partners is their deep knowledge of the substantive area. For this reason, we recommend involving them in the design declaration and diagnosis process. How can we develop intuitions about the means, variances, and covariances of the variables to be measured? Ask your partner for their best guesses, which may be far more educated than your own. For experimental studies, solicit your partner’s beliefs about the magnitude of the treatment effect on each outcome variable, subgroup by subgroup if possible. Engaging partners in the declaration process improves design – and it very quickly sharpens the discussion of key design details. Sharing your design diagnoses and mock analyses before the study is launched can help to build a consensus around the study’s goals.

21.4 Funding

Higher quality designs usually come with higher costs. Collecting original data is more expensive than analyzing existing data, but collecting new data may be more or less costly depending on the ease of contacting subjects or conducting measurements. As a result, including cost diagnosands in research design diagnosis can directly aid data strategy decision-making. These diagnosands may usefully include both average cost and maximum cost. Researchers may make different decisions about cost: in some cases, the researcher will select the “best” design in terms of research design quality subject to a budget constraint. Others will choose the cheapest among similar quality designs to save money for future research. Diagnosis can help identify each set and decide among them.

To relax the budget constraint, researchers apply for funding. Funding applications have to communicate important features of the proposed research design. Funders want to know why the study would be useful, important, or interesting to scholars, the public, or policymakers. They also want to ensure that the research design provides credible answers to the question and that the research team is capable of executing the design. Since it’s their money on the line, funders also care that the design provides good value-for-money.

Researchers and funders have an information problem. Applicants wish to obtain as large a grant as possible for their design but have difficulty credibly communicating the quality of their design given the subjectivity of the exercise. On the flip side, funders wish to get the most value-for-money in the set of proposals they decide to fund and have difficulty assessing the quality of proposed research. Design declaration and diagnosis provide a partial solution to the information problem. A common language for communicating the proposed design and its properties can communicate the value of the research under design assumptions that can be understood and interrogated by funders.

Funding applications should include a declaration and diagnosis of the proposed design. In addition to common diagnosands such as bias and efficiency, two special diagnosands may be valuable: cost and value-for-money. The cost can be included for each design variant as a function of design features such as sample size, the number of treated units, and the duration of survey interviews. Simulating the design across possible realizations of each variant explains how costs vary with choices the researcher makes. Value-for-money is a diagnosand that is a function of cost and the amount learned from the design.

In some cases, funders request applicants to provide multiple options and multiple price points or make clear how a design could be altered so that it could be funded at a lower level. Redesigning over differing sample sizes communicates how the researcher conceptualizes these options and provides the funder with an understanding of tradeoffs between the amount of learning and cost in these design variants. Applicants could use the redesign process to justify the high cost of their request directly in terms of the amount learned.

Ex-ante power analyses are required by an increasing number of funders. Current practice, however, illustrates the crux of the misaligned incentives between applicants and funders. Power calculators online have difficult-to-interrogate assumptions built in and cannot accommodate the specifics of many common designs (Blair et al. 2020). As a result, existing power analyses can demonstrate that almost any design is “sufficiently powered” by changing expected effect sizes and variances. Design declaration is a partial solution to this problem. By clarifying the assumptions of the design in code, applicants can more clearly link the assumptions of the power analysis to the specifics of the design setting.

Finally, design declarations can also help funders compare applications on standard scales: root mean-squared-error, bias, and power. They also want to weigh considerations like importance and fit. Moving design considerations onto a common scale takes some of the guesswork out of the process and reduces reliance on researcher claims about properties.

21.5 Piloting

Designing a research study always entails relying on a set of beliefs, what we’ve referred to as the set of possible models in M. Choices like how many subjects to sample, which covariates to measure, which treatments to allocate, and depend on beliefs about treatment effects, the correlations of the covariates with the outcome, and the variance of the outcome.

We may have reasonably educated guesses about these parameters from past studies or theory. Our understanding of the nodes and edges in the causal graph of M, expected effect sizes, the distribution of outcomes, feasible randomization schemes, and many other features are directly selected from past research or chosen based on a literature review of past studies.

Even so, we remain uncertain about these values. One reason for the uncertainty is that our research context and inquiries often differ subtly from previous work. Even when replicating an existing study as closely as possible, difficult-to-intuit features of the research setting may have serious consequences for the design. Moreover, our uncertainty about a design parameter is often the very reason for conducting a study. We run experiments because we are uncertain about the average treatment effect. If we knew the ATE for sure, there would be no need to run the study. Frustratingly, we always have to design using parameters whose values we are unsure of.

The main goal of pilot studies is to reduce this uncertainty over the possible models in M so that the main study can be designed, taking into account design parameters closer to the true values. Pilots take many forms: focus groups to learn how to ask survey questions, small-scale tests of measurement tools, even miniature versions of the main study on a smaller scale. We want to learn things like the distribution of outcomes, how covariates and outcomes might be correlated, or how feasible the assignment, sampling, and measurement strategies are.

Almost by definition, pilot studies are inferentially weaker than main studies. We turn to them in response to constraints on our time, money, and capacity. If we were not constrained, we would run a first full-size study, learn what is wrong with our design, then run a corrected full-size study. Since running multiple full studies is too expensive or otherwise infeasible, we run either smaller mini-studies or test out only a subset of the elements of our planned design. Accordingly, the diagnosands of a pilot design will not measure up to those of the main design. Pilots have much lower statistical power and may suffer from higher measurement error and less generalizability. Accordingly, the goal of pilot studies should not be to obtain a preliminary answer to the main inquiry, but instead to learn the information that will make the main study a success.

Like main studies, pilot studies can be declared and diagnosed – but importantly, the diagnosands for main and pilot studies need not be the same. Statistical power for an average treatment effect may be an essential diagnosand for the main study, but owing to their small size, power for pilot studies will typically be abysmal. Pilot studies should be diagnosed with respect to the decisions they imply for the main study.

Figure 21.2 shows the relationship between effect size and the sample size required to achieve 80% statistical power for a two-arm trial using simple random assignment. Uncertainty about the true effect size has enormous design consequences. If the effect size is 0.17, we need about 1,100 subjects to achieve 80% power. If it’s 0.1, we need 3200.

Minimum required sample sizes and uncertainty over effect size

Figure 21.2: Minimum required sample sizes and uncertainty over effect size

Suppose we have prior beliefs about the effect size that can be summarized as a normal distribution centered at 0.3 with a standard deviation of 0.1, as in the bottom panel of Figure 21.2. We could choose a design that corresponds to this best guess, the average of our prior belief distribution. If the true effect size is 0.3, then a study with 350 subjects will have 80% power.

However, redesigning the study to optimize for the “best guess” is risky because the true effect could be much smaller than 0.3. Suppose we adopt the redesign heuristic of powering the study for an effect size at the 10th percentile of our prior belief distribution, which works out here to be an effect size of 0.17. Following this rule, we would select a design with 1100 subjects.

Now suppose the true effect size is, in actuality, only 0.1, so we would need to sample 3200 subjects for 80% power. The power of our chosen 1100-subject design is a mere 38%. Here we see the consequences of having incorrect prior beliefs: our ex-ante guess of the effect size was too optimistic. Even taking what we thought of as a conservative choice – the 10th percentile redesign heuristic – we ended up with too small a study.

A pilot study can help researchers update their priors about important design parameters. If we do a small scale pilot with 100 subjects, we’ll get a noisy but unbiased estimate of the true effect size. We can update prior beliefs by taking a precision weighted average of our priors and the estimate from the pilot, where the weights are the inverse of the variance of each guess. Our posterior beliefs will be closer to the truth, and our posterior uncertainty will be smaller. If we then follow the heuristic of powering the 10th percentile of our (now posterior) beliefs about effect size, we will have come closer to correctly powering our study. Figure 21.3 shows how large the studies would be, depending on how the pilot study came out if we were to follow the 10th percentile decision rule. On average, the pilot leads us to design the main study with 1800 subjects, sometimes more and sometimes less.

This exercise reveals that a pilot study can be quite valuable. Without a pilot study, we would chose to sample 1100 subjects, but since the true effect size is only 0.1 (not our best guess of 0.3), the experiment would be underpowered. The pilot study helps us correct our diffuse and incorrect prior beliefs. However, since the pilot is small, we don’t update our priors all the way to the truth. We still end up with a main study that is on average too small (1800), with a corresponding power of 56%. That said, a 56% chance of finding a statistically significant result is better than a 38% chance.

Distribution of post-pilot sample size choices

Figure 21.3: Distribution of post-pilot sample size choices

In summary, pilots are most useful when we are uncertain – or outright wrong – about important design parameters. This uncertainty can often be shrunk by quite a bit without running pilot studies by meta-analyzing past empirical studies. Some things are hard to learn by reading others’ work; pilot studies are especially useful tools for learning about those things.

21.6 Criticism

A vital part of the research design process is gathering criticism and feedback from others. Timing is delicate here. Asking for comments on an underdeveloped project can sometimes lead to brainstorming sessions about what research questions one might look into. Such unstructured sessions can be quite useful but essentially restarts the research design lifecycle from the beginning. Sharing work only after a full draft has been produced is worse since the data strategy will have already yielded the realized data. The investigators may have become attached to favored answer strategies and interpretations. While critics can always suggest changes to I and A post-data collection, an almost-finished project is fundamentally constrained by the data strategy as it was implemented.

The best moments to seek advice come before registering preanalysis plans or, if not writing a PAP, before implementing major data strategy elements. The point is not to seek advice exclusively on sampling, assignment, or measurement procedures; the important thing is that there’s still time to modify those design elements (Principle 3.1). Feedback about the design as a whole can inform changes to the data strategy before it is set in stone.

Feedback will come in many forms. Sometimes the comments are directly about diagnosands. The critic may think the design has too many arms and won’t be well-powered for many inquiries. Or they may be concerned about bias due to excludability violations or selection issues. These comments are especially useful because they can easily be incorporated in design diagnosis and redesign exercises.

Other comments are harder to pin down. A fruitful exercise in such cases is to understand how the criticism fits in to M, I, D, and A. Comments like, “I’m concerned about external validity here” might seem to be about the data strategy. If the units were not randomly sampled from some well-specified population, we can’t generalize from the sample to the population. But if the inquiry is not actually a population quantity, then this inability to use sample data to estimate a population quantity is irrelevant. The question then becomes whether knowing the answer to your sample inquiry helps make theoretical progress or whether we need to generalize – to switch the inquiry to the population quantity to make headway. Critics will not usually be specific about how their criticism relates to each element of design, so it is up to the criticism-seeker to understand the implications for design.

Sometimes we seek feedback from smart people, but they do not immediately understand the design setting. If the critic hasn’t absorbed or taken into account important features of the design, their recommendations and amendments may be off-base. For this reason, it’s important to communicate the design features – the model, inquiry, data strategy, and answer strategy – at a high enough level of detail that the critic is up to speed before passing judgment.

21.7 Preanalysis Plan

In many research communities, it is becoming standard practice to publicly register a pre-analysis plan (PAP) before implementing some or all of the data strategy. PAPs serve many functions, but most importantly, they clarify which design choices were made before data collection and which were made after. Sometimes – perhaps every time! – we conduct a research study, aspects of M, I, D, and A shift along the way. A concern is that they shift in ways that invalidate the apparent conclusions of the study. For example, “p-hacking” is the shady practice of trying out many regression specifications until the p-value associated with an important test attains statistical significance. PAPs protect researchers by communicating to skeptics when design decisions were made. If the regression specification was detailed in a PAP posted before any data were collected, the test could not be the result of a p-hack.

PAPs are sometimes misinterpreted as a binding commitment to report all pre-registered analyses and nothing but. This view is unrealistic and unnecessarily rigid. While we think that researchers should report all pre-registered analyses somewhere (see Section 22.2 on “populated PAPs”), study writeups inevitably deviate in some way from the PAP – and that’s a good thing. Researchers learn more by conducting research. This learning can and should be reflected in the finalized answer strategy.

Our hunch is that the main consequence of actually writing a PAP is improving the research design itself. Just like research design declaration forces us to think through the details of our model, inquiry, data strategy, and answer strategy, describing those choices in a publicly-posted document surely causes deeper reflection about the design. In this way, the main audience for a PAP is the study authors themselves.

What belongs in a PAP? Recommendations for the set of decisions that should be specified in a PAP remain remarkably unclear and inconsistent across research communities. PAP templates and checklists are proliferating, and the number of items they suggest ranges from nine to sixty. PAPs themselves are becoming longer and more detailed. Some in the American Economic Association and Evidence in Governance and Politics (EGAP) study registries reach hundreds of pages as researchers seek to be ever more comprehensive. Some registries emphasize the registration of the hypotheses to be tested, while others emphasize the registration of the tests that will used. In a review of many PAPs, G. Ofosu and Posner (2021) find considerable variation in how often analytically-relevant pieces of information appear in posted plans.

In our view a PAP should center on a design declaration. Currently, most PAPs focus on the answer strategy A: what estimator to use, what covariates to condition on, and what subsets of the data to include. But of course, we also need to know the details of the data strategy D: how units will be sampled, how treatments will be assigned, and how the outcomes will be measured. We need these details to assess the properties of the design and gauge whether the principles of analysis respecting sampling, treatment assignment, and measurement procedures are being followed. We need to know about the inquiry I because we need to know the target of inference. A significant concern is “outcome switching,” wherein the eventual report focuses on different outcomes than initially intended. When we switch outcomes, we switch inquiries! We need enough of the model M in the plan to describe I in sufficient detail. In short, a design declaration is what belongs in a PAP because a design declaration specifies all of the analytically-relevant design decisions.

In addition to a design declaration, a PAP should include mock analyses conducted on simulated data. If the design declaration is made formally in code, creating simulated data that resemble the eventually realized data is straightforward. We think researchers should run their answer strategy on the mock data, creating mock figures and tables that will ultimately be made with real data. In our experience, this is the step that really causes researchers to think hard about all aspects of their design.

PAPs can, optionally, include design diagnoses in addition to declarations, since it can be informative to describe why a particular design was chosen. For this reason, a PAP might include estimates of diagnosands like power, root-mean-squared-error, or bias. If a researcher writes in a PAP that the power to detect a very small effect is large, then if the study comes back null, the eventual writeup can much more credibly rule out “low power” as an explanation for the null.

21.7.1 Example

In this section, we provide an example of how to supplement a PAP with a design declaration. We follow the actual PAP for Bonilla and Tillery (2020), which was posted to the As Predicted registry. The study’s goal is to estimate the causal effects of alternative framings of Black Lives Matter (BLM) on support for the movement among Black Americans overall and among subsets of the Black community. These study authors are models of research transparency: they prominently link to the PAP in the published article, they conduct no non-preregistered analyses except those requested during the review process, and their replication archive includes all materials required to confirm their analyses, all of which we were able to reproduce exactly with minimal effort. Our goal with this section is to show how design declaration can supplement and complement existing planning practices. Model

The authors write in their PAP:

We hypothesize that: H1: Black Nationalist frames of the BLM movement will increase perceived effectiveness of BLM among African American test subjects. H2: Feminist frames of the BLM movement will increase perceived effectiveness of BLM among African American women, but decrease perceived effectiveness in male subjects. H3: LGBTQ and Intersectional frames of the BLM movement will have no effect (or a demobilizing effect) on the perceived effectiveness of BLM African American subjects.

These hypotheses reflect a model of coalition politics that emphasizes the tensions induced by overlapping group identities. Framing the BLM movement as feminist or pro-LGBTQ may increase support among Black women or Black LGBTQ identifiers, but that increase may come at the expense of support among Black men or Black Americans who do not identify as LGBTQ. Similarly, this model predicts that subjects with stronger attachment to their Black identity will have a larger response to a Black nationalist framing of BLM than those with weaker attachments.

The model also includes beliefs about the distributions of gender, LGBTQ status, and Black identity strength. In the data strategy, Black identity was measured with the standard linked fate measure. Other background characteristics that may be correlated with BLM support include age, religiosity, income, education, and familiarity with the movement, so these are included in M as well.

The study’s focus will be on the causal effects of nationalism, feminism, and intersectional frames relative to a general description of the Black Lives Matter movement. Model beliefs about treatment effect heterogeneity are embedded in the model declaration. The effect of the nationalism treatment is hypothesized to be stronger, the greater subjects’ sense of linked fate; the effect of the feminism treatment should be negative for men but positive for women; the effect of the intersectionality treatment should be positive for LGBTQ identifiers, but negative for non-identifiers.

rescale <- function(x) {
  (x - min(x)) / (max(x) - min(x))

likert_cut <- function(x) {
  as.numeric(cut(x, breaks = c(-100, 0.1, 0.3, 0.6, 0.8, 100), labels = 1:5))

model <- 
    N = 800,
    female = rbinom(N, 1, prob = 0.51),
    lgbtq = rbinom(N, 1, prob = 0.05),
    linked_fate = sample(1:5, N, replace = TRUE, 
                         prob = c(0.05, 0.05, 0.15, 0.25, 0.5)),
    age = sample(18:80, N, replace = TRUE),
    religiosity = sample(1:6, N, replace = TRUE),
    income = sample(1:12, N, replace = TRUE),
    college = rbinom(N, 1, prob = 0.5),
    blm_familiarity = sample(1:4, N, replace = TRUE),
    U = runif(N),
    blm_support_latent = rescale(
      U + 0.1 * blm_familiarity + 
        0.45 * linked_fate + 
        0.001 * age + 
        0.25 * lgbtq + 
        0.01 * income + 
        0.1 * college + 
        -0.1 * religiosity),
    # potential_outcomes
    blm_support_Z_general = 
    blm_support_Z_nationalism = 
      likert_cut(blm_support_latent + 0.01 + 
                   0.01 * linked_fate + 
                   0.01 * blm_familiarity),
    blm_support_Z_feminism = 
      likert_cut(blm_support_latent - 0.02 + 
                   0.07 * female + 
                   0.01 * blm_familiarity),
    blm_support_Z_intersectional = 
      likert_cut(blm_support_latent  - 0.05 + 
                   0.15 * lgbtq + 
                   0.01 * blm_familiarity)
  ) Inquiry

The inquiries for this study naturally include the average effects of all three treatments relative to the “general” framing, as well as the differences in average effects for subgroups. When describing their planned analyses, the authors write:

We will also look at differences in responses between those indicating a pre-treatment familiarity BLM (4-Extensive knowledge to 1-Never heard of BLM), gender (particularly on the Feminist treatment), linked fate (particularly on the Nationalist treatment), and LGBT+ affiliation (particularly on the LGBT+ treatment), though we are not necessarily expecting these moderations to have a strong effect because samples may lack adequate representation.

In the code below, we specify how each treatment effect changes with its corresponding covariate \(X\) with \(\frac{\mathrm{cov}(\tau_i, X)}{\mathbb{V}(X)}\), which is identical to the difference-in-difference for the binary covariates (female and lgbtq) and is the slope of the best linear predictor of how the effect changes over the range of linked_fate, and blm_familiarity which we are treating as quasi-continuous here.

slope <- function(y, x) { cov(y, x) / var(x) }

inquiry <-  
    # Average effects
    ATE_nationalism = 
      mean(blm_support_Z_nationalism - blm_support_Z_general),
    ATE_feminism = 
      mean(blm_support_Z_feminism - blm_support_Z_general),
    ATE_intersectional = 
      mean(blm_support_Z_intersectional - blm_support_Z_general),
    # Overall heterogeneity w.r.t. blm_familiarity
    DID_nationalism_familiarity = 
      slope(blm_support_Z_nationalism - blm_support_Z_general, 
    DID_feminism_familiarity = 
      slope(blm_support_Z_feminism - blm_support_Z_general, 
    DID_intersectional_familiarity = 
      slope(blm_support_Z_intersectional - blm_support_Z_general, 
    # Treatment-specific heterogeneity
    DID_nationalism_linked_fate = 
      slope(blm_support_Z_nationalism - blm_support_Z_general, 
    DID_feminism_gender = 
      slope(blm_support_Z_feminism - blm_support_Z_general,
    DID_intersectional_lgbtq = 
      slope(blm_support_Z_intersectional - blm_support_Z_general, 
  ) Data strategy

This study’s subjects are 800 Black Americans recruited by the survey firm Qualtrics using a quota sampling procedure. We omit this sampling step in our declaration: 800 subjects are described in the model declaration above. The reason is that, as is common practice in the analysis of survey experiments on convenience samples, the authors do not formally extrapolate from their data to make generalizations about the population of Black Americans. The inquiries they study are sample average effects. If the authors had used a different sampling strategy, such as using random sampling through random digit dialing, we would have defined the population from which they were sampling and the random sampling procedure.

After subjects’ background characteristics were measured, they were assigned to one of four treatment conditions. Since the survey was conducted on Qualtrics, we assume that the authors used the built-in randomization tools, which use simple (Bernoulli) random assignment.

data_strategy <- 
    Z = simple_ra(
      conditions = 
        c("general", "nationalism", "feminism", "intersectional"), 
      simple = TRUE
  ) + 
  declare_measurement(blm_support = reveal_outcomes(blm_support ~ Z)) Answer strategy

The authors write:

We will run an OLS regression predicting the support for, effectiveness of, and trust in BLM on each treatment condition. […] We will also look at differences in responses between those indicating a pre-treatment familiarity BLM (4-Extensive knowledge to 1-Never heard of BLM), gender (particularly on the Feminist treatment), linked fate (particularly on the Nationalist treatment), and LGBT+ affiliation (particularly on the LGBT+ treatment), though we are not necessarily expecting these moderations to have a strong effect because samples may lack adequate representation. We plan to conduct analyses without controls. As we will check for between group balance, we may also run OLS analyses with demographic controls (age, linked fate, gender, sexual orientation, religiosity, income, education, and ethnic or multi-racial backgrounds), and will report differences in OLS results.

In DeclareDesign, this corresponds to five estimators, with two shooting at the ATEs and three shooting at the differences-in-differences. We use OLS for all five. The majority of the code is bookkeeping to ensure we match the right regression coefficient with the appropriate inquiry.

answer_strategy <-
    blm_support ~ Z,
    term = c("Znationalism", "Zfeminism", "Zintersectional"),
    inquiry = 
      c("ATE_nationalism", "ATE_feminism", "ATE_intersectional"),
    label = "OLS") +
    blm_support ~ Z + age + female + as.factor(linked_fate) + lgbtq,
    term = c("Znationalism", "Zfeminism", "Zintersectional"),
    inquiry = 
      c("ATE_nationalism", "ATE_feminism", "ATE_intersectional"),
    label = "OLS with controls") +
    blm_support ~ Z*blm_familiarity,
    term = c("Znationalism:blm_familiarity", 
    inquiry = c("DID_nationalism_familiarity", 
    label = "DID_familiarity") +
    blm_support ~ Z * linked_fate,
    term = "Zfeminism:linked_fate",
    inquiry = "DID_nationalism_linked_fate",
    label = "DID_nationalism_linked_fate") +
    blm_support ~ Z * female,
    term = "Zfeminism:female",
    inquiry = "DID_feminism_gender",
    label = "DID_feminism_gender") +
    blm_support ~ Z * lgbtq,
    term = "Zintersectional:lgbtq",
    inquiry = "DID_intersectional_lgbtq",
    label = "DID_intersectional_lgbtq") Mock analysis

Putting it all together, we can declare the complete design and draw mock data from it.

Declaration 21.1 \(~\)

design <- model + inquiry + data_strategy + answer_strategy
mock_data <- draw_data(design)


Table 21.1: Mock analysis from Bonilla and Tillery design.
ID female lgbtq linked_fate age religiosity income college blm_familiarity U blm_support_latent blm_support_Z_general blm_support_Z_nationalism blm_support_Z_feminism blm_support_Z_intersectional Z blm_support
001 0 0 3 27 1 11 1 4 0.976 0.758 4 5 4 4 intersectional 4
002 0 0 4 44 6 4 1 1 0.274 0.429 3 3 3 3 feminism 3
003 1 0 3 78 2 9 1 2 0.349 0.491 3 3 3 3 nationalism 3
004 0 0 5 23 5 4 1 3 0.929 0.841 5 5 5 5 feminism 5
005 0 0 4 69 5 1 0 4 0.351 0.540 3 4 3 3 feminism 3

The table below shows a mock analysis of average effects (estimated with and without covariate adjustment) as well as the heterogeneous effects analyses with respect to the quasi-continuous moderators.

Mock regression table from Bonilla and Tillery design.
  Model 1 Model 2 Model 3 Model 4
(Intercept) 3.629*** 1.283*** 1.383*** 3.284***
  (0.059) (0.104) (0.146) (0.140)
Znationalism 0.339*** 0.305*** 0.101 -0.056
  (0.089) (0.048) (0.199) (0.217)
Zfeminism 0.284*** 0.208*** 0.543** 0.361
  (0.084) (0.049) (0.207) (0.198)
Zintersectional -0.073 -0.041 0.169 0.031
  (0.087) (0.050) (0.230) (0.208)
female   0.071*    
lgbtq   0.279*    
age   0.000    
religiosity   -0.147***    
income   0.013**    
college   0.165***    
linked_fate   0.549*** 0.561***  
    (0.015) (0.035)  
blm_familiarity   0.173***   0.141**
    (0.017)   (0.052)
Znationalism:linked_fate     0.067  
Zfeminism:linked_fate     -0.075  
Zintersectional:linked_fate     -0.061  
Znationalism:blm_familiarity       0.147
Zfeminism:blm_familiarity       -0.034
Zintersectional:blm_familiarity       -0.040
R2 0.038 0.701 0.567 0.081
Adj. R2 0.034 0.697 0.563 0.073
Num. obs. 800 800 800 800
RMSE 0.882 0.494 0.593 0.864
p < 0.001; p < 0.01; p < 0.05
Mock coefficient plot from Bonilla and Tillery design.

Figure 21.4: Mock coefficient plot from Bonilla and Tillery design. Design diagnosis

Finally, while a design diagnosis is not a necessary component of a preanalysis plan, it can be useful to show readers why a particular design was chosen over others. This diagnosis indicates that the design produces unbiased estimates but is better powered from some inquires than others (under the above assumptions about effect size, which were our own and not the original authors’). We are well-powered for the average effects, and the power increases when we include covariate controls. The design is probably too small for most of the heterogeneous effect analyses, which is a point directly conceded in the authors’ original PAP.

Table 21.2: Design diagnosis for Bonilla and Tillery design.
Estimand Estimator Bias Power
ATE feminism OLS -0.002 0.527
ATE feminism OLS with controls -0.003 0.851
ATE intersectional OLS -0.005 0.174
ATE intersectional OLS with controls -0.005 0.326
ATE nationalism OLS -0.001 0.962
ATE nationalism OLS with controls -0.002 1.000
DID feminism familiarity DID familiarity 0.003 0.091
DID feminism gender DID feminism gender 0.004 0.461
DID intersectional familiarity DID familiarity -0.001 0.091
DID intersectional lgbtq DID intersectional lgbtq -0.011 0.359
DID nationalism familiarity DID familiarity -0.002 0.077
DID nationalism linked fate DID nationalism linked fate -0.045 0.053

Further reading.