4 Software primer

This chapter serves as a brief introduction to the DeclareDesign package for the R programming language. DeclareDesign is a software implementation of every step of the declare-diagnose-redesign process. While you can declare, diagnose, and redesign using nearly any programming language, DeclareDesign is structured to make it easy to mix-and-match design elements while handling the tedious simulation bookkeeping behind the scenes.

4.1 Installing R

You can download R for free from CRAN. We also recommend the free program RStudio, which provides a friendly interface to R. Both R and RStudio are available on Windows, Mac, and Linux.

Once you have R and RStudio installed, open up RStudio and install DeclareDesign and its related packages. These include three packages that enable specific steps in the research process: fabricatr for simulating social science data, randomizr for random sampling and random assignment, and estimatr for design-based estimators. You can also install DesignLibrary, which gets standard designs up and running in one line. To install them all, copy the following code into your R console:

install.packages(c(
  "DeclareDesign",
  "fabricatr",
  "randomizr",
  "estimatr",
  "DesignLibrary"
))

We also recommend that you install and get to know the tidyverse set of packages for data analysis, which we will use throughout the book:

install.packages("tidyverse")

For introductions to R and the tidyverse we especially recommend the free resource R for Data Science.

4.2 Declaring research design elements

Designs are declared through a concatenation of design elements. Almost all elements take a dataset as an input and return a dataset as their output. We will imagine an input dataset of 100 voters in Los Angeles. The research project involves randomly assigning voters to receive (or not) a knock on their door from a canvasser. Our data look like this:

Table 4.1: Example data
ID age sex party precinct
001 66 M REP 9104
002 54 F DEM 8029
003 18 M GRN 8383
004 42 F DEM 2048
005 27 M REP 5210

The data strategy is a function that takes this dataset, implements a random assignment, adds it to the dataset, and then returns the resulting dataset.

You could write your own function to do that, but you can also use one of the declare_* functions in DeclareDesign. Each one of these functions is a kind of “function factory”: it takes a set of parameters about your research design as inputs, and returns a function as its output.

Here is an example of a declare_assignment element:

simple_random_assignment <- 
  declare_assignment(Z = simple_ra(N = N, prob = 0.6))

The big idea here is that the object we created, simple_random_assignment, is not a particular assignment. Instead, it is a function that conducts assignment when called (see Principle 3.6). You can run the function on data:

simple_random_assignment(voter_file) 
Table 4.2: Data output following implementation of an assignment step.
ID age sex party precinct Z
001 66 M REP 9104 1
002 54 F DEM 8029 1
003 18 M GRN 8383 0
004 42 F DEM 2048 1
005 27 M REP 5210 0

We want to emphasize that most steps are “dataset-in, dataset-out” functions. The simple_random_assignment function took the voter_file dataset and returned a dataset with assignment information appended.

Every step in a research design can be declared using one of the declare_* functions. Table 4.3 collects these according to the four elements of a research design. Below, we walk through common uses of each of these declaration functions.

Table 4.3: Declaration functions in DeclareDesign
Design component Function Description
Model declare_model() background variables and potential outcomes
Inquiry declare_inquiry() research questions
Data strategy declare_sampling() sampling procedures
declare_assignment() assignment procedures
declare_measurement() measurement procedures
Answer strategy declare_estimator() estimation procedures
declare_test() testing procedures

4.2.1 Model

The model defines the number of units under study, their background characteristics, the latent outcomes we want to measure, and their potential outcomes. We can define the model in several ways. In some cases, you may start a design with data on the units you wish to study. When that happens, we may not need to simulate all parts of the model. We can start declaring the model with existing data.

declare_model(data = voter_file)
Table 4.4: Draw from a fixed population
ID age sex party precinct
001 66 M REP 9104
002 54 F DEM 8029
003 18 M GRN 8383
004 42 F DEM 2048
005 27 M REP 5210

We typically need to simulate part or all of the model. Even when we have background data, we don;t have access to the latent outcomes or potential outcomes that are needed to define many kinds of causal and descriptive inquiries (see Principle 3.5).

We can use the data simulation functions from the fabricatr package to simulate when we do not have complete data on the units under study. For instance, we can declare a model that generates a dataset with 100 units and a random variable U:

declare_model(N = 100, U = rnorm(N))

When we run this model function, we will get a different 100-unit dataset each time, as shown in Table 4.5.

Table 4.5: Five draws from the model
Draw 1
Draw 2
Draw 3
Draw 4
Draw 5
ID U ID U ID U ID U ID U
001 0.377 001 1.369 001 1.556 001 -2.530 001 1.459
002 -1.310 002 -0.058 002 -1.327 002 0.243 002 0.409
003 0.078 003 0.449 003 -0.430 003 -1.596 003 -0.692
004 -0.795 004 1.077 004 0.814 004 0.076 004 0.037
005 1.766 005 0.186 005 0.035 005 1.590 005 -0.619

Defining potential outcomes is as easy as a single expression per potential outcome. Potential outcomes may depend on background characteristics or other potential outcomes.

declare_model(
  N = 100,
  U = rnorm(N),
  Y_Z_0 = U, 
  Y_Z_1 = Y_Z_0 + 0.25
)
Table 4.6: Adding potential outcomes to the model
ID U Y_Z_0 Y_Z_1
001 0.521 0.521 0.771
002 1.990 1.990 2.240
003 -0.952 -0.952 -0.702
004 0.626 0.626 0.876
005 0.733 0.733 0.983

We also provide an alternative interface for defining potential outcomes that uses R’s formula syntax with the potential_outcomes function. The formula syntax lets you specify “regression-like” outcome equations. One downside is that it mildly obscures how the names of the eventual potential outcomes columns are named. We build the names using the outcome name (here Y on the left-hand side of the formula) and the name of the assignment variable from the variable name in the conditions argument (here Z). We also defined the two values Z takes on, 0 and 1 — so the two potential outcomes columns will be named Y_Z_0 and Y_Z_1.

declare_model(
  N = 100,
  U = rnorm(N),
  potential_outcomes(Y ~ 0.25 * Z + U, conditions = list(Z = c(0, 1)))
)

Either way of creating potential outcomes works; one may be easier or harder to code up in a given research design setting.

4.2.2 Inquiry

To define the inquiry, we declare a summary function of the events generated by the model. We can declare the “population average treatment effect” inquiry as the average difference in the two variables created by the model above.

declare_inquiry(PATE = mean(Y_Z_1 - Y_Z_0))

4.2.3 Data strategy

The data strategy constitutes one or more steps representing interventions the researcher makes in the world from sampling to treatment assignment to measurement.

4.2.3.1 Sampling

The sampling step typically involves drawing a random sample of units and then filtering to the sampled units, dropping the unsampled units. You can use the complete_rs function from the randomizr package to conduct the sampling. See Section 8.1.1 for an overview of the many kinds of sampling that are possible with randomizr. The second step is accomplished with the filter argument, which by default retains units for which S == 1. Here, we draw a complete random sample of 50 units from the population:

declare_sampling(S = complete_rs(N, n = 50), filter = S == 1)

When we draw data from our simple design at this point, it will have fewer rows. It will have shrunk from 100 units in the population to a data frame of 50 sampled units.

Table 4.7: Sampled data
ID U Y_Z_0 Y_Z_1 S
1 001 -1.664 -1.664 -1.414 1
2 002 -1.355 -1.355 -1.105 1
3 003 1.025 1.025 1.275 1
6 006 -0.885 -0.885 -0.635 1
9 009 -0.660 -0.660 -0.410 1

4.2.3.2 Treatment assignment

In experimental studies, units are assigned to one of two or more treatment conditions. The randomizr package provides functions for randomly assigning units. Here we use complete random assignment with probability 0.5:

declare_assignment(Z = complete_ra(N, prob = 0.5))

After treatments are assigned, some potential outcomes are revealed. Treated units reveal their treated potential outcomes and untreated units reveal their untreated potential outcomes. In most declarations, you need a measurement step to reveal measured outcomes. The reveal_outcomes function performs this “switching” operation, so called because the function “switches” which potential outcome is revealed depending on the value of the random assignment.

declare_measurement(Y = reveal_outcomes(Y ~ Z))
Table 4.8: Sampled data with assignment indicator
ID U Y_Z_0 Y_Z_1 S Z Y
001 0.669 0.669 0.919 1 0 0.669
003 -0.251 -0.251 -0.001 1 1 -0.001
004 -0.724 -0.724 -0.474 1 0 -0.724
006 0.399 0.399 0.649 1 1 0.649
010 0.619 0.619 0.869 1 0 0.619

4.2.3.3 Measurement

Measurement is how we translate latent events into observed data. For example, we might imagine that the normally distributed outcome variable Y is a latent outcome that will be translated into a binary outcome when measured by the researcher:

declare_measurement(Y_binary = rbinom(N, 1, prob = pnorm(Y)))
Table 4.9: Sampled data with an explicitly measured outcome
ID U Y_Z_0 Y_Z_1 S Z Y Y_binary
005 1.978 1.978 2.228 1 1 2.228 1
017 -0.603 -0.603 -0.353 1 1 -0.353 1
018 -1.296 -1.296 -1.046 1 0 -1.296 0
020 0.213 0.213 0.463 1 0 0.213 1
024 0.797 0.797 1.047 1 1 1.047 1

4.2.4 Answer strategy

We declare answer strategy steps using the declare_estimator function, which plays nicely with the many statistical modeling functions available in R, such as lm, glm, or the ictreg function from the list package, among hundreds of others. Throughout the book, we will be using many estimators from estimatr (like lm_robust and difference_in_means) because they are fast and calculate robust standard errors easily.

Estimators are associated with inquires. Here, we target the population average treatment effect with the difference-in-means estimator.

declare_estimator(
  Y ~ Z, model = difference_in_means, inquiry = "PATE"
)

The output from a modeling function is a complicated model fit object that contains large amounts of information. We typically only want a few summary pieces of information out of these objects, like the coefficient estimates, standard errors, and confidence intervals. We use model summary functions passed to the model_summary argument of declare_estimator to do so. Model summary functions take model fits as inputs and return answers as data frames.

The default model summary function is tidy:

declare_estimator(
  Y ~ Z, model = lm_robust, model_summary = tidy
)

4.3 Building a design from design elements

We now declare all the individual design elements in one go.

model <- 
  declare_model(
    N = 100,
    U = rnorm(N),
    potential_outcomes(Y ~ 0.25 * Z + U)
  )

inquiry <- 
  declare_inquiry(PATE = mean(Y_Z_1 - Y_Z_0)) 

sampling <- 
  declare_sampling(S = complete_rs(N, n = 50)) 

assignment <- 
  declare_assignment(Z = complete_ra(N, prob = 0.5))

measurement <- 
  declare_measurement(Y = reveal_outcomes(Y ~ Z)) 

answer_strategy <- 
  declare_estimator(
    Y ~ Z, model = difference_in_means, inquiry = "PATE"
  )

To construct a research design object that we can operate on — diagnose it, redesign it, draw data from it, etc. — we add them together with the + operator.

design <- 
  model + 
  inquiry + 
  sampling + assignment + measurement + 
  answer_strategy

We usually declare designs more compactly, concatenating steps directly with +. Declaration 4.1 shows the format of most declarations throughout the book.

Declaration 4.1 Two-arm randomized experiment

design <- 
  declare_model(N = 100, U = rnorm(N),
                potential_outcomes(Y ~ 0.25 * Z + U)) +
  declare_inquiry(PATE = mean(Y_Z_1 - Y_Z_0)) +
  declare_sampling(S = complete_rs(N, n = 50)) +
  declare_assignment(Z = complete_ra(N, prob = 0.5)) +
  declare_measurement(Y = reveal_outcomes(Y ~ Z)) +
  declare_estimator(
    Y ~ Z, model = difference_in_means, inquiry = "PATE"
  )

\(~\)

Order matters in declaring designs. We can think of the order of the declaration as the temporal order in which steps take place. Below, since the inquiry comes before sampling and assignment, the inquiry is a population inquiry, the population average treatment effect.

model +
  declare_inquiry(PATE = mean(Y_Z_1 - Y_Z_0)) +
  sampling +
  assignment +
  measurement +
  answer_strategy

We could instead define our inquiry as a sample average treatment effect by putting the inquiry after sampling:

model +
  sampling +
  declare_inquiry(SATE = mean(Y_Z_1 - Y_Z_0)) +
  assignment +
  measurement + 
  answer_strategy

4.4 Simulating a research design

Diagnosing a research design — learning about its properties — requires first simulating by running the design over and over. We need to simulate the event generating process, calculate the values of the inquiries, then draw simulated data and calculate the resulting estimates. To draw simulated data, we use draw_data:

draw_data(design)
Table 4.10: Simulated data draw
ID U Y_Z_0 Y_Z_1 S Z Y
001 0.214 0.214 0.464 1 1 0.464
004 1.140 1.140 1.390 1 0 1.140
011 -0.136 -0.136 0.114 1 1 0.114
012 -0.987 -0.987 -0.737 1 0 -0.987
014 -0.795 -0.795 -0.545 1 0 -0.795

draw_data runs all of the “data steps” in a design, which are both from the model and from the data strategy (sampling, assignment, and measurement).

To simulate the estimands from a single run of the design, we use draw_estimands. This function runs two operations at once: it draws the events, and calculates the estimands at the point defined by the design.

draw_estimands(design)
Table 4.11: Estimands calculated from simulated data.
inquiry estimand
PATE 0.25

Similarly, we can draw the estimates from a single run with draw_estimates, which simulates data and, at the appropriate moment, calculates estimates.

draw_estimates(design)
Table 4.12: Estimates calculated from simulated data.
term estimate std.error statistic p.value conf.low conf.high df outcome inquiry
Z 0.132 0.273 0.482 0.632 -0.418 0.681 47.837 Y PATE

To simulate whole designs, we use the simulate_design function to draw data and calculate estimands and estimates many times in a row (500 times by default).

simulate_design(design)
Table 4.13: Simulations data frame.
sim_ID estimand estimate std.error statistic p.value conf.low conf.high df
1 0.25 0.248 0.285 0.872 0.388 -0.324 0.821 47.918
2 0.25 -0.182 0.308 -0.592 0.557 -0.804 0.439 42.897
3 0.25 0.410 0.294 1.396 0.170 -0.181 1.002 46.011
4 0.25 0.172 0.234 0.732 0.468 -0.300 0.643 47.882
5 0.25 0.710 0.321 2.210 0.032 0.063 1.357 45.143

4.5 Diagnosing a research design

Using the simulations data frame, we can calculate diagnosands like bias, root mean-squared-error, and power for each estimator-inquiry pair. In DeclareDesign, we do this in two steps. First, we declare diagnosands, which are functions that summarize the building blocks of diagnosands, diagnostic statistics. The software includes many pre-coded diagnosands (see Section 10), though you can write your own like this:

study_diagnosands <-
  declare_diagnosands(
    bias = mean(estimate - estimand),
    rmse = sqrt(mean((estimate - estimand) ^ 2)),
    power = mean(p.value <= 0.05)
  )

Diagnosands are summaries of the simulations data frame. The bias diagnosand first calculates the difference between estimate and estimand, and then takes the average.

Next, we apply your diagnosand declaration to the simulations data frame with the diagnose_design function:

diagnose_design(simulation_df, diagnosands = study_diagnosands)
Table 4.14: Design diagnosis.
Bias RMSE Power
-0.00 0.28 0.14
(0.01) (0.01) (0.01)

We can also do this in a single step by sending a design object directly to diagnose_design. The function will first run the simulations, then calculate the diagnosands.

diagnose_design(design, diagnosands = study_diagnosands)

4.6 Redesign

We redesign to learn how the diagnosands change as design features change. We can do this using redesign over a range of sample sizes, resulting in a list of designs.

designs <- redesign(design, N = c(100, 200, 300, 400, 500))

Our simulation and diagnosis tools can operate directly on a list of designs:

diagnose_design(designs)

4.7 Library of designs

In our DesignLibrary package, we have created a set of common designs as designers (functions that create designs from just a few parameters), so you can get started quickly.

library(DesignLibrary)

block_cluster_design <- block_cluster_two_arm_designer(N = 1000, N_blocks = 10)

4.8 Complex declarations

We have illustrated the simple way to use DeclareDesign declarations thus far, and throughout the book the majority of declarations rely on this method. However, you can also escape the standard way of doing things at any step. Each design element has a “handler” that works behind the scenes on the bookkeeping. You can switch to your own function or to a function from another package at any time. In addition, you can skip our diagnosis tools by simply operating on the simulations data frame itself.

For example, we may want to declare an inquiry for many subgroups of the units in your population. A custom function relying on a dplyr pipeline to group data by city and calculate the by-city ATEs could be used like this:

declare_inquiry(
 handler = function(data) {
   # start with data
   data %>%
     # split the dataset by city
     group_by(city) %>%
     # estimate city-level ATEs and return as a data.frame
     summarize(city_ATE = mean(Y_Z_1 - Y_Z_0), .groups = "drop")
 }
)

Further reading

This primer is an introduction to DeclareDesign, but it only begins to scratch the surface of what we can do with the software. At the end of each section in Part II, we illustrate how to tackle interesting problems that come up during declaration, diagnosis, and redesign.

In the meantime, we recommend the following external resources for learning more.