# 4 Software primer

This chapter serves as a brief introduction to the DeclareDesign package for the R programming language. DeclareDesign is a software implementation of every step of the declare-diagnose-redesign process. While you can declare, diagnose, and redesign using nearly any programming language, DeclareDesign is structured to make it easy to mix-and-match design elements while handling the tedious simulation bookkeeping behind the scenes.

## 4.1 Installing R

You can download R for free from CRAN. We also recommend the free program RStudio, which provides a friendly interface to R. Both R and RStudio are available on Windows, Mac, and Linux.

Once you have R and RStudio installed, open up RStudio and install DeclareDesign and its related packages. These include three packages that enable specific steps in the research process: `fabricatr`

for simulating social science data, `randomizr`

for random sampling and random assignment, and `estimatr`

for design-based estimators. You can also install `DesignLibrary`

, which gets standard designs up and running in one line. To install them all, copy the following code into your R console:

```
install.packages(c(
"DeclareDesign",
"fabricatr",
"randomizr",
"estimatr",
"DesignLibrary"
))
```

We also recommend that you install and get to know the `tidyverse`

set of packages for data analysis, which we will use throughout the book:

`install.packages("tidyverse")`

For introductions to R and the `tidyverse`

we especially recommend the free resource R for Data Science.

## 4.2 Declaring research design elements

Designs are declared through a concatenation of design elements. Almost all elements take a dataset as an input and return a dataset as their output. We will imagine an input dataset of 100 voters in Los Angeles. The research project involves randomly assigning voters to receive (or not) a knock on their door from a canvasser. Our data look like this:

ID | age | sex | party | precinct |
---|---|---|---|---|

001 | 66 | M | REP | 9104 |

002 | 54 | F | DEM | 8029 |

003 | 18 | M | GRN | 8383 |

004 | 42 | F | DEM | 2048 |

005 | 27 | M | REP | 5210 |

The data strategy is a function that takes this dataset, implements a random assignment, adds it to the dataset, and then returns the resulting dataset.

You could write your own function to do that, but you can also use one of the `declare_*`

functions in DeclareDesign. Each one of these functions is a kind of “function factory”: it takes a set of parameters about your research design as inputs, and returns a function as its output.

Here is an example of a `declare_assignment`

element:

```
simple_random_assignment <-
declare_assignment(Z = simple_ra(N = N, prob = 0.6))
```

The big idea here is that the object we created, `simple_random_assignment`

, is not a particular assignment. Instead, it is a function that conducts assignment when called (see Principle 3.6). You can run the function on data:

`simple_random_assignment(voter_file) `

ID | age | sex | party | precinct | Z |
---|---|---|---|---|---|

001 | 66 | M | REP | 9104 | 1 |

002 | 54 | F | DEM | 8029 | 1 |

003 | 18 | M | GRN | 8383 | 0 |

004 | 42 | F | DEM | 2048 | 1 |

005 | 27 | M | REP | 5210 | 0 |

We want to emphasize that most steps are “dataset-in, dataset-out” functions. The `simple_random_assignment`

function took the `voter_file`

dataset and returned a dataset with assignment information appended.

Every step in a research design can be declared using one of the `declare_*`

functions. Table 4.3 collects these according to the four elements of a research design. Below, we walk through common uses of each of these declaration functions.

Design component | Function | Description |
---|---|---|

Model | `declare_model()` |
background variables and potential outcomes |

Inquiry | `declare_inquiry()` |
research questions |

Data strategy | `declare_sampling()` |
sampling procedures |

`declare_assignment()` |
assignment procedures | |

`declare_measurement()` |
measurement procedures | |

Answer strategy | `declare_estimator()` |
estimation procedures |

`declare_test()` |
testing procedures |

### 4.2.1 Model

The model defines the number of units under study, their background characteristics, the latent outcomes we want to measure, and their potential outcomes. We can define the model in several ways. In some cases, you may start a design with data on the units you wish to study. When that happens, we may not need to simulate all parts of the model. We can start declaring the model with existing data.

`declare_model(data = voter_file)`

ID | age | sex | party | precinct |
---|---|---|---|---|

001 | 66 | M | REP | 9104 |

002 | 54 | F | DEM | 8029 |

003 | 18 | M | GRN | 8383 |

004 | 42 | F | DEM | 2048 |

005 | 27 | M | REP | 5210 |

We typically need to simulate part or all of the model. Even when we have background data, we don;t have access to the latent outcomes or potential outcomes that are needed to define many kinds of causal and descriptive inquiries (see Principle 3.5).

We can use the data simulation functions from the `fabricatr`

package to simulate when we do not have complete data on the units under study. For instance, we can declare a model that generates a dataset with 100 units and a random variable `U`

:

`declare_model(N = 100, U = rnorm(N))`

When we run this model function, we will get a different 100-unit dataset each time, as shown in Table 4.5.

ID | U | ID | U | ID | U | ID | U | ID | U |
---|---|---|---|---|---|---|---|---|---|

001 | 0.377 | 001 | 1.369 | 001 | 1.556 | 001 | -2.530 | 001 | 1.459 |

002 | -1.310 | 002 | -0.058 | 002 | -1.327 | 002 | 0.243 | 002 | 0.409 |

003 | 0.078 | 003 | 0.449 | 003 | -0.430 | 003 | -1.596 | 003 | -0.692 |

004 | -0.795 | 004 | 1.077 | 004 | 0.814 | 004 | 0.076 | 004 | 0.037 |

005 | 1.766 | 005 | 0.186 | 005 | 0.035 | 005 | 1.590 | 005 | -0.619 |

Defining potential outcomes is as easy as a single expression per potential outcome. Potential outcomes may depend on background characteristics or other potential outcomes.

```
declare_model(
N = 100,
U = rnorm(N),
Y_Z_0 = U,
Y_Z_1 = Y_Z_0 + 0.25
)
```

ID | U | Y_Z_0 | Y_Z_1 |
---|---|---|---|

001 | 0.521 | 0.521 | 0.771 |

002 | 1.990 | 1.990 | 2.240 |

003 | -0.952 | -0.952 | -0.702 |

004 | 0.626 | 0.626 | 0.876 |

005 | 0.733 | 0.733 | 0.983 |

We also provide an alternative interface for defining potential outcomes that uses R’s formula syntax with the `potential_outcomes`

function. The formula syntax lets you specify “regression-like” outcome equations. One downside is that it mildly obscures how the names of the eventual potential outcomes columns are named. We build the names using the outcome name (here `Y`

on the left-hand side of the formula) and the name of the assignment variable from the variable name in the `conditions`

argument (here `Z`

). We also defined the two values `Z`

takes on, 0 and 1 — so the two potential outcomes columns will be named `Y_Z_0`

and `Y_Z_1`

.

```
declare_model(
N = 100,
U = rnorm(N),
potential_outcomes(Y ~ 0.25 * Z + U, conditions = list(Z = c(0, 1)))
)
```

Either way of creating potential outcomes works; one may be easier or harder to code up in a given research design setting.

### 4.2.2 Inquiry

To define the inquiry, we declare a summary function of the events generated by the model. We can declare the “population average treatment effect” inquiry as the average difference in the two variables created by the model above.

`declare_inquiry(PATE = mean(Y_Z_1 - Y_Z_0))`

### 4.2.3 Data strategy

The data strategy constitutes one or more steps representing interventions the researcher makes in the world from sampling to treatment assignment to measurement.

#### 4.2.3.1 Sampling

The sampling step typically involves drawing a random sample of units and then filtering to the sampled units, dropping the unsampled units. You can use the `complete_rs`

function from the `randomizr`

package to conduct the sampling. See Section 8.1.1 for an overview of the many kinds of sampling that are possible with `randomizr`

. The second step is accomplished with the `filter`

argument, which by default retains units for which `S == 1`

. Here, we draw a complete random sample of 50 units from the population:

`declare_sampling(S = complete_rs(N, n = 50), filter = S == 1)`

When we draw data from our simple design at this point, it will have fewer rows. It will have shrunk from 100 units in the population to a data frame of 50 sampled units.

ID | U | Y_Z_0 | Y_Z_1 | S | |
---|---|---|---|---|---|

1 | 001 | -1.664 | -1.664 | -1.414 | 1 |

2 | 002 | -1.355 | -1.355 | -1.105 | 1 |

3 | 003 | 1.025 | 1.025 | 1.275 | 1 |

6 | 006 | -0.885 | -0.885 | -0.635 | 1 |

9 | 009 | -0.660 | -0.660 | -0.410 | 1 |

#### 4.2.3.2 Treatment assignment

In experimental studies, units are assigned to one of two or more treatment conditions. The `randomizr`

package provides functions for randomly assigning units. Here we use complete random assignment with probability 0.5:

`declare_assignment(Z = complete_ra(N, prob = 0.5))`

After treatments are assigned, some potential outcomes are revealed. Treated units reveal their treated potential outcomes and untreated units reveal their untreated potential outcomes. In most declarations, you need a measurement step to reveal measured outcomes. The `reveal_outcomes`

function performs this “switching” operation, so called because the function “switches” which potential outcome is revealed depending on the value of the random assignment.

`declare_measurement(Y = reveal_outcomes(Y ~ Z))`

ID | U | Y_Z_0 | Y_Z_1 | S | Z | Y |
---|---|---|---|---|---|---|

001 | 0.669 | 0.669 | 0.919 | 1 | 0 | 0.669 |

003 | -0.251 | -0.251 | -0.001 | 1 | 1 | -0.001 |

004 | -0.724 | -0.724 | -0.474 | 1 | 0 | -0.724 |

006 | 0.399 | 0.399 | 0.649 | 1 | 1 | 0.649 |

010 | 0.619 | 0.619 | 0.869 | 1 | 0 | 0.619 |

#### 4.2.3.3 Measurement

Measurement is how we translate latent events into observed data. For example, we might imagine that the normally distributed outcome variable `Y`

is a latent outcome that will be translated into a binary outcome when measured by the researcher:

ID | U | Y_Z_0 | Y_Z_1 | S | Z | Y | Y_binary |
---|---|---|---|---|---|---|---|

005 | 1.978 | 1.978 | 2.228 | 1 | 1 | 2.228 | 1 |

017 | -0.603 | -0.603 | -0.353 | 1 | 1 | -0.353 | 1 |

018 | -1.296 | -1.296 | -1.046 | 1 | 0 | -1.296 | 0 |

020 | 0.213 | 0.213 | 0.463 | 1 | 0 | 0.213 | 1 |

024 | 0.797 | 0.797 | 1.047 | 1 | 1 | 1.047 | 1 |

### 4.2.4 Answer strategy

We declare answer strategy steps using the `declare_estimator`

function, which plays nicely with the many statistical modeling functions available in R, such as `lm`

, `glm`

, or the `ictreg`

function from the `list`

package, among hundreds of others. Throughout the book, we will be using many estimators from `estimatr`

(like `lm_robust`

and `difference_in_means`

) because they are fast and calculate robust standard errors easily.

Estimators are associated with inquires. Here, we target the population average treatment effect with the difference-in-means estimator.

```
declare_estimator(
Y ~ Z, model = difference_in_means, inquiry = "PATE"
)
```

The output from a modeling function is a complicated model fit object that contains large amounts of information. We typically only want a few summary pieces of information out of these objects, like the coefficient estimates, standard errors, and confidence intervals. We use model summary functions passed to the `model_summary`

argument of `declare_estimator`

to do so. Model summary functions take model fits as inputs and return answers as data frames.

The default model summary function is `tidy`

:

```
declare_estimator(
Y ~ Z, model = lm_robust, model_summary = tidy
)
```

## 4.3 Building a design from design elements

We now declare all the individual design elements in one go.

```
model <-
declare_model(
N = 100,
U = rnorm(N),
potential_outcomes(Y ~ 0.25 * Z + U)
)
inquiry <-
declare_inquiry(PATE = mean(Y_Z_1 - Y_Z_0))
sampling <-
declare_sampling(S = complete_rs(N, n = 50))
assignment <-
declare_assignment(Z = complete_ra(N, prob = 0.5))
measurement <-
declare_measurement(Y = reveal_outcomes(Y ~ Z))
answer_strategy <-
declare_estimator(
Y ~ Z, model = difference_in_means, inquiry = "PATE"
)
```

To construct a research design object that we can operate on — diagnose it, redesign it, draw data from it, etc. — we add them together with the `+`

operator.

```
design <-
model +
inquiry +
sampling + assignment + measurement +
answer_strategy
```

We usually declare designs more compactly, concatenating steps directly with `+`

. Declaration 4.1 shows the format of most declarations throughout the book.

**Declaration 4.1 **Two-arm randomized experiment

```
design <-
declare_model(N = 100, U = rnorm(N),
potential_outcomes(Y ~ 0.25 * Z + U)) +
declare_inquiry(PATE = mean(Y_Z_1 - Y_Z_0)) +
declare_sampling(S = complete_rs(N, n = 50)) +
declare_assignment(Z = complete_ra(N, prob = 0.5)) +
declare_measurement(Y = reveal_outcomes(Y ~ Z)) +
declare_estimator(
Y ~ Z, model = difference_in_means, inquiry = "PATE"
)
```

\(~\)

Order matters in declaring designs. We can think of the order of the declaration as the temporal order in which steps take place. Below, since the inquiry comes before sampling and assignment, the inquiry is a *population* inquiry, the population average treatment effect.

```
model +
declare_inquiry(PATE = mean(Y_Z_1 - Y_Z_0)) +
sampling +
assignment +
measurement +
answer_strategy
```

We could instead define our inquiry as a *sample* average treatment effect by putting the inquiry after sampling:

```
model +
sampling +
declare_inquiry(SATE = mean(Y_Z_1 - Y_Z_0)) +
assignment +
measurement +
answer_strategy
```

## 4.4 Simulating a research design

Diagnosing a research design — learning about its properties — requires first simulating by running the design over and over. We need to simulate the event generating process, calculate the values of the inquiries, then draw simulated data and calculate the resulting estimates. To draw simulated data, we use `draw_data`

:

`draw_data(design)`

ID | U | Y_Z_0 | Y_Z_1 | S | Z | Y |
---|---|---|---|---|---|---|

001 | 0.214 | 0.214 | 0.464 | 1 | 1 | 0.464 |

004 | 1.140 | 1.140 | 1.390 | 1 | 0 | 1.140 |

011 | -0.136 | -0.136 | 0.114 | 1 | 1 | 0.114 |

012 | -0.987 | -0.987 | -0.737 | 1 | 0 | -0.987 |

014 | -0.795 | -0.795 | -0.545 | 1 | 0 | -0.795 |

`draw_data`

runs all of the “data steps” in a design, which are both from the model and from the data strategy (sampling, assignment, and measurement).

To simulate the estimands from a single run of the design, we use `draw_estimands`

. This function runs two operations at once: it draws the events, and calculates the estimands at the point defined by the design.

`draw_estimands(design)`

inquiry | estimand |
---|---|

PATE | 0.25 |

Similarly, we can draw the estimates from a single run with `draw_estimates`

, which simulates data and, at the appropriate moment, calculates estimates.

`draw_estimates(design)`

term | estimate | std.error | statistic | p.value | conf.low | conf.high | df | outcome | inquiry |
---|---|---|---|---|---|---|---|---|---|

Z | 0.132 | 0.273 | 0.482 | 0.632 | -0.418 | 0.681 | 47.837 | Y | PATE |

To simulate whole designs, we use the `simulate_design`

function to draw data and calculate estimands and estimates many times in a row (500 times by default).

`simulate_design(design)`

sim_ID | estimand | estimate | std.error | statistic | p.value | conf.low | conf.high | df |
---|---|---|---|---|---|---|---|---|

1 | 0.25 | 0.248 | 0.285 | 0.872 | 0.388 | -0.324 | 0.821 | 47.918 |

2 | 0.25 | -0.182 | 0.308 | -0.592 | 0.557 | -0.804 | 0.439 | 42.897 |

3 | 0.25 | 0.410 | 0.294 | 1.396 | 0.170 | -0.181 | 1.002 | 46.011 |

4 | 0.25 | 0.172 | 0.234 | 0.732 | 0.468 | -0.300 | 0.643 | 47.882 |

5 | 0.25 | 0.710 | 0.321 | 2.210 | 0.032 | 0.063 | 1.357 | 45.143 |

## 4.5 Diagnosing a research design

Using the simulations data frame, we can calculate diagnosands like bias, root mean-squared-error, and power for each estimator-inquiry pair. In DeclareDesign, we do this in two steps. First, we declare diagnosands, which are functions that summarize the building blocks of diagnosands, diagnostic statistics. The software includes many pre-coded diagnosands (see Section 10), though you can write your own like this:

```
study_diagnosands <-
declare_diagnosands(
bias = mean(estimate - estimand),
rmse = sqrt(mean((estimate - estimand) ^ 2)),
power = mean(p.value <= 0.05)
)
```

Diagnosands are summaries of the simulations data frame. The bias diagnosand first calculates the difference between estimate and estimand, and then takes the average.

Next, we apply your diagnosand declaration to the simulations data frame with the `diagnose_design`

function:

`diagnose_design(simulation_df, diagnosands = study_diagnosands)`

Bias | RMSE | Power |
---|---|---|

-0.00 | 0.28 | 0.14 |

(0.01) | (0.01) | (0.01) |

We can also do this in a single step by sending a design object directly to `diagnose_design`

. The function will first run the simulations, then calculate the diagnosands.

`diagnose_design(design, diagnosands = study_diagnosands)`

## 4.6 Redesign

We redesign to learn how the diagnosands change as design features change. We can do this using `redesign`

over a range of sample sizes, resulting in a list of designs.

`designs <- redesign(design, N = c(100, 200, 300, 400, 500))`

Our simulation and diagnosis tools can operate directly on a list of designs:

`diagnose_design(designs)`

## 4.7 Library of designs

In our `DesignLibrary`

package, we have created a set of common designs as designers (functions that create designs from just a few parameters), so you can get started quickly.

```
library(DesignLibrary)
block_cluster_design <- block_cluster_two_arm_designer(N = 1000, N_blocks = 10)
```

## 4.8 Complex declarations

We have illustrated the simple way to use DeclareDesign declarations thus far, and throughout the book the majority of declarations rely on this method. However, you can also escape the standard way of doing things at any step. Each design element has a “handler” that works behind the scenes on the bookkeeping. You can switch to your own function or to a function from another package at any time. In addition, you can skip our diagnosis tools by simply operating on the simulations data frame itself.

For example, we may want to declare an inquiry for many subgroups of the units in your population. A custom function relying on a `dplyr`

pipeline to group data by city and calculate the by-city ATEs could be used like this:

```
declare_inquiry(
handler = function(data) {
# start with data
data %>%
# split the dataset by city
group_by(city) %>%
# estimate city-level ATEs and return as a data.frame
summarize(city_ATE = mean(Y_Z_1 - Y_Z_0), .groups = "drop")
}
)
```

## Further reading

This primer is an introduction to DeclareDesign, but it only begins to scratch the surface of what we can do with the software. At the end of each section in Part II, we illustrate how to tackle interesting problems that come up during declaration, diagnosis, and redesign.

In the meantime, we recommend the following external resources for learning more.