# 16 Experimental : descriptive

Why would we ever need to do an experiment to do descriptive inference?

Suppose we want to understand the causal model M of a violin. In particular, we have a descriptive inquiry I about the pitch of the highest string, the E string. We want to know if the E string is in tune. Call the latent pitch of the string $$Y^*$$. No matter how hard we listen to the string, we can’t hear $$Y^*$$ – it is latent. As part of a data strategy D, we could measure the pitch by $$P$$ plucking it: $$Y^* -> Y <- P$$. This is descriptive research about the causal model M, because the DAG of the violin includes four string nodes which each cause pitch nodes; we’d like to know a descriptive fact about the pitch nodes (at what frequency do they vibrate?).

This question could be recast as a causal inquiry: the untreated potential outcome is the pitch of the unplucked string, as defined by the frequency of vibration. While strings are never perfectly still, we can call the untreated potential outcome $$Y_i(0) = 0hz$$. The treated potential outcome is the frequency when the string is plucked $$Y_i(1) = 650hz$$. The causal effect of plucking the string is $$Y_i(1) - Y_i(0) = 650 - 0 = 650$$.

Whether framed as a descriptive inquiry or a casual inquiry, we arrive at an answer of 650 hertz. Violinists reading this will know that that means the E string is flat and will need to be tuned up.

Likewise, we can use the assignment of units to conditions to learn about descriptive quantities. Audit experiments estimate the fraction of units that discriminate. List experiments estimate the prevalance rate of a sensitive item. Conjoint experiments measure (aggregations of) preferences over candidates. Experimental behavioral games measure trust. The fact of randomization does not render the inquiry causal any more than the lack of randomization renders the inquiries of observational causal research descriptive.

## 16.1 Audit experiments

Audit experiments are used to measure discrimination against one group in favor of another group. The design is used commonly in employment settings to measure whether job applications that are otherwise similar but come from candidates from different genders, races, or social backgrounds receive the same rate of job interview invitations. The same approach has been applied to a very wide range of settings, including education, housing, and requests to politicians.

The audit experiment design we’ll explore in this chapter has the identical data and answer strategies as the two-arm trial for causal inference described in @ref{sec:p3twoarm}. The difference between an audit study and the typical randomized experiment lies in the model and inquiry. In a two arm trial, a common (causal) inquiry is the average difference between the treated and untreated potential outcomes, the ATE. In an audit experiment, by contrast, the inquiry is descriptive: the fraction of the sample that discriminates.

We can hear our colleagues objecting now – the inquiry in an audit study can of course be conceived of as causal! It’s the average effect of signaling membership in one social group on a resume versus signaling membership in another. We agree, of course, that this interpretation is possible and technically correct. But when we think of the inquiry as descriptive, we can better understand how the audit experiment relies on substantive assumptions about how people who do and do not discriminate behave.

Consider , which seeks to measure discrimination against Latinos by election officials by assessing whether election officials respond to emailed requests for information from putatively Latino or White voters. We imagine three types of election officials: those who would always respond to the request (regardless of the emailer’s ethnicity), those who would never respond to the request (again regardless of the emailer’s ethnicity), and officials who discriminate against Latinos. Here, discriminators are defined by their behavior: they would respond to the White voter but not to the Latino voter. These three types are given in Table 16.1.

Table 16.1: Audit experiment response types
Type $$Y_i(Z_i = White)$$ $$Y_i(Z_i = Latino)$$
Always-responder 1 1
Anti-Latino Discriminator 1 0
Never-responder 0 0

Our descriptive inquiry is the fraction of the sample that discriminates: $$\mathbb{E}[Type = Anti Latino Discriminator]$$. Under this behavioral assumption, $$\mathbb{E}[Type = Anti Latino Discriminator] = \mathbb{E}[Y_i(Z_i = White) - Y_i(Z_i = Latino)]$$, which is why we can use a randomized experiment to measure this descriptive quantity. In the data strategy, we randomly sample from the $$Y_i(Z_i = White)$$’s and from the $$Y_i(Z_i = Latinos)$$’s, then in the answer strategy, we take a difference-in-means, generating an estimate of the fraction of the sample that discriminates.

Some finer points about this behavioral assumption. First, we assume that Always-responders and Never-responders do not engage in discrimination. It could be that some Never-responder doesn’t respond to the Latino voter out of racialized animus but doesn’t respond to the White voter out of laziness. In this model, such an official would be not be classified as an Anti-Latino Discriminator. Second, we assume that there are no Anti-White discriminators. If there were, then the difference-in-means would not be unbiased for the fraction of Anti-Latino discriminators. Instead, it would be unbiased for “net” discrimination, i.e., how much more election officials discriminate against Latinos versus how much the discriminate against Whites. Anti-Latino discrimination and net discrimination are theoretically distinct inquiries, but the distinction is often elided in experimental audit studies.

Declaration 16.1 connects the behavioral assumption we make about subjects to the randomized experiment we used to infer the value of a descriptive quantity. Only Never-responders fail to respond to the White request while only Always-responders respond to the Latino request. The inquiry is the proportion of the sample that is an anti-Latino discriminator. The data strategy involves randomly assigning the putative ethnicity of the voter making the request and recording whether it was responded to. The answer strategy is compares average response rates by randomly assigned group.

Declaration 16.1 $$~$$

types <- c("Always-Responder","Anti-Latino Discriminator","Never-Responder")
design <-
declare_model(
N = 1000,
type = sample(size = N,
replace = TRUE,
x = types,
prob = c(0.30, 0.05, 0.65)),
# Behavioral assumption represented here:
Y_Z_white = if_else(type == "Never-Responder", 0, 1),
Y_Z_latino = if_else(type == "Always-Responder", 1, 0)
) +
declare_inquiry(anti_latino_discrimination = mean(type == "Anti-Latino Discriminator")) +
declare_assignment(Z = complete_ra(N, conditions = c("latino", "white"))) +
declare_measurement(Y = reveal_outcomes(Y ~ Z)) +
declare_estimator(Y ~ Z, inquiry = "anti_latino_discrimination")

$$~$$

### 16.1.1 Intervening to decrease discrimination

prompt researchers to “move beyond measurement” in audit studies. Under the model assumptions in the design, audit experiments measure the level of discrimination, but of course they do not do anything to reduce them. To move beyond measurement, we intervene in the world to reduce discrimination in a treatment group but not in a control group, then measure the level of discrimination in both arms using the audit experiment technology.

This two-stage design can be seen clearly in the declaration below. The first half of the design is about causal inference: we want to learn about the effect of the intervention on discrimination. The second half of the design is about descriptive inference – within each treatment arm. We incorporate both stages of the design in the answer strategy, in which the coefficient on the interaction of the intervention indicator with the audit indicator is our estimator of the effect on discrimination.

Even at 5,000 subjects, the power to detect the effect of the intervention is quite poor, at approximately 15%. This low power stems from the small treatment effect (reducing discrimination by 50% from 5.0% to 2.5%) and from the noisy measurement strategy.

Declaration 16.1 $$~$$

N = 5000

design <-
# This part of the design is about causal inference
declare_model(
N = N,
type_D_0 = sample(
size = N,
replace = TRUE,
x = types,
prob = c(0.30, 0.05, 0.65)
),
type_tau_i = rbinom(N, 1, 0.5),
type_D_1 = if_else(
type_D_0 == "Anti-Latino Discriminator" &
type_tau_i == 1,
"Always-Responder",
type_D_0
)
) +
declare_inquiry(ATE = mean((type_D_1 == "Anti-Latino Discriminator") -
(type_D_0 == "Anti-Latino Discriminator")
)) +
declare_assignment(D = complete_ra(N)) +
declare_measurement(type = reveal_outcomes(type ~ D)) +
# This part is about descriptive inference in each condition!
declare_model(
Y_Z_white = if_else(type == "Never-Responder", 0, 1),
Y_Z_latino = if_else(type == "Always-Responder", 1, 0)
) +
declare_assignment(
Z = complete_ra(N, conditions = c("latino", "white"))) +
declare_measurement(Y = reveal_outcomes(Y ~ Z)) +
declare_estimator(Y ~ Z * D, term = "Zwhite:D", inquiry = "ATE")

$$~$$

• discusses how to avoid post-treatment bias when studying how the audit treatment affects the “quality” of responses, such as the tone of an email or the enthusiasm of the hiring call-back. Interestingly, this causal effect is not defined among the discriminators, because they never send emails to Latinos, so those emails never have a tone. The causal effect on tone is defined only among the “nicest” subjects who always respond, but estimating this effect without bias is tricky.

• randomizes a New York City government intervention designed to stop housing discrimination, which was then measured by an audit study design.

• Statistical power is a major issue when using audit studies to measure the causal effect of interventions on discrimination. describe the analogous problem of low statistical power when using list experiments to measure the causal effect of interventions on sensitive traits

### Exercises

1. Modify the descriptive design to allow for anti-White discrimination. hint: please use this set of types: type = sample(size = N, replace = TRUE, x = c("Always-Responder", "Anti-Latino Discriminator", "Never-Responder", "Anti-White Discriminator), prob = c(0.30, 0.05, 0.63, 0.02))
1. What is the bias for the anti-Latino discrimination inquiry?
2. Include a net_discrimintation inquiry. What is the bias for that inquiry?
1. Modify the sample size of the “moving beyond measurement” design. How large does it have to be before the power for the interaction term reaches 80%?

## 16.2 List experiments

Sometimes, subjects might not tell the truth about certain attitudes or behaviors when asked directly. Responses may be affected by sensitivity bias, or the tendency of survey subjects to dissemble for fear of negative repercussions if some individual or group learns their true response . In such cases, standard survey estimates based on direct questions will be biased. One class of solutions to this problem is to obscure individual responses, providing protection from social or legal pressures. When we obscure responses systematically through an experiment, we can often still identify average quantities of interest. One such design is the list experiment (introduced in ), which asks respondents for the count of the number of “yes” responses to a series of questions including the sensitive item, rather than for a yes or no answer on the sensitive item itself. List experiments give subjects cover by aggregating their answer to the sensitive item with responses to other questions.

For example, study religious discrimination among Americans regarding immigration policy. They worried that in asking people directly whether they were willing to grant citizenship to Muslims that prejudiced people would not be willing to admit their opposition to the policy. To mitigate this risk, the authors obtained estimates of preferences for allowing legal immigration for Muslims that were free of social desirability bias using a list experiment. Subjects in the control and treatment groups were asked: “Below you will read [three/four] things that sometimes people oppose or are against. After you read all [three/four], just tell us HOW MANY of them you OPPOSE. We don’t want to know which ones, just HOW MANY.”

Table 16.2: list experiment conditions
Control Treatment
The federal government increasing assistance to the poor. The federal government increasing assistance to the poor.
Professional athletes making millions of dollars per year. Professional athletes making millions of dollars per year.
Large corporations polluting the environment. Large corporations polluting the environment.
Granting citizenship to a legal immigrant who is Muslim.

The treatment group averaged 2.123 items while the control group averaged 1.904 items, for a difference-in-means estimate 0.219. Under the usual assumptions of randomized experiments, the difference-in-means is an unbiased estimator for the average treatment effect of being asked to respond to the treated list versus the control list. But our (descriptive) inquiry is the proportion of people who would grant citizenship to a legal immigrant who is Muslim.

For the difference-in-means to be an unbiased estimator for that inquiry, we invoke two additional assumptions :

• No design effects. The count of yes’’ responses to control items must be the same whether a respondent is assigned to the treatment or control group. The assumption highlights that we need a good estimate of the average control item count from the control group (in the example, 1.843). We use that to net out the control item count from responses to the treated group (what is left is the sensitive item proportion).

• No misreporting. The respondent must report the truthful answer to the sensitive item in the treatment group, when granted the anonymity protection of the list experiment. This assumption relies on the fact that the sensitive item is aggregated among the control items and so identifying individual responses is, in most cases, not possible, and this cover is enough to change the respondent’s willingness to truthfully report. However, there are two circumstances in which the respondent is not provided any cover: if the respondent reports “zero” in the treatment group, they are exactly identified as not holding the sensitive trait; when they report the highest possible count in the treatment group, they are exactly identified as holding the trait. We describe the resulting biases below as floor and ceiling effects, respectively.

The no liars assumption of list experiments is evident in the DAG (Figure 16.1): sensitivity bias $$S$$ is not a parent of the list experiment outcome $$Y^L$$ by assumption (no liars). The no design effects assumption is not directly visible.

We declare a design for the list experiment to study the descriptive estimand of the prevalence. Our model includes subjects’ true support for granting citizenship to Muslims (Y_star) and whether or not they are “shy” (S). These two variables combine to determine how subjects will respond when asked directly about support for the policy. The potential outcomes model combines three types of information to determine how subjects will respond to the list experiment: their responses to the three nonsensitive control items (control_count), their true support for Trump (Y_star), and whether they are assigned to see the treatment or the control list (Z). Notice that our definition of the potential outcomes embeds the no liars and no design effects assumptions required for the list experiment design. Our estimand is the proportion of voters who support granting citizenship to Muslims. In the data strategy, we randomly assign 50% of our 100 subjects to treatment and the remainder to control. In the survey, we ask subjects the list experiment question (Y_list). Our answer strategy involves estimating the proportion who would grant citizenship to Muslims by calculating the difference-in-means in the list outcome between treatment and control.

Declaration 16.2 $$~$$

design <-
declare_model(
N = 5000,
control_count = rbinom(N, size = 3, prob = 0.5),
Y_star = rbinom(N, size = 1, prob = 0.3),
potential_outcomes(Y_list ~ Y_star * Z + control_count)
) +
declare_inquiry(prevalence_rate = mean(Y_star)) +
declare_sampling(S = complete_rs(N, n = 500)) +
declare_assignment(Z = complete_ra(N)) +
declare_measurement(Y_list = reveal_outcomes(Y_list ~ Z)) +
declare_estimator(Y_list ~ Z, model = difference_in_means,
inquiry = "prevalence_rate")

$$~$$

diagnosands <- declare_diagnosands(
bias = mean(estimate - estimand),
mean_CI_width = mean(abs(conf.high - conf.low))
)
diagnosis <- diagnose_design(design, sims = sims, diagnosands = diagnosands)
Table 16.3: Diagnosis of a list experiment.
Bias Mean CI width
-0.001 0.325

We see in the diagnosis that there is no bias, but the average width of the confidence interval is enormous: 32 percentage points.

### 16.2.1 Assumption violations

Recent work on list experiments emphasizes the possibility of violations of both the no liars and the no design effects assumptions. We can diagnose the properties of our design under plausible violations of each.

First, we consider violations of the no design effects assumption, which means that the control item count differs depending on whether a subject is assigned to treatment or control. Typically, this means that the inclusion of the sensitive item changes responses to the control items, because they are judged in relative terms or because the respondent became suspicious of the researcher’s intentions due to the taboo of asking a sensitive question.

We declare a modified design below that defines two different potential control item counts depending on whether the respondent is in the treatment group (control_count_treat) or control group (control_count_control). The potential outcomes for the list outcome also change: control_count_treat is revealed in treatment and control_count_treat in control.

Declaration 16.3 List experiment “design effects” design

design_design_effects <-
declare_model(
N = 5000,
U = rnorm(N),
control_count_control = rbinom(N, size = 3, prob = 0.5),
control_count_treat = rbinom(N, size = 3, prob = 0.25),
Y_star = rbinom(N, size = 1, prob = 0.3),
potential_outcomes(Y_list ~ (Y_star + control_count_treat) * Z + control_count_control * (1 - Z))
) +
declare_inquiry(prevalence_rate = mean(Y_star)) +
declare_sampling(S = complete_rs(N, n = 500)) +
declare_assignment(Z = complete_ra(N)) +
declare_measurement(Y_list = reveal_outcomes(Y_list ~ Z)) +
declare_estimator(Y_list ~ Z, inquiry = "prevalence_rate")

$$~$$

diagnose_design_effects <- diagnose_design(design_design_effects, sims = sims)

In the diagnosis, we see that there is substantial bias in estimates of the prevalence rate in the presence of design effects:

Table 16.4: Diagnosis of bias due to design effects
inquiry bias
prevalence_rate -0.75

Second, a violation of no liars implies that respondents do not respond truthfully even when provided with the privacy protection of the list experiment. Two common circumstances researchers worry about are ceiling effects and floor effects. In ceiling effects, respondents respond with the maximum number of control items rather than with their truthful response of that plus one, to avoid being identified as holding the sensitive trait. The floor effects problem is the reverse, when respondents hide not holding the sensitive trait by responding one more than their truthful count in the treatment group.

Declaration 16.4 List experiment ceiling effects design

design_liars <-
declare_model(
N = 5000,
U = rnorm(N),
control_count = rbinom(N, size = 3, prob = 0.5),
Y_star = rbinom(N, size = 1, prob = 0.3),
potential_outcomes(
Y_list ~
if_else(control_count == 3 & Y_star == 1 & Z == 1,
3,
Y_star * Z + control_count))
) +
declare_inquiry(prevalence_rate = mean(Y_star)) +
declare_sampling(S = complete_rs(N, n = 500)) +
declare_assignment(Z = complete_ra(N)) +
declare_measurement(Y_list = reveal_outcomes(Y_list ~ Z)) +
declare_estimator(Y_list ~ Z, inquiry = "prevalence_rate")

$$~$$

diagnose_liars <- diagnose_design(design_liars, sims = sims)

Again, we see in the presence of a violation of this assumption, no liars, that there is bias in our estimates of the prevalence rate.

Table 16.5: Diagnosis of bias due to ceiling effects
inquiry bias
prevalence_rate -0.037

Ceiling and floor effects are not the only ways in which the no liars assumption might be violated. Respondents, when noting the sensitive item among the list, might always respond zero (or the highest number) regardless of their control item count to hide their response. A declaration could be made for these other kinds of violations, also by changing the potential outcomes for Y_list.

### 16.2.2 Choosing design parameters

Researchers have control of three important design parameters that affect the inferential power of list experiments: the number of control items, the selection of control items, and sample size. Of course, researchers also must choose whether to adopt a list experiment, compared to a simpler direct question. We take up this question in the discussion of sample size.

How many control items. After sample size, an early choice list researchers must make is how many control items to select. Here we also face a tradeoff: the more control items, the more privacy protection for the respondent; but the more items the more variance and the less efficient our estimator of the proportion holding the sensitive item. We can quantify the amount of privacy protection provided as the average width of the confidence interval on the posterior prediction of the sensitive item given the observed count. The efficiency can be quantified as the RMSE.

Which control items. The choice of which set of control items to ask can be as or more important than the number. There are three aims with their selection: reduce bias from ceiling and floor effects, provide sufficient cover to respondents so the no liars assumption is met, and increase efficiency of the estimates. The first goal can be met by reducing the number of people whose latent control count is between one and $$J-1$$, one above and one below the lowest and highest numbers possible in the treated group. Respondents in this band will not feel pressured to subtract (add) from their responses to hide that they (do not) hold the sensitive item. One solution to this is to add an item with high prevalance and an item with low prevalence. Though this would address problem one, it would violate problem two: items that are obviously high and low prevalence do nothing to add to privacy protection. The ideal control item count is one with low variance around the middle of the range of the count. To achieve this while providing sufficient cover, items that are inversely correlated can be added.

Sample size. The bias-variance tradeoff in the choice between list and direct questioning can be diagnosed by examining the root mean-squared error (a measure of the efficiency of the design) across two varying parameters: sample size and the amount of sensitivity. We declare a new design with varying n and varying proportion_shy, the proportion of Trump voters who withhold their truthful response when asked directly:

Declaration 14.1 Combined list experiment and direct question design

design <-
declare_model(
N = 5000,
U = rnorm(N),
control_count = rbinom(N, size = 3, prob = 0.5),
Y_star = rbinom(N, size = 1, prob = 0.3),
W = case_when(Y_star == 0 ~ 0L,
Y_star == 1 ~ rbinom(N, size = 1, prob = proportion_shy)),
potential_outcomes(Y_list ~ Y_star * Z + control_count)
) +
declare_inquiry(prevalence_rate = mean(Y_star)) +
declare_sampling(S = complete_rs(N, n = n)) +
declare_assignment(Z = complete_ra(N)) +
declare_measurement(Y_list = reveal_outcomes(Y_list ~ Z),
Y_direct = Y_star - W) +
declare_estimator(Y_list ~ Z, inquiry = "prevalence_rate", label = "list") +
declare_estimator(Y_direct ~ 1, inquiry = "prevalence_rate", label = "direct")

designs <- redesign(design, proportion_shy = seq(from = 0, to = 0.3, by = 0.1), n = seq(from = 500, to = 2500, by = 500))

$$~$$

diagnosis_tradeoff <- diagnose_design(designs, sims = sims, bootstrap_sims = b_sims)

Diagnosing this design, we see that at low levels of sensitivity and low sample sizes, the direct question is preferred on RMSE grounds. This is because though the direct question is biased for the proportion of Trump voters in the presence of any sensitivity bias (positive proportion_shy), it is much more efficient than the list experiment. When we have a large sample size, then we begin to prefer the list experiment for its low bias. At high levels of sensitivity, we prefer the list on RMSE grounds despite its inefficiency, because bias will be so large. Beyond the list experiment, this diagnosis illustrates that when comparing two possible designs we need to understand both the bias and the variance of the designs in order to select the best one in our setting. In other designs, it will not be the proportion who are shy but some other feature of the model or data and answer strategy that affect bias.

In the upper left, we see that when there is no sensitivity bias we always prefer the direct question due to the inefficiency of the list experiment. The red line is always below the blue. However, when we get to 0.1, there are sample sizes at which we prefer the direct question to the list: below 3000 subjects. However, above 3000 subjects the RMSE of the list experiment is better than the direct question. When we get to 25 percent of Trump supporters misreporting, we always prefer the list experiment in terms of RMSE. In other words, at such high levels of sensitivity bias we are always willing to tolerate the efficiency loss to get an unbiased estimate in this region.

### Exercises

1. The variance of the list experiment is given by this expression, where $$\mathbb{V}(Y_i(0)$$ is the variance of the control item count and $$\mathrm{cov}(Y_i(0), D_i^*)$$ is the covariance of the control item count with the sensitive trait.

$\frac{1}{N-1} \bigg\{ \pi^*(1-\pi^*) + 4 \mathbb{V}(Y_i(0)) + 4 \mathrm{cov}(Y_i(0), D_i^*) \bigg\}$

Our goal is to compare the direct question and list experiment designs with respect to the RMSE diagnosand. Recall that RMSE equals the square root of variance plus bias squared: $$\mathrm{RMSE} = \sqrt{\mathrm{Variance} + \mathrm{Bias}^2}$$. Assume the following design parameters: $$\delta = 0.10$$, $$\pi^* = 0.50$$, $$\mathbb{V}(Y_i(0) = 0.075$$, $$\mathrm{cov}(Y_i(0), D_i^*) = 0.025$$.

1. What is the RMSE of the direct question when $$N$$ = 100?
2. What is the RMSE of the list experiment when $$N$$ = 100?
3. Make a figure with $$N$$ on the horizontal axis and RMSE on the vertical axis. Plot the RMSE for both designs over a range of sample sizes from 100 to 2000. Hint: you’ll need to write a function for each design that takes $$N$$ as an input and returns RMSE. You can get started by filling out this starter function: direct_rmse <- function(N){ # write_your_function_here}
4. How large does the sample size need to be before the list experiment is preferred to the direct question on RMSE grounds?
5. Comment on how your answer to (d) would change if $$\delta$$ were equal to 0.2? What are the implications for the choice between list experiments and direct questions?
1. The list experiment is one of several experimental designs for answering descriptive inquiries about sensitive topics. Most can target the same inquiry: the proportion of subjects who hold a sensitive trait. The randomized response technique is another such design, itself with many variants. In a “forced response” randomized response design (see ), respondents are asked to roll a dice, and depending on the dice result they either answer honestly or are “forced” to answer either “yes” or “no.” With a six-sided dice, respondents might be asked to answer “yes” if they roll a 6, “no” if they roll a “1,” and to answer the question truthfully if they roll any other number, two through five. Because the probability of rolling a 1 and 6 are known, we can back out the probability of answering the sensitive item from the observed data. Declaring this design necessitates changes in M (potential outcomes are a function of the dice roll); D (random assignment is the dice roll itself); and A (an estimator that is a function of the observed outcomes and the known probability of being forced into each response). We declare one below:
model_rr <-
declare_model(
N = 100,
U = rnorm(N),
X = rbinom(N, size = 3, prob = 0.5),
Y_star = rbinom(N, size = 1, prob = 0.3),
potential_outcomes(
Y_rr ~
case_when(
dice == 1 ~ 0L,
dice %in% 2:5 ~ Y_star,
dice == 6 ~ 1L
),
conditions = list(dice = 1:6))
) +
declare_assignment(
dice = complete_ra(N, prob_each = rep(1/6, 6),
conditions = 1:6)) +
declare_measurement(Y_rr = reveal_outcomes(Y_rr ~ dice)) +
declare_estimator(Y_rr ~ 1, handler = label_estimator(rr_forced_known),
label = "forced_known", inquiry = "proportion")

## 16.3 Conjoint experiments

Conjoint survey experiments have become hugely popular in political science and beyond for describing multidimensional preferences over profiles . Conjoint experiments come in two basic varieties: the single profile design and the forced-choice design. Throughout this chapter, we’ll discuss these studies in the context of hypothetical candidate experiments, in which candidates are described in terms of a number of attributes each of which can take on levels. In the single profile design, subjects asked to rate one profile at a time using, for example, a 1 - 7 support scale. In a forced-choice conjoint experiment, subjects are shown two profiles at a time, then asked to make a binary choice between them. Forced choice conjoint experiments are especially useful for studying electoral behavior because they closely mirror the real-world behavior of choosing between two candidates at the ballot box. A similar logic applies to purchasing behavior when consumers have to choose one product over another. Occasionally, forced-choice conjoint experiments are applied even when no real-world analogue for the binary choice exists. For example, we rarely face a binary choice between two immigrants or between two refugees, so in those cases, perhaps rating profiles one at a time would be more appropriate.

We take the slightly unorthodox position that conjoint experiments target descriptive, rather than casual inquiries. The reason can be most easily seen in the single profile design case. For concreteness, imagine subjects are presented with one profile at a time that describes the age (young, middle-aged, old), gender (woman, man), and employment sector (public, private) of a candidate for office and are asked to rate their support for the candidate on a 1-7 Likert scale. This set of attributes and levels generates 2 * 3 * 2 = 12 possible profiles. We could ask subjects to rate all 12, but we typically ask them instead to rate only random subset. This design can support many inquiries. First, our inquiry could be average preference for each the 12 profiles. It might also be the difference in the average preference across two profiles. The most common inquiry is the Average Marginal Component “Effect” or AMCE, which summarizes the average difference in preference between two levels of one attribute, averaging over all of the levels of the other attributes. The AMCE for gender, for example, considers the average difference in preference for women candidates versus men candidates among young candidates who work in the private sector, among middle-aged candidates who work in the public sector, and so on for all six combinations. The overall AMCE is a weighted average of all six of these average preference differences, where the weights are given the relative frequency of each type of candidates. We put the “Effect” in AMCE in scare quotes because we think of the AMCE as a descriptive quantity. We of course agree there is a sense in which the AMCE is a causal quantity, since it is the average effect on preferences of describing a hypothetical candidate as a man or a woman. Sure. But we can see this quantity as descriptive if we just imagine asking subjects about both candidates and describing the difference in their preferences. The only reason we don’t ask about all possible profiles is that typically, there are far too many to get through in a typical survey, so we ask subjects about a random subset.

Just like single-profile conjoints, forced-choice conjoints also target descriptive inquiries, but the inquiry is one step removed from raw preferences over profiles. Instead, we aim to describe the fraction of pairwise contests that a profile would win, averaging over all subjects in the experiment. That is, we aim to describe a profile’s ‘average win rate.’ We can further describe the differences in the average win rate across profiles, For example, among young candidates who work in the private sector, what is the average difference in win rates for women versus men? Just as in the single profile case, the AMCE is a weighted average of these differences, weighted by the relative frequency of each type of candidate.

Here again, we could think of the AMCE as a causal effect, i.e., the average effect of describing a profile as a woman versus a man. But we can also imagine asking subjects to consider all 12 * 12 = 144 possible pairwise contests, then using those binary choices to fully describe subjects preferences over contests A forced-choice conjoint asks subjects to rate just a random subset of those contests, since asking about all of them would be impractical.

One final wrinkle about the AMCE inquiries, in both the single-profiled and force-choice cases. They are “data-strategy-dependent” inquiries in the sense that, implicitly AMCEs average over the distribution of the other profile attributes, and that distribution is controlled by the researcher.17 The AMCE of gender for profiles that describe age and employment sector is different from the AMCE of gender for profiles that also include partisanship. Further, and more subtly, the AMCE of gender for profiles that are 75% public sector and 25% private sector is different from the AMCE of gender for profiles that are 50% public sector and 50% private sector, because those relative frequencies are part of the very definition of the inquiry. For contrast, consider a vignette-style hypothetical candidate experiment in which all or most of the other candidate features are fixed, save gender. In that design, we estimate an ATE of gender under only one set of conditions, but in the conjoint design, the AMCE averages over ATEs under many sets of conditions.

The data strategy for conjoints, then, requires making these four choices, in addition to the usual measurement, sampling, and assignment concerns:

1. Which attributes to include in the profiles
2. Which levels to include in each attribute (and in what proportion)
3. How many profiles subjects are asked to rate at a time
4. How many sets of profiles subjects are asked to rate in total

The right set of attributes is governed by the “masking/satisficing” tradeoff . If you don’t include an important attribute (like partisanship in a candidate choice experiment), you’re worried that subjects will partially infer partisanship from other attributes (like race or gender). If so, partisanship is “masked,” and the estimates for the effects of race or gender will be biased by these “omitted variables” . But if you add too many attributes in order to avoid masking, you may induce “satisficing” among subjects, whereby they only take in a little bit of information, enough to make a “good enough” choice among the candidates.

The right set levels to include is a tricky choice. We want to include all of the most important levels, but every additional level harms statistical precision. If an attribute has three levels, it’s like you’re conducting a three-arm trial, so you’ll want to have enough subjects for each arm. The more levels, the lower the precision.

How many profiles to rate at the same time is also tricky. Our point of view is that this choice should be guided by the real-world analogue of the survey task. If we’re learning about binary choices between options in the real world, then the forced-choice, paired design makes good sense. If we’re learning about preferences over many possibilities, the single profile design may be more appropriate. That said, the paired design can yield precision gains over the single profile design in the sense that subjects rate two profiles at the same time, so we effectively generate twice as many observations for perhaps less than twice as much cognitive effort.

Finally, the right number of choice tasks usually depends on your survey budget. You can always add more conjoint tasks and the only cost is the opportunity cost of asking a different question of the survey that may serve some higher scientific purpose. If you’re worried that respondents will get bored with the task, you can always throw out profile pairs that come later in the survey. suggest that you can ask many tasks without much loss of data quality.

We begin by establishing the attributes, levels, and their probability distributions, and creating a dataset of candidate types:

f1 = c("man", "woman")
f1_prob = c(0.5, 0.5)
f2 = c("young", "middleaged", "old")
f2_prob = c(0.25, 0.50, 0.25)
f3 = c("private", "public")
f3_prob = c(0.5, 0.5)

candidates_df <-
bind_cols(expand_grid(f1, f2, f3),
expand_grid(f1_prob, f2_prob, f3_prob)) %>%
mutate(
candidate = paste(f1, f2, f3, sep = "_"),
woman = as.numeric(f1 == "woman"),
middleaged = as.numeric(f2 == "middleaged"),
old = as.numeric(f2 == "old"),
public = as.numeric(f3 == "public"),
prob = f1_prob * f2_prob * f3_prob
)

Here we describe the true prefereces of the 1,000 individuals we will enroll in our conjoint experiment. We describe preferences using a regression-model-like approach. Subject evaluations of candidates are given by the following equation:

\begin{align*} Y = &\beta_{0} + \\ & \beta_{1} * \mathrm{woman} + \\ & \beta_{2} * \mathrm{middleaged} + \\ & \beta_{3} * \mathrm{old} + \\ & \beta_{4} * \mathrm{public}+ \\ & \beta_{5} * \mathrm{woman*middleaged} + \\ & \beta_{6} * \mathrm{woman*old} + \\ & \beta_{7} * \mathrm{woman*public} + \\ & \beta_{8} * \mathrm{public*middleaged} + \\ & \beta_{9} * \mathrm{public*old} \\ & \beta_{10} * \mathrm{woman*public*middleaged} + \\ & \beta_{11} * \mathrm{woman*public*old} \end{align*}

Every subject has a different value for each of $$\beta_{0}$$ through $$\beta_{11}$$. Here we imagine that $$\beta$$’s are larger and more variable for base terms than for the two-way interactions than for the three-way interactions as this structure of preferences appears to approximate how individuals evaluate candidates .

individuals_df <-
fabricate(
N = 1000,
beta_woman = rnorm(N, mean = 0.1, sd = 1),
beta_middleaged = rnorm(N, mean = 0.1, sd = 1),
beta_old = rnorm(N, mean = -0.1, sd = 1),
beta_public = rnorm(N, mean = 0.1, sd = 1),
# two way interactions
beta_woman_middleaged = rnorm(N, mean = -0.05, sd = 0.25),
beta_woman_old = rnorm(N, mean = -0.05, sd = 0.25),
beta_woman_public = rnorm(N, mean = 0.05, sd = 0.25),
beta_public_middleaged = rnorm(N, mean = 0.05, sd = 0.25),
beta_public_old = rnorm(N, mean = 0.05, sd = 0.25),
# three-way interactions
beta_woman_public_middleaged = 0,
beta_woman_public_old = 0,
# Idiosyncratic error
U = rnorm(N, sd = 1)
)

Next, we join candidates and individuals, in order to calucate preferences over each of the 12 candidates for all 1000 subjects. We use this dataset to calculate the true values of the AMCEs. The calculate_amces function is hidden, but can be found in the accompanying full chapter script.

candidate_individuals_df <-
left_join(candidates_df, individuals_df, by = character()) %>%
mutate(
evaluation =
beta_woman * woman +
beta_middleaged * middleaged +
beta_old * old +
beta_public * public +

beta_woman_middleaged * woman * middleaged +
beta_woman_old * woman * old +
beta_woman_public * woman * public +
beta_public_middleaged * public * middleaged +
beta_public_old  * public * old +

beta_woman_public_middleaged * woman * public * middleaged +
beta_woman_public_old * woman * public * old +
U
)
inquiries_df <- calculate_amces(candidate_individuals_df)
inquiries_df
Table 16.6: AMCE Inquiries
inquiry estimand
AMCE Woman 0.017
AMCE Old -0.021
AMCE Middle-aged 0.020
AMCE Public 0.035

We’re almost ready to declare this full design. We have to reshape the data back to a wider format in which the rows are individuals and the columns are evaluations of candidates.

individuals_wide_df <-
candidate_individuals_df %>%
transmute(ID, candidate, evaluation) %>%
pivot_wider(id_cols = ID,
names_from = candidate,
values_from = evaluation)

Declaration 16.5 shows the full design. We ask each of our 1000 subjects to evaluation four pairs of candidates. Which pairs of candidates candidates they see are determined by the declare_assignment function, which respects the relative frequencies of the candidates set above. The two measurement functions are mildly complicated to address the particularities of the data structure. The estimator is OLS with robust standard errors clustered at the respondent level.

Declaration 16.5 $$~$$

design <-
declare_model(
data = individuals_wide_df,
) +
declare_inquiry(
handler = function(data){inquiries_df}
) +
declare_assignment(
f1 = complete_ra(N, conditions = f1, prob_each = f1_prob),
f2 = complete_ra(N, conditions = f2, prob_each = f2_prob),
f3 = complete_ra(N, conditions = f3, prob_each = f3_prob),
candidate = paste(f1, f2, f3, sep = "_")
) +
# This function picks out the evaluation of the correct candidate
declare_measurement(
handler = function(data) {
data %>%
rowwise %>%
mutate(evaluation = get(candidate)) %>%
ungroup
}
) +
# This function returns a winner and a loser
declare_measurement(
handler = function(data) {
data %>%
group_by(pair) %>%
mutate(Y = pair_fun(evaluation)) %>%
ungroup
}
) +
declare_estimator(
Y ~ f1 + f2 + f3,
model = lm_robust,
clusters = ID,
se_type = "stata",
term = c("f1woman", "f2middleaged", "f2old", "f3public"),
inquiry = c("AMCE Woman", "AMCE Middle-aged", "AMCE Old", "AMCE Public")
)

Figure 16.3 shows the sampling distribution of the four AMCE estimators. They are all four unbiased, but with only 1000 subjects evaluating 4 pairs of candidates, the power for the smaller AMES is less than ideal.

### Exercises

• Modify Declaration 16.5. How many pairs would the 1,000 subjects need to evaluate before power is above 0.80 for all four estimators? What concerns would you have about asking subjects to evaluate that many pairs?

• Modify Declaration 16.5. How many subjects would you need if every subject only evaluated 1 pair?

## 16.4 Behavioral games

Behavioral games are often used to study difficult-to-measure characteristics of subjects: risk attitudes, altruism, prejudice, trust. The approach involves using lab or other mechanisms to control contexts. A high level of control brings two distinct benefits. First, it can eliminate noise. One can get estimates under a particular well defined set of conditions rather than estimates generated from averaging over a range of conditions. Second, more subtly, it can prevent various forms of confounding. For instance outside the lab we might observe how people act when they work on tasks with an outgroup member. But we only observe the responses among those that do work with out group members, not among those that do not. By setting things up in a controlled way you can see how people would react when put into particular situations.

The approach holds enormous value. But, as highlighted by , it also introduces many subtle design choices. Many of these can be revealed through declaration and diagnosis. We illustrate using the “trust” game.