14 Observational : descriptive

An observational design for descriptive inference has an inquiry like a population mean, covariance, or distribution as the main research goal. In an observational research design, the data strategy includes sampling and measurement components, but no treatments are allocated. Put differently, in an observational design for descriptive inference, researchers seek to measure and summarize the world, but not to change it. This class of research design encompasses a huge portion of research activity – most surveys fall into this class, as do large-scale data collections of economic and sociopolitical indicators. Other examples of observational designs for descriptive inference include classic case studies focused on “thick description” and many text analysis projects.

14.1 Simple random sampling

We declare a design in which a researcher takes a simple random sample of a population and uses an instrument to measure a latent quantity. The inquiry is the population average of the measured quantity. We show how to declare this design and an approach to incorporate concerns about non random non response in the design.

Descriptive inquiries like the population mean of one variable or the population correlation between two variables are defined with respect to a well-defined group of units – that group of \(N\) (written with an upper-case \(N\)) units is the population about which we want to draw inferences.

One approach to studying a population is to conduct a census in which we record data on all \(N\) units. A census has the clear advantage of being comprehensive, but it usually comes at an overwhelming and prohibitive cost.

To avoid those costs, we collect data from only \(n\) units (written with a lower-case \(n\)), where the sample size \(n\) is smaller than the population size \(N\). When the \(n\) units we happen to sample are chosen at random, we can obtain unbiased estimates of many descriptive inquiries.13

Imagine we seek to estimate the average political ideology of adult residents of the small town of Portola, California (population 2,100). Under our model M, the latent ideology \(Y^*\) is drawn from a standard normal distribution.

The data strategy \(D\) has two components: a survey question \(Q\) and a sampling procedure \(S\). The survey question asks subjects to place themselves on a left-right scale that varies from 1 (most liberal) to 7 (most conservative). We approximate this measurement procedure with a function that “cuts” the latent ideology into 7 separate groups. This survey question will introduce measurement error insofar as it does not distinguish among units with different latent ideologies despite placing themselves at the same place on the seven-point scale. Our main hope for this measurement procedure is that all of the people who give themselves higher scores are indeed more conservative than those who give themselves lower scores. The sampling procedure is “complete” random sampling. We draw a sample of exactly \(n = 100\), where every member of the population has an equal probability of inclusion in the sample, \(\frac{n}{N}\).

The model and data strategy are represented by the DAG in Figure 14.1. The DAG shows that the observed outcome \(Y\) is a function of the latent score \(Y^*\) and the survey question \(Q\). The observed outcome \(Y\) is only measured for sampled units, i.e., units that have \(S = 1\). This simple diagram represents important design assumptions. First, no arrow leads from \(Y^*\) to \(S\). If such an arrow were present, then units with higher or lower latent ideologies would be more likely to be sampled. Second, an arrow does lead from \(Y^*\) to \(Y\), indicating that we assume the measured variable does indeed respond to the latent variable. Finally, the diagram includes an explicit role for the survey question, which helps us to consider how alternative wordings of \(Q\) might change the observed variable \(Y\).

DAG for the simple random sampling design

Figure 14.1: DAG for the simple random sampling design

Our inquiry \(I\) is the population mean of the measured variable \(Y\): \(\frac{1}{N} \sum_1^N Y_i = \bar{Y}\), rather than the mean of the latent variable \(Y^*\). In this sense, our inquiry is “data-strategy dependent,” since we are interested in the average value of what we would measure for any member of the population were we to sample them.

Our answer strategy is the sample mean estimator: \(\widehat{\overline{Y}} = \frac{1}{n} \sum_1^n Y_i\), implemented here as an ordinary least squares regression to facilitate the easy calculation of auxiliary statistics like the standard error of the estimate and the 95% confidence interval.

We incorporate these design features into Declaration 14.1. The portola object is a fixed population of 2100 units with a latent ideology Y_star. The declaration of the measurement strategy comes before the declaration of the inquiry, showing how the inquiry is data strategy dependent.

Declaration 14.1 \(~\)

set.seed(343)
portola <-
  fabricate(
    N = 2100,
    Y_star = rnorm(N)
    )

design <- 
  declare_model(data = portola) + 
  declare_measurement(Y = as.numeric(cut(Y_star, 7))) + 
  declare_inquiry(Y_bar = mean(Y)) + 
  declare_sampling(S = complete_rs(N, n = 100)) + 
  declare_estimator(Y ~ 1, inquiry = "Y_bar")

\(~\)

Two main diagnosands for the simple random sampling design are bias and rmse. We want to know if we get the right answer on average and we want to know, on average, how far off from the truth we are.

diagnosands <- declare_diagnosands(
  bias = mean(estimate - estimand),
  rmse = sqrt(mean((estimate - estimand) ^ 2))
)
diagnosis <- diagnose_design(design, diagnosands = diagnosands) 
Table 14.1: Complete random sampling design diagnosis
Bias RMSE
-0.01 0.11
(0.01) (0.00)

The diagnosis in table 14.1 indicates that under complete random sampling, the sample mean estimator of the population mean is unbiased and that the root mean squared error is manageable at 0.11.

14.1.1 What can go wrong

The most serious threat to descriptive inference in a randomized sampling design is nonresponse. Missingness due to nonresponse can lead to bias if missingness is correlated with outcomes. Sometimes this bias is referred to as “selection bias” in the sense that some units select out of responding when sampled.

Depending on what information is available about the missing units, various answer strategy fix-ups are available to analysts. For example, if we have demographic or other covariate information about the missing units, we can search for similar-seeming units in the observed data, then impute their outcomes for the missing outcomes. This approach depends on the strong assumption that units with the same covariate profile have the same average outcome, regardless of whether they go missing. The imputation process is often done on the basis of a regression model; multiple imputation methods attempt to incorporate the additional uncertainty that accompanies the modeling technique (see [decent introduction to MI]).

Avoiding – or dynamically responding to – missingness in the data strategy is usually preferable to the addition of modeling assumptions in the answer strategy. Avoiding missignness often means adding extra effort and expense: monetary incentives for participation, multiple rounds of attempted contact, and a variety of modes of contact (phone, mail, email, direct message, text, canvass). The best way to allocate extra effort will vary from context to context, as will the payoff from doing so. Our recommendation is to reason about the plausible response rates that would result from different levels of effort, then to consider how to optimize the bias-effort tradeoff. Sometimes, achieving zero bias would be far too costly, so we would be willing to tolerate some bias because effort is too expensive.

Declaration 14.2 builds in a dependence between the latent outcome \(Y*\) and the probability of responding to the survey. That probability also responds to researcher effort. The diagnosis shows how effort translates into higher response rates and lower bias:

Declaration 14.2 Survey nonresponse design

design <- 
  declare_model(data = portola) + 
  declare_measurement(Y = as.numeric(cut(Y_star, 7))) + 
  declare_inquiry(Y_bar = mean(Y)) + 
  declare_sampling(S = complete_rs(N, n = 100)) + 
  declare_measurement(
    R = rbinom(n = N, size = 1, prob = pnorm(Y_star + effort)),
    Y = if_else(R == 1, Y, NA_real_)
  ) +
  declare_estimator(Y ~ 1, inquiry = "Y_bar") +
  declare_estimator(R ~ 1, label = "Response Rate")

\(~\)

designs <- redesign(design, effort = seq(0, 5, by = 0.5))
diagnosis <- diagnose_designs(designs, sims = 500)
Redesigning the random sampling design over researcher effort

Figure 14.2: Redesigning the random sampling design over researcher effort

14.2 Cluster random sampling

We declare a design in which a researcher takes a clustered random sample of a population and uses the design to ask whether they should invest in sampling more clusters or in more individuals per cluster. The design includes a budget constraint that affects how many can be sampled within clusters as a funciton of the number of clusters sampled.

Researchers often cannot randomly sample at the individual level because it may, among other reasons, be too costly or logistically impractical. Instead, they may choose to randomly first sample clusters, then sample units within clusters. Clusters might be schools, localities, or households.

How does clustering change the research design relative to the individual-level design? First, we need to elaborate the model \(M\) to make the clustering hierarchy explicit. In the declaration below, we imagine our research site is two states in Nigeria. Localities are nested within states and individuals are nested within localities. Second, we want to respect this hierarchy when thinking about the distribution of outcomes. Individuals living in the same locality are likely to share ideological view points, either through the explicit transmission of political views or because of common exposure to political influence. The “intra-cluster correlation” or ICC is an extremely important statistic for the design of cluster-sampled studies. It describes what fraction of the total variation in the outcome can be attributed to the across-cluster differences in average outcomes.

In the declaration below, the latent outcome Y_star describes subject’s latent political ideology. This latent outcome is a function of a locality shock and an individual shock. The variances of these two shocks are determined by the ICC parameter. If ICC were equal to 1, the variance across localities would be equal to 1, and all individuals within a locality would have exactly the same ideology. If ICC were equal to zero, then the variation in ideology would take place entirely at the individual level.

Declaration 14.3 \(~\)

set.seed(343)
ICC <- 0.4

two_nigerian_states <-
  fabricate(
    state = add_level(N = 2, 
                      state_name = c("taraba", "kwara"),
                      state_mean = c(-0.2, 0.2)),
    locality = add_level(
      N = 500,
      locality_shock = rnorm(N, state_mean, sqrt(ICC))
    ),
    individual = add_level(
      N = 100,
      individual_shock = rnorm(N, sd = sqrt(1 - ICC)),
      Y_star = locality_shock + individual_shock
    )
  )

\(~\)

Many different cluster sampling designs are possible, but a very standard choice is a two-stage design in which first, some but not all clusters are sampled, and second, some but not all units within a cluster are sampled. The sampling at either stage may be stratified by covariates at the appropriate level. The first stage can be stratified by cluster-level covariates and the second stage can be stratified by individual-level covariates in order to improve precision. In this declaration, we form cluster-level strata by state.

The two stage random-sampling design raises an important tradeoff: Should we invest in sampling more clusters or in more individuals per cluster? Typically, adding the marginal cluster is more expensive than adding the marginal individual. We formalize the tradeoff with a “budget function” that returns the largest individual level inclusion probability that is budget-compatible with a given cluster sampling probability:

budget_function <- 
  function(cluster_prob){
  budget = 20000
  cluster_cost = 20
  individual_cost = 2
  n_clusters = 1000
  n_individuals_per_cluster = 100
  
  total_cluster_cost <-
    cluster_prob * n_clusters * cluster_cost
  
  remaining_funds <- budget - total_cluster_cost
  
  sampleable_individuals <- 
    cluster_prob * n_clusters * n_individuals_per_cluster
  
  individual_prob = 
    (remaining_funds/individual_cost)/sampleable_individuals
  
  pmin(individual_prob, 1)
}

We use the output of this function to determine the probability of an individual being sampled, conditional on their cluster being sampled.

Lastly, the answer strategy must also respects the data strategy by clustering standard errors at the highest level at which units are sampled, which in this case is the locality.

Declaration 14.4 \(~\)

design <-
  declare_model(data = two_nigerian_states) +
  declare_measurement(Y = as.numeric(cut(Y_star, 7))) +
  declare_inquiry(Y_bar = mean(Y)) +
  declare_sampling(
    S_cluster = strata_and_cluster_rs(
      strata = state,
      clusters = locality,
      prob = cluster_prob
    ),
    filter = S_cluster == 1
  ) +
  declare_sampling(
    S_individual = 
      strata_rs(strata = locality, 
                prob = budget_function(cluster_prob)),
    filter = S_individual == 1
  ) +
  declare_estimator(Y ~ 1,
                    clusters = locality,
                    inquiry = "Y_bar")

\(~\)

We redesign Declaration 14.4 over various levels of the cluster-level probability of sampling, which in turn sets the probability of sampling at the individual level. Figure 14.3 shows that for a good while, adding additional clusters yields precision gains. At some point, however, the cost to sample size within cluster is too large, and we start seeing precision loss around a probability of 0.75 of a cluster being sampled. The precise combination of design parameters that minimize the standard deviation of the sampling distribution will depend on nearly every aspect of the design declaration, but the most important are the total budget, the relative costs of clusters and individuals, and the ICC. When the ICC is zero, we should invest in few clusters and many individuals. When the ICC is one, we should invest in many clusters and few individuals.

designs <- redesign(design, cluster_prob = seq(0.1, 0.9, 0.1))
diagnosis <- diagnose_design(designs, sims = 500)
Trading off the number of clusters and the number of individuals per cluster

Figure 14.3: Trading off the number of clusters and the number of individuals per cluster

14.3 Multi-level regression and poststratification

We declare a design in which researchers reweight the responses of different units in a sample in order to better estimate a population level quantity. Reweighting depends on how much units are thought to “represent” other unsampled units and requires making decisions about how much units of different types should be pooled together. Design performance of a partially pooled model is compared against designs that involve no pooling and full pooling.

Multi-level regression and postratification (MRP) is a technique used primarily for “small area estimation.” In the prototypical setting, we conduct a nationally-representative survey of public opinion, but our goal is to generate estimates of the average level of public opinion for many subnational units. In the United States context, these “small area” units are often the 50 states. The main problem is that in a national poll of 2,000 Americans, we might only have 4 respondents from small states like Alaska, Wyoming, or Vermont, but more than 100 from large states like California, New York, or Texas. Accordingly, it is harder to estimate opinion in small states than in large states.The key insight of an MRP design is that we can “borrow strength” across states and kinds of people in order to improve state level estimates.

In an MRP design, the answer strategy includes two steps: a multi-level regression step and a poststratification step. The regression model generates estimates of the average opinion for classes of people within each state. The precise flavor of regression model can vary from application to application. In the simple example below, we use a generalized linear mixed effect model with an individual-level covariate and random effects by state, but regression models of astounding complexity are sometimes used to model important nuances in how opinions covary with individual and state-level chararacteristics.

The regression model generates estimates of the average opinion for classes of people within each state – the post-stratification step reweights these estimates to known proportions of each class of person within each state. The knowledge of these proportions has to come from outside the survey. The US census is the usual source of these population proportions, though any reliable source of this information is suitable as well.

We begin with a dataset of the fifty states that describes what fraction of people in each state has graduated high school. This code block also establishes the true state_means that will be our inquiry.

states <- 
  as_tibble(state.x77) %>%
  transmute(
    state = rownames(state.x77),
    prop_of_US = Population / sum(Population),
    prob_HS = `HS Grad` / 100,
    state_shock = rnorm(n = n(), sd = 0.5),
    state_mean = prob_HS * pnorm(0.2 + state_shock) + (1 - prob_HS) * pnorm(state_shock)
  )

Here we declare the design. In the model, we draw a nationally-representative sample of size 2,000, respecting the fraction of people within each state with a high school degrees. The post-stratification weights are built from the that fraction as well. The binary public opinion variable policy_support is a function of the high school covariate, an individual-level shock, and a state-level shock. The inquiry is the mean policy support at the state level.

Declaration 14.5 \(~\)

design <-
  declare_model(
    data = states[sample(1:50,
                         size = 2000,
                         replace = TRUE,
                         prob = states$prop_of_US),],
    HS = rbinom(n = N, size = 1, prob = prob_HS),
    PS_weight =
      case_when(HS == 0 ~ (1 - prob_HS),
                HS == 1 ~ prob_HS),
    individual_shock = rnorm(n = N, sd = 0.5),
    policy_support = rbinom(N, 1, prob = pnorm(0.2 * HS + individual_shock + state_shock))
  ) +
  declare_inquiry(
    handler = function(data) {
      states %>% transmute(state, inquiry = "mean_policy_support",  estimand = state_mean)
    }
  )

The tricky part of this design is the two-step answer strategy. The first step is handled by the multilevel regression function glmer. The second step is handled by the ps_helper function that obtains predictions from the model, then reweights them according to the post-stratification weights.

ps_helper <- function(model_fit, data) {
  prediction(
    model_fit,
    data = data,
    allow.new.levels = TRUE,
    type = "response"
  ) %>%
    group_by(state) %>%
    summarize(estimate = weighted.mean(fitted, PS_weight))
}
design <-
  design +
  declare_estimator(handler = label_estimator(function(data) {
    model_fit <- glmer(
      formula = policy_support ~ HS + (1 | state),
      data = data,
      family = binomial(link = "logit")
    )
    ps_helper(model_fit, data = data)
  }),
  label = "Partial pooling",
  inquiry = "mean_policy_support")

\(~\)

Figure 14.4 shows one draw from this design, plotting the MRP estimates against the true level of opinion.

Estimates of state-level option plotted against their true levels

Figure 14.4: Estimates of state-level option plotted against their true levels

14.3.1 Redesign over answer strategies

The strengths of the MRP design are best appreciated by contrasting MRP’s partial pooling approach to two alternatives: no pooling and full pooling. Under no pooling, we estimate each state mean separately with a national adjustment for the individual-level high school covariate. Under full pooling, we only adjust for high school and ignore state information altogether. Here we add both estimators to the design and diagnose.

design <- 
  design +
  
  declare_estimator(
    handler = label_estimator(function(data) {
      model_fit <- lm_robust(
        formula = policy_support ~ HS + state,
        data = data
      )
      ps_helper(model_fit, data = data)
    }),
    label = "No pooling",
    inquiry = "mean_policy_support") +
  
  declare_estimator(
    handler = label_estimator(function(data) {
      model_fit <- lm_robust(
        formula = policy_support ~ HS,
        data = data
      )
      ps_helper(model_fit, data = data)
    }),
    label = "Full pooling",
    inquiry = "mean_policy_support")

Figure 14.5 compares the three estimators. The first column of facets shows one draw of the estimates against the estimands. The main thing to notice here is that the full pooling estimate is more or less a flat line – regardless of the estimand, the estiamtes are just above 50%. Relative to partial pooling, the no pooling estimates are further spread around the 45 degress line, with small states bouncing around the most.

On the right side of the figure, we see the bias, RMSE, and standard deviation diagnosands for each inquiry under all three answer strategies. Under no pooling, bias is very low, but the RMSE and standard deviation for small states is very high. Under full pooling, the standard deviation is very low, but bias is very positive for states with low support and very negative for states with high support. The resulting RMSE has a funny “V” shape – we only do well for states that happen to have opinion that is very close to the national average.

Partial pooling represents a Goldilocks compromise between full and no pooling. Yes, we have some postive bias for low-opinion states and negative bias for high opinion states, but variance has been brought under control. As a result, the RMSE for both small and large states is small.

Comparison of three answer strategies

Figure 14.5: Comparison of three answer strategies

14.4 Index creation

We declare a design is which researchers take multiple measures and combine them to learn about a latent, unobservable quantity. We use diagnosis to show that it is possible to generate interpretable conditional estimates of this quantity and assess bias even though the metric of the unobservable quantity is unknown. Diagnosis highlights however how subtle differences in scale construction generate different types of bias.

Models often specify a latent variable (Y_star) that we can’t observe directly. The measurement procedures we use to produce observed variables (Y_1, Y_2, Y_3) are not perfect: observed values are related latent variables but may be on their own scales. A common strategy for addressing measurement error is to combine multiple measures of the same latent variable into an index. The basic intuition for this procedure is that when we combine multiple measures, the errors attending to each measure will tend to cancel one another out. When this happens, the index itself has lower measurement error than any of the constituent parts.

The first difficult feature of such problems is that we do not have access to the scale on which Y_star is measured and so it may seem like a hopeless exercise to try to assess whether we have got good or bad estimates of Y_star when we combine the measured data.

One way around this is to normalize the scale of both the latent variable and the measured variable so that they have a mean of 0 and unit standard deviation. But in that case, we are guaranteed that our estimate of the mean of the normalized variable will be unbiased! We will certainly estimate a mean of 0! That may be but as we show in the declaration below, if your model is correct this approach may still be useful for calculating other quantities — such as conditional means—that you don’t just get right by construction.

A second challenge is deciding which measurements to combine into the index. We clearly only want to create indices using measurements of the same latent variable, but it can hard to be sure which latent variable, exactly, is being measured. Just relying on whether the measurements are positively correlated or not is not sufficient, because measurements can be correlated even if they are not measurements of the same underlying variable. Ultimately we have to rely on theory to make these decisions, as uncomfortable as that may make us.

Lastly, we have to choose a procedure to combine the multiple measures into a single index. Many such procedures exist. Here we’ll consider the most common approach, which is simply to take the average of the three raw measures.

In Declaration 14.6, our inquiry is the average level of the latent variable among units whose binary covariate X is equal to one.

In the declaration below Y_star has a normal distribution but it is not centered on zero. The measured variables Y_1, Y_2, Y_3 are also normally distributed but each has its own scale; they are all related to Y_star, though some more strongly than others. We construct two indices. One (Y_av) is constructed by first scaling each of these measured variables, then averaging and then scaling again. This is akin to the approach used in Kling, Liebman, and Katz (2007). The second also weights the components but this time using weights generated by ‘principal components analysis’ which, intuitively, seeks to find a weighting that minimizes the distance to the measured variables.

Declaration 14.6 \(~\)

design <-
  declare_model(
    N = 500,
    X = rep(0:1, N / 2),
    Y_star = 1 + X + 2 * rnorm(N)
  ) +
  declare_inquiry(Y_bar_X1 = mean(scale(Y_star)[X == 1])) +
  declare_measurement(
    Y_1 = 3 + 0.1 * Y_star + rnorm(N, sd = 5),
    Y_2 = 2 + 1.0 * Y_star + rnorm(N, sd = 2),
    Y_3 = 1 + 0.5 * Y_star + rnorm(N, sd = 1),
    Y_av = ((scale(Y_1) + scale(Y_2) + scale(Y_3))),
    Y_av_adj = (
      # rescaling according to the X = 0 group
      (Y_1 - mean(Y_1[X == 0])) / sd(Y_1[X == 0]) +
        (Y_2 - mean(Y_2[X == 0])) / sd(Y_2[X == 0]) +
        (Y_3 - mean(Y_3[X == 0])) / sd(Y_3[X == 0])
    ) / 3,
    Y_av_scaled = scale((scale(Y_1) + scale(Y_2) + scale(Y_3))),
    Y_fa  = princomp( ~ Y_1 + Y_2 + Y_2, cor = TRUE)$scores[, 1]
  ) +
  declare_estimator(
    Y_av ~ 1,
    model = lm_robust,
    inquiry = "Y_bar_X1",
    subset = X == 1,
    label = "Average"
  ) +
  declare_estimator(
    Y_av_scaled ~ 1,
    model = lm_robust,
    inquiry = "Y_bar_X1",
    subset = X == 1,
    label = "Average (rescaled)"
  ) +
  declare_estimator(
    Y_av_adj ~ 1,
    model = lm_robust,
    inquiry = "Y_bar_X1",
    subset = X == 1,
    label = "Average (adjusted)"
  ) +
  declare_estimator(
    Y_fa ~ 1,
    model = lm_robust,
    inquiry = "Y_bar_X1",
    subset = X == 1,
    label = "Principal Components"
  )

\(~\)

In 14.6 we show that the correlation between all the indices and the underlying quantity is quite good, even though the strength of the correlations for some of the components is weak. Trickier however is being sure we have an interpretable scale.

Correlations between the latent variable, measured components, and indices.

Figure 14.6: Correlations between the latent variable, measured components, and indices.

diagnosis <- diagnose_design(design) 

Figure 14.7 shows the distribution of estimates from different approaches to generating indices. A few features stand out from the distribution of estimates.

First, the principal component version appears to do poorly, but there is a simple reason for this: the method does not presuppose knowledge of the direction of the scale of the latent variable and can come up with index that reverse the actual scale of interest. Avoiding this requires an interpretative step after the principal components analysis is implemented (more subtly the averaging approach also has an interpretative step before averaging, when components are introduced with a metric that presupposes a positive correlation with the underlying quantity). Even accounting for the different sign patterns however, the estimates are too small.

The simple averaging of the normalized scales, by contrast has generated estimates that are too large on average. Two rescaling approach fare somewhat differently. An approach that simply rescales again after rescaling the component parts produces bias in the opposite direction. A third approach in which the component parts are rescaled in terms of the average and standard deviations observed in the women’s group does well however. The key insight here is that the total variation in the latent variable combines the variation within groups and between them: we want to measure outcomes on a scale benchmarked to the within group variation and so have to take out the between group variation when rescaling.

Distribution of estimates from different approaches to generating indices

Figure 14.7: Distribution of estimates from different approaches to generating indices

Overall we see from the diagnosis that we can do quite well here in recovering the conditional mean of the standardized latent variable. Though we see risks here. We declared a design under optimistic conditions in which we knew the relevant components and these were all related to the latent variable in a simple way. Even in this case we had difficulty getting the answer right.

14.4.1 Exercises

  1. Modify the design so that only two rather than three observed components are used. How is the diagnosis affected? Does it matter which indices you use?
  2. How is estimation affected if you have the wrong model linking measures to the latent variables? Try a design in which the measures are non linear in the latent variable.