8 min read

Survey tables in R

I have found surprisingly few CRAN packages specifically written with survey data in mind–questionr and xtable are exceptions. Fortunately, a few fairly recent packages in (or associated with) the tidyverse have made writing one’s own functions to handle survey data much easier. It might seem redundant for many individual R users to write their own survey data functions, but I think the benefits of tailoring these tools to the idiosyncracies of your specific dataset(s) is well worth the up-front effort. That’s why I wrote the MLSPTools package specifically for the Marquette Law School Pool. Anyone is welcome to view (and even download) this package on github, but they may have trouble using it on their own datasets because I’ve included specific variable names throughout the functions. Still, they may provide useful guides to writing custom functions for other datasets.

Below I give two simple examples of functions that build basic survey summary tables–toplines and crosstabs. You could easily set the default arguments to the weight variable, for instance, to your most common use-case.

My basic workflow is this:

  1. Use dplyr to summarize the data.
  2. Use the pivot functions from tidyr to control whether the output is long or wide.
  3. Pass the data.frame object to knitr::kable for pretty html or pdf outputs.
  4. Use the kableExtra package to apply custom table formatting as desired.

Background reading

I use non-standard evaluation in these functions, even though I don’t understand it entirely. Fortunately, you don’t need to understand exactly how it works to understand how to use it. The official “Programming with dplyr” vignette was invaluable in this process. I also found Sharla Gelfand’s “here’s what i know about tidyeval” very helpful.

If it wasn’t for the labelled package, I’d still be doing a lot of this in Stata. Read the vignette here.

When I want to alter the appearance of a table in some new way, Hao Zhu has almost always already provided the solution in his excellent package kableExtra. Here is his vignette for making html tables, and here is the tutorial for pdf latex tables.

Basic functions

Here is some example data:

library(tidyverse)
library(labelled)
library(knitr)
library(kableExtra)

# get some data to work with
gss <- forcats::gss_cat %>%
  # make up a weight variable
  mutate(fakeweight = rnorm(n = n(), mean = 1, sd = 0.2)) %>%
  # turn some factors into labelled data (as that is what I commonly work with)
  # I'm also reversing the factor levels to better match my preferred order
  mutate(marital = to_labelled(fct_rev(marital)),
         relig = to_labelled(fct_rev(relig))) %>%
  select(year, marital, relig, fakeweight)

As is often the case with survey data, the marital and relig columns consist of labelled integers. My summary table functions need to convert the labelled data to ordered factors.

head(gss)
## # A tibble: 6 x 4
##    year           marital                   relig fakeweight
##   <int>         <dbl+lbl>               <dbl+lbl>      <dbl>
## 1  2000 5 [Never married]  2 [Protestant]               1.23
## 2  2000 3 [Divorced]       2 [Protestant]               1.12
## 3  2000 2 [Widowed]        2 [Protestant]               1.09
## 4  2000 5 [Never married] 11 [Orthodox-christian]       1.25
## 5  2000 3 [Divorced]       5 [None]                     1.06
## 6  2000 1 [Married]        2 [Protestant]               1.04

The simplest function just makes basic toplines.

make.topline <- function(variable, data, weights){
  data %>%
    # remove any cases with missing data
    filter(!is.na({{variable}})) %>%
    # convert the labelled data to ordered factors
    mutate({{variable}} := to_factor({{variable}})) %>%
    # calculate the denominator
    mutate(total = sum({{weights}})) %>%
    group_by({{variable}}) %>%
    # calculate the frequencies and percentages
    summarise(frequency = sum({{weights}}),
              percent = (frequency/first(total))*100)
}

The output is a dataframe.

make.topline(variable = marital, data = gss, weights = fakeweight)
## # A tibble: 6 x 3
##   marital       frequency percent
##   <fct>             <dbl>   <dbl>
## 1 Married         10133.  47.2   
## 2 Widowed          1795.   8.35  
## 3 Divorced         3372.  15.7   
## 4 Separated         743.   3.46  
## 5 Never married    5424.  25.2   
## 6 No answer          17.8  0.0829

Since this is an html file, I’ll output the table as a nicely formatted table using kable and kableExtra.

make.topline(variable = marital, data = gss, weights = fakeweight) %>%
  kable(caption = "Marital status",
        digits = 0,
        format = "html") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Table 1: Marital status
marital frequency percent
Married 10133 47
Widowed 1795 8
Divorced 3372 16
Separated 743 3
Never married 5424 25
No answer 18 0

We can make a crosstab just by adding a few more steps.

make.crosstab <- function(xvar, yvar, data, weights, format = "wide"){
  output <- data %>%
    filter(!is.na({{xvar}}),
           !is.na({{yvar}})) %>%
    # convert labelled data to ordered factors
    mutate({{xvar}} := to_factor({{xvar}}),
           {{yvar}} := to_factor({{yvar}})) %>%
    group_by({{xvar}}) %>%
    mutate(total = sum({{weights}})) %>%
    group_by({{xvar}}, {{yvar}}) %>%
    summarise(percent = (sum({{weights}})/first(total))*100)
  
  if(format == "wide"){
    output <- output %>%
      pivot_wider(names_from = {{yvar}}, values_from = percent,
                  values_fill = list(percent = 0))
  }
  output
}

Wide output (the default) looks like this:

make.crosstab(xvar = relig, yvar = marital, data = gss, weights = fakeweight)
##                      relig  Married   Widowed  Divorced Separated
## 1               Protestant 49.66968 10.667005 16.425416  3.403529
## 2                 Catholic 49.71340  8.322072 13.272894  3.560061
## 3                   Jewish 50.91535  9.719786 14.484672  2.349942
## 4                     None 36.62723  3.336657 16.269888  3.499927
## 5                    Other 35.70249  3.972487 21.553323  2.328431
## 6                 Buddhism 44.29848  5.124769 14.607297  2.394883
## 7                 Hinduism 68.54787  0.000000  2.213122  0.000000
## 8            Other eastern 42.70310  6.391881 20.378180  3.786103
## 9             Moslem/islam 52.38179  0.000000 13.753795  3.056015
## 10      Orthodox-christian 66.45540  2.898814 13.678288  2.222485
## 11               Christian 40.50393  3.363736 19.333651  4.772236
## 12         Native american 31.71473  7.120186 16.562438  9.434870
## 13 Inter-nondenominational 43.43053  2.572419 23.006535  3.189294
## 14              Don't know 36.37545  0.000000  8.353109 17.078639
## 15               No answer 46.77067  6.882590 13.667494  3.635999
##    Never married  No answer
## 1       19.76679 0.06757869
## 2       25.10920 0.02237751
## 3       22.53025 0.00000000
## 4       40.21488 0.05141949
## 5       35.89176 0.55150788
## 6       32.94329 0.63128475
## 7       29.23901 0.00000000
## 8       26.74074 0.00000000
## 9       30.80840 0.00000000
## 10      14.74501 0.00000000
## 11      31.90159 0.12485347
## 12      35.16778 0.00000000
## 13      27.80122 0.00000000
## 14      38.19280 0.00000000
## 15      24.22811 4.81514191

Long output is typically better for making graphs. It looks like this:

make.crosstab(xvar = relig, yvar = marital, data = gss, weights = fakeweight, format = "long")
## # A tibble: 78 x 3
## # Groups:   relig [15]
##    relig      marital       percent
##    <fct>      <fct>           <dbl>
##  1 Protestant Married       49.7   
##  2 Protestant Widowed       10.7   
##  3 Protestant Divorced      16.4   
##  4 Protestant Separated      3.40  
##  5 Protestant Never married 19.8   
##  6 Protestant No answer      0.0676
##  7 Catholic   Married       49.7   
##  8 Catholic   Widowed        8.32  
##  9 Catholic   Divorced      13.3   
## 10 Catholic   Separated      3.56  
## # … with 68 more rows

Adding features

Once you understand the grouping and summarizing operations I’m using above, you can begin to add additional steps.

For example, you might want your topline function to reveal explicit NA values and calculate valid percents as well as total percents. Do this by using forcats::fct_explicit_na.

make.topline.with.na <- function(variable, data, weights){
  data %>%
    # Convert to ordered factors
    mutate({{variable}} := to_factor({{variable}}),
           {{variable}} := forcats::fct_explicit_na({{variable}})) %>%
    # Calculate denominator
    mutate(total = sum({{weights}}),
           valid.total = sum(({{weights}})[{{variable}} != "(Missing)"])) %>%
    # Calculate proportions
    group_by({{variable}}) %>%
    summarise(pct = (sum({{weights}})/first(total))*100,
              valid.pct = (sum({{weights}})/first(valid.total)*100),
              n = sum({{weights}})) %>%
    ungroup() %>%
    mutate(cum = cumsum(valid.pct),
           valid.pct = replace(valid.pct, {{variable}} == "(Missing)", NA),
           cum = replace(cum, {{variable}} == "(Missing)", NA)) %>%
    select(Response = {{variable}}, Frequency = n, Percent = pct,
           `Valid Percent` = valid.pct, `Cumulative Percent` = cum)
}

## invent 999 random NA values
gss.with.na <- gss %>%
  mutate(relig = replace(relig, row_number() %in% sample(1:nrow(gss), 999, replace = FALSE), NA))

If you don’t want to mess around with customizing the knitr::kable() output yourself, you can set format = "pandoc". These tables can’t be otherwise altered with kableExtra, but they do come with nice default settings.

make.topline.with.na(variable = relig, data = gss.with.na, weights = fakeweight) %>%
  kable(format = "pandoc", digits = 0)
Response Frequency Percent Valid Percent Cumulative Percent
Protestant 10317 48 50 50
Catholic 4900 23 24 74
Jewish 371 2 2 76
None 3374 16 16 93
Other 221 1 1 94
Buddhism 139 1 1 94
Hinduism 66 0 0 95
Other eastern 29 0 0 95
Moslem/islam 97 0 0 95
Orthodox-christian 92 0 0 96
Christian 653 3 3 99
Native american 22 0 0 99
Inter-nondenominational 102 0 0 100
Don’t know 11 0 0 100
No answer 85 0 0 100
(Missing) 1007 5 NA NA

A modified crosstab function that only makes time series can save a lot of work. Just make the default value for the xvar the name of your date variable.

make.time.series <- function(yvar, data, weights = fakeweight, date = year, format = "long"){
  output <- data %>%
    filter(!is.na({{date}}),
           !is.na({{yvar}})) %>%
    # convert labelled data to ordered factors
    mutate({{yvar}} := to_factor({{yvar}})) %>%
    group_by({{date}}) %>%
    mutate(total = sum({{weights}})) %>%
    group_by({{date}}, {{yvar}}) %>%
    summarise(percent = (sum({{weights}})/first(total))*100)
  
  if(format == "wide"){
    output <- output %>%
      pivot_wider(names_from = {{yvar}}, values_from = percent,
                  values_fill = list(percent = 0))
  }
  output
}

Notice that I also set the default value of the weight argument to “fakeweight”, further simplifying the function call.

make.time.series(relig, gss)
## # A tibble: 118 x 3
## # Groups:   year [8]
##     year relig              percent
##    <int> <fct>                <dbl>
##  1  2000 Protestant         54.3   
##  2  2000 Catholic           24.0   
##  3  2000 Jewish              2.19  
##  4  2000 None               14.0   
##  5  2000 Other               1.45  
##  6  2000 Buddhism            0.591 
##  7  2000 Hinduism            0.285 
##  8  2000 Other eastern       0.0331
##  9  2000 Moslem/islam        0.402 
## 10  2000 Orthodox-christian  0.432 
## # … with 108 more rows