I have found surprisingly few CRAN packages specifically written with survey data in mind–`questionr`

and `xtable`

are exceptions. Fortunately, a few fairly recent packages in (or associated with) the `tidyverse`

have made writing one’s own functions to handle survey data much easier. It might seem redundant for many individual R users to write their own survey data functions, but I think the benefits of tailoring these tools to the idiosyncracies of your specific dataset(s) is well worth the up-front effort. That’s why I wrote the `MLSPTools`

package specifically for the Marquette Law School Pool. Anyone is welcome to view (and even download) this package **on github**, but they may have trouble using it on their own datasets because I’ve included specific variable names throughout the functions. Still, they may provide useful guides to writing custom functions for other datasets.

Below I give two simple examples of functions that build basic survey summary tables–toplines and crosstabs. You could easily set the default arguments to the weight variable, for instance, to your most common use-case.

My basic workflow is this:

- Use
`dplyr`

to summarize the data. - Use the
`pivot`

functions from`tidyr`

to control whether the output is long or wide. - Pass the data.frame object to
`knitr::kable`

for pretty html or pdf outputs. - Use the
`kableExtra`

package to apply custom table formatting as desired.

**Background reading**

I use non-standard evaluation in these functions, even though I don’t understand it entirely. Fortunately, you don’t need to understand exactly how it works to understand how to use it. The official **“Programming with dplyr”** vignette was invaluable in this process. I also found Sharla Gelfand’s **“here’s what i know about tidyeval”** very helpful.

If it wasn’t for the `labelled`

package, I’d still be doing a lot of this in Stata. **Read the vignette** here.

When I want to alter the appearance of a table in some new way, Hao Zhu has almost always already provided the solution in his excellent package `kableExtra`

. Here is his vignette for **making html tables**, and here is the tutorial for **pdf latex tables**.

## Basic functions

Here is some example data:

```
library(tidyverse)
library(labelled)
library(knitr)
library(kableExtra)
# get some data to work with
gss <- forcats::gss_cat %>%
# make up a weight variable
mutate(fakeweight = rnorm(n = n(), mean = 1, sd = 0.2)) %>%
# turn some factors into labelled data (as that is what I commonly work with)
# I'm also reversing the factor levels to better match my preferred order
mutate(marital = to_labelled(fct_rev(marital)),
relig = to_labelled(fct_rev(relig))) %>%
select(year, marital, relig, fakeweight)
```

As is often the case with survey data, the `marital`

and `relig`

columns consist of labelled integers. My summary table functions need to convert the labelled data to ordered factors.

`head(gss)`

```
## # A tibble: 6 x 4
## year marital relig fakeweight
## <int> <dbl+lbl> <dbl+lbl> <dbl>
## 1 2000 5 [Never married] 2 [Protestant] 1.23
## 2 2000 3 [Divorced] 2 [Protestant] 1.12
## 3 2000 2 [Widowed] 2 [Protestant] 1.09
## 4 2000 5 [Never married] 11 [Orthodox-christian] 1.25
## 5 2000 3 [Divorced] 5 [None] 1.06
## 6 2000 1 [Married] 2 [Protestant] 1.04
```

The simplest function just makes basic toplines.

```
make.topline <- function(variable, data, weights){
data %>%
# remove any cases with missing data
filter(!is.na({{variable}})) %>%
# convert the labelled data to ordered factors
mutate({{variable}} := to_factor({{variable}})) %>%
# calculate the denominator
mutate(total = sum({{weights}})) %>%
group_by({{variable}}) %>%
# calculate the frequencies and percentages
summarise(frequency = sum({{weights}}),
percent = (frequency/first(total))*100)
}
```

The output is a dataframe.

`make.topline(variable = marital, data = gss, weights = fakeweight)`

```
## # A tibble: 6 x 3
## marital frequency percent
## <fct> <dbl> <dbl>
## 1 Married 10133. 47.2
## 2 Widowed 1795. 8.35
## 3 Divorced 3372. 15.7
## 4 Separated 743. 3.46
## 5 Never married 5424. 25.2
## 6 No answer 17.8 0.0829
```

Since this is an html file, I’ll output the table as a nicely formatted table using `kable`

and `kableExtra`

.

```
make.topline(variable = marital, data = gss, weights = fakeweight) %>%
kable(caption = "Marital status",
digits = 0,
format = "html") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)
```

marital | frequency | percent |
---|---|---|

Married | 10133 | 47 |

Widowed | 1795 | 8 |

Divorced | 3372 | 16 |

Separated | 743 | 3 |

Never married | 5424 | 25 |

No answer | 18 | 0 |

We can make a crosstab just by adding a few more steps.

```
make.crosstab <- function(xvar, yvar, data, weights, format = "wide"){
output <- data %>%
filter(!is.na({{xvar}}),
!is.na({{yvar}})) %>%
# convert labelled data to ordered factors
mutate({{xvar}} := to_factor({{xvar}}),
{{yvar}} := to_factor({{yvar}})) %>%
group_by({{xvar}}) %>%
mutate(total = sum({{weights}})) %>%
group_by({{xvar}}, {{yvar}}) %>%
summarise(percent = (sum({{weights}})/first(total))*100)
if(format == "wide"){
output <- output %>%
pivot_wider(names_from = {{yvar}}, values_from = percent,
values_fill = list(percent = 0))
}
output
}
```

Wide output (the default) looks like this:

`make.crosstab(xvar = relig, yvar = marital, data = gss, weights = fakeweight)`

```
## relig Married Widowed Divorced Separated
## 1 Protestant 49.66968 10.667005 16.425416 3.403529
## 2 Catholic 49.71340 8.322072 13.272894 3.560061
## 3 Jewish 50.91535 9.719786 14.484672 2.349942
## 4 None 36.62723 3.336657 16.269888 3.499927
## 5 Other 35.70249 3.972487 21.553323 2.328431
## 6 Buddhism 44.29848 5.124769 14.607297 2.394883
## 7 Hinduism 68.54787 0.000000 2.213122 0.000000
## 8 Other eastern 42.70310 6.391881 20.378180 3.786103
## 9 Moslem/islam 52.38179 0.000000 13.753795 3.056015
## 10 Orthodox-christian 66.45540 2.898814 13.678288 2.222485
## 11 Christian 40.50393 3.363736 19.333651 4.772236
## 12 Native american 31.71473 7.120186 16.562438 9.434870
## 13 Inter-nondenominational 43.43053 2.572419 23.006535 3.189294
## 14 Don't know 36.37545 0.000000 8.353109 17.078639
## 15 No answer 46.77067 6.882590 13.667494 3.635999
## Never married No answer
## 1 19.76679 0.06757869
## 2 25.10920 0.02237751
## 3 22.53025 0.00000000
## 4 40.21488 0.05141949
## 5 35.89176 0.55150788
## 6 32.94329 0.63128475
## 7 29.23901 0.00000000
## 8 26.74074 0.00000000
## 9 30.80840 0.00000000
## 10 14.74501 0.00000000
## 11 31.90159 0.12485347
## 12 35.16778 0.00000000
## 13 27.80122 0.00000000
## 14 38.19280 0.00000000
## 15 24.22811 4.81514191
```

Long output is typically better for making graphs. It looks like this:

`make.crosstab(xvar = relig, yvar = marital, data = gss, weights = fakeweight, format = "long")`

```
## # A tibble: 78 x 3
## # Groups: relig [15]
## relig marital percent
## <fct> <fct> <dbl>
## 1 Protestant Married 49.7
## 2 Protestant Widowed 10.7
## 3 Protestant Divorced 16.4
## 4 Protestant Separated 3.40
## 5 Protestant Never married 19.8
## 6 Protestant No answer 0.0676
## 7 Catholic Married 49.7
## 8 Catholic Widowed 8.32
## 9 Catholic Divorced 13.3
## 10 Catholic Separated 3.56
## # … with 68 more rows
```

## Adding features

Once you understand the grouping and summarizing operations I’m using above, you can begin to add additional steps.

For example, you might want your topline function to reveal explicit NA values and calculate valid percents as well as total percents. Do this by using `forcats::fct_explicit_na`

.

```
make.topline.with.na <- function(variable, data, weights){
data %>%
# Convert to ordered factors
mutate({{variable}} := to_factor({{variable}}),
{{variable}} := forcats::fct_explicit_na({{variable}})) %>%
# Calculate denominator
mutate(total = sum({{weights}}),
valid.total = sum(({{weights}})[{{variable}} != "(Missing)"])) %>%
# Calculate proportions
group_by({{variable}}) %>%
summarise(pct = (sum({{weights}})/first(total))*100,
valid.pct = (sum({{weights}})/first(valid.total)*100),
n = sum({{weights}})) %>%
ungroup() %>%
mutate(cum = cumsum(valid.pct),
valid.pct = replace(valid.pct, {{variable}} == "(Missing)", NA),
cum = replace(cum, {{variable}} == "(Missing)", NA)) %>%
select(Response = {{variable}}, Frequency = n, Percent = pct,
`Valid Percent` = valid.pct, `Cumulative Percent` = cum)
}
## invent 999 random NA values
gss.with.na <- gss %>%
mutate(relig = replace(relig, row_number() %in% sample(1:nrow(gss), 999, replace = FALSE), NA))
```

If you don’t want to mess around with customizing the `knitr::kable()`

output yourself, you can set `format = "pandoc"`

. These tables can’t be otherwise altered with `kableExtra`

, but they do come with nice default settings.

```
make.topline.with.na(variable = relig, data = gss.with.na, weights = fakeweight) %>%
kable(format = "pandoc", digits = 0)
```

Response | Frequency | Percent | Valid Percent | Cumulative Percent |
---|---|---|---|---|

Protestant | 10317 | 48 | 50 | 50 |

Catholic | 4900 | 23 | 24 | 74 |

Jewish | 371 | 2 | 2 | 76 |

None | 3374 | 16 | 16 | 93 |

Other | 221 | 1 | 1 | 94 |

Buddhism | 139 | 1 | 1 | 94 |

Hinduism | 66 | 0 | 0 | 95 |

Other eastern | 29 | 0 | 0 | 95 |

Moslem/islam | 97 | 0 | 0 | 95 |

Orthodox-christian | 92 | 0 | 0 | 96 |

Christian | 653 | 3 | 3 | 99 |

Native american | 22 | 0 | 0 | 99 |

Inter-nondenominational | 102 | 0 | 0 | 100 |

Don’t know | 11 | 0 | 0 | 100 |

No answer | 85 | 0 | 0 | 100 |

(Missing) | 1007 | 5 | NA | NA |

A modified crosstab function that only makes time series can save a lot of work. Just make the default value for the `xvar`

the name of your date variable.

```
make.time.series <- function(yvar, data, weights = fakeweight, date = year, format = "long"){
output <- data %>%
filter(!is.na({{date}}),
!is.na({{yvar}})) %>%
# convert labelled data to ordered factors
mutate({{yvar}} := to_factor({{yvar}})) %>%
group_by({{date}}) %>%
mutate(total = sum({{weights}})) %>%
group_by({{date}}, {{yvar}}) %>%
summarise(percent = (sum({{weights}})/first(total))*100)
if(format == "wide"){
output <- output %>%
pivot_wider(names_from = {{yvar}}, values_from = percent,
values_fill = list(percent = 0))
}
output
}
```

Notice that I also set the default value of the weight argument to “fakeweight”, further simplifying the function call.

`make.time.series(relig, gss)`

```
## # A tibble: 118 x 3
## # Groups: year [8]
## year relig percent
## <int> <fct> <dbl>
## 1 2000 Protestant 54.3
## 2 2000 Catholic 24.0
## 3 2000 Jewish 2.19
## 4 2000 None 14.0
## 5 2000 Other 1.45
## 6 2000 Buddhism 0.591
## 7 2000 Hinduism 0.285
## 8 2000 Other eastern 0.0331
## 9 2000 Moslem/islam 0.402
## 10 2000 Orthodox-christian 0.432
## # … with 108 more rows
```