Stacked Likert Plots using ggplot

Steven Isley bio photo By Steven Isley

One issue I’ve run into when using R and ggplot is trying to create stacked horizontal bar plots in ggplot that don’t start at zero. In survey analysis these are often the simplest way to convey the results of a matrix-style question with a Likert response scale. There are some good packages out there - the one that gets the closest to what I want is Jason Bryer “likert” package.

However, I found it difficult to make modifications to this package to get just the plot I wanted. The underlying code was too complex for me to make simple changes. So, I finally got around to creating my own version. I’m not going to make it into a package or anything like that. I’m providing it as is so that anyone can hack it easily into whatever they need.

Overview

library(dplyr)
library(ggplot2)
library(knitr)
library(tidyr)
library(scales)

First, some example data. I’ll generate data that looks like common output from a survey. Each row will be a respondent, and the columns will be the answers to questions. Let’s assume that Q1 through Q5 were questions in a ‘matrix style’ group of questions. That means they are five questions that shared a common set of response options. For simplicitly, I’ll use the common “strongly agree to strongly disagree” scale.

# Simulation N responses
N <- 50
answers <- c("Strongly Disagree","Somewhat Disagree","Neither Agree nor Disagree",
              "Somewhat Agree", "Strongly Agree")
set.seed(12342)
d <- tibble(
  id = paste0("Respondent", 1:N),
  Q1 = sample(answers, N, replace=TRUE),
  Q2 = sample(answers, N, replace=TRUE),
  Q3 = sample(answers, N, replace=TRUE),
  Q4 = sample(answers, N, replace=TRUE),
  Q5 = sample(answers, N, replace=TRUE)
)

kable(d[1:10,])
id Q1 Q2 Q3 Q4 Q5
Respondent1 Somewhat Agree Strongly Disagree Neither Agree nor Disagree Somewhat Disagree Somewhat Agree
Respondent2 Somewhat Agree Neither Agree nor Disagree Strongly Disagree Strongly Agree Somewhat Agree
Respondent3 Strongly Disagree Somewhat Agree Strongly Disagree Neither Agree nor Disagree Somewhat Agree
Respondent4 Strongly Disagree Neither Agree nor Disagree Strongly Disagree Neither Agree nor Disagree Somewhat Disagree
Respondent5 Somewhat Disagree Strongly Disagree Somewhat Disagree Strongly Agree Somewhat Agree
Respondent6 Somewhat Disagree Somewhat Disagree Strongly Disagree Somewhat Disagree Somewhat Disagree
Respondent7 Strongly Disagree Somewhat Disagree Strongly Agree Strongly Agree Strongly Agree
Respondent8 Somewhat Disagree Somewhat Agree Strongly Disagree Neither Agree nor Disagree Strongly Disagree
Respondent9 Somewhat Disagree Strongly Disagree Somewhat Disagree Neither Agree nor Disagree Somewhat Agree
Respondent10 Strongly Agree Neither Agree nor Disagree Strongly Disagree Strongly Agree Strongly Agree

Now that we have some practice data, let’s reduce it to a form that will make plotting easier. What we want is for each row to be a statement, response, count summary. At this point I’ll also convert the responses into a factor with the levels ordered appropriately. This will be important later on.

d.reduced <- d %>%
  select(-id) %>%
  gather("Q", "ans") %>%
  group_by(Q, ans) %>%
  summarize(n=n()) %>%
  mutate(per = n/sum(n),
         ans = factor(ans, levels=answers)) %>%
  arrange(Q, ans)

Here are the first 10 rows of the reduced data set.

kable(d.reduced[1:10,])
Q ans n per
Q1 Strongly Disagree 12 0.24
Q1 Somewhat Disagree 11 0.22
Q1 Neither Agree nor Disagree 8 0.16
Q1 Somewhat Agree 7 0.14
Q1 Strongly Agree 12 0.24
Q2 Strongly Disagree 10 0.20
Q2 Somewhat Disagree 10 0.20
Q2 Neither Agree nor Disagree 13 0.26
Q2 Somewhat Agree 11 0.22
Q2 Strongly Agree 6 0.12

Now comes the actual work of creating a horizontal stacked bar plot. After some experimentation, I decided to forgo trying to use ggplot’s geom_bar function. It required too much hacking to make work appropriately. I found it was much (much) easier to just build the plot using geom_rect. It sounded harder at first, but the calculations are pretty straight forward. It also seamlessly handles odd vs. even numbers of factor levels. When you have an odd number of levels, convention is to take the middle level and center its box at zero. When you have an even number of levels then the two middle levels are placed on either side of zero. I’ll show examples of both

#tmp <-
stage1 <- d.reduced %>%
  mutate(text = paste0(formatC(100 * per, format="f", digits=0), "%"),
         cs = cumsum(per),
         offset = sum(per[1:(floor(n()/2))]) + (n() %% 2)*0.5*(per[ceiling(n()/2)]),
         xmax = -offset + cs,
         xmin = xmax-per) %>%
  ungroup()

The confusing line above is where offset is defined. Basically, each stacked bar is going to have a total length of 1. We need to figure out how much to offset each bar such that the ‘Likert center’ (either the midpoint of the middle level when there is an odd number of levels or between the middle two when there is an even number of levels) lines up at x=0. The floor(n()/2) component sums half the levels (rounded down), and the n() %%2 component accounts for the half level in case there is an odd number of levels (in an even-numbered level situation, this term goes to zero).

The next step (stage2 I’m calling it) is to arrange the questions by how ‘positive’ they are overall. It is common to arrange the items with the question that had the highest percent of positive responses at the top. This makes it very easy to see which questions got the most positive responses just by looking at the question order in the plot. To do that, we just need to figure out which question had the most positive xmax value, then set the questions y-values accordingly.

gap <- 0.2

stage2 <- stage1 %>%
  left_join(stage1 %>%
              group_by(Q) %>%
              summarize(max.xmax = max(xmax)) %>%
              mutate(r = row_number(max.xmax)),
            by = "Q") %>%
  arrange(desc(r)) %>%
  mutate(ymin = r - (1-gap)/2,
         ymax = r + (1-gap)/2)

Now we have everything in place to plot it.

ggplot(stage2) +
  geom_rect(aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax, fill=ans)) +
  geom_text(aes(x=(xmin+xmax)/2, y=(ymin+ymax)/2, label=text), size = 3) +
  scale_x_continuous("", labels=percent, breaks=seq(-1, 1, len=9), limits=c(-1, 1)) +
  scale_y_continuous("", breaks = 1:n_distinct(stage2$Q),
                     labels=rev(stage2 %>% distinct(Q) %>% .$Q)) +
  scale_fill_brewer("", palette = "BrBG")

Now let’s test if it still works with an even number of levels. I’ll do this by removing the middle option and then renormalzing the data

d.r1 <- d.reduced %>%
  filter(ans != "Neither Agree nor Disagree") %>%
  mutate(per = n/sum(n),
         ans = droplevels(ans))

kable(d.r1[1:10,])
Q ans n per
Q1 Strongly Disagree 12 0.2857143
Q1 Somewhat Disagree 11 0.2619048
Q1 Somewhat Agree 7 0.1666667
Q1 Strongly Agree 12 0.2857143
Q2 Strongly Disagree 10 0.2702703
Q2 Somewhat Disagree 10 0.2702703
Q2 Somewhat Agree 11 0.2972973
Q2 Strongly Agree 6 0.1621622
Q3 Strongly Disagree 14 0.3783784
Q3 Somewhat Disagree 6 0.1621622

Now let’s see if the above code still works (sorry for copy-paste breakage of the ‘do not repeat yourself’ rule, but I figured making a function earlier would detract from explaining how it works)

stage1 <- d.r1 %>%
  mutate(text = paste0(formatC(100 * per, format="f", digits=0), "%"),
         cs = cumsum(per),
         offset = sum(per[1:(floor(n()/2))]) + (n() %% 2)*0.5*(per[ceiling(n()/2)]),
         xmax = -offset + cs,
         xmin = xmax-per) %>%
  ungroup()

stage2 <- stage1 %>%
  left_join(stage1 %>%
              group_by(Q) %>%
              summarize(max.xmax = max(xmax)) %>%
              mutate(r = row_number(max.xmax)),
            by = "Q") %>%
  arrange(desc(r)) %>%
  mutate(ymin = r - (1-gap)/2,
         ymax = r + (1-gap)/2)

ggplot(stage2) +
  geom_rect(aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax, fill=ans)) +
  geom_text(aes(x=(xmin+xmax)/2, y=(ymin+ymax)/2, label=text), size = 3) +
  scale_x_continuous("", labels=percent, breaks=seq(-1, 1, len=9), limits=c(-1, 1)) +
  scale_y_continuous("", breaks = 1:n_distinct(stage2$Q),
                     labels=rev(stage2 %>% distinct(Q) %>% .$Q)) +
  scale_fill_brewer("", palette = "BrBG")

Ta-da! It works for an even number of levels as well without any special if-then cases.

But wait, there is an important edge case to consider. What if a response option was never chosen? In that case, the above code generates the wrong chart as it puts the wrong factor level in the middle of the plot. The trick to overcoming this is to start out by generating an empty dataset that has all combinations of questions and responses, then joining this to the real data. This ensures that any absent levels are present in the data, but with a value of zero. The math all works the same and the correct level will end up in the middle of the plot.

To test this, I’ll remove a couple rows from the original d.reduced data frame, then renormalize as before

d.r2 <- d.reduced[c(1:2, 4:5, 7:14, 16:25),]
d.r2 <- d.r2 %>%
  mutate(per = n/sum(n))

kable(d.r2[1:10,])
Q ans n per
Q1 Strongly Disagree 12 0.2857143
Q1 Somewhat Disagree 11 0.2619048
Q1 Somewhat Agree 7 0.1666667
Q1 Strongly Agree 12 0.2857143
Q2 Somewhat Disagree 10 0.2500000
Q2 Neither Agree nor Disagree 13 0.3250000
Q2 Somewhat Agree 11 0.2750000
Q2 Strongly Agree 6 0.1500000
Q3 Strongly Disagree 14 0.3414634
Q3 Somewhat Disagree 6 0.1463415

Here’s the final code to generate a horizontal stacked bar plot.

# Create the full combination of questions and response options
full.set <- expand.grid(unique(d.r2$Q), answers, stringsAsFactors=F) %>%
  mutate(Var2 = factor(Var2, answers))
names(full.set) <- names(d.r2)[1:2]

stage1 <- d.r2 %>%
  # Do the full join to keep levels that aren't in the original data
  full_join(full.set, by=names(full.set)) %>%
  # Need to reorder before calculating offsets
  arrange(Q, ans) %>%
  # NA's are missing levels, replace with zero
  mutate(n = replace_na(n, 0),
         per = replace_na(per, 0),
         # Hide the text label if level is missing
         text = if_else(n >0, paste0(formatC(100 * per, format="f", digits=0), "%"), ""),
         cs = cumsum(per),
         offset = sum(per[1:(floor(n()/2))]) + (n() %% 2)*0.5*(per[ceiling(n()/2)]),
         xmax = -offset + cs,
         xmin = xmax-per) %>%
  ungroup()

stage2 <- stage1 %>%
  left_join(stage1 %>%
              group_by(Q) %>%
              summarize(max.xmax = max(xmax)) %>%
              mutate(r = row_number(max.xmax)),
            by = "Q") %>%
  arrange(desc(r)) %>%
  mutate(ymin = r - (1-gap)/2,
         ymax = r + (1-gap)/2)

ggplot(stage2) +
  geom_rect(aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax, fill=ans)) +
  geom_text(aes(x=(xmin+xmax)/2, y=(ymin+ymax)/2, label=text), size = 3) +
  scale_x_continuous("", labels=percent, breaks=seq(-1, 1, len=9), limits=c(-1, 1)) +
  scale_y_continuous("", breaks = 1:n_distinct(stage2$Q),
                     labels=rev(stage2 %>% distinct(Q) %>% .$Q)) +
  scale_fill_brewer("", palette = "BrBG")

The chart is easy to customize, and the implementation can easily be made into a function. Happy plotting.