One issue I’ve run into when using R and ggplot is trying to create stacked horizontal bar plots in ggplot that don’t start at zero. In survey analysis these are often the simplest way to convey the results of a matrix-style question with a Likert response scale. There are some good packages out there - the one that gets the closest to what I want is Jason Bryer “likert” package.

However, I found it difficult to make modifications to this package to get just the plot I wanted. The underlying code was too complex for me to make simple changes. So, I finally got around to creating my own version. I’m not going to make it into a package or anything like that. I’m providing it as is so that anyone can hack it easily into whatever they need.

Overview

library(dplyr)
library(ggplot2)
library(knitr)
library(tidyr)
library(scales)

First, some example data. I’ll generate data that looks like common output from a survey. Each row will be a respondent, and the columns will be the answers to questions. Let’s assume that Q1 through Q5 were questions in a ‘matrix style’ group of questions. That means they are five questions that shared a common set of response options. For simplicitly, I’ll use the common “strongly agree to strongly disagree” scale.

# Simulation N responses
N <- 50
answers <- c("Strongly Disagree","Somewhat Disagree","Neither Agree nor Disagree",
              "Somewhat Agree", "Strongly Agree")
set.seed(12342)
d <- tibble(
  id = paste0("Respondent", 1:N),
  Q1 = sample(answers, N, replace=TRUE),
  Q2 = sample(answers, N, replace=TRUE),
  Q3 = sample(answers, N, replace=TRUE),
  Q4 = sample(answers, N, replace=TRUE),
  Q5 = sample(answers, N, replace=TRUE)
)

kable(d[1:10,])

id	Q1	Q2	Q3	Q4	Q5
Respondent1	Somewhat Agree	Strongly Disagree	Neither Agree nor Disagree	Somewhat Disagree	Somewhat Agree
Respondent2	Somewhat Agree	Neither Agree nor Disagree	Strongly Disagree	Strongly Agree	Somewhat Agree
Respondent3	Strongly Disagree	Somewhat Agree	Strongly Disagree	Neither Agree nor Disagree	Somewhat Agree
Respondent4	Strongly Disagree	Neither Agree nor Disagree	Strongly Disagree	Neither Agree nor Disagree	Somewhat Disagree
Respondent5	Somewhat Disagree	Strongly Disagree	Somewhat Disagree	Strongly Agree	Somewhat Agree
Respondent6	Somewhat Disagree	Somewhat Disagree	Strongly Disagree	Somewhat Disagree	Somewhat Disagree
Respondent7	Strongly Disagree	Somewhat Disagree	Strongly Agree	Strongly Agree	Strongly Agree
Respondent8	Somewhat Disagree	Somewhat Agree	Strongly Disagree	Neither Agree nor Disagree	Strongly Disagree
Respondent9	Somewhat Disagree	Strongly Disagree	Somewhat Disagree	Neither Agree nor Disagree	Somewhat Agree
Respondent10	Strongly Agree	Neither Agree nor Disagree	Strongly Disagree	Strongly Agree	Strongly Agree

Now that we have some practice data, let’s reduce it to a form that will make plotting easier. What we want is for each row to be a statement, response, count summary. At this point I’ll also convert the responses into a factor with the levels ordered appropriately. This will be important later on.

d.reduced <- d %>%
  select(-id) %>%
  gather("Q", "ans") %>%
  group_by(Q, ans) %>%
  summarize(n=n()) %>%
  mutate(per = n/sum(n),
         ans = factor(ans, levels=answers)) %>%
  arrange(Q, ans)

Here are the first 10 rows of the reduced data set.

kable(d.reduced[1:10,])

Q	ans	n	per
Q1	Strongly Disagree	12	0.24
Q1	Somewhat Disagree	11	0.22
Q1	Neither Agree nor Disagree	8	0.16
Q1	Somewhat Agree	7	0.14
Q1	Strongly Agree	12	0.24
Q2	Strongly Disagree	10	0.20
Q2	Somewhat Disagree	10	0.20
Q2	Neither Agree nor Disagree	13	0.26
Q2	Somewhat Agree	11	0.22
Q2	Strongly Agree	6	0.12

Now comes the actual work of creating a horizontal stacked bar plot. After some experimentation, I decided to forgo trying to use ggplot’s geom_bar function. It required too much hacking to make work appropriately. I found it was much (much) easier to just build the plot using geom_rect. It sounded harder at first, but the calculations are pretty straight forward. It also seamlessly handles odd vs. even numbers of factor levels. When you have an odd number of levels, convention is to take the middle level and center its box at zero. When you have an even number of levels then the two middle levels are placed on either side of zero. I’ll show examples of both

#tmp <-
stage1 <- d.reduced %>%
  mutate(text = paste0(formatC(100 * per, format="f", digits=0), "%"),
         cs = cumsum(per),
         offset = sum(per[1:(floor(n()/2))]) + (n() %% 2)*0.5*(per[ceiling(n()/2)]),
         xmax = -offset + cs,
         xmin = xmax-per) %>%
  ungroup()

The confusing line above is where offset is defined. Basically, each stacked bar is going to have a total length of 1. We need to figure out how much to offset each bar such that the ‘Likert center’ (either the midpoint of the middle level when there is an odd number of levels or between the middle two when there is an even number of levels) lines up at x=0. The floor(n()/2) component sums half the levels (rounded down), and the n() %%2 component accounts for the half level in case there is an odd number of levels (in an even-numbered level situation, this term goes to zero).

The next step (stage2 I’m calling it) is to arrange the questions by how ‘positive’ they are overall. It is common to arrange the items with the question that had the highest percent of positive responses at the top. This makes it very easy to see which questions got the most positive responses just by looking at the question order in the plot. To do that, we just need to figure out which question had the most positive xmax value, then set the questions y-values accordingly.

gap <- 0.2

stage2 <- stage1 %>%
  left_join(stage1 %>%
              group_by(Q) %>%
              summarize(max.xmax = max(xmax)) %>%
              mutate(r = row_number(max.xmax)),
            by = "Q") %>%
  arrange(desc(r)) %>%
  mutate(ymin = r - (1-gap)/2,
         ymax = r + (1-gap)/2)

Now we have everything in place to plot it.

ggplot(stage2) +
  geom_rect(aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax, fill=ans)) +
  geom_text(aes(x=(xmin+xmax)/2, y=(ymin+ymax)/2, label=text), size = 3) +
  scale_x_continuous("", labels=percent, breaks=seq(-1, 1, len=9), limits=c(-1, 1)) +
  scale_y_continuous("", breaks = 1:n_distinct(stage2$Q),
                     labels=rev(stage2 %>% distinct(Q) %>% .$Q)) +
  scale_fill_brewer("", palette = "BrBG")

Now let’s test if it still works with an even number of levels. I’ll do this by removing the middle option and then renormalzing the data

d.r1 <- d.reduced %>%
  filter(ans != "Neither Agree nor Disagree") %>%
  mutate(per = n/sum(n),
         ans = droplevels(ans))

kable(d.r1[1:10,])

Q	ans	n	per
Q1	Strongly Disagree	12	0.2857143
Q1	Somewhat Disagree	11	0.2619048
Q1	Somewhat Agree	7	0.1666667
Q1	Strongly Agree	12	0.2857143
Q2	Strongly Disagree	10	0.2702703
Q2	Somewhat Disagree	10	0.2702703
Q2	Somewhat Agree	11	0.2972973
Q2	Strongly Agree	6	0.1621622
Q3	Strongly Disagree	14	0.3783784
Q3	Somewhat Disagree	6	0.1621622

Now let’s see if the above code still works (sorry for copy-paste breakage of the ‘do not repeat yourself’ rule, but I figured making a function earlier would detract from explaining how it works)

stage1 <- d.r1 %>%
  mutate(text = paste0(formatC(100 * per, format="f", digits=0), "%"),
         cs = cumsum(per),
         offset = sum(per[1:(floor(n()/2))]) + (n() %% 2)*0.5*(per[ceiling(n()/2)]),
         xmax = -offset + cs,
         xmin = xmax-per) %>%
  ungroup()

stage2 <- stage1 %>%
  left_join(stage1 %>%
              group_by(Q) %>%
              summarize(max.xmax = max(xmax)) %>%
              mutate(r = row_number(max.xmax)),
            by = "Q") %>%
  arrange(desc(r)) %>%
  mutate(ymin = r - (1-gap)/2,
         ymax = r + (1-gap)/2)

ggplot(stage2) +
  geom_rect(aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax, fill=ans)) +
  geom_text(aes(x=(xmin+xmax)/2, y=(ymin+ymax)/2, label=text), size = 3) +
  scale_x_continuous("", labels=percent, breaks=seq(-1, 1, len=9), limits=c(-1, 1)) +
  scale_y_continuous("", breaks = 1:n_distinct(stage2$Q),
                     labels=rev(stage2 %>% distinct(Q) %>% .$Q)) +
  scale_fill_brewer("", palette = "BrBG")

Ta-da! It works for an even number of levels as well without any special if-then cases.

But wait, there is an important edge case to consider. What if a response option was never chosen? In that case, the above code generates the wrong chart as it puts the wrong factor level in the middle of the plot. The trick to overcoming this is to start out by generating an empty dataset that has all combinations of questions and responses, then joining this to the real data. This ensures that any absent levels are present in the data, but with a value of zero. The math all works the same and the correct level will end up in the middle of the plot.

To test this, I’ll remove a couple rows from the original d.reduced data frame, then renormalize as before

d.r2 <- d.reduced[c(1:2, 4:5, 7:14, 16:25),]
d.r2 <- d.r2 %>%
  mutate(per = n/sum(n))

kable(d.r2[1:10,])

Q	ans	n	per
Q1	Strongly Disagree	12	0.2857143
Q1	Somewhat Disagree	11	0.2619048
Q1	Somewhat Agree	7	0.1666667
Q1	Strongly Agree	12	0.2857143
Q2	Somewhat Disagree	10	0.2500000
Q2	Neither Agree nor Disagree	13	0.3250000
Q2	Somewhat Agree	11	0.2750000
Q2	Strongly Agree	6	0.1500000
Q3	Strongly Disagree	14	0.3414634
Q3	Somewhat Disagree	6	0.1463415

Here’s the final code to generate a horizontal stacked bar plot.

# Create the full combination of questions and response options
full.set <- expand.grid(unique(d.r2$Q), answers, stringsAsFactors=F) %>%
  mutate(Var2 = factor(Var2, answers))
names(full.set) <- names(d.r2)[1:2]

stage1 <- d.r2 %>%
  # Do the full join to keep levels that aren't in the original data
  full_join(full.set, by=names(full.set)) %>%
  # Need to reorder before calculating offsets
  arrange(Q, ans) %>%
  # NA's are missing levels, replace with zero
  mutate(n = replace_na(n, 0),
         per = replace_na(per, 0),
         # Hide the text label if level is missing
         text = if_else(n >0, paste0(formatC(100 * per, format="f", digits=0), "%"), ""),
         cs = cumsum(per),
         offset = sum(per[1:(floor(n()/2))]) + (n() %% 2)*0.5*(per[ceiling(n()/2)]),
         xmax = -offset + cs,
         xmin = xmax-per) %>%
  ungroup()

stage2 <- stage1 %>%
  left_join(stage1 %>%
              group_by(Q) %>%
              summarize(max.xmax = max(xmax)) %>%
              mutate(r = row_number(max.xmax)),
            by = "Q") %>%
  arrange(desc(r)) %>%
  mutate(ymin = r - (1-gap)/2,
         ymax = r + (1-gap)/2)

ggplot(stage2) +
  geom_rect(aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax, fill=ans)) +
  geom_text(aes(x=(xmin+xmax)/2, y=(ymin+ymax)/2, label=text), size = 3) +
  scale_x_continuous("", labels=percent, breaks=seq(-1, 1, len=9), limits=c(-1, 1)) +
  scale_y_continuous("", breaks = 1:n_distinct(stage2$Q),
                     labels=rev(stage2 %>% distinct(Q) %>% .$Q)) +
  scale_fill_brewer("", palette = "BrBG")

The chart is easy to customize, and the implementation can easily be made into a function. Happy plotting.

Steven Isley Portfolio/Blog/etc.

Code, writing, thoughts, and more.

Stacked Likert Plots using ggplot

Overview