A tutorial on Likert plots, a.k.a. diverging stacked bar charts, with ggplot only, with example data from the Arab Barometer III survey. Also discussed are some common questions regarding complex plots with ggplot, for example, ordering factors in a plot and handling negative y-values.
Diverging stacked bar charts are one of the best options for the display of ordinal or Likert-type data. Oftentimes, such data is collected through survey research. However, few software options provide templates for such plots. With Microsoft Excel the only way to produce such a plot without a custom Visual Basic script is hack-horrific; see for example this increadibly tedious example.
Even in R, our go-to favorite for visualization (i.e., Hadley Wickham’s ggplot package) does not give us an easy solution for creating such plots, though two additional packages do warrant some specific comment and praise, and readers would be advised to also check these out: Jason Bryer’s likert package and Heiberger’s HH package. At the very least, one should read Robbins and Heiberger’s “Plotting Likert and Other Rating Scales” for an excellent discussion of the advantages and disadvantages of using diverging stacked bar charts for data visualization.
While both of these packages provide excellent tools for visualizing Likert-type data, I do have some complaint with both. The problem is that both of these packages work best with full data frames but not so well with pre-summarized data. The scenario I encounter most often is as follows. I conduct a survey, but like with most real life sample surveys, I need to include weights or need to use some complex samples design. For example, I think that Thomas Lumley’s survey package is one of the best tools that R has to offer a survey researcher like myself. As a result, I often need to pass pre-summarized (e.g., weighted) estimates to my plot. The likert package does include an option for pre-summarized data, but there is limited flexibility with this option.
Preparing the Data
I am going to use example data from the Arab Baromteter Wave III Survey. I applied the default survey weight and calculated estimates for twelve countries on a Likert-type rating of confidence in the future economy. Respondents were asked, “What do you think will be the economic situation in your country during the next few years (3-5 years) compared to the current situation?” The pre-summarized data can be downloaded here in CSV format. I chose a variable with five categories, to make this example a little more interesting.
To start, I load a few packages. Some of these aren’t strictly necessary for this example but are included for some additional aesthetics or to create code that can easily be re-purposed for other data. For example, the stringr package is used to wrap our axis labels to a max of 40 characters. None of the country names here are over this limit, but by building this in, this script could easily be re-purposed to plot responses to multiple question items, where we would want to wrap these longer labels. Note that I manually assign line breaks in the title with the \n newline notation, and I escape the quotation marks, also with a backslash.
To handle the center category of “Almost the same,” I am going to divide this estimate by two, and then include it twice. That way, I can plot half of this category below my center line, and I can plot the other half above this line.
I don’t want my chart to have too much blank space, so I want to find the closest 25% break below my minimum value and above my maximum value, so I can pass these to ggplot’s limits option. However, for this specific example, because Lebanon’s outlook on the economic future is so poor and Kuwait’s outlook is so positive, I need the full range on my axis.
For applying colors, I have a data set with six columns of estimates, but I still only have five levels on my factor. So, I need to create two palettes, so that I can use one to color my plot and one to make the legend. So first I take a diverging red-blue palette from the RColorBrewer package with five levels. Because I think the grey at the midpoint of this palette is too light, I replace it with a darker grey, with hex code #DFDFDF. I then save this palette as legend.pal before manipulating my original palette to include the middle grey color twice. So now I have one palette with length five and one with length six.
There are a few final manipulations before we pass our data frame to ggplot. First, I melt() this into a long-form with the reshape2 package. Then, I manually assign colors, using my palette with length six. I multiply everything by 100 to get into a percent rather than decimal format. I wrap long labels (though again, this is unnecessary for this specific example).
I want to order my plot from least optimistic to most optimistic. Ordering with ggplot is sometimes less than intuitive. I am going to be including country as my x-variable in ggplot’s aes() mapping. If this variable is of character class, by default ggplot with alphabetize this. If this variable is a factor, ggplot will use the order of the factor levels. So, I factor country with levels ordered by the sum of optimistic ratings.
I then split my data frame into two equal halves. The lows data frame contains estimates for “Much worse,” “Somewhat worse,” and half of the “Almost the same” category. Likewise, the highs data frame contains half of the “Almost the same” category, “Somewhat better,” and “Much better.”
However, there is another ordering task to note here. By default, ggplot will stack bars with the order in which they appear in the data frame. I need to completely reverse the order of the lows data frame, because I am going to plot these as negative values.
Below is the call to ggplot(). I’m going to walk through this line-by-line. First, I plot the high-values as I would any other stacked bar in ggplot. Next, I plot the low values, but in the aes() mapping, I specify -value so that these are plotted below the axis. This will throw a warning message, Stacking not well defined when ymin != 0, but we can safety ignore this. Sometimes with ggplot, negative bar values conflict with how colors are mapped or items are ordered, and this is why it is necessary to define color-mappings and order manually, as done above.
Next I draw a line indicating the midpoint on the scale. This is the midpoint on the Likert-type scale, not necessarily the midpoint on any distribution. Because my two data frames are ordered in opposite directions, and because I have six bars for each country but only five categories, I need to use the legend.pal I defined above to make my legend. I do this with the scale_fill_identity option.
I apply the theme_fivethirtyeight() theme from the ggthemes package, and I flip the axes so that I have a horizontal bar chart. I add the title and labels, adjust some font sizes, move the legend to the bottom, and add grid lines at 25% intervals.
Here is another example, with just a few modifications. This plot visualizes agreement/disagreement with a number of statements about democracy in Jordan. However, unlike the previous plot, this includes wrapped labels, and each outcome has only four categories. Because the sum of “Strongly disagree” and “Somewhat disagree” never exceeds 75%, the negative y-axis has also been trimmed automatically. The color palette is “PuOr,” also from the RColorBrewer package. This pre-summarized data may be found here, and a script for the plot may be found here.