October 26, 2016

Plotting individual observations and group means with ggplot2

@drsimonj here to share my approach for visualizing individual observations with group means in the same plot. Here are some examples of what we’ll be creating:

I find these sorts of plots to be incredibly useful for visualizing and gaining insight into our data. We often visualize group means only, sometimes with the likes of standard errors bars. Alternatively, we plot only the individual observations using histograms or scatter plots. Separately, these two methods have unique problems. For example, we can’t easily see sample sizes or variability with group means, and we can’t easily see underlying patterns or trends in individual observations. But when individual observations and group means are combined into a single plot, we can produce some powerful visualizations.

A quick note that, after publishing this post, the paper, “Modern graphical methods to compare two groups of observations” (Rousselet, Pernet, and Wilcox, 2016) was brought to my attention by Guillaume Rousselet, who kindly agreed to the reference being posted here. This paper is an excellent resource that goes into some very important details that motivate the work presented here, and it shows some really great plot examples (with R code!). Do take the time to read it if you get the chance.

General approach #

Below is generic pseudo-code capturing the approach that we’ll cover in this post. Following this will be some worked examples of diving deeper into each component.

# Packages we need
library(ggplot2)
library(dplyr)

# Have an individual-observation data set
id

# Create a group-means data set
gd <- id %>% 
        group_by(GROUPING-VARIABLES) %>% 
        summarise(
          VAR1 = mean(VAR1),
          VAR2 = mean(VAR2),
          ...
        )

# Plot both data sets
ggplot(id, aes(GEOM-AESTHETICS)) +
  geom_*() +
  geom_*(data = gd)

# Adjust plot to effectively differentiate data layers

Tidyverse packages #

Throughout, we’ll be using packages from the tidyverse: ggplot2 for plotting, and dplyr for working on the data. Let’s load these into our session:

library(ggplot2)
library(dplyr)

Group means on a single variable #

To get started, we’ll examine the logic behind the pseudo code with a simple example of presenting group means on a single variable. Let’s use mtcars as our individual-observation data set, id:

id <- mtcars %>% tibble::rownames_to_column() %>% as_data_frame()
id
#> # A tibble: 32 × 12
#>              rowname   mpg   cyl  disp    hp  drat    wt  qsec    vs    am
#>                <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1          Mazda RX4  21.0     6 160.0   110  3.90 2.620 16.46     0     1
#> 2      Mazda RX4 Wag  21.0     6 160.0   110  3.90 2.875 17.02     0     1
#> 3         Datsun 710  22.8     4 108.0    93  3.85 2.320 18.61     1     1
#> 4     Hornet 4 Drive  21.4     6 258.0   110  3.08 3.215 19.44     1     0
#> 5  Hornet Sportabout  18.7     8 360.0   175  3.15 3.440 17.02     0     0
#> 6            Valiant  18.1     6 225.0   105  2.76 3.460 20.22     1     0
#> 7         Duster 360  14.3     8 360.0   245  3.21 3.570 15.84     0     0
#> 8          Merc 240D  24.4     4 146.7    62  3.69 3.190 20.00     1     0
#> 9           Merc 230  22.8     4 140.8    95  3.92 3.150 22.90     1     0
#> 10          Merc 280  19.2     6 167.6   123  3.92 3.440 18.30     1     0
#> # ... with 22 more rows, and 2 more variables: gear <dbl>, carb <dbl>

Say we want to plot cars’ horsepower (hp), separately for automatic and manual cars (am). Let’s quickly convert am to a factor variable with proper labels:

id <- id %>% mutate(am = factor(am, levels = c(0, 1), labels = c("automatic", "manual")))

Using the individual observations, we can plot the data as points via:

ggplot(id, aes(x = am, y = hp)) +
  geom_point()

What if we want to visualize the means for these groups of points? We start by computing the mean horsepower for each transmission type into a new group-means data set (gd) as follows:

gd <- id %>% 
        group_by(am) %>% 
        summarise(hp = mean(hp))
gd
#> # A tibble: 2 × 2
#>          am       hp
#>      <fctr>    <dbl>
#> 1 automatic 160.2632
#> 2    manual 126.8462

There are a few important aspects to this:

We group our individual observations by the categorical variable using group_by().
We summarise() the variable as its mean().
We give the summarized variable the same name in the new data set. E.g., hp = mean(hp) results in hp being in both data sets.

We could plot these means as bars via:

ggplot(gd, aes(x = am, y = hp)) +
  geom_bar(stat = "identity")

The challenge now is to combine these plots.

As the base, we start with the individual-observation plot:

ggplot(id, aes(x = am, y = hp)) +
  geom_point()

Next, to display the group-means, we add a geom layer specifying data = gd. In this case, we’ll specify the geom_bar() layer as above:

ggplot(id, aes(x = am, y = hp)) +
  geom_point() +
  geom_bar(data = gd, stat = "identity")

Although there are some obvious problems, we’ve successfully covered most of our pseudo-code and have individual observations and group means in the one plot.

Before we address the issues, let’s discuss how this works. The main point is that our base layer (ggplot(id, aes(x = am, y = hp))) specifies the variables (am and hp) that are going to be plotted. By including id, it also means that any geom layers that follow without specifying data, will use the individual-observation data. Thus, geom_point() plots the individual points. geom_bar(), however, specifies data = gd, meaning it will try to use information from the group-means data. Because our group-means data has the same variables as the individual data, it can make use of the variables mapped out in our base ggplot() layer.

At this point, the elements we need are in the plot, and it’s a matter of adjusting the visual elements to differentiate the individual and group-means data and display the data effectively overall. Among other adjustments, this typically involves paying careful attention to the order in which the geom layers are added, and making heavy use of the alpha (transparency) values.

For example, we can make the bars transparent to see all of the points by reducing the alpha of the bars:

ggplot(id, aes(x = am, y = hp)) +
  geom_point() +
  geom_bar(data = gd, stat = "identity", alpha = .3)

Here’s a final polished version that includes:

Color to the bars and points for visual appeal.
ggrepel::geom_text_repel to add car labels to each point.
theme_bw() to clean the overall appearance.
Proper axis labels.

ggplot(id, aes(x = am, y = hp, color = am, fill = am)) +
  geom_bar(data = gd, stat = "identity", alpha = .3) +
  ggrepel::geom_text_repel(aes(label = rowname), color = "black", size = 2.5, segment.color = "grey") +
  geom_point() +
  guides(color = "none", fill = "none") +
  theme_bw() +
  labs(
    title = "Car horespower by transmission type",
    x = "Transmission",
    y = "Horsepower"
  )

Notice that, again, we can specify how variables are mapped to aesthetics in the base ggplot() layer (e.g., color = am), and this affects the individual and group-means geom layers because both data sets have the same variables.

Group means on two variables #

Next, we’ll move to overlaying individual observations and group means for two continuous variables. This time we’ll use the iris data set as our individual-observation data:

id <- as_data_frame(iris)
id
#> # A tibble: 150 × 5
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>           <dbl>       <dbl>        <dbl>       <dbl>  <fctr>
#> 1           5.1         3.5          1.4         0.2  setosa
#> 2           4.9         3.0          1.4         0.2  setosa
#> 3           4.7         3.2          1.3         0.2  setosa
#> 4           4.6         3.1          1.5         0.2  setosa
#> 5           5.0         3.6          1.4         0.2  setosa
#> 6           5.4         3.9          1.7         0.4  setosa
#> 7           4.6         3.4          1.4         0.3  setosa
#> 8           5.0         3.4          1.5         0.2  setosa
#> 9           4.4         2.9          1.4         0.2  setosa
#> 10          4.9         3.1          1.5         0.1  setosa
#> # ... with 140 more rows

Let’s say we want to visualize the petal length and width for each iris Species.

Let’s create the group-means data set as follows:

gd <- id %>% 
        group_by(Species) %>% 
        summarise(Petal.Length = mean(Petal.Length),
                  Petal.Width  = mean(Petal.Width))
gd
#> # A tibble: 3 × 3
#>      Species Petal.Length Petal.Width
#>       <fctr>        <dbl>       <dbl>
#> 1     setosa        1.462       0.246
#> 2 versicolor        4.260       1.326
#> 3  virginica        5.552       2.026

We’ve now got the variable means for each Species in a new group-means data set, gd. The important point, as before, is that there are the same variables in id and gd.

Let’s prepare our base plot using the individual observations, id:

ggplot(id, aes(x = Petal.Length, y = Petal.Width)) +
  geom_point()

Let’s use the color aesthetic to distinguish the groups:

ggplot(id, aes(x = Petal.Length, y = Petal.Width, color = Species)) +
  geom_point()

Now we can add a geom that uses our group means. We’ll use geom_point() again:

ggplot(id, aes(x = Petal.Length, y = Petal.Width, color = Species)) +
  geom_point() +
  geom_point(data = gd)

Did it work? Well, yes, it did. The problem is that we can’t distinguish the group means from the individual observations because the points look the same. Again, we’ve successfully integrated observations and means into a single plot. The challenge now is to make various adjustments to highlight the difference between the data layers.

To do this, we’ll fade out the observation-level geom layer (using alpha) and increase the size of the group means:

ggplot(id, aes(x = Petal.Length, y = Petal.Width, color = Species)) +
  geom_point(alpha = .4) +
  geom_point(data = gd, size = 4)

Here’s a final polished version for you to play around with:

ggplot(id, aes(x = Petal.Length, y = Petal.Width, color = Species, shape = Species)) +
  geom_point(alpha = .4) +
  geom_point(data = gd, size = 4) +
  theme_bw() +
  guides(color = guide_legend("Species"),  shape = guide_legend("Species")) +
  labs(
    title = "Petal size of iris species",
    x = "Length",
    y = "Width"
  )

Repeated observations #

One useful avenue I see for this approach is to visualize repeated observations. For example, colleagues in my department might want to plot depression levels measured at multiple time points for people who receive one of two types of treatment. Typically, they would present the means of the two groups over time with error bars. However, we can improve on this by also presenting the individual trajectories.

As an example, let’s examine changes in healthcare expenditure over five years (from 2001 to 2005) for countries in Oceania and the Europe.

Start by gathering our individual observations from my new ourworldindata package for R, which you can learn more about in a previous blogR post:

# Individual-observations data
library(ourworldindata)
id <- financing_healthcare %>% 
        filter(continent %in% c("Oceania", "Europe") & between(year, 2001, 2005)) %>% 
        select(continent, country, year, health_exp_total) %>% 
        na.omit()
id
#> # A tibble: 275 × 4
#>    continent country  year health_exp_total
#>        <chr>   <chr> <int>            <dbl>
#> 1     Europe Albania  2001         198.2242
#> 2     Europe Albania  2002         225.1861
#> 3     Europe Albania  2003         236.3563
#> 4     Europe Albania  2004         263.5986
#> 5     Europe Albania  2005         276.6520
#> 6     Europe Andorra  2001        1432.2798
#> 7     Europe Andorra  2002        1564.6976
#> 8     Europe Andorra  2003        1601.0641
#> 9     Europe Andorra  2004        1661.5608
#> 10    Europe Andorra  2005        1793.9938
#> # ... with 265 more rows

Let’s plot these individual country trajectories:

ggplot(id, aes(x = year, y = health_exp_total)) +
  geom_line()

Hmm, this doesn’t look like right. The problem is that we need to group our data by country:

ggplot(id, aes(x = year, y = health_exp_total, group = country)) +
  geom_line()

We now have a separate line for each country. Let’s color these depending on the world region (continent) in which they reside:

ggplot(id, aes(x = year, y = health_exp_total, group = country, color = continent)) +
  geom_line()

If we tried to follow our usual steps by creating group-level data for each world region and adding it to the plot, we would do something like this:

gd <- id %>% 
        group_by(continent) %>% 
        summarise(health_exp_total = mean(health_exp_total))

ggplot(id, aes(x = year, y = health_exp_total, group = country, color = continent)) +
  geom_line() +
  geom_line(data = gd)

This, however, will lead to a couple of errors, which are both caused by variables being called in the base ggplot() layer, but not appearing in our group-means data, gd.

First, we’re not taking year into account, but we want to! In this case, year must be treated as a second grouping variable, and included in the group_by command. Thus, to compute the relevant group-means, we need to do the following:

gd <- id %>% 
        group_by(continent, year) %>% 
        summarise(health_exp_total = mean(health_exp_total))
gd
#> Source: local data frame [10 x 3]
#> Groups: continent [?]
#> 
#>    continent  year health_exp_total
#>        <chr> <int>            <dbl>
#> 1     Europe  2001        1196.7948
#> 2     Europe  2002        1311.2303
#> 3     Europe  2003        1375.2729
#> 4     Europe  2004        1465.5530
#> 5     Europe  2005        1550.2395
#> 6    Oceania  2001         398.1582
#> 7    Oceania  2002         414.7088
#> 8    Oceania  2003         448.6919
#> 9    Oceania  2004         475.8466
#> 10   Oceania  2005         501.5413

The second error is because we’re grouping lines by country, but our group means data, gd, doesn’t contain this information. Thus, we need to move aes(group = country) into the geom layer that draws the individual-observation data.

Now, our plot will be:

ggplot(id, aes(x = year, y = health_exp_total, color = continent)) +
  geom_line(aes(group = country)) +
  geom_line(data = gd)

It worked again; we just need to make the necessary adjustments to see the data properly. Here’s a polished final version of the plot. See if you can work it out!

ggplot(id, aes(x = year, y = health_exp_total, color = continent)) +
  geom_line(aes(group = country), alpha = .3) +
  geom_line(data = gd, alpha = .8, size = 3) +
  theme_bw() +
  labs(
    title = "Changes in healthcare spending\nacross countries and world regions",
    x = NULL,
    y = "Total healthcare investment ($)",
    color = NULL
  )

A challenge #

For me, in a scientific paper, I like to draw time-series like the example above using the line plot described in another blogR post. As a challenge, I’ll leave it to you to draw this sort of neat time series with individual trajectories drawn underneath the mean trajectories with error bars. Don’t hesitate to get in touch if you’re struggling. Even better, succeed and tweet the results to let me know by including @drsimonj!

Boxplots #

After publishing this post, I received a wonderful email from Professor Bob Sekuler (Brandeis University), who tells me that plotting individual points over group means is a growing trend. He also suggested that boxplots, rather than bars, helps to provide even more information, and showed me some nice examples that were created by him and his student, Yile Sun. So, I thought I’d include a simple example here for other readers who might be interested.

When it comes to boxplots, our lives get a little easier, because we don’t need to create a group-means data frame. So, in the below example, we plot boxplots using geom_boxplot(). Note that we need the group aesthtic to split by transmission type (am). We then overlay it with points using geom_jitter(). We could use geom_point(), but jitter just spreads the points out a bit in case there are any that overlap.

mtcars %>% 
  mutate(am = factor(am, levels = c(0, 1), labels = c("Automatic", "Manual"))) %>% 
  ggplot(aes(x = am, y = hp, group = am, fill = am)) +
    geom_boxplot(alpha = .7) +
    geom_jitter(width = .05, alpha = .4) +
    guides(fill = "none") +
    theme_bw() +
    labs(
      x = "Transmission",
      y = "Horsepower"
    )

This is a really nice alternative as we get information about quantiles, skew, and outliers.

Sign off #

Thanks for reading and I hope this was useful for you.

For updates of recent blog posts, follow @drsimonj on Twitter, or email me at drsimonjackson@gmail.com to get in touch.

If you’d like the code that produced this blog, check out the blogR GitHub repository.

403

Kudos

403