by Simon Jackson

R tips and tricks from a scientist. All R Markdown docs with full R code can be found at my GitHub:

Page 2

Ordering categories within ggplot2 facets

@drsimonj here to share my method for ordering categories within facets to create plots that look like this…


instead of like this…


 Motivation: Tidy Text Mining in R

The motivation for this post comes from Tidy Text Mining in R by Julia Silge and David Robinson. It is a must read if text mining is something that interests you.

I noticed that Julia and David had left themselves a “TODO” in Chapter 5 that was “not easy to fix.” Not easy to fix? Could Julia Silge and David Robinson face challenges as the rest of us do?!


Shocking, I know.

Well, it was probably just a matter of time until they fixed it. Still, I thought it was an interesting challenge; gave it some thought, and wanted to share my solution.

 The problem

They were using ggplot2 to create a bar plot with the following features:

  • Facetted into separate panels
  • One bar for each category (words in their case).

Continue reading →

Plotting individual observations and group means with ggplot2

@drsimonj here to share my approach for visualizing individual observations with group means in the same plot. Here are some examples of what we’ll be creating:




I find these sorts of plots to be incredibly useful for visualizing and gaining insight into our data. We often visualize group means only, sometimes with the likes of standard errors bars. Alternatively, we plot only the individual observations using histograms or scatter plots. Separately, these two methods have unique problems. For example, we can’t easily see sample sizes or variability with group means, and we can’t easily see underlying patterns or trends in individual observations. But when individual observations and group means are combined into a single plot, we can produce some powerful visualizations.

A quick note that, after publishing this post, the paper, “Modern graphical methods to compare two groups of

Continue reading →

Exploring the effects of healthcare investment on child mortality in R

@drsimonj here to investigate the effects of healthcare investment on child mortality rates over time. I hope that you find the content to be as equally interesting as I do. However, please note that this post is intended to be an informative exercise of exploring and visualizing data with R and my new ourworldindata package. The conclusions drawn here require independent, peer-reviewed verification.

On this note, thank you to Amanda Glassman for bringing this research paper to my attention after this post was first published. The paper suggests that healthcare expenditure does not, or weakly affects child mortality rates. I think it’s an excellent paper and, if you’re interested in the content, a far more legitimate resource in terms of the scientific approach taken. After reading that paper, with the exception of this paragraph, I’ve left this post unchanged for interested readers.

Continue reading →

corrr 0.2.1 now on CRAN

@drsimonj here to discuss the latest CRAN release of corrr (0.2.1), a package for exploring correlations in a tidy R framework. This post will describe corrr features added since version 0.1.0.

You can install or update to this latest version directly from CRAN by running:


Let’s load corrr into our workspace and create a correlation data frame of the mtcars data set to work with:

rdf <- correlate(mtcars)
#> # A tibble: 11 × 12
#>    rowname        mpg        cyl       disp         hp        drat
#>      <chr>      <dbl>      <dbl>      <dbl>      <dbl>       <dbl>
#> 1      mpg         NA -0.8521620 -0.8475514 -0.7761684  0.68117191
#> 2      cyl -0.8521620         NA  0.9020329  0.8324475 -0.69993811
#> 3     disp -0.8475514  0.9020329         NA  0.7909486 -0.71021393
#> 4       hp -0.7761684  0.8324475  0.7909486         NA -0.44875912

Continue reading →

ourworldindata: an R data package

@drsimonj here to introduce ourworldindata: a new data package for R.

The ourworldindata package contains data frames that are generated by combining datasets from “an online publication that shows how living conditions around the world are changing”. The data frames in this package have undergone tidying so that they are suited to quick analysis in R. The purpose of this package is to serve as a central R resource for these datasets so that they might be used for the likes of practice or exploratory data analysis in a replicable manner.

 Thanks to the OurWorldInData team

Before discussing the package, I’d like to express my thanks to Max Roser and the rest of the OurWorldInData team, who collate the data sets that form the foundation of this package. If you appreciate their work and make use of this package, please consider supporting OurWorldInData. Personal

Continue reading →

Running a model on separate groups

Ever wanted to run a model on separate groups of data? Read on!

Here’s an example of a regression model fitted to separate groups: predicting a car’s Miles per Gallon with various attributes, but spearately for automatic and manual cars.

mtcars %>% 
  nest(-am) %>% 
  mutate(am = factor(am, levels = c(0, 1), labels = c("automatic", "manual")),
         fit = map(data, ~ lm(mpg ~ hp + wt + disp, data = .)),
         results = map(fit, augment)) %>% 
  unnest(results) %>% 
  ggplot(aes(x = mpg, y = .fitted)) +
    geom_abline(intercept = 0, slope = 1, alpha = .2) +  # Line of perfect fit
    geom_point() +
    facet_grid(am ~ .) +
    labs(x = "Miles Per Gallon", y = "Predicted Value") +


 Getting Started

A few things to do/keep in mind before getting started…

 A lot of detail for novices

I started this post after working on a larger

Continue reading →

Five ways to calculate internal consistency

Let’s get psychometric and learn a range of ways to compute the internal consistency of a test or questionnaire in R. We’ll be covering:

  • Average inter-item correlation
  • Average item-total correlation
  • Cronbach’s alpha
  • Split-half reliability (adjusted using the Spearman–Brown prophecy formula)
  • Composite reliability

If you’re unfamiliar with any of these, here are some resources to get you up to speed:


 The data

For this post, we’ll be using data on a Big 5 measure of personality that is freely available from Personality Tests. You can download the data yourself HERE, or running the following code will handle the downloading and save the data

Continue reading →

Visualising Residuals

Residuals. Now there’s something to get you out of bed in the morning!

OK, maybe residuals aren’t the sexiest topic in the world. Still, they’re an essential element and means for identifying potential problems of any statistical model. For example, the residuals from a linear regression model should be homoscedastic. If not, this indicates an issue with the model such as non-linearity in the data.

This post will cover various methods for visualising residuals from regression-based models. Here are some examples of the visualisations that we’ll be creating:




 What you need to know

To get the most out of this post, there are a few things you should be aware of. Firstly, if you’re unfamiliar with the meaning of residuals, or what seems to be going on here, I’d recommend that you first do some introductory reading on the topic. Some places to get started are Wikipedia and this

Continue reading →

Plotting background data for groups with ggplot2

This tweet by mikefc alerted me to a mind-blowingly simple but amazing trick using the ggplot2 package: to visualise data for different groups in a facetted plot with all of the data plotted in the background. Here’s an example that we’ll learn to make in this post so you know what I’m talking about:


 Credit where credit’s due

Before continuing, I’d be remiss for not mentioning that the origin of this ingenious suggestion is Hadley Wickham. The tip comes in his latest ggplot book, for which hardcopies are available online at places like Amazon, and the code and text behind it are freely available on Hadley’s Github at this repository.

 Some motivating examples

Let’s start with some examples that explain just why I’m so excited about this trick. Consider wanting to plot the results shown in the example above. That is, for the iris data set (that comes with R), we want to plot a

Continue reading →

focus() on correlations of some variables with many others

Get the correlations of one or more variables with many others using focus() from the corrr package:

mtcars %>% correlate() %>% focus(mpg)
#> # A tibble: 10 x 2
#>    rowname        mpg
#>      <chr>      <dbl>
#> 1      cyl -0.8521620
#> 2     disp -0.8475514
#> 3       hp -0.7761684
#> 4     drat  0.6811719
#> 5       wt -0.8676594
#> 6     qsec  0.4186840
#> 7       vs  0.6640389
#> 8       am  0.5998324
#> 9     gear  0.4802848
#> 10    carb -0.5509251

Let’s break it down.


I’ve noticed a lot of people asking how to do this: see here, here, here.

So this post will explain how to use focus() from the corrr package to correlate one or more variables in a data frame with many others.

 Starting with corrr

We’ll be using the corrr package, which starts by using correlate() to create a correlation data frame. For example, we can correlate() all columns in

Continue reading →