by Simon Jackson

R tips and tricks from a scientist. All R Markdown docs with full R code can be found at my GitHub:

Page 2

Tidy grid search with pipelearner

@drsimonj here to show you how to use pipelearner to easily grid-search hyperparameters for a model.

pipelearner is a package for making machine learning piplines and is currently available to install from GitHub by running the following:

# install.packages("devtools")  # Run this if devtools isn't installed

In this post we’ll grid search hyperparameters of a decision tree (using the rpart package) predicting cars’ transmission type (automatic or manual) using the mtcars data set. Let’s load rpart along with tidyverse, which pipelearner is intended to work with:


 The data

Quickly convert our outcome variable to a factor with proper labels:

d <- mtcars %>% 
  mutate(am = factor(am, labels = c("automatic", "manual")))
#>    mpg cyl disp  hp drat    wt  qsec vs        am

Continue reading →

Data science opinions and tools to support them at rstudio::conf

@drsimonj here to share my big takeaways from rstudio::conf 2017. My aim here is to share the broad data science opinions and challenges that I feel bring together the R community right now, and perhaps offer some guidance to anyone wanting to get into the R community.

DISCLAIMER: this is based on my experience, my primary interests, the talks I attended, the people I met, etc. I’m also very jet lagged after flying back to Australia! If I’ve missed something important to you (which I’m sure I have), please comment in whichever medium (Twitter, Facebook, etc.) and get the discussion going!

 My overall experience

I’ll start by saying that I had a great time. RStudio went all out and nailed everything from getting high-quality speakers, to booking a great venue and organizing a social event at Harry Potter world I won’t forget. But if I do, Hilary Parker took some great shots!

Continue reading →

Easy machine learning pipelines with pipelearner: intro and call for contributors

@drsimonj here to introduce pipelearner – a package I’m developing to make it easy to create machine learning pipelines in R – and to spread the word in the hope that some readers may be interested in contributing or testing it.

This post will demonstrate some examples of what pipeleaner can currently do. For example, the Figure below plots the results of a model fitted to 10% to 100% (in 10% increments) of training data in 50 cross-validation pairs. Fitting all of these models takes about four lines of code in pipelearner.


Head to the pipelearner Github page to learn more and contact me if you have a chance to test it yourself or are interested in contributing (my contact details are at the end of this post).


 Some setup


# Help functions
r_square <- function(model, data) {
  actual    <-

Continue reading →

Grid search in the tidyverse

@drsimonj here to share a tidyverse method of grid search for optimizing a model’s hyperparameters.

 Grid Search

For anyone who’s unfamiliar with the term, grid search involves running a model many times with combinations of various hyperparameters. The point is to identify which hyperparameters are likely to work best. A more technical definition from Wikipedia, grid search is:

an exhaustive searching through a manually specified subset of the hyperparameter space of a learning algorithm

 What this post isn’t about

To keep the focus on grid search, this post does NOT cover…

  • k-fold cross-validation. Although a practically essential addition to grid search, I’ll save the combination of these techniques for a future post. If you can’t wait, check out my last post for some inspiration.
  • Complex learning models. We’ll stick to a simple decision tree.
  • Getting a great model fit. I’ve

Continue reading →

k-fold cross validation with modelr and broom

@drsimonj here to discuss how to conduct k-fold cross validation, with an emphasis on evaluating models supported by David Robinson’s broom package. Full credit also goes to David, as this is a slightly more detailed version of his past post, which I read some time ago and felt like unpacking.

 Assumed knowledge: K-fold Cross validation

This post assumes you know what k-fold cross validation is. If you want to brush up, here’s a fantastic tutorial from Stanford University professors Trevor Hastie and Rob Tibshirani.

 Creating folds

Before worrying about models, we can generate K folds using crossv_kfold from the modelr package. Let’s practice with the mtcars data to keep things simple.

set.seed(1)  # Run to replicate this post
folds <- crossv_kfold(mtcars, k = 5)
#> # A tibble: 5 × 3
#>            train           test   .id
#>           <list>         <list>

Continue reading →

Plotting my trips with ubeR

@drsimonj here to explain how I used ubeR, an R package for the Uber API, to create this map of my trips over the last couple of years:


 Getting ubeR

The ubeR package, which I first heard about here, is currently available on GitHub. In R, install and load it as follows:

# install.packages("devtools")  # Run to install the devtools package if needed
devtools::install_github("DataWookie/ubeR")  # Install ubeR

For this post I also use many of the tidyverse packages, so install and load this too to follow along:


 Setting up an app

To use ubeR and the uber API, you’ll need an uber account and to register a new app. In a web browser, log into your uber account and head to this page. Fill in the details. Here’s an example:


Once created, under the Authorization tab, set the Redirect URL to http://localhost:1410/


Further down, under General Scopes

Continue reading →

Ordering categories within ggplot2 facets

@drsimonj here to share my method for ordering categories within facets to create plots that look like this…


instead of like this…


 Motivation: Tidy Text Mining in R

The motivation for this post comes from Tidy Text Mining in R by Julia Silge and David Robinson. It is a must read if text mining is something that interests you.

I noticed that Julia and David had left themselves a “TODO” in Chapter 5 that was “not easy to fix.” Not easy to fix? Could Julia Silge and David Robinson face challenges as the rest of us do?!


Shocking, I know.

Well, it was probably just a matter of time until they fixed it. Still, I thought it was an interesting challenge; gave it some thought, and wanted to share my solution.

 The problem

They were using ggplot2 to create a bar plot with the following features:

  • Facetted into separate panels
  • One bar for each category (words in their case).

Continue reading →

Plotting individual observations and group means with ggplot2

@drsimonj here to share my approach for visualizing individual observations with group means in the same plot. Here are some examples of what we’ll be creating:




I find these sorts of plots to be incredibly useful for visualizing and gaining insight into our data. We often visualize group means only, sometimes with the likes of standard errors bars. Alternatively, we plot only the individual observations using histograms or scatter plots. Separately, these two methods have unique problems. For example, we can’t easily see sample sizes or variability with group means, and we can’t easily see underlying patterns or trends in individual observations. But when individual observations and group means are combined into a single plot, we can produce some powerful visualizations.

A quick note that, after publishing this post, the paper, “Modern graphical methods to compare two groups of

Continue reading →

Exploring the effects of healthcare investment on child mortality in R

@drsimonj here to investigate the effects of healthcare investment on child mortality rates over time. I hope that you find the content to be as equally interesting as I do. However, please note that this post is intended to be an informative exercise of exploring and visualizing data with R and my new ourworldindata package. The conclusions drawn here require independent, peer-reviewed verification.

On this note, thank you to Amanda Glassman for bringing this research paper to my attention after this post was first published. The paper suggests that healthcare expenditure does not, or weakly affects child mortality rates. I think it’s an excellent paper and, if you’re interested in the content, a far more legitimate resource in terms of the scientific approach taken. After reading that paper, with the exception of this paragraph, I’ve left this post unchanged for interested readers.

Continue reading →

corrr 0.2.1 now on CRAN

@drsimonj here to discuss the latest CRAN release of corrr (0.2.1), a package for exploring correlations in a tidy R framework. This post will describe corrr features added since version 0.1.0.

You can install or update to this latest version directly from CRAN by running:


Let’s load corrr into our workspace and create a correlation data frame of the mtcars data set to work with:

rdf <- correlate(mtcars)
#> # A tibble: 11 × 12
#>    rowname        mpg        cyl       disp         hp        drat
#>      <chr>      <dbl>      <dbl>      <dbl>      <dbl>       <dbl>
#> 1      mpg         NA -0.8521620 -0.8475514 -0.7761684  0.68117191
#> 2      cyl -0.8521620         NA  0.9020329  0.8324475 -0.69993811
#> 3     disp -0.8475514  0.9020329         NA  0.7909486 -0.71021393
#> 4       hp -0.7761684  0.8324475  0.7909486         NA -0.44875912

Continue reading →