blogR

by Simon Jackson

R tips and tricks from a scientist. All R Markdown docs with full R code can be found at my GitHub: https://github.com/drsimonj/blogR

Read this first

Big Data Solutions: A/B t test

@drsimonj here to share my code for using Welch’s t-test to compare group means using summary statistics.

 Motivation

I’ve just started working with A/B tests that use big data. Where once I’d whimsically run t.test(), now my data won’t fit into memory!

I’m sharing my solution here in the hope that it might help others.

 In-memory data

As a baseline, let’s start with an in-memory case by comparing whether automatic and manual cars have different Miles Per Gallon ratings on average (using the mtcars data set).

t.test(mpg ~ am, data = mtcars)
#> 
#>  Welch Two Sample t-test
#> 
#> data:  mpg by am
#> t = -3.7671, df = 18.332, p-value = 0.001374
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#>  -11.280194  -3.209684
#> sample estimates:
#> mean in group 0 mean in group 1 
#>        17.14737        24.39231

Well… that was easy!

Continue reading →


A tidy model pipeline with twidlr and broom

@drsimonj here to show you how to go from data in a data.frame to a tidy data.frame of model output by combining twidlr and broom in a single, tidy model pipeline.

 The problem

Different model functions take different types of inputs (data.frames, matrices, etc) and produce different types of output! Thus, we’re often confronted with the very untidy challenge presented in this Figure:

problem.png

Thus, different models may need very different code.

However, it’s possible to create a consistent, tidy pipeline by combining the twidlr and broom packages. Let’s see how this works.

 Two-step modelling

To understand the solution, think of the problem as a two-step process, depicted in this Figure:

two-step.png

 Step 1: from data to fitted model

Step 1 must take data in a data.frame as input and return a fitted model object. twidlr exposes model functions that do just this!

To demonstrate:

Continue reading →


Pretty scatter plots with ggplot2

@drsimonj here to make pretty scatter plots of correlated variables with ggplot2!

We’ll learn how to create plots that look like this:

init-example-1.png

 Data

In a data.frame d, we’ll simulate two correlated variables a and b of length n:

set.seed(170513)
n <- 200
d <- data.frame(a = rnorm(n))
d$b <- .4 * (d$a + rnorm(n))

head(d)
#>            a           b
#> 1 -0.9279965 -0.03795339
#> 2  0.9133158  0.21116682
#> 3  1.4516084  0.69060249
#> 4  0.5264596  0.22471694
#> 5 -1.9412516 -1.70890512
#> 6  1.4198574  0.30805526

 Basic scatter plot

Using ggplot2, the basic scatter plot (with theme_minimal) is created via:

library(ggplot2)

ggplot(d, aes(a, b)) +
  geom_point() +
  theme_minimal()

unnamed-chunk-3-1.JPEG

 Shape and size

There are many ways to tweak the shape and size of the points. Here’s the combination I settled on for this post:

ggplot(d, aes(a, b)) +
  geom_point(shape = 16, size = 5) +

Continue reading →


Pretty histograms with ggplot2

@drsimonj here to make pretty histograms with ggplot2!

In this post you’ll learn how to create histograms like this:

init-example-1.jpg

 The data

Let’s simulate data for a continuous variable x in a data frame d:

set.seed(070510)
d <- data.frame(x = rnorm(2000))

head(d)
#>            x
#> 1  1.3681661
#> 2 -0.0452337
#> 3  0.0290572
#> 4 -0.8717429
#> 5  0.9565475
#> 6 -0.5521690

 Basic Histogram

Create the basic ggplot2 histogram via:

library(ggplot2)

ggplot(d, aes(x)) +
    geom_histogram()

basic-1.jpg

 Adding Colour

Time to jazz it up with colour! The method I’ll present was motivated by my answer to this StackOverflow question.

We can add colour by exploiting the way that ggplot2 stacks colour for different groups. Specifically, we fill the bars with the same variable (x) but cut into multiple categories:

ggplot(d, aes(x, fill = cut(x, 100))) +
    geom_histogram()

color1-1.jpg

What the…

Oh, ggplot2 has

Continue reading →


twidlr: data.frame-based API for model and predict functons

@drsimonj here to introduce my latest tidy-modelling package for R, “twidlr”. twidlr wraps model and predict functions you already know and love with a consistent data.frame-based API!

All models wrapped by twidlr can be fit to data and used to make predictions as follows:

library(twidlr)

fit <- model(data, formula, ...)
predict(fit, data, ...)
  • data is a data.frame (or object that can be corced to one) and is required
  • formula describes the model to be fit

 The motivation

The APIs of model and predict functions in R are inconsistent and messy.

Some models like linear regression want a formula and data.frame:

lm(hp ~ ., mtcars)

Models like gradient-boosted decision trees want vectors and matrices:

library(xgboost)

y <- mtcars$hp
x <- as.matrix(mtcars[names(mtcars) != "hp"])

xgboost(x, y, nrounds = 5)

Models like generalized linear models want you to work. For

Continue reading →


How and when: ridge regression with glmnet

@drsimonj here to show you how to conduct ridge regression (linear regression with L2 regularization) in R using the glmnet package, and use simulations to demonstrate its relative advantages over ordinary least squares regression.

 Ridge regression

Ridge regression uses L2 regularisation to weight/penalise residuals when the parameters of a regression model are being learned. In the context of linear regression, it can be compared to Ordinary Least Square (OLS). OLS defines the function by which parameter estimates (intercepts and slopes) are calculated. It involves minimising the sum of squared residuals. L2 regularisation is a small addition to the OLS function that weights residuals in a particular way to make the parameters more stable. The outcome is typically a model that fits the training data less well than OLS but generalises better because it is less sensitive to extreme

Continue reading →


Easy leave-one-out cross validation with pipelearner

@drsimonj here to show you how to do leave-one-out cross validation using pipelearner.

 Leave-one-out cross validation

Leave-one-out is a type of cross validation whereby the following is done for each observation in the data:

  • Run model on all other observations
  • Use model to predict value for observation

This means that a model is fitted, and a predicted is made n times where n is the number of observations in your data.

 Leave-one-out in pipelearner

pipelearner is a package for streamlining machine learning pipelines, including cross validation. If you’re new to it, check out blogR for other relevant posts.

To demonstrate, let’s use regression to predict horsepower (hp) with all other variables in the mtcars data set. Set this up in pipelearner as follows:

library(pipelearner)

pl <- pipelearner(mtcars, lm, hp ~ .)

How cross validation is done is handled by learn_cvpairs()

Continue reading →


How to create correlation network plots with corrr and ggraph (and which countries drink like Australia)

@drsimonj here to show you how to use ggraph and corrr to create correlation network plots like these:

init-example-a-1.jpeg

init-example-b-1.jpeg

 ggraph and corrr

The ggraph package by Thomas Lin Pedersen, has just been published on CRAN and it’s so hot right now! What does it do?

“ggraph is an extension of ggplot2 aimed at supporting relational data structures such as networks, graphs, and trees.”

A relational metric I work with a lot is correlations. Becuase of this, I created the corrr package, which helps to explore correlations by leveraging data frames and tidyverse tools rather than matrices.

So…

  • corrr creates relational data frames of correlations intended to work with tidyverse tools like ggplot2.
  • ggraph extends ggplot2 to help plot relational structures.

Seems like a perfect match!

 Libraries

We’ll be using the following libraries:

library(tidyverse)
library(corrr)
library(igraph)
library(ggraph)

Continue reading →


With our powers combined! xgboost and pipelearner

@drsimonj here to show you how to use xgboost (extreme gradient boosting) models in pipelearner.

 Why a post on xgboost and pipelearner?

xgboost is one of the most powerful machine-learning libraries, so there’s a good reason to use it. pipelearner helps to create machine-learning pipelines that make it easy to do cross-fold validation, hyperparameter grid searching, and more. So bringing them together will make for an awesome combination!

The only problem - out of the box, xgboost doesn’t play nice with pipelearner. Let’s work out how to deal with this.

 Setup

To follow this post you’ll need the following packages:

# Install (if necessary)
install.packages(c("xgboost", "tidyverse", "devtools"))
devtools::install_github("drsimonj/pipelearner")

# Attach
library(tidyverse)
library(xgboost)
library(pipelearner)
library(lazyeval)

Our example will be to try and predict whether tumours

Continue reading →


Tidy grid search with pipelearner

@drsimonj here to show you how to use pipelearner to easily grid-search hyperparameters for a model.

pipelearner is a package for making machine learning piplines and is currently available to install from GitHub by running the following:

# install.packages("devtools")  # Run this if devtools isn't installed
devtools::install_github("drsimonj/pipelearner")
library(pipelearner)

In this post we’ll grid search hyperparameters of a decision tree (using the rpart package) predicting cars’ transmission type (automatic or manual) using the mtcars data set. Let’s load rpart along with tidyverse, which pipelearner is intended to work with:

library(tidyverse)
library(rpart)

 The data

Quickly convert our outcome variable to a factor with proper labels:

d <- mtcars %>% 
  mutate(am = factor(am, labels = c("automatic", "manual")))
head(d)
#>    mpg cyl disp  hp drat    wt  qsec vs        am

Continue reading →