June 1, 2017

A tidy model pipeline with twidlr and broom

@drsimonj here to show you how to go from data in a data.frame to a tidy data.frame of model output by combining twidlr and broom in a single, tidy model pipeline.

The problem #

Different model functions take different types of inputs (data.frames, matrices, etc) and produce different types of output! Thus, we’re often confronted with the very untidy challenge presented in this Figure:

Thus, different models may need very different code.

However, it’s possible to create a consistent, tidy pipeline by combining the twidlr and broom packages. Let’s see how this works.

Two-step modelling #

To understand the solution, think of the problem as a two-step process, depicted in this Figure:

Step 1: from data to fitted model #

Step 1 must take data in a data.frame as input and return a fitted model object. twidlr exposes model functions that do just this!

To demonstrate:

#devtools::install_github("drsimonj/twidlr")  # To install
library(twidlr)

lm(mtcars, hp ~ .)
#> 
#> Call:
#> stats::lm(formula = formula, data = data)
#> 
#> Coefficients:
#> (Intercept)          mpg          cyl         disp         drat  
#>      79.048       -2.063        8.204        0.439       -4.619  
#>          wt         qsec           vs           am         gear  
#>     -27.660       -1.784       25.813        9.486        7.216  
#>        carb  
#>      18.749

This means we can pipe data.frames into any model function exposed by twidlr. For example:

library(dplyr)

mtcars %>% lm(hp ~ .)
#> 
#> Call:
#> stats::lm(formula = formula, data = data)
#> 
#> Coefficients:
#> (Intercept)          mpg          cyl         disp         drat  
#>      79.048       -2.063        8.204        0.439       -4.619  
#>          wt         qsec           vs           am         gear  
#>     -27.660       -1.784       25.813        9.486        7.216  
#>        carb  
#>      18.749

Step2: fitted model to tidy results #

Step 2 must take a fitted model object as its input and return a tidy data frame of results. This is precisely what the broom package does via three functions: glance, tidy, and augment! To demonstrate:

#install.packages("broom")  # To install
library(broom)

fit <- mtcars %>% lm(hp ~ .)

glance(fit)
#>   r.squared adj.r.squared    sigma statistic     p.value df    logLik
#> 1 0.9027993     0.8565132 25.97138  19.50477 1.89833e-08 11 -142.8905
#>        AIC      BIC deviance df.residual
#> 1 309.7809 327.3697 14164.76          21

tidy(fit)
#>           term    estimate   std.error  statistic     p.value
#> 1  (Intercept)  79.0483879 184.5040756  0.4284371 0.672695339
#> 2          mpg  -2.0630545   2.0905650 -0.9868407 0.334955314
#> 3          cyl   8.2037204  10.0861425  0.8133655 0.425134929
#> 4         disp   0.4390024   0.1492007  2.9423609 0.007779725
#> 5         drat  -4.6185488  16.0829171 -0.2871711 0.776795845
#> 6           wt -27.6600472  19.2703681 -1.4353668 0.165910518
#> 7         qsec  -1.7843654   7.3639133 -0.2423121 0.810889101
#> 8           vs  25.8128774  19.8512410  1.3003156 0.207583411
#> 9           am   9.4862914  20.7599371  0.4569518 0.652397317
#> 10        gear   7.2164047  14.6160152  0.4937327 0.626619355
#> 11        carb  18.7486691   7.0287674  2.6674192 0.014412403

augment(fit) %>% head()
#>           .rownames  hp  mpg cyl disp drat    wt  qsec vs am gear carb
#> 1         Mazda RX4 110 21.0   6  160 3.90 2.620 16.46  0  1    4    4
#> 2     Mazda RX4 Wag 110 21.0   6  160 3.90 2.875 17.02  0  1    4    4
#> 3        Datsun 710  93 22.8   4  108 3.85 2.320 18.61  1  1    4    1
#> 4    Hornet 4 Drive 110 21.4   6  258 3.08 3.215 19.44  1  0    3    1
#> 5 Hornet Sportabout 175 18.7   8  360 3.15 3.440 17.02  0  0    3    2
#> 6           Valiant 105 18.1   6  225 2.76 3.460 20.22  1  0    3    1
#>     .fitted     .resid      .hat   .sigma     .cooksd .std.resid
#> 1 148.68122 -38.681220 0.2142214 24.75946 0.069964902 -1.6801773
#> 2 140.62866 -30.628664 0.2323739 25.43881 0.049861042 -1.3460408
#> 3  79.99158  13.008418 0.3075987 26.38216 0.014633059  0.6019364
#> 4 125.75448 -15.754483 0.2103960 26.31579 0.011288712 -0.6826601
#> 5 183.21756  -8.217565 0.2016137 26.53317 0.002878707 -0.3541128
#> 6 111.38490  -6.384902 0.3147448 26.55680 0.003682813 -0.2969840

A single, tidy pipeline #

So twidlr and broom functions can be combined into a single, tidy pipeline to go from data.frame to tidy output:

library(twidlr)
library(broom)

mtcars %>% 
  lm(hp ~ .)  %>% 
  glance()
#>   r.squared adj.r.squared    sigma statistic     p.value df    logLik
#> 1 0.9027993     0.8565132 25.97138  19.50477 1.89833e-08 11 -142.8905
#>        AIC      BIC deviance df.residual
#> 1 309.7809 327.3697 14164.76          21

Any model included in twidlr and broom can be used in this same way. Here’s a kmeans example:

iris %>%
  select(-Species) %>% 
  kmeans(centers = 3) %>% 
  tidy()
#>         x1       x2       x3       x4 size withinss cluster
#> 1 5.901613 2.748387 4.393548 1.433871   62 39.82097       1
#> 2 5.006000 3.428000 1.462000 0.246000   50 15.15100       2
#> 3 6.850000 3.073684 5.742105 2.071053   38 23.87947       3

And a ridge regression with cross-fold validation example:

mtcars %>% 
  cv.glmnet(am ~ ., alpha = 0) %>% 
  glance()
#>   lambda.min lambda.1se
#> 1  0.2284167  0.8402035

So next time you want to do some tidy modelling, keep this pipeline in mind:

Limitations #

Currently, the major limitation for this approach is that a model must be covered by twidlr and broom. For example, you can’t use randomForest in this pipeline because, although twidlr exposes a data.frame friendly version of it, broom doesn’t provide tidying methods for it. So if you want to write tidy code for a model that isn’t covered by these packages, have a go at helping out by contributing to these open source projects! To get started creating and contributing to R packages, take a look at Hadley Wickham’s free book, “R Packages”.

Sign off #

Thanks for reading and I hope this was useful for you.

For updates of recent blog posts, follow @drsimonj on Twitter, or email me at drsimonjackson@gmail.com to get in touch.

If you’d like the code that produced this blog, check out the blogR GitHub repository.

Kudos