A tidy model pipeline with twidlr and broom  

@drsimonj here to show you how to go from data in a data.frame to a tidy data.frame of model output by combining twidlr and broom in a single, tidy model pipeline.

The problem #

Different model functions take different types of inputs (data.frames, matrices, etc) and produce different types of output! Thus, we’re often confronted with the very untidy challenge presented in this Figure:

problem.png

Thus, different models may need very different code.

However, it’s possible to create a consistent, tidy pipeline by combining the twidlr and broom packages. Let’s see how this works.

Two-step modelling #

To understand the solution, think of the problem as a two-step process, depicted in this Figure:

two-step.png

Step 1: from data to fitted model #

Step 1 must take data in a data.frame as input and return a fitted model object. twidlr exposes model functions that do just this!

To demonstrate:

#devtools::install_github("drsimonj/twidlr")  # To install
library(twidlr)

lm(mtcars, hp ~ .)
#> 
#> Call:
#> stats::lm(formula = formula, data = data)
#> 
#> Coefficients:
#> (Intercept)          mpg          cyl         disp         drat  
#>      79.048       -2.063        8.204        0.439       -4.619  
#>          wt         qsec           vs           am         gear  
#>     -27.660       -1.784       25.813        9.486        7.216  
#>        carb  
#>      18.749

This means we can pipe data.frames into any model function exposed by twidlr. For example:

library(dplyr)

mtcars %>% lm(hp ~ .)
#> 
#> Call:
#> stats::lm(formula = formula, data = data)
#> 
#> Coefficients:
#> (Intercept)          mpg          cyl         disp         drat  
#>      79.048       -2.063        8.204        0.439       -4.619  
#>          wt         qsec           vs           am         gear  
#>     -27.660       -1.784       25.813        9.486        7.216  
#>        carb  
#>      18.749

Step2: fitted model to tidy results #

Step 2 must take a fitted model object as its input and return a tidy data frame of results. This is precisely what the broom package does via three functions: glance, tidy, and augment! To demonstrate:

#install.packages("broom")  # To install
library(broom)

fit <- mtcars %>% lm(hp ~ .)

glance(fit)
#>   r.squared adj.r.squared    sigma statistic     p.value df    logLik
#> 1 0.9027993     0.8565132 25.97138  19.50477 1.89833e-08 11 -142.8905
#>        AIC      BIC deviance df.residual
#> 1 309.7809 327.3697 14164.76          21

tidy(fit)
#>           term    estimate   std.error  statistic     p.value
#> 1  (Intercept)  79.0483879 184.5040756  0.4284371 0.672695339
#> 2          mpg  -2.0630545   2.0905650 -0.9868407 0.334955314
#> 3          cyl   8.2037204  10.0861425  0.8133655 0.425134929
#> 4         disp   0.4390024   0.1492007  2.9423609 0.007779725
#> 5         drat  -4.6185488  16.0829171 -0.2871711 0.776795845
#> 6           wt -27.6600472  19.2703681 -1.4353668 0.165910518
#> 7         qsec  -1.7843654   7.3639133 -0.2423121 0.810889101
#> 8           vs  25.8128774  19.8512410  1.3003156 0.207583411
#> 9           am   9.4862914  20.7599371  0.4569518 0.652397317
#> 10        gear   7.2164047  14.6160152  0.4937327 0.626619355
#> 11        carb  18.7486691   7.0287674  2.6674192 0.014412403

augment(fit) %>% head()
#>           .rownames  hp  mpg cyl disp drat    wt  qsec vs am gear carb
#> 1         Mazda RX4 110 21.0   6  160 3.90 2.620 16.46  0  1    4    4
#> 2     Mazda RX4 Wag 110 21.0   6  160 3.90 2.875 17.02  0  1    4    4
#> 3        Datsun 710  93 22.8   4  108 3.85 2.320 18.61  1  1    4    1
#> 4    Hornet 4 Drive 110 21.4   6  258 3.08 3.215 19.44  1  0    3    1
#> 5 Hornet Sportabout 175 18.7   8  360 3.15 3.440 17.02  0  0    3    2
#> 6           Valiant 105 18.1   6  225 2.76 3.460 20.22  1  0    3    1
#>     .fitted     .resid      .hat   .sigma     .cooksd .std.resid
#> 1 148.68122 -38.681220 0.2142214 24.75946 0.069964902 -1.6801773
#> 2 140.62866 -30.628664 0.2323739 25.43881 0.049861042 -1.3460408
#> 3  79.99158  13.008418 0.3075987 26.38216 0.014633059  0.6019364
#> 4 125.75448 -15.754483 0.2103960 26.31579 0.011288712 -0.6826601
#> 5 183.21756  -8.217565 0.2016137 26.53317 0.002878707 -0.3541128
#> 6 111.38490  -6.384902 0.3147448 26.55680 0.003682813 -0.2969840

A single, tidy pipeline #

So twidlr and broom functions can be combined into a single, tidy pipeline to go from data.frame to tidy output:

library(twidlr)
library(broom)

mtcars %>% 
  lm(hp ~ .)  %>% 
  glance()
#>   r.squared adj.r.squared    sigma statistic     p.value df    logLik
#> 1 0.9027993     0.8565132 25.97138  19.50477 1.89833e-08 11 -142.8905
#>        AIC      BIC deviance df.residual
#> 1 309.7809 327.3697 14164.76          21

Any model included in twidlr and broom can be used in this same way. Here’s a kmeans example:

iris %>%
  select(-Species) %>% 
  kmeans(centers = 3) %>% 
  tidy()
#>         x1       x2       x3       x4 size withinss cluster
#> 1 5.901613 2.748387 4.393548 1.433871   62 39.82097       1
#> 2 5.006000 3.428000 1.462000 0.246000   50 15.15100       2
#> 3 6.850000 3.073684 5.742105 2.071053   38 23.87947       3

And a ridge regression with cross-fold validation example:

mtcars %>% 
  cv.glmnet(am ~ ., alpha = 0) %>% 
  glance()
#>   lambda.min lambda.1se
#> 1  0.2284167  0.8402035

So next time you want to do some tidy modelling, keep this pipeline in mind:

pipeline.png

Limitations #

Currently, the major limitation for this approach is that a model must be covered by twidlr and broom. For example, you can’t use randomForest in this pipeline because, although twidlr exposes a data.frame friendly version of it, broom doesn’t provide tidying methods for it. So if you want to write tidy code for a model that isn’t covered by these packages, have a go at helping out by contributing to these open source projects! To get started creating and contributing to R packages, take a look at Hadley Wickham’s free book, “R Packages”.

Sign off #

Thanks for reading and I hope this was useful for you.

For updates of recent blog posts, follow @drsimonj on Twitter, or email me at drsimonjackson@gmail.com to get in touch.

If you’d like the code that produced this blog, check out the blogR GitHub repository.

 
76
Kudos
 
76
Kudos

Now read this

Label line ends in time series with ggplot2

@drsimonj here with a quick share on making great use of the secondary y axis with ggplot2 – super helpful if you’re plotting groups of time series! Here’s an example of what I want to show you how to create (pay attention to the numbers... Continue →