A tidy model pipeline with twidlr and broom
@drsimonj here to show you how to go from data in a data.frame to a tidy data.frame of model output by combining twidlr and broom in a single, tidy model pipeline.
The problem #
Different model functions take different types of inputs (data.frames, matrices, etc) and produce different types of output! Thus, we’re often confronted with the very untidy challenge presented in this Figure:
Thus, different models may need very different code.
However, it’s possible to create a consistent, tidy pipeline by combining the twidlr and broom packages. Let’s see how this works.
Two-step modelling #
To understand the solution, think of the problem as a two-step process, depicted in this Figure:
Step 1: from data to fitted model #
Step 1 must take data in a data.frame as input and return a fitted model object. twidlr exposes model functions that do just this!
To demonstrate:
#devtools::install_github("drsimonj/twidlr") # To install
library(twidlr)
lm(mtcars, hp ~ .)
#>
#> Call:
#> stats::lm(formula = formula, data = data)
#>
#> Coefficients:
#> (Intercept) mpg cyl disp drat
#> 79.048 -2.063 8.204 0.439 -4.619
#> wt qsec vs am gear
#> -27.660 -1.784 25.813 9.486 7.216
#> carb
#> 18.749
This means we can pipe data.frames into any model function exposed by twidlr. For example:
library(dplyr)
mtcars %>% lm(hp ~ .)
#>
#> Call:
#> stats::lm(formula = formula, data = data)
#>
#> Coefficients:
#> (Intercept) mpg cyl disp drat
#> 79.048 -2.063 8.204 0.439 -4.619
#> wt qsec vs am gear
#> -27.660 -1.784 25.813 9.486 7.216
#> carb
#> 18.749
Step2: fitted model to tidy results #
Step 2 must take a fitted model object as its input and return a tidy data frame of results. This is precisely what the broom package does via three functions: glance
, tidy
, and augment
! To demonstrate:
#install.packages("broom") # To install
library(broom)
fit <- mtcars %>% lm(hp ~ .)
glance(fit)
#> r.squared adj.r.squared sigma statistic p.value df logLik
#> 1 0.9027993 0.8565132 25.97138 19.50477 1.89833e-08 11 -142.8905
#> AIC BIC deviance df.residual
#> 1 309.7809 327.3697 14164.76 21
tidy(fit)
#> term estimate std.error statistic p.value
#> 1 (Intercept) 79.0483879 184.5040756 0.4284371 0.672695339
#> 2 mpg -2.0630545 2.0905650 -0.9868407 0.334955314
#> 3 cyl 8.2037204 10.0861425 0.8133655 0.425134929
#> 4 disp 0.4390024 0.1492007 2.9423609 0.007779725
#> 5 drat -4.6185488 16.0829171 -0.2871711 0.776795845
#> 6 wt -27.6600472 19.2703681 -1.4353668 0.165910518
#> 7 qsec -1.7843654 7.3639133 -0.2423121 0.810889101
#> 8 vs 25.8128774 19.8512410 1.3003156 0.207583411
#> 9 am 9.4862914 20.7599371 0.4569518 0.652397317
#> 10 gear 7.2164047 14.6160152 0.4937327 0.626619355
#> 11 carb 18.7486691 7.0287674 2.6674192 0.014412403
augment(fit) %>% head()
#> .rownames hp mpg cyl disp drat wt qsec vs am gear carb
#> 1 Mazda RX4 110 21.0 6 160 3.90 2.620 16.46 0 1 4 4
#> 2 Mazda RX4 Wag 110 21.0 6 160 3.90 2.875 17.02 0 1 4 4
#> 3 Datsun 710 93 22.8 4 108 3.85 2.320 18.61 1 1 4 1
#> 4 Hornet 4 Drive 110 21.4 6 258 3.08 3.215 19.44 1 0 3 1
#> 5 Hornet Sportabout 175 18.7 8 360 3.15 3.440 17.02 0 0 3 2
#> 6 Valiant 105 18.1 6 225 2.76 3.460 20.22 1 0 3 1
#> .fitted .resid .hat .sigma .cooksd .std.resid
#> 1 148.68122 -38.681220 0.2142214 24.75946 0.069964902 -1.6801773
#> 2 140.62866 -30.628664 0.2323739 25.43881 0.049861042 -1.3460408
#> 3 79.99158 13.008418 0.3075987 26.38216 0.014633059 0.6019364
#> 4 125.75448 -15.754483 0.2103960 26.31579 0.011288712 -0.6826601
#> 5 183.21756 -8.217565 0.2016137 26.53317 0.002878707 -0.3541128
#> 6 111.38490 -6.384902 0.3147448 26.55680 0.003682813 -0.2969840
A single, tidy pipeline #
So twidlr and broom functions can be combined into a single, tidy pipeline to go from data.frame to tidy output:
library(twidlr)
library(broom)
mtcars %>%
lm(hp ~ .) %>%
glance()
#> r.squared adj.r.squared sigma statistic p.value df logLik
#> 1 0.9027993 0.8565132 25.97138 19.50477 1.89833e-08 11 -142.8905
#> AIC BIC deviance df.residual
#> 1 309.7809 327.3697 14164.76 21
Any model included in twidlr and broom can be used in this same way. Here’s a kmeans
example:
iris %>%
select(-Species) %>%
kmeans(centers = 3) %>%
tidy()
#> x1 x2 x3 x4 size withinss cluster
#> 1 5.901613 2.748387 4.393548 1.433871 62 39.82097 1
#> 2 5.006000 3.428000 1.462000 0.246000 50 15.15100 2
#> 3 6.850000 3.073684 5.742105 2.071053 38 23.87947 3
And a ridge regression with cross-fold validation example:
mtcars %>%
cv.glmnet(am ~ ., alpha = 0) %>%
glance()
#> lambda.min lambda.1se
#> 1 0.2284167 0.8402035
So next time you want to do some tidy modelling, keep this pipeline in mind:
Limitations #
Currently, the major limitation for this approach is that a model must be covered by twidlr and broom. For example, you can’t use randomForest
in this pipeline because, although twidlr exposes a data.frame friendly version of it, broom doesn’t provide tidying methods for it. So if you want to write tidy code for a model that isn’t covered by these packages, have a go at helping out by contributing to these open source projects! To get started creating and contributing to R packages, take a look at Hadley Wickham’s free book, “R Packages”.
Sign off #
Thanks for reading and I hope this was useful for you.
For updates of recent blog posts, follow @drsimonj on Twitter, or email me at drsimonjackson@gmail.com to get in touch.
If you’d like the code that produced this blog, check out the blogR GitHub repository.