February 1, 2017

Tidy grid search with pipelearner

@drsimonj here to show you how to use pipelearner to easily grid-search hyperparameters for a model.

pipelearner is a package for making machine learning piplines and is currently available to install from GitHub by running the following:

# install.packages("devtools")  # Run this if devtools isn't installed
devtools::install_github("drsimonj/pipelearner")
library(pipelearner)

In this post we’ll grid search hyperparameters of a decision tree (using the rpart package) predicting cars’ transmission type (automatic or manual) using the mtcars data set. Let’s load rpart along with tidyverse, which pipelearner is intended to work with:

library(tidyverse)
library(rpart)

The data #

Quickly convert our outcome variable to a factor with proper labels:

d <- mtcars %>% 
  mutate(am = factor(am, labels = c("automatic", "manual")))
head(d)
#>    mpg cyl disp  hp drat    wt  qsec vs        am gear carb
#> 1 21.0   6  160 110 3.90 2.620 16.46  0    manual    4    4
#> 2 21.0   6  160 110 3.90 2.875 17.02  0    manual    4    4
#> 3 22.8   4  108  93 3.85 2.320 18.61  1    manual    4    1
#> 4 21.4   6  258 110 3.08 3.215 19.44  1 automatic    3    1
#> 5 18.7   8  360 175 3.15 3.440 17.02  0 automatic    3    2
#> 6 18.1   6  225 105 2.76 3.460 20.22  1 automatic    3    1

Default hyperparameters #

We’ll first create a pipelearner object that uses the default hyperparameters of the decision tree.

pl <- d %>% pipelearner(rpart, am ~ .)
pl
#> $data
#> # A tibble: 32 × 11
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs        am  gear  carb
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <fctr> <dbl> <dbl>
#> 1   21.0     6 160.0   110  3.90 2.620 16.46     0    manual     4     4
#> 2   21.0     6 160.0   110  3.90 2.875 17.02     0    manual     4     4
#> 3   22.8     4 108.0    93  3.85 2.320 18.61     1    manual     4     1
#> 4   21.4     6 258.0   110  3.08 3.215 19.44     1 automatic     3     1
#> 5   18.7     8 360.0   175  3.15 3.440 17.02     0 automatic     3     2
#> 6   18.1     6 225.0   105  2.76 3.460 20.22     1 automatic     3     1
#> 7   14.3     8 360.0   245  3.21 3.570 15.84     0 automatic     3     4
#> 8   24.4     4 146.7    62  3.69 3.190 20.00     1 automatic     4     2
#> 9   22.8     4 140.8    95  3.92 3.150 22.90     1 automatic     4     2
#> 10  19.2     6 167.6   123  3.92 3.440 18.30     1 automatic     4     4
#> # ... with 22 more rows
#> 
#> $cv_pairs
#> # A tibble: 1 × 3
#>            train           test   .id
#>           <list>         <list> <chr>
#> 1 <S3: resample> <S3: resample>     1
#> 
#> $train_ps
#> [1] 1
#> 
#> $models
#> # A tibble: 1 × 5
#>   target model     params     .f   .id
#>    <chr> <chr>     <list> <list> <chr>
#> 1     am rpart <list [1]>  <fun>     1
#> 
#> attr(,"class")
#> [1] "pipelearner"

Fit the model with learn():

results <- pl %>% learn()
results
#> # A tibble: 1 × 9
#>   models.id cv_pairs.id train_p         fit target model     params
#>       <chr>       <chr>   <dbl>      <list>  <chr> <chr>     <list>
#> 1         1           1       1 <S3: rpart>     am rpart <list [1]>
#> # ... with 2 more variables: train <list>, test <list>

The fitted results include our single model. Let’s assess the model’s performance on the training and test sets:

# Function to compute accuracy
accuracy <- function(fit, data, target_var) {
  # Coerce `data` to data.frame (needed for resample objects)
  data <- as.data.frame(data)
  # Obtain predicted class
  predicted <- predict(fit, data, type = "class")
  # Return accuracy
  mean(predicted == data[[target_var]])
}

# Training accuracy
accuracy(results$fit[[1]], results$train[[1]], results$target[[1]])
#> [1] 0.92

# Test accuracy
accuracy(results$fit[[1]], results$test[[1]], results$target[[1]])
#> [1] 0.8571429

Looks like we’ve achieved 92% accuracy on the training data and 86% accuracy on the test data. Perhaps we can improve on this by tweaking the model’s hyperparameters.

Adding hyperparameters #

When using pipelearner, you can add any arguments that the learning function will accept after we provide a formula. For example, run ?rpart and you’ll see that control options can be added. To see these options, run ?rpart.control.

An obvious choice for decision trees is minsplit, which determines “the minimum number of observations that must exist in a node in order for a split to be attempted.” By default it’s set to 20. Given that we have such a small data set, this seems like a poor choice. We can adjust it as follows:

pl <- d %>% pipelearner(rpart, am ~ ., minsplit = 5)
results <- pl %>% learn()

# Training accuracy
accuracy(results$fit[[1]], results$train[[1]], results$target[[1]])
#> [1] 0.92

# Test accuracy
accuracy(results$fit[[1]], results$test[[1]], results$target[[1]])
#> [1] 0.8571429

Reducing minsplit will generally increase your training accuracy. Too small, however, and you’ll overfit the training data resulting in poorer test accuracy.

Using vectors #

All the model arguments you provide to pipelearner() can be vectors. pipelearner will then automatically expand those vectors into a grid and test all combinations. For example, let’s try out many values for minsplit:

pl <- d %>% pipelearner(rpart, am ~ ., minsplit = c(2, 4, 6, 8, 10))
results <- pl %>% learn()
results
#> # A tibble: 5 × 9
#>   models.id cv_pairs.id train_p         fit target model     params
#>       <chr>       <chr>   <dbl>      <list>  <chr> <chr>     <list>
#> 1         1           1       1 <S3: rpart>     am rpart <list [2]>
#> 2         2           1       1 <S3: rpart>     am rpart <list [2]>
#> 3         3           1       1 <S3: rpart>     am rpart <list [2]>
#> 4         4           1       1 <S3: rpart>     am rpart <list [2]>
#> 5         5           1       1 <S3: rpart>     am rpart <list [2]>
#> # ... with 2 more variables: train <list>, test <list>

Combining mutate from dplyr and map functions from the purrr package (all loaded with tidyverse), we can extract the relevant information for each value of minsplit:

results <- results %>% 
  mutate(
    minsplit = map_dbl(params, "minsplit"),
    accuracy_train = pmap_dbl(list(fit, train, target), accuracy),
    accuracy_test  = pmap_dbl(list(fit, test,  target), accuracy)
  )

results %>% select(minsplit, contains("accuracy"))
#> # A tibble: 5 × 3
#>   minsplit accuracy_train accuracy_test
#>      <dbl>          <dbl>         <dbl>
#> 1        2              1     0.5714286
#> 2        4              1     0.5714286
#> 3        6              1     0.5714286
#> 4        8              1     0.5714286
#> 5       10              1     0.5714286

This applies to as many hyperparameters as you care to add. For example, let’s grid search combinations of values for minsplit, maxdepth, and xval:

pl <- d %>% pipelearner(rpart, am ~ .,
                        minsplit = c(2, 20),
                        maxdepth = c(2, 5),
                        xval     = c(5, 10))
pl %>%
  learn()%>% 
  mutate(
    minsplit = map_dbl(params, "minsplit"),
    maxdepth = map_dbl(params, "maxdepth"),
    xval     = map_dbl(params, "xval"),
    accuracy_train = pmap_dbl(list(fit, train, target), accuracy),
    accuracy_test  = pmap_dbl(list(fit, test,  target), accuracy)
  ) %>%
  select(minsplit, maxdepth, xval, contains("accuracy"))
#> # A tibble: 8 × 5
#>   minsplit maxdepth  xval accuracy_train accuracy_test
#>      <dbl>    <dbl> <dbl>          <dbl>         <dbl>
#> 1        2        2     5           1.00     0.8571429
#> 2       20        2     5           0.92     0.8571429
#> 3        2        5     5           1.00     0.8571429
#> 4       20        5     5           0.92     0.8571429
#> 5        2        2    10           1.00     0.8571429
#> 6       20        2    10           0.92     0.8571429
#> 7        2        5    10           1.00     0.8571429
#> 8       20        5    10           0.92     0.8571429

Not much variance in the accuracy, but it demonstrates how you can use this in your own work.

Using train_models() #

A bonus tip for those of you how are comfortable so far: you can use learn_models() to isolate multiple grid searches. For example:

pl <- d %>%
  pipelearner() %>% 
  learn_models(rpart, am ~ ., minsplit = c(1, 2), maxdepth = c(4, 5)) %>% 
  learn_models(rpart, am ~ ., minsplit = c(6, 7), maxdepth = c(1, 2))

pl %>%
  learn()%>% 
  mutate(
    minsplit = map_dbl(params, "minsplit"),
    maxdepth = map_dbl(params, "maxdepth"),
    accuracy_train = pmap_dbl(list(fit, train, target), accuracy),
    accuracy_test  = pmap_dbl(list(fit, test,  target), accuracy)
  ) %>%
  select(minsplit, maxdepth, contains("accuracy"))
#> # A tibble: 8 × 4
#>   minsplit maxdepth accuracy_train accuracy_test
#>      <dbl>    <dbl>          <dbl>         <dbl>
#> 1        1        4           1.00     1.0000000
#> 2        2        4           1.00     1.0000000
#> 3        1        5           1.00     1.0000000
#> 4        2        5           1.00     1.0000000
#> 5        6        1           0.88     0.8571429
#> 6        7        1           0.88     0.8571429
#> 7        6        2           0.96     0.8571429
#> 8        7        2           0.96     0.8571429

Notice the separate grid searches for minsplit = c(1, 2), maxdepth = c(4, 5) and minsplit = c(6, 7), maxdepth = c(1, 2).

This is because grid search is applied separately for each model defined by a learn_models() call. This means you can separate various hyperparameters combinations if you want to.

Sign off #

Thanks for reading and I hope this was useful for you.

For updates of recent blog posts, follow @drsimonj on Twitter, or email me at drsimonjackson@gmail.com to get in touch.

If you’d like the code that produced this blog, check out the blogR GitHub repository.

Kudos