Photo by @szmigieldesign on Unsplash
Photo by @szmigieldesign on Unsplash

Predict Credit Default with Random Forest Model with Tuned Hyperparameters

2020-06-07 14-minute read

I just want to focus on writing about implementing the workflow of this algorithm in R, so will keep the explanation very brief. However, these are some excellent articles that I referenced on hyperparameters, random forest and combining them.

Found this UCI Machine Learning Repo dataset containing Portuguese bank data from 2009. Its original focus was to determine how successful their marketing campaigns were (to get people to deposit money). But, it has a lot of variables that I wanted to use to tell another story with this dataset!

Background terminology

Random foreset

Random Forest is a classification/regression algorithm that makes use of many decision trees that act as a committee making decisions by majority vote. The most important detail is the uncorrelated-ness between trees. By creating many, many decision trees that don’t look like each other, the random forrest “hedges” risks of getting the wrong prediction. If one tree messes up, but many other trees with completely different features get it right, the initial tree is less guilty :) Much like a diversified investment portfolio. On top of this, to ensure even more randomness, decision crieria at each nodes of trees are decided by randomly chosen variables.

Tuning hyperparameters

Hyperparameters refer to the settings of a training process. These must be set by a human before training. In case of what we’re looking at today in a random forest,

  • Number of trees
  • Number of features allowed in nodes
  • Minimum number of data points in a node that qualifies a split.

On the other hand, a parameter is something that a training process teaches itself, such as slope/intercept in linear models and coefficients in logistic models.

Grid search & cross validation

Kinda sounds very brute force-y, but tuning hyperparameters requires trying as many combinations possible and picking the best specs. To do this we’re going to use a technique called “Grid Search” to achieve this.

Finally, we’ll use resampling/subsampling methods like bootstrap or cross-validation to add randomness to the training dataset. I don’t quite fully understand the benefits/downsides of choosing one over the other, but I’m going to use cross-validation for this post.

Let’s build the model!

Quickly cleaned column names and gave row id’s.

library(tidyverse)
library(tidymodels)
theme_set(theme_light())

bank <- read_csv("bank_cleaned.csv") %>% 
  janitor::clean_names() %>% 
  rename(id = x1)

bank_rough <- bank %>%
  select(-response_binary) %>% 
  mutate_if(is.character, factor)

bank_rough
## # A tibble: 40,841 x 17
##       id   age job   marital education default balance housing loan    day month
##    <dbl> <dbl> <fct> <fct>   <fct>     <fct>     <dbl> <fct>   <fct> <dbl> <fct>
##  1     0    58 mana~ married tertiary  no         2143 yes     no        5 may  
##  2     1    44 tech~ single  secondary no           29 yes     no        5 may  
##  3     2    33 entr~ married secondary no            2 yes     yes       5 may  
##  4     5    35 mana~ married tertiary  no          231 yes     no        5 may  
##  5     6    28 mana~ single  tertiary  no          447 yes     yes       5 may  
##  6     7    42 entr~ divorc~ tertiary  yes           2 yes     no        5 may  
##  7     8    58 reti~ married primary   no          121 yes     no        5 may  
##  8     9    43 tech~ single  secondary no          593 yes     no        5 may  
##  9    10    41 admi~ divorc~ secondary no          270 yes     no        5 may  
## 10    11    29 admi~ single  secondary no          390 yes     no        5 may  
## # ... with 40,831 more rows, and 6 more variables: duration <dbl>,
## #   campaign <dbl>, pdays <dbl>, previous <dbl>, poutcome <fct>, response <fct>

Quick EDA

This is something I like to do with numeric data. So I took all columns that are of class dbl, pivoted the dataframe, and then made a histogram for each variable.

bank_rough %>% 
  select_if(is.double) %>% 
  select(-id) %>% 
  pivot_longer(age:previous) %>% 
  ggplot(aes(value)) +
  geom_histogram() +
  facet_wrap(~name, scales = "free")

Interesting to see there’s negative balances, that’s probably gonna be a major predictor.

For the columns that weren’t selected, I’m going to do the exact same thing as above, but with column charts. This is a trick I learned recently. The powerful reorder_within & scale_y_reordered functions from the tidytext package makes it really easy to get a grasp our data.

library(tidytext)

bank_rough %>% 
  select_if(is.factor) %>% 
  mutate_all(as.character) %>% 
  pivot_longer(job:response) %>% 
  count(name, value) %>% 
  mutate(value = reorder_within(value, n, name)) %>% 
  ggplot(aes(n, value, fill = value)) +
  geom_col() +
  theme(legend.position = "none") +
  facet_wrap(~name, scales = "free") +
  scale_y_reordered()

Now that we’ve seen all variables in this dataset, I’m going to take these variables that I think are important.

bank_proc <- bank_rough %>% 
  select(id, education, default, housing, job, loan, marital, age, balance)

Model Building

Let’s set split our data into training/testing sets. “strata” argument just locks in the proportions of class, so that both training and testing sets have the same % mix of default and non-defaults.

library(tidymodels)
set.seed(123)

bank_split <- initial_split(bank_proc, strata = housing)
bank_train <- training(bank_split)
bank_test <- testing(bank_split)

I am going to tune our model twice. This tuning comes in after setting the recipe, algorithm specs, workflow, and then a cv-set. In the first tuning, I’m going to randomly select 20 combinations of hyperparameters (mtry and min_n), and then I’ll look at the result, and then narrow down the range of the combinations and tune once again. We’re going to do this the tidy way, with tidymodels. I’m excited to see what we can do with this!

1. Recipe

In tidymodels, a recipe works much like a cooking recipe. You first say what you’re going to cook (credit default explained by all variables from bank_train), and then add a bunch of “steps” you need to finish cooking. A recipe is not trained with any data yet, it’s just some specifications to the data before it’s fed into the model! Some steps we took are:

  1. update_role: We just said the id column is literally an id, so don’t use as a predictor.
  2. step_other: take a column, and an item within that column doesn’t come up frequently enough. I thought there were too many jobs, so we’ll lump the least frequent ones as “other”.
  3. step_dummy: turn every single item within each columns into a dummy variable.
  4. step_downsample: Make sure we have equal number of observations for credit defaults
bank_rec <- recipe(default ~ ., data = bank_train) %>% 
  update_role(id, new_role = "id") %>% 
  step_other(job, threshold = 0.03) %>%
  step_dummy(all_nominal(), -all_outcomes()) %>% 
  step_downsample(default)
## Warning: `step_downsample()` was deprecated in recipes 0.1.13.
## Please use `themis::step_downsample()` instead.
bank_prep <- prep(bank_rec)

juiced <- juice(bank_prep)

This is what a recipe looks like. It’s just a preprocessing

bank_rec
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##         id          1
##    outcome          1
##  predictor          7
## 
## Operations:
## 
## Collapsing factor levels for job
## Dummy variables from all_nominal(), -all_outcomes()
## Down-sampling based on default

A recipe needs to be prep’d, which gives us a summary

bank_prep
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##         id          1
##    outcome          1
##  predictor          7
## 
## Training data contained 30631 data points and no missing data.
## 
## Operations:
## 
## Collapsing factor levels for job [trained]
## Dummy variables from education, housing, job, loan, marital [trained]
## Down-sampling based on default [trained]

Then we can juice a recipe, which gives us the product of our instructions

juiced
## # A tibble: 1,168 x 18
##       id   age balance default education_secon~ education_terti~ housing_yes
##    <dbl> <dbl>   <dbl> <fct>              <dbl>            <dbl>       <dbl>
##  1  2023    31     606 no                     0                1           1
##  2 42980    35    7050 no                     0                1           0
##  3 28143    43     373 no                     1                0           0
##  4 14981    37     764 no                     1                0           1
##  5 21943    45     638 no                     1                0           1
##  6 13692    42     199 no                     0                1           1
##  7 10221    37    -119 no                     0                1           1
##  8   441    34     242 no                     0                1           1
##  9  2909    42    2535 no                     1                0           1
## 10  8922    33     934 no                     0                0           1
## # ... with 1,158 more rows, and 11 more variables: job_blue.collar <dbl>,
## #   job_entrepreneur <dbl>, job_management <dbl>, job_retired <dbl>,
## #   job_self.employed <dbl>, job_services <dbl>, job_technician <dbl>,
## #   job_other <dbl>, loan_yes <dbl>, marital_married <dbl>,
## #   marital_single <dbl>

2. Algorithm Specs

Now we have to specify the settings of our random foreset. tune_spec is literally just a setting of a random forest we’re going to be applying. tune() means we’re going to tune this hyperparameter, so don’t specify just yet, like 100 for trees. (Would have done more than a 100, but it took forever so just doing 100) We’re tuning two hyperparams here:

  • mtry: Number of features to be used to split a node
  • min_n: Minimum number of items in a node that qualifies a split

And then, 2 final lines are saying we’re solving a classification problem with random forest, using a “ranger” package.

tune_spec <- rand_forest(
  mtry = tune(),
  trees = 100,
  min_n = tune()
) %>%
  set_mode("classification") %>%
  set_engine("ranger")

3. Workflow

Workflow puts everything together.

tune_wf <- workflow() %>% 
  add_recipe(bank_rec) %>% 
  add_model(tune_spec)

tune_wf
## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: rand_forest()
## 
## -- Preprocessor ----------------------------------------------------------------
## 3 Recipe Steps
## 
## * step_other()
## * step_dummy()
## * step_downsample()
## 
## -- Model -----------------------------------------------------------------------
## Random Forest Model Specification (classification)
## 
## Main Arguments:
##   mtry = tune()
##   trees = 100
##   min_n = tune()
## 
## Computational engine: ranger

4. Cross Validation and Grid Search Tuning

Now we make a 10-fold cross validation training sets

set.seed(234)
bank_folds <- vfold_cv(bank_train)

bank_folds
## #  10-fold cross-validation 
## # A tibble: 10 x 2
##    splits               id    
##    <list>               <chr> 
##  1 <split [27.6K/3.1K]> Fold01
##  2 <split [27.6K/3.1K]> Fold02
##  3 <split [27.6K/3.1K]> Fold03
##  4 <split [27.6K/3.1K]> Fold04
##  5 <split [27.6K/3.1K]> Fold05
##  6 <split [27.6K/3.1K]> Fold06
##  7 <split [27.6K/3.1K]> Fold07
##  8 <split [27.6K/3.1K]> Fold08
##  9 <split [27.6K/3.1K]> Fold09
## 10 <split [27.6K/3.1K]> Fold10

First Tuning

Now that we’ve established all the previous steps, let’s tune some hyperparameters

We turn on parallel processing to make it a bit faster, and use the tune_grid, which takes our workflow, cv set, and 20 random combinations of hyperparameters.

doParallel::registerDoParallel()

set.seed(345)
tune_res <- tune_grid(
  tune_wf,
  resamples = bank_folds,
  grid = 20
)

The result is a dataframe containing each data split we’re using to train, and the metrics.

tune_res
## # Tuning results
## # 10-fold cross-validation 
## # A tibble: 10 x 4
##    splits               id     .metrics          .notes          
##    <list>               <chr>  <list>            <list>          
##  1 <split [27.6K/3.1K]> Fold01 <tibble [40 x 6]> <tibble [0 x 1]>
##  2 <split [27.6K/3.1K]> Fold02 <tibble [40 x 6]> <tibble [0 x 1]>
##  3 <split [27.6K/3.1K]> Fold03 <tibble [40 x 6]> <tibble [0 x 1]>
##  4 <split [27.6K/3.1K]> Fold04 <tibble [40 x 6]> <tibble [0 x 1]>
##  5 <split [27.6K/3.1K]> Fold05 <tibble [40 x 6]> <tibble [0 x 1]>
##  6 <split [27.6K/3.1K]> Fold06 <tibble [40 x 6]> <tibble [0 x 1]>
##  7 <split [27.6K/3.1K]> Fold07 <tibble [40 x 6]> <tibble [0 x 1]>
##  8 <split [27.6K/3.1K]> Fold08 <tibble [40 x 6]> <tibble [0 x 1]>
##  9 <split [27.6K/3.1K]> Fold09 <tibble [40 x 6]> <tibble [0 x 1]>
## 10 <split [27.6K/3.1K]> Fold10 <tibble [40 x 6]> <tibble [0 x 1]>

And we can pull the metrics like so:

tune_res %>% 
  collect_metrics()
## # A tibble: 40 x 8
##     mtry min_n .metric  .estimator  mean     n std_err .config              
##    <int> <int> <chr>    <chr>      <dbl> <int>   <dbl> <chr>                
##  1     3    35 accuracy binary     0.758    10 0.00397 Preprocessor1_Model01
##  2     3    35 roc_auc  binary     0.866    10 0.00696 Preprocessor1_Model01
##  3    12    19 accuracy binary     0.766    10 0.00526 Preprocessor1_Model02
##  4    12    19 roc_auc  binary     0.869    10 0.00413 Preprocessor1_Model02
##  5    11    37 accuracy binary     0.765    10 0.00496 Preprocessor1_Model03
##  6    11    37 roc_auc  binary     0.871    10 0.00514 Preprocessor1_Model03
##  7    13    20 accuracy binary     0.767    10 0.00466 Preprocessor1_Model04
##  8    13    20 roc_auc  binary     0.869    10 0.00445 Preprocessor1_Model04
##  9     6    32 accuracy binary     0.763    10 0.00425 Preprocessor1_Model05
## 10     6    32 roc_auc  binary     0.870    10 0.00582 Preprocessor1_Model05
## # ... with 30 more rows

Let’s pick the best hyperparameters according to roc_auc.

tune_res %>% 
  collect_metrics() %>% 
  filter(.metric == "roc_auc") %>% 
  pivot_longer(mtry:min_n) %>% 
  ggplot(aes(value, mean, color = name)) +
  geom_point() + 
  geom_line() +
  facet_wrap(~name)

It seems like min_n is the most accurate around 5 & 10, while mtry was around 25, 35. Now that we know a little better, let’s try a bunch of combinations in those ranges to train our model!

Second Tuning

We need a new “grid”. We’re going to give it the ranges we want to test, and levels, which controls the number of possible combinations (trying too many would take too long!)

rf_grid <- grid_regular(
  mtry(range = c(5, 15)),
  min_n(range = c(25, 35)),
  levels = 10
)

rf_grid
## # A tibble: 100 x 2
##     mtry min_n
##    <int> <int>
##  1     5    25
##  2     6    25
##  3     7    25
##  4     8    25
##  5     9    25
##  6    10    25
##  7    11    25
##  8    12    25
##  9    13    25
## 10    15    25
## # ... with 90 more rows

Same exact tuning as the first one, but with a different grid.

set.seed(456)
regular_res <- tune_grid(
  tune_wf,
  resamples = bank_folds,
  grid = rf_grid
)

regular_res
## # Tuning results
## # 10-fold cross-validation 
## # A tibble: 10 x 4
##    splits               id     .metrics           .notes          
##    <list>               <chr>  <list>             <list>          
##  1 <split [27.6K/3.1K]> Fold01 <tibble [200 x 6]> <tibble [0 x 1]>
##  2 <split [27.6K/3.1K]> Fold02 <tibble [200 x 6]> <tibble [0 x 1]>
##  3 <split [27.6K/3.1K]> Fold03 <tibble [200 x 6]> <tibble [0 x 1]>
##  4 <split [27.6K/3.1K]> Fold04 <tibble [200 x 6]> <tibble [0 x 1]>
##  5 <split [27.6K/3.1K]> Fold05 <tibble [200 x 6]> <tibble [0 x 1]>
##  6 <split [27.6K/3.1K]> Fold06 <tibble [200 x 6]> <tibble [0 x 1]>
##  7 <split [27.6K/3.1K]> Fold07 <tibble [200 x 6]> <tibble [0 x 1]>
##  8 <split [27.6K/3.1K]> Fold08 <tibble [200 x 6]> <tibble [0 x 1]>
##  9 <split [27.6K/3.1K]> Fold09 <tibble [200 x 6]> <tibble [0 x 1]>
## 10 <split [27.6K/3.1K]> Fold10 <tibble [200 x 6]> <tibble [0 x 1]>

Let’s take a look at our results. This graph tells you the accuracy of models with specified hyperparameters where each line represents a min_n, and the x-axis represents mtry.

regular_res %>% 
  collect_metrics() %>% 
  filter(.metric == "roc_auc") %>% 
  mutate(min_n = factor(min_n)) %>% 
  select(mtry, min_n, mean) %>% 
  ggplot(aes(mtry, mean, color = min_n)) +
  geom_point() +
  geom_line(alpha = 0.5, size = 1.5)

Seems like min_n of 35, at 8 mtry did the best. However, let’s not guess, there is a function for that!

best_auc <- select_best(regular_res, "roc_auc")

best_auc
## # A tibble: 1 x 3
##    mtry min_n .config               
##   <int> <int> <chr>                 
## 1     8    33 Preprocessor1_Model084

8 & 33 it is! Let’s finalize that.

final_rf <- finalize_model(
  tune_spec,
  best_auc
)

final_rf
## Random Forest Model Specification (classification)
## 
## Main Arguments:
##   mtry = 8
##   trees = 100
##   min_n = 33
## 
## Computational engine: ranger

Test model on test set

vip package lets us see how each variable affected the model. The way it does it is by looking at each decision nodes’ improvement compared to other nodes.

library(vip)
## 
## Attaching package: 'vip'
## The following object is masked from 'package:utils':
## 
##     vi
final_rf %>%
  set_engine("ranger", importance = "permutation") %>%
  fit(default ~ .,
    data = juice(bank_prep) %>% select(-id)
  ) %>%
  vip(geom = "point")

Obviously, your bank balance and whether you have loans or not will be a good predictor of credit default. All others seem meh though. That’s probably a problem. Maybe this model will be useless when an unseen relationship between bank balance and credit default shows up. But regardless, we’ll go forward.

We’re doing the exact same thing as above recipe, but with our final model.

final_wf <- workflow() %>%
  add_recipe(bank_rec) %>%
  add_model(final_rf)

final_res <- final_wf %>%
  last_fit(bank_split)

final_res
## # Resampling results
## # Manual resampling 
## # A tibble: 1 x 6
##   splits         id          .metrics     .notes      .predictions     .workflow
##   <list>         <chr>       <list>       <list>      <list>           <list>   
## 1 <split [30.6K~ train/test~ <tibble [2 ~ <tibble [0~ <tibble [10,210~ <workflo~

76% accuracy. I think that’s pretty good.

final_res %>% 
  collect_metrics()
## # A tibble: 2 x 4
##   .metric  .estimator .estimate .config             
##   <chr>    <chr>          <dbl> <chr>               
## 1 accuracy binary         0.756 Preprocessor1_Model1
## 2 roc_auc  binary         0.865 Preprocessor1_Model1

One last thing to look at is the confusion matrix.

final_res %>%  
  collect_predictions() %>% 
  conf_mat(.pred_class, default)
##           Truth
## Prediction   no  yes
##        no  7568 2463
##        yes   27  152

Seems like it did a good job predicting when someone has not defaulted on their loan, but it also said a lot of no’s when in fact the person has defaulted on their loan. (lots of false positives)