Tidymodeling

Part II

Tentative outline for part 2

  • Feature engineering with recipes
  • Model optimization by tuning
    • Grid search
    • Racing
    • Iterative methods
  • Extras (time permitting)
    • Effect encodings
    • Case studies

Let’s install some packages

# Packages needed
pkgs <- 
  c("bonsai", "Cubist", "doParallel", "earth", "embed", "finetune", 
    "forested", "lightgbm", "lme4", "parallelly", "plumber", "probably", 
    "ranger", "rpart", "rpart.plot", "rules", "splines2", "stacks", 
    "text2vec", "textrecipes", "tidymodels", "vetiver")

install.packages(pkgs)

Also, you should install the newest version of the dials package (version 1.3.0). To check this, you can run:

rlang::check_installed("dials", version = "1.3.0")

My versions

R version 4.5.0 (2025-04-11), Quarto (1.7.29)

package version
bonsai 0.3.2
broom 1.0.8
Cubist 0.5.0
dials 1.4.0
doParallel 1.0.17
dplyr 1.1.4
earth 5.3.4
embed 1.1.5
finetune 1.2.0
Formula 1.2-5
package version
ggplot2 3.5.2
lattice 0.22-6
lightgbm 4.6.0
lme4 1.1-37
modeldata 1.4.0
parallelly 1.43.0
parsnip 1.3.1
plotmo 3.6.4
plotrix 3.8-4
plumber 1.3.0
package version
probably 1.0.3
purrr 1.0.4
recipes 1.3.0
rsample 1.3.0
rules 1.0.2
scales 1.4.0
splines2 0.5.4
stacks 1.1.0
text2vec 0.6.4
textrecipes 1.1.0
package version
tibble 3.2.1
tidymodels 1.3.0
tidyr 1.3.1
tune 1.3.0
vetiver 0.2.5
workflows 1.2.0
workflowsets 1.1.0
yardstick 1.3.2

Open Worksheet 1 πŸ“„

Hotel Data

We’ll use data on hotels to predict the cost of a room.

The data are in the modeldata package. We’ll sample down the data and refactor some columns:

library(tidymodels)

# extra settings: 
tidymodels_prefer()
theme_set(theme_bw())
options(
  pillar.advice = FALSE, 
  pillar.min_title_chars = Inf
)
data(hotel_rates)
set.seed(295)
hotel_rates <- 
  hotel_rates %>% 
  sample_n(5000) %>% 
  arrange(arrival_date) %>% 
  select(-arrival_date) %>% 
  mutate(
    company = factor(as.character(company)),
    country = factor(as.character(country)),
    agent = factor(as.character(agent))
  )

Hotel date columns

names(hotel_rates)
#>  [1] "avg_price_per_room"             "lead_time"                     
#>  [3] "stays_in_weekend_nights"        "stays_in_week_nights"          
#>  [5] "adults"                         "children"                      
#>  [7] "babies"                         "meal"                          
#>  [9] "country"                        "market_segment"                
#> [11] "distribution_channel"           "is_repeated_guest"             
#> [13] "previous_cancellations"         "previous_bookings_not_canceled"
#> [15] "reserved_room_type"             "assigned_room_type"            
#> [17] "booking_changes"                "agent"                         
#> [19] "company"                        "days_in_waiting_list"          
#> [21] "customer_type"                  "required_car_parking_spaces"   
#> [23] "total_of_special_requests"      "arrival_date_num"              
#> [25] "near_christmas"                 "near_new_years"                
#> [27] "historical_adr"

Data splitting strategy

Data Spending

Let’s split the data into a training set (75%) and testing set (25%) using stratification:

set.seed(4028)
hotel_split <- initial_split(hotel_rates, strata = avg_price_per_room)

hotel_train <- training(hotel_split)
hotel_test <- testing(hotel_split)

Your turn

Let’s take some time and investigate the training data. The outcome is avg_price_per_room.

Are there any interesting characteristics of the data?

10:00

Feature Engineering

Working with our predictors

We might want to modify our predictors columns for a few reasons:

  • The model requires them in a different format (e.g. dummy variables for linear regression).
  • The model needs certain data qualities (e.g. same units for K-NN).
  • The outcome is better predicted when one or more columns are transformed in some way (a.k.a β€œfeature engineering”).

The first two reasons are fairly predictable (next page).

The last one depends on your modeling problem.

What is feature engineering?

Think of a feature as some representation of a predictor that will be used in a model.

Example representations:

  • Interactions
  • Polynomial expansions/splines
  • Principal component analysis (PCA) feature extraction

There are a lot of examples in Feature Engineering and Selection (FES) and Applied Machine Learning for Tabular Data (aml4td).

Example: Dates

How can we represent date columns for our model?

When we use a date column in its native format, most models in R convert it to an integer.

We can re-engineer it as:

  • Days since a reference date
  • Day of the week
  • Month
  • Year
  • Indicators for holidays

General definitions

  • Data preprocessing steps allow your model to fit.

  • Feature engineering steps help the model do the least work to predict the outcome as well as possible.

The recipes package can handle both!

Previously - Setup

library(tidymodels)

# Add another package:
library(textrecipes)

# extra settings: 
tidymodels_prefer()
theme_set(theme_bw())
options(
  pillar.advice = FALSE, 
  pillar.min_title_chars = Inf
)
data(hotel_rates)
set.seed(295)
hotel_rates <- 
  hotel_rates %>% 
  sample_n(5000) %>% 
  arrange(arrival_date) %>% 
  select(-arrival_date) %>% 
  mutate(
    company = factor(as.character(company)),
    country = factor(as.character(country)),
    agent = factor(as.character(agent))
  )

Previously - Data Usage

set.seed(4028)
hotel_split <- initial_split(hotel_rates, strata = avg_price_per_room)

hotel_train <- training(hotel_split)
hotel_test <- testing(hotel_split)


We’ll go from here and create a set of resamples to use for model assessments.

Resampling Strategy

Resampling Strategy

We’ll use simple 10-fold cross-validation (stratified sampling):

set.seed(472)
hotel_rs <- vfold_cv(hotel_train, strata = avg_price_per_room)
hotel_rs
#> #  10-fold cross-validation using stratification 
#> # A tibble: 10 Γ— 2
#>    splits             id    
#>    <list>             <chr> 
#>  1 <split [3372/377]> Fold01
#>  2 <split [3373/376]> Fold02
#>  3 <split [3373/376]> Fold03
#>  4 <split [3373/376]> Fold04
#>  5 <split [3373/376]> Fold05
#>  6 <split [3374/375]> Fold06
#>  7 <split [3375/374]> Fold07
#>  8 <split [3376/373]> Fold08
#>  9 <split [3376/373]> Fold09
#> 10 <split [3376/373]> Fold10

Prepare your data for modeling

  • The recipes package is an extensible framework for pipeable sequences of preprocessing and feature engineering steps.
  • Statistical parameters for the steps can be estimated from an initial data set and then applied to other data sets.
  • The resulting processed output can be used as inputs for statistical or machine learning models.

A first recipe

hotel_rec <- 
  recipe(avg_price_per_room ~ ., data = hotel_train)
  • The recipe() function assigns columns to roles of β€œoutcome” or β€œpredictor” using the formula

A first recipe

summary(hotel_rec)
#> # A tibble: 27 Γ— 4
#>    variable                type      role      source  
#>    <chr>                   <list>    <chr>     <chr>   
#>  1 lead_time               <chr [2]> predictor original
#>  2 stays_in_weekend_nights <chr [2]> predictor original
#>  3 stays_in_week_nights    <chr [2]> predictor original
#>  4 adults                  <chr [2]> predictor original
#>  5 children                <chr [2]> predictor original
#>  6 babies                  <chr [2]> predictor original
#>  7 meal                    <chr [3]> predictor original
#>  8 country                 <chr [3]> predictor original
#>  9 market_segment          <chr [3]> predictor original
#> 10 distribution_channel    <chr [3]> predictor original
#> # β„Ή 17 more rows

The type column contains information on the variables

Your turn

What do you think are in the type vectors for the lead_time and country columns?

03:00

Create indicator variables

hotel_rec <- 
  recipe(avg_price_per_room ~ ., data = hotel_train) %>% 
  step_dummy(all_nominal_predictors())
  • For any factor or character predictors, make binary indicators.

  • There are many recipe steps that can convert categorical predictors to numeric columns.

  • step_dummy() records the levels of the categorical predictors in the training set.

Filter out constant columns

hotel_rec <- 
  recipe(avg_price_per_room ~ ., data = hotel_train) %>% 
  step_dummy(all_nominal_predictors()) %>% 
  step_zv(all_predictors())

In case there is a factor level that was never observed in the training data (resulting in a column of all 0s), we can delete any zero-variance predictors that have a single unique value.

Normalization

hotel_rec <- 
  recipe(avg_price_per_room ~ ., data = hotel_train) %>% 
  step_dummy(all_nominal_predictors()) %>% 
  step_zv(all_predictors()) %>% 
  step_normalize(all_numeric_predictors())
  • This centers and scales the numeric predictors.

  • The recipe will use the training set to estimate the means and standard deviations of the data.

  • All data the recipe is applied to will be normalized using those statistics (there is no re-estimation).

Reduce correlation

hotel_rec <- 
  recipe(avg_price_per_room ~ ., data = hotel_train) %>% 
  step_dummy(all_nominal_predictors()) %>% 
  step_zv(all_predictors()) %>% 
  step_normalize(all_numeric_predictors()) %>% 
  step_corr(all_numeric_predictors(), threshold = 0.9)

To deal with highly correlated predictors, find the minimum set of predictor columns that make the pairwise correlations less than the threshold.

Other possible steps

hotel_rec <- 
  recipe(avg_price_per_room ~ ., data = hotel_train) %>% 
  step_dummy(all_nominal_predictors()) %>% 
  step_zv(all_predictors()) %>% 
  step_normalize(all_numeric_predictors()) %>% 
  step_pca(all_numeric_predictors())

PCA feature extraction…

Other possible steps

hotel_rec <- 
  recipe(avg_price_per_room ~ ., data = hotel_train) %>% 
  step_dummy(all_nominal_predictors()) %>% 
  step_zv(all_predictors()) %>% 
  step_normalize(all_numeric_predictors()) %>% 
  embed::step_umap(all_numeric_predictors(), outcome = vars(avg_price_per_room))

A fancy machine learning supervised dimension reduction technique called UMAP…

Other possible steps

hotel_rec <- 
  recipe(avg_price_per_room ~ ., data = hotel_train) %>% 
  step_dummy(all_nominal_predictors()) %>% 
  step_zv(all_predictors()) %>% 
  step_normalize(all_numeric_predictors()) %>% 
  step_spline_natural(arrival_date_num, deg_free = 10)

Nonlinear transforms like natural splines, and so on!

Your turn

Create a recipe() for the hotel data to:

  • use a Yeo-Johnson (YJ) transformation on lead_time
  • convert factors to indicator variables
  • remove zero-variance variables
  • add the spline technique shown above
10:00

Minimal recipe

hotel_indicators <-
  recipe(avg_price_per_room ~ ., data = hotel_train) %>% 
  step_YeoJohnson(lead_time) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_zv(all_predictors()) %>% 
  step_spline_natural(arrival_date_num, deg_free = 10)

Measuring Performance

We’ll compute two measures: mean absolute error and the coefficient of determination (a.k.a \(R^2\)).

\[\begin{align} MAE &= \frac{1}{n}\sum_{i=1}^n |y_i - \hat{y}_i| \notag \\ R^2 &= cor(y_i, \hat{y}_i)^2 \end{align}\]

The focus will be on MAE for parameter optimization. We’ll use a metric set to compute these:

reg_metrics <- metric_set(mae, rsq)

Using a workflow

set.seed(9)

hotel_lm_wflow <-
  workflow() %>%
  add_recipe(hotel_indicators) %>%
  add_model(linear_reg())
 
ctrl <- control_resamples(save_pred = TRUE)
hotel_lm_res <-
  hotel_lm_wflow %>%
  fit_resamples(hotel_rs, control = ctrl, metrics = reg_metrics)

collect_metrics(hotel_lm_res)
#> # A tibble: 2 Γ— 6
#>   .metric .estimator   mean     n std_err .config             
#>   <chr>   <chr>       <dbl> <int>   <dbl> <chr>               
#> 1 mae     standard   16.6      10 0.214   Preprocessor1_Model1
#> 2 rsq     standard    0.884    10 0.00339 Preprocessor1_Model1

Your turn

Use fit_resamples() to fit your workflow with a recipe.

Collect the predictions from the results.

05:00

Holdout predictions

# Since we used `save_pred = TRUE`
lm_cv_pred <- collect_predictions(hotel_lm_res)
lm_cv_pred %>% print(n = 7)
#> # A tibble: 3,749 Γ— 5
#>   .pred id      .row avg_price_per_room .config             
#>   <dbl> <chr>  <int>              <dbl> <chr>               
#> 1  75.1 Fold01    20                 40 Preprocessor1_Model1
#> 2  49.3 Fold01    28                 54 Preprocessor1_Model1
#> 3  64.9 Fold01    45                 50 Preprocessor1_Model1
#> 4  52.8 Fold01    49                 42 Preprocessor1_Model1
#> 5  48.6 Fold01    61                 49 Preprocessor1_Model1
#> 6  29.8 Fold01    66                 40 Preprocessor1_Model1
#> 7  36.9 Fold01    88                 49 Preprocessor1_Model1
#> # β„Ή 3,742 more rows

Calibration Plot

library(probably)

cal_plot_regression(hotel_lm_res)

What do we do with the agent and company data?

There are 98 unique agent values and 100 unique companies in our training set. How can we include this information in our model?

We could:

  • make the full set of indicator variables 😳

  • lump agents and companies that rarely occur into an β€œother” group

  • use feature hashing to create a smaller set of indicator variables

  • use effect encoding to replace the agent and company columns with the estimated effect of that predictor (in the extra materials)

Per-agent statistics

Collapsing factor levels

There is a recipe step that will redefine factor levels based on their frequency in the training set:

hotel_other_rec <-
  recipe(avg_price_per_room ~ ., data = hotel_train) %>% 
  step_YeoJohnson(lead_time) %>%
  step_other(agent, threshold = 0.001) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_zv(all_predictors()) %>% 
  step_spline_natural(arrival_date_num, deg_free = 10)

Using this code, 34 agents (out of 98) were collapsed into β€œother” based on the training set.

We could try to optimize the threshold for collapsing (we will get to model tuning later).

Does othering help?

hotel_other_wflow <-
  hotel_lm_wflow %>%
  update_recipe(hotel_other_rec)

hotel_other_res <-
  hotel_other_wflow %>%
  fit_resamples(hotel_rs, control = ctrl, metrics = reg_metrics)

collect_metrics(hotel_other_res)
#> # A tibble: 2 Γ— 6
#>   .metric .estimator   mean     n std_err .config             
#>   <chr>   <chr>       <dbl> <int>   <dbl> <chr>               
#> 1 mae     standard   16.7      10 0.213   Preprocessor1_Model1
#> 2 rsq     standard    0.884    10 0.00341 Preprocessor1_Model1

About the same MAE and much faster to complete.

Now let’s look at a more sophisticated tool called effect feature hashing.

Feature Hashing

Between agent and company, simple dummy variables would create 198 new columns (that are mostly zeros).

Another option is to have a binary indicator that combines some levels of these variables.

Feature hashing (for more see FES, SMLTAR, TMwR, and aml4td):

  • uses the character values of the levels
  • converts them to integer hash values
  • uses the integers to assign them to a specific indicator column.

Feature Hashing

Suppose we want to use 32 indicator variables for agent.

For a agent with value β€œMax_Kuhn”, a hashing function converts it to an integer (say 210397726).

To assign it to one of the 32 columns, we would use modular arithmetic to assign it to a column:

# For "Max_Kuhn" put a '1' in column: 
210397726 %% 32
#> [1] 30

Hash functions are meant to emulate randomness.

Feature Hashing Pros

  • The procedure will automatically work on new values of the predictors.
  • It is fast.
  • β€œSigned” hashes add a sign to help avoid aliasing.

Feature Hashing Cons

  • There is no real logic behind which factor levels are combined.
  • We don’t know how many columns to add (more in the next section).
  • Some columns may have all zeros.
  • If a indicator column is important to the model, we can’t easily determine why.

Feature Hashing in recipes

The textrecipes package has a step that can be added to the recipe:

library(textrecipes)

hash_rec <-
  recipe(avg_price_per_room ~ ., data = hotel_train) %>%
  step_YeoJohnson(lead_time) %>%
  # Defaults to 32 signed indicator columns
  step_dummy_hash(agent) %>%
  step_dummy_hash(company) %>%
  # Regular indicators for the others
  step_dummy(all_nominal_predictors()) %>% 
  step_zv(all_predictors()) %>% 
  step_spline_natural(arrival_date_num, deg_free = 10)

hotel_hash_wflow <-
  hotel_lm_wflow %>%
  update_recipe(hash_rec)

Feature Hashing in recipes

hotel_hash_res <-
  hotel_hash_wflow %>%
  fit_resamples(hotel_rs, control = ctrl, metrics = reg_metrics)

collect_metrics(hotel_hash_res)
#> # A tibble: 2 Γ— 6
#>   .metric .estimator   mean     n std_err .config             
#>   <chr>   <chr>       <dbl> <int>   <dbl> <chr>               
#> 1 mae     standard   16.7      10 0.239   Preprocessor1_Model1
#> 2 rsq     standard    0.884    10 0.00324 Preprocessor1_Model1

About the same performance but now we can handle new values.

Debugging a recipe

  • Typically, you will want to use a workflow to estimate and apply a recipe.
  • If you have an error and need to debug your recipe, the original recipe object (e.g. hash_rec) can be estimated manually with a function called prep(). It is analogous to fit(). See TMwR section 16.4
  • Another function (bake()) is analogous to predict(), and gives you the processed data back.
  • The tidy() function can be used to get specific results from the recipe.

Example

hash_rec_fit <- prep(hash_rec)

# Get the transformation coefficient
tidy(hash_rec_fit, number = 1)

# Get the processed data
bake(hash_rec_fit, hotel_train %>% slice(1:3), contains("_agent_"))

More on recipes

  • Once fit() is called on a workflow, changing the model does not re-fit the recipe.
  • Some steps can be skipped when using predict().
  • The order of the steps matters.

Open Worksheet 2 πŸ“„

Previously - Setup

library(tidymodels)
library(textrecipes)
library(bonsai)

# extra settings: 
tidymodels_prefer()
theme_set(theme_bw())
options(
  pillar.advice = FALSE, 
  pillar.min_title_chars = Inf
)

reg_metrics <- metric_set(mae, rsq)
data(hotel_rates)
set.seed(295)
hotel_rates <- 
  hotel_rates %>% 
  sample_n(5000) %>% 
  arrange(arrival_date) %>% 
  select(-arrival_date) %>% 
  mutate(
    company = factor(as.character(company)),
    country = factor(as.character(country)),
    agent = factor(as.character(agent))
  )

Previously - Data Usage

set.seed(4028)
hotel_split <- initial_split(hotel_rates, strata = avg_price_per_room)

hotel_train <- training(hotel_split)
hotel_test <- testing(hotel_split)

set.seed(472)
hotel_rs <- vfold_cv(hotel_train, strata = avg_price_per_room)

Previously - Feature engineering

library(textrecipes)

hash_rec <-
  recipe(avg_price_per_room ~ ., data = hotel_train) %>%
  step_YeoJohnson(lead_time) %>%
  # Defaults to 32 signed indicator columns
  step_dummy_hash(agent) %>%
  step_dummy_hash(company) %>%
  # Regular indicators for the others
  step_dummy(all_nominal_predictors()) %>% 
  step_zv(all_predictors())

Optimizing Models via Tuning Parameters

Tuning parameters

Some model or preprocessing parameters cannot be estimated directly from the data.

Some examples:

  • Tree depth in decision trees
  • Number of neighbors in a K-nearest neighbor model

Activation function in neural networks?

Sigmoidal functions, ReLu, etc.

Yes, it is a tuning parameter. βœ…

Number of feature hashing columns to generate?

Yes, it is a tuning parameter. βœ…

Bayesian priors for model parameters?

Hmmmm, probably not. These are based on prior belief. ❌

The random seed?

Nope. It is not. ❌

Optimize tuning parameters

  • Try different values and measure their performance.
  • Find good values for these parameters.
  • Once the value(s) of the parameter(s) are determined, a model can be finalized by fitting the model to the entire training set.

Tagging parameters for tuning

With tidymodels, you can mark the parameters that you want to optimize with a value of tune().


The function itself just returns… itself:

tune()
#> tune()
str(tune())
#>  language tune()

# optionally add a label
tune("I hope that this is going well")
#> tune("I hope that this is going well")

For example…

Optimizing the hash features

Our new recipe is:

hash_rec <-
  recipe(avg_price_per_room ~ ., data = hotel_train) %>%
  step_YeoJohnson(lead_time) %>%
  step_dummy_hash(agent,   num_terms = tune("agent hash")) %>%
  step_dummy_hash(company, num_terms = tune("company hash")) %>%
  step_zv(all_predictors())


We will be using a tree-based model in a minute.

  • The other categorical predictors are left as-is.
  • That’s why there is no step_dummy().

Boosted Trees

These are popular ensemble methods that build a sequence of tree models.


Each tree uses the results of the previous tree to better predict samples, especially those that have been poorly predicted.


Each tree in the ensemble is saved and new samples are predicted using a weighted average of the votes of each tree in the ensemble.


We’ll focus on the popular lightgbm implementation.

Boosted Tree Tuning Parameters

Some possible parameters:

  • mtry: The number of predictors randomly sampled at each split (in \([1, ncol(x)]\) or \((0, 1]\)).
  • trees: The number of trees (\([1, \infty]\), but usually up to thousands)
  • min_n: The number of samples needed to further split (\([1, n]\)).
  • learn_rate: The rate that each tree adapts from previous iterations (\((0, \infty]\), usual maximum is 0.1).
  • stop_iter: The number of iterations of boosting where no improvement was shown before stopping (\([1, trees]\))

Boosted Tree Tuning Parameters

TBH it is usually not difficult to optimize these models.


Often, there are multiple candidate tuning parameter combinations that have very good results.


To demonstrate simple concepts, we’ll look at optimizing the number of trees in the ensemble (between 1 and 100) and the learning rate (\(10^{-5}\) to \(10^{-1}\)).

Boosted Tree Tuning Parameters

We’ll need to load the bonsai package. This has the information needed to use lightgbm

library(bonsai)
lgbm_spec <- 
  boost_tree(trees = tune(), learn_rate = tune()) %>% 
  set_mode("regression") %>% 
  set_engine("lightgbm", num_threads = 1)

lgbm_wflow <- workflow(hash_rec, lgbm_spec)

Optimize tuning parameters

The main two strategies for optimization are:

  • Grid search πŸ’  which tests a pre-defined set of candidate values

  • Iterative search πŸŒ€ which suggests/estimates new values of candidate parameters to evaluate

Grid search

In reality we would probably sample the space more densely:

Grid Search

Parameters

  • The tidymodels framework provides pre-defined information on tuning parameters (such as their type, range, transformations, etc).

  • The extract_parameter_set_dials() function extracts these tuning parameters and the info.

Grids

  • Create your grid manually or automatically.

  • The grid_*() functions can make a grid.

Different types of grids

Space-filling designs (SFD) attempt to cover the parameter space without redundant candidates (recommended).

Create a grid

lgbm_wflow %>% 
  extract_parameter_set_dials()

# Individual functions: 
trees()
learn_rate()

A parameter set can be updated (e.g. to change the ranges).

Create a grid

set.seed(12)
grid <- 
  lgbm_wflow %>% 
  extract_parameter_set_dials() %>% 
  grid_space_filling(size = 25)

grid
#> # A tibble: 25 Γ— 4
#>    trees learn_rate `agent hash` `company hash`
#>    <int>      <dbl>        <int>          <int>
#>  1     1   7.50e- 6          574            574
#>  2    84   1.78e- 5         2048           2298
#>  3   167   5.62e-10         1824            912
#>  4   250   4.22e- 5         3250            512
#>  5   334   1.78e- 8          512           2896
#>  6   417   1.33e- 3          322           1625
#>  7   500   1   e- 1         1448           1149
#>  8   584   1   e- 7         1290            256
#>  9   667   2.37e-10          456            724
#> 10   750   1.78e- 2          645            322
#> # β„Ή 15 more rows

Your turn

Create a grid for our tunable workflow.

Try creating a regular grid.

10:00

Create a regular grid

set.seed(12)
grid <- 
  lgbm_wflow %>% 
  extract_parameter_set_dials() %>% 
  grid_regular(levels = 4)

grid
#> # A tibble: 256 Γ— 4
#>    trees   learn_rate `agent hash` `company hash`
#>    <int>        <dbl>        <int>          <int>
#>  1     1 0.0000000001          256            256
#>  2   667 0.0000000001          256            256
#>  3  1333 0.0000000001          256            256
#>  4  2000 0.0000000001          256            256
#>  5     1 0.0000001             256            256
#>  6   667 0.0000001             256            256
#>  7  1333 0.0000001             256            256
#>  8  2000 0.0000001             256            256
#>  9     1 0.0001                256            256
#> 10   667 0.0001                256            256
#> # β„Ή 246 more rows

Your turn


What advantage would a regular grid have?

Update parameter ranges

lgbm_param <- 
  lgbm_wflow %>% 
  extract_parameter_set_dials() %>% 
  update(trees = trees(c(1L, 100L)),
         learn_rate = learn_rate(c(-5, -1)))

set.seed(712)
grid <- 
  lgbm_param %>% 
  grid_space_filling(size = 25)

grid
#> # A tibble: 25 Γ— 4
#>    trees learn_rate `agent hash` `company hash`
#>    <int>      <dbl>        <int>          <int>
#>  1     1  0.00147            574            574
#>  2     5  0.00215           2048           2298
#>  3     9  0.0000215         1824            912
#>  4    13  0.00316           3250            512
#>  5    17  0.0001             512           2896
#>  6    21  0.0147             322           1625
#>  7    25  0.1               1448           1149
#>  8    29  0.000215          1290            256
#>  9    34  0.0000147          456            724
#> 10    38  0.0464             645            322
#> # β„Ή 15 more rows

The results

grid %>% 
  ggplot(aes(trees, learn_rate)) +
  geom_point(size = 4) +
  scale_y_log10()

Note that the learning rates are uniform on the log-10 scale and this shows 2 of 4 dimensions.

Use the tune_*() functions to tune models

Choosing tuning parameters

Let’s take our previous model and tune more parameters:

lgbm_spec <- 
  boost_tree(trees = tune(), learn_rate = tune(),  min_n = tune()) %>% 
  set_mode("regression") %>% 
  set_engine("lightgbm", num_threads = 1)

lgbm_wflow <- workflow(hash_rec, lgbm_spec)

# Update the feature hash ranges (log-2 units)
lgbm_param <-
  lgbm_wflow %>%
  extract_parameter_set_dials() %>%
  update(`agent hash`   = num_hash(c(3, 8)),
         `company hash` = num_hash(c(3, 8)))

Grid Search

set.seed(9)
ctrl <- control_grid(save_pred = TRUE)

lgbm_res <-
  lgbm_wflow %>%
  tune_grid(
    resamples = hotel_rs,
    grid = 25,
    # The options below are not required by default
    param_info = lgbm_param, 
    control = ctrl,
    metrics = reg_metrics
  )

Grid Search

lgbm_res 
#> # Tuning results
#> # 10-fold cross-validation using stratification 
#> # A tibble: 10 Γ— 5
#>    splits             id     .metrics          .notes           .predictions        
#>    <list>             <chr>  <list>            <list>           <list>              
#>  1 <split [3372/377]> Fold01 <tibble [50 Γ— 9]> <tibble [0 Γ— 3]> <tibble [9,425 Γ— 9]>
#>  2 <split [3373/376]> Fold02 <tibble [50 Γ— 9]> <tibble [0 Γ— 3]> <tibble [9,400 Γ— 9]>
#>  3 <split [3373/376]> Fold03 <tibble [50 Γ— 9]> <tibble [0 Γ— 3]> <tibble [9,400 Γ— 9]>
#>  4 <split [3373/376]> Fold04 <tibble [50 Γ— 9]> <tibble [0 Γ— 3]> <tibble [9,400 Γ— 9]>
#>  5 <split [3373/376]> Fold05 <tibble [50 Γ— 9]> <tibble [0 Γ— 3]> <tibble [9,400 Γ— 9]>
#>  6 <split [3374/375]> Fold06 <tibble [50 Γ— 9]> <tibble [0 Γ— 3]> <tibble [9,375 Γ— 9]>
#>  7 <split [3375/374]> Fold07 <tibble [50 Γ— 9]> <tibble [0 Γ— 3]> <tibble [9,350 Γ— 9]>
#>  8 <split [3376/373]> Fold08 <tibble [50 Γ— 9]> <tibble [0 Γ— 3]> <tibble [9,325 Γ— 9]>
#>  9 <split [3376/373]> Fold09 <tibble [50 Γ— 9]> <tibble [0 Γ— 3]> <tibble [9,325 Γ— 9]>
#> 10 <split [3376/373]> Fold10 <tibble [50 Γ— 9]> <tibble [0 Γ— 3]> <tibble [9,325 Γ— 9]>

Grid results

autoplot(lgbm_res)

Tuning results

collect_metrics(lgbm_res)
#> # A tibble: 50 Γ— 11
#>    trees min_n    learn_rate `agent hash` `company hash` .metric .estimator   mean     n std_err .config              
#>    <int> <int>         <dbl>        <int>          <int> <chr>   <chr>       <dbl> <int>   <dbl> <chr>                
#>  1     1    17 0.00000000133           93             25 mae     standard   53.2      10 0.427   Preprocessor01_Model1
#>  2     1    17 0.00000000133           93             25 rsq     standard    0.810    10 0.00746 Preprocessor01_Model1
#>  3    84    27 0.0178                  80             69 mae     standard   20.0      10 0.293   Preprocessor02_Model1
#>  4    84    27 0.0178                  80             69 rsq     standard    0.897    10 0.00423 Preprocessor02_Model1
#>  5   167    19 0.00000133              12            143 mae     standard   53.2      10 0.426   Preprocessor03_Model1
#>  6   167    19 0.00000133              12            143 rsq     standard    0.811    10 0.00698 Preprocessor03_Model1
#>  7   250     2 0.000237                33             33 mae     standard   50.6      10 0.409   Preprocessor04_Model1
#>  8   250     2 0.000237                33             33 rsq     standard    0.815    10 0.00800 Preprocessor04_Model1
#>  9   334    36 0.0000000178            19             21 mae     standard   53.2      10 0.427   Preprocessor05_Model1
#> 10   334    36 0.0000000178            19             21 rsq     standard    0.807    10 0.00692 Preprocessor05_Model1
#> # β„Ή 40 more rows

Tuning results

collect_metrics(lgbm_res, summarize = FALSE)
#> # A tibble: 500 Γ— 10
#>    id     trees min_n    learn_rate `agent hash` `company hash` .metric .estimator .estimate .config              
#>    <chr>  <int> <int>         <dbl>        <int>          <int> <chr>   <chr>          <dbl> <chr>                
#>  1 Fold01     1    17 0.00000000133           93             25 mae     standard      51.8   Preprocessor01_Model1
#>  2 Fold01     1    17 0.00000000133           93             25 rsq     standard       0.821 Preprocessor01_Model1
#>  3 Fold02     1    17 0.00000000133           93             25 mae     standard      52.1   Preprocessor01_Model1
#>  4 Fold02     1    17 0.00000000133           93             25 rsq     standard       0.810 Preprocessor01_Model1
#>  5 Fold03     1    17 0.00000000133           93             25 mae     standard      52.2   Preprocessor01_Model1
#>  6 Fold03     1    17 0.00000000133           93             25 rsq     standard       0.783 Preprocessor01_Model1
#>  7 Fold04     1    17 0.00000000133           93             25 mae     standard      51.7   Preprocessor01_Model1
#>  8 Fold04     1    17 0.00000000133           93             25 rsq     standard       0.815 Preprocessor01_Model1
#>  9 Fold05     1    17 0.00000000133           93             25 mae     standard      55.2   Preprocessor01_Model1
#> 10 Fold05     1    17 0.00000000133           93             25 rsq     standard       0.847 Preprocessor01_Model1
#> # β„Ή 490 more rows

Choose a parameter combination

show_best(lgbm_res, metric = "rsq")
#> # A tibble: 5 Γ— 11
#>   trees min_n learn_rate `agent hash` `company hash` .metric .estimator  mean     n std_err .config              
#>   <int> <int>      <dbl>        <int>          <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
#> 1  1250     9    0.1              107             80 rsq     standard   0.947    10 0.00395 Preprocessor16_Model1
#> 2   917    16    0.0422             9             39 rsq     standard   0.946    10 0.00362 Preprocessor12_Model1
#> 3  1666    30    0.00750          124             19 rsq     standard   0.943    10 0.00309 Preprocessor21_Model1
#> 4  2000    24    0.00133           25            124 rsq     standard   0.919    10 0.00383 Preprocessor25_Model1
#> 5   500    25    0.00316           52              8 rsq     standard   0.899    10 0.00435 Preprocessor07_Model1

Choose a parameter combination

Create your own tibble for final parameters or use one of the tune::select_*() functions:

lgbm_best <- select_best(lgbm_res, metric = "mae")
lgbm_best
#> # A tibble: 1 Γ— 6
#>   trees min_n learn_rate `agent hash` `company hash` .config              
#>   <int> <int>      <dbl>        <int>          <int> <chr>                
#> 1  1250     9        0.1          107             80 Preprocessor16_Model1

Checking Calibration

library(probably)
lgbm_res %>%
  collect_predictions(
    parameters = lgbm_best
  ) %>%
  cal_plot_regression(
    truth = avg_price_per_room,
    estimate = .pred
  )

Running in parallel

  • Grid search, combined with resampling, requires fitting a lot of models!

  • These models don’t depend on one another and can be run in parallel.

We can use a parallel backend to do this:

cores <- parallelly::availableCores(logical = FALSE)
cl <- parallel::makePSOCKcluster(cores)
doParallel::registerDoParallel(cl)

# Now call `tune_grid()`!

# Shut it down with:
foreach::registerDoSEQ()
parallel::stopCluster(cl)

Running in parallel

Speed-ups are fairly linear up to the number of physical cores (10 here).

Early stopping for boosted trees

We have directly optimized the number of trees as a tuning parameter.

Instead we could

  • Set the number of trees to a single large number.
  • Stop adding trees when performance gets worse.

This is known as β€œearly stopping” and there is a parameter for that: stop_iter.

Early stopping has a potential to decrease the tuning time.

Your turn


Set trees = 2000 and tune the stop_iter parameter.

Note that you will need to regenerate lgbm_param with your new workflow!

10:00

Open Worksheet 3 πŸ“„

Previously - Setup

library(tidymodels)
library(textrecipes)
library(bonsai)

# extra settings: 
tidymodels_prefer()
theme_set(theme_bw())
options(
  pillar.advice = FALSE, 
  pillar.min_title_chars = Inf
)

reg_metrics <- metric_set(mae, rsq)
data(hotel_rates)
set.seed(295)
hotel_rates <- 
  hotel_rates %>% 
  sample_n(5000) %>% 
  arrange(arrival_date) %>% 
  select(-arrival_date) %>% 
  mutate(
    company = factor(as.character(company)),
    country = factor(as.character(country)),
    agent = factor(as.character(agent))
  )

Previously - Data Usage

set.seed(4028)
hotel_split <-
  initial_split(hotel_rates, strata = avg_price_per_room)

hotel_train <- training(hotel_split)
hotel_test <- testing(hotel_split)

set.seed(472)
hotel_rs <- vfold_cv(hotel_train, strata = avg_price_per_room)

Previously - Boosting Model

hotel_rec <-
  recipe(avg_price_per_room ~ ., data = hotel_train) %>%
  step_YeoJohnson(lead_time) %>%
  step_dummy_hash(agent,   num_terms = tune("agent hash")) %>%
  step_dummy_hash(company, num_terms = tune("company hash")) %>%
  step_zv(all_predictors())

lgbm_spec <- 
  boost_tree(trees = tune(), learn_rate = tune(), min_n = tune()) %>% 
  set_mode("regression") %>% 
  set_engine("lightgbm", num_threads = 1)

lgbm_wflow <- workflow(hotel_rec, lgbm_spec)

lgbm_param <-
  lgbm_wflow %>%
  extract_parameter_set_dials() %>%
  update(`agent hash`   = num_hash(c(3, 8)),
         `company hash` = num_hash(c(3, 8)))

Making Grid Search More Efficient

In the last section, we evaluated 250 models (25 candidates times 10 resamples).

We can make this go faster using parallel processing.

Also, for some models, we can fit far fewer models than the number that are being evaluated.

  • For boosting, a model with X trees can often predict on candidates with less than X trees.

Both of these methods can lead to enormous speed-ups.

Model Racing

Racing is an old tool that we can use to go even faster.

  1. Evaluate all of the candidate models but only for a few resamples.
  2. Determine which candidates have a low probability of being selected.
  3. Eliminate poor candidates.
  4. Repeat with next resample (until no more resamples remain)

This can result in fitting a small number of models.

Discarding Candidates

How do we eliminate tuning parameter combinations?

There are a few methods to do so. We’ll use one based on analysis of variance (ANOVA).

However… there is typically a large difference between resamples in the results.

Resampling Results (Non-Racing)

Here are some realistic (but simulated) examples of two candidate models.

An error estimate is measured for each of 10 resamples.

  • The lines connect resamples.

There is usually a significant resample-to-resample effect (rank corr: 0.83).

Are Candidates Different?

One way to evaluate these models is to do a paired t-test

  • or a t-test on their differences matched by resamples

With \(n = 10\) resamples, the confidence interval for the difference in RMSE is (0.99, 2.8), indicating that candidate number 2 has smaller error.

Evaluating Differences in Candidates

What if we were to have compared the candidates while we sequentially evaluated each resample?

πŸ‘‰

One candidate shows superiority when 4 resamples have been evaluated.

Interim Analysis of Results

One version of racing uses a mixed model ANOVA to construct one-sided confidence intervals for each candidate versus the current best.

Any candidates whose bound does not include zero are discarded. Here is an animation.

The resamples are analyzed in a random order (so set the seed).


Kuhn (2014) has examples and simulations to show that the method works.

The finetune package has functions tune_race_anova() and tune_race_win_loss().

Racing

# Let's use a larger grid
set.seed(8945)
lgbm_grid <- 
  lgbm_param %>% 
  grid_space_filling(size = 50)

library(finetune)

set.seed(9)
lgbm_race_res <-
  lgbm_wflow %>%
  tune_race_anova(
    resamples = hotel_rs,
    grid = lgbm_grid, 
    metrics = reg_metrics
  )

The syntax and helper functions are extremely similar to those shown for tune_grid().

Racing Results

show_best(lgbm_race_res, metric = "mae")
#> # A tibble: 2 Γ— 11
#>   trees min_n learn_rate `agent hash` `company hash` .metric .estimator  mean     n std_err .config              
#>   <int> <int>      <dbl>        <int>          <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
#> 1  1347     5     0.0655           66             26 mae     standard    9.64    10   0.173 Preprocessor34_Model1
#> 2   980     8     0.0429           17            135 mae     standard    9.76    10   0.164 Preprocessor25_Model1

Racing Results

Only 171 models were fit (out of 500).

select_best() never considers candidate models that did not get to the end of the race.

There is a helper function to see how candidate models were removed from consideration.

plot_race(lgbm_race_res) + 
  scale_x_continuous(breaks = pretty_breaks())

Your turn

  • Run tune_race_anova() with a different seed.
  • Did you get the same or similar results?
10:00

Your turn

  • Now repeat the process using tune_race_anova(), but this time focus on performance metric rsq (use set.seed(9) as before).

  • What can you conclude when you compare results from mae and rsq?

15:00

Open Worksheet 4 πŸ“„

Iterative Search

Previously - Setup

library(tidymodels)
library(textrecipes)
library(bonsai)

# extra settings: 
tidymodels_prefer()
theme_set(theme_bw())
options(
  pillar.advice = FALSE, 
  pillar.min_title_chars = Inf
)
data(hotel_rates)
set.seed(295)
hotel_rates <- 
  hotel_rates %>% 
  sample_n(5000) %>% 
  arrange(arrival_date) %>% 
  select(-arrival_date) %>%  
  mutate(
    company = factor(as.character(company)),
    country = factor(as.character(country)),
    agent = factor(as.character(agent))
  )

Previously - Data Usage

set.seed(4028)
hotel_split <- initial_split(hotel_rates, strata = avg_price_per_room)

hotel_train <- training(hotel_split)
hotel_test <- testing(hotel_split)

set.seed(472)
hotel_rs <- vfold_cv(hotel_train, strata = avg_price_per_room)

Our Boosting Model

We used feature hashing to generate a smaller set of indicator columns to deal with the large number of levels for the agent and country predictors.


Tree-based models (and a few others) don’t require indicators for categorical predictors. They can split on these variables as-is.


We’ll keep all categorical predictors as factors and focus on optimizing additional boosting parameters.

Our Boosting Model

lgbm_spec <- 
  boost_tree(trees = 1000, learn_rate = tune(), min_n = tune(), 
             tree_depth = tune(), loss_reduction = tune(), 
             stop_iter = tune()) %>% 
  set_mode("regression") %>% 
  set_engine("lightgbm", num_threads = 1)

lgbm_wflow <- workflow(avg_price_per_room ~ ., lgbm_spec)

lgbm_param <- 
  lgbm_wflow %>%
    extract_parameter_set_dials() %>%
    update(learn_rate = learn_rate(c(-5, -1)))

Iterative Search

Instead of pre-defining a grid of candidate points, we can model our current results to predict what the next candidate point should be.


Suppose that we are only tuning the learning rate in our boosted tree.


We could do something like:

mae_pred <- lm(mae ~ learn_rate, data = resample_results)

and use this to predict and rank new learning rate candidates.

Iterative Search

A linear model probably isn’t the best choice though (more in a minute).

To illustrate the process, we resampled a large grid of learning rate values for our data to show what the relationship is between MAE and learning rate.

Now suppose that we used a grid of three points in the parameter range for learning rate…

A Large Grid

A Three Point Grid

Gaussian Processes and Optimization

We can make a β€œmeta-model” with a small set of historical performance results.

Gaussian Processes (GP) models are a good choice to model performance.

  • It is a Bayesian model so we are using Bayesian Optimization (BO).
  • For regression, we can assume that our data are multivariate normal.
  • We also define a covariance function for the variance relationship between data points. A common one is:

\[\operatorname{cov}(\boldsymbol{x}_i, \boldsymbol{x}_j) = \exp\left(-\frac{1}{2}|\boldsymbol{x}_i - \boldsymbol{x}_j|^2\right) + \sigma^2_{ij}\]

Predicting Candidates

The GP model can take candidate tuning parameter combinations as inputs and make predictions for performance (e.g. MAE)

  • The mean performance
  • The variance of performance

The variance is mostly driven by spatial variability (the previous equation).

The predicted variance is zero at locations of actual data points and becomes very high when far away from any observed data.

Your turn

Your GP makes predictions on two new candidate tuning parameters.

We want to minimize MAE.

Which should we choose?

05:00

GP Fit (ribbon is mean +/- 1SD)

Choosing New Candidates

This isn’t a very good fit but we can still use it.

How can we use the outputs to choose the next point to measure?


Acquisition functions take the predicted mean and variance and use them to balance:

  • exploration: new candidates should explore new areas.
  • exploitation: new candidates must stay near existing values.

Exploration focuses on the variance, exploitation is about the mean.

Acquisition Functions

We’ll use an acquisition function to select a new candidate.

The most popular method appears to be expected improvement (EI) above the current best results.

  • Zero at existing data points.
  • The expected improvement is integrated over all possible improvement (β€œexpected” in the probability sense).

We would probably pick the point with the largest EI as the next point.

(There are other functions beyond EI.)

Expected Improvement

Iteration

Once we pick the candidate point, we measure performance for it (e.g. resampling).


Another GP is fit, EI is recomputed, and so on.


We stop when we have completed the allowed number of iterations or if we don’t see any improvement after a pre-set number of attempts.

GP Fit with four points

Expected Improvement

GP Evolution

Expected Improvement Evolution

BO in tidymodels

We’ll use a function called tune_bayes() that has very similar syntax to tune_grid().


It has an additional initial argument for the initial set of performance estimates and parameter combinations for the GP model.

Initial grid points

initial can be the results of another tune_*() function or an integer (in which case tune_grid() is used under to hood to make such an initial set of results).

  • We’ll run the optimization more than once, so let’s make an initial grid of results to serve as the substrate for the BO.

  • I suggest at least the number of tuning parameters plus two as the initial grid for BO.

An Initial Grid

reg_metrics <- metric_set(mae, rsq)

set.seed(12)
init_res <-
  lgbm_wflow %>%
  tune_grid(
    resamples = hotel_rs,
    grid = nrow(lgbm_param) + 2,
    param_info = lgbm_param,
    metrics = reg_metrics
  )

show_best(init_res, metric = "mae") %>% select(-.metric, -.estimator)
#> # A tibble: 5 Γ— 9
#>   min_n tree_depth learn_rate loss_reduction stop_iter  mean     n std_err .config             
#>   <int>      <int>      <dbl>          <dbl>     <int> <dbl> <int>   <dbl> <chr>               
#> 1    33         12   0.0215         8.25e- 9         5  10.1    10   0.182 Preprocessor1_Model6
#> 2    14         10   0.1            3.16e+ 1         8  10.2    10   0.192 Preprocessor1_Model3
#> 3     8          3   0.00464        1   e-10        11  14.4    10   0.305 Preprocessor1_Model2
#> 4    40          1   0.001          4.64e- 3        14  36.1    10   0.408 Preprocessor1_Model7
#> 5    27         15   0.000215       6.81e- 7        20  44.5    10   0.371 Preprocessor1_Model5

BO using tidymodels

ctrl_bo <- control_bayes(seed = 15, verbose_iter = TRUE) # <- for demonstration

set.seed(15)
lgbm_bayes_res <-
  lgbm_wflow %>%
  tune_bayes(
    resamples = hotel_rs,
    initial = init_res,     # <- initial results
    iter = 20,
    param_info = lgbm_param,
    control = ctrl_bo,
    metrics = reg_metrics
  )
#> Optimizing mae using the expected improvement
#> 
#> ── Iteration 1 ───────────────────────────────────────────────────────
#> 
#> i Current best:      mae=10.05 (@iter 0)
#> i Gaussian process model
#> βœ“ Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i min_n=3, tree_depth=9, learn_rate=0.089, loss_reduction=2.73e-10, stop_iter=4
#> i Estimating performance
#> βœ“ Estimating performance
#> β™₯ Newest results:    mae=9.651 (+/-0.159)
#> 
#> ── Iteration 2 ───────────────────────────────────────────────────────
#> 
#> i Current best:      mae=9.651 (@iter 1)
#> i Gaussian process model
#> βœ“ Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i min_n=2, tree_depth=13, learn_rate=0.0196, loss_reduction=0.000897, stop_iter=19
#> i Estimating performance
#> βœ“ Estimating performance
#> β“§ Newest results:    mae=9.971 (+/-0.172)
#> 
#> ── Iteration 3 ───────────────────────────────────────────────────────
#> 
#> i Current best:      mae=9.651 (@iter 1)
#> i Gaussian process model
#> βœ“ Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i min_n=24, tree_depth=15, learn_rate=0.0554, loss_reduction=0.00201, stop_iter=16
#> i Estimating performance
#> βœ“ Estimating performance
#> β“§ Newest results:    mae=9.73 (+/-0.149)
#> 
#> ── Iteration 4 ───────────────────────────────────────────────────────
#> 
#> i Current best:      mae=9.651 (@iter 1)
#> i Gaussian process model
#> βœ“ Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i min_n=37, tree_depth=1, learn_rate=0.0117, loss_reduction=1.16e-06, stop_iter=5
#> i Estimating performance
#> βœ“ Estimating performance
#> β“§ Newest results:    mae=17.91 (+/-0.308)
#> 
#> ── Iteration 5 ───────────────────────────────────────────────────────
#> 
#> i Current best:      mae=9.651 (@iter 1)
#> i Gaussian process model
#> βœ“ Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i min_n=2, tree_depth=14, learn_rate=0.0253, loss_reduction=3.1e-05, stop_iter=4
#> i Estimating performance
#> βœ“ Estimating performance
#> β“§ Newest results:    mae=9.852 (+/-0.161)
#> 
#> ── Iteration 6 ───────────────────────────────────────────────────────
#> 
#> i Current best:      mae=9.651 (@iter 1)
#> i Gaussian process model
#> βœ“ Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i min_n=6, tree_depth=7, learn_rate=0.0323, loss_reduction=2.04e-10, stop_iter=6
#> i Estimating performance
#> βœ“ Estimating performance
#> β“§ Newest results:    mae=9.816 (+/-0.154)
#> 
#> ── Iteration 7 ───────────────────────────────────────────────────────
#> 
#> i Current best:      mae=9.651 (@iter 1)
#> i Gaussian process model
#> βœ“ Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i min_n=17, tree_depth=12, learn_rate=0.0359, loss_reduction=1.7e-10, stop_iter=12
#> i Estimating performance
#> βœ“ Estimating performance
#> β“§ Newest results:    mae=9.706 (+/-0.159)
#> 
#> ── Iteration 8 ───────────────────────────────────────────────────────
#> 
#> i Current best:      mae=9.651 (@iter 1)
#> i Gaussian process model
#> βœ“ Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i min_n=39, tree_depth=15, learn_rate=0.029, loss_reduction=9.27e-08, stop_iter=6
#> i Estimating performance
#> βœ“ Estimating performance
#> β“§ Newest results:    mae=9.999 (+/-0.192)
#> 
#> ── Iteration 9 ───────────────────────────────────────────────────────
#> 
#> i Current best:      mae=9.651 (@iter 1)
#> i Gaussian process model
#> βœ“ Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i min_n=26, tree_depth=15, learn_rate=0.0249, loss_reduction=6.75e-08, stop_iter=13
#> i Estimating performance
#> βœ“ Estimating performance
#> β“§ Newest results:    mae=9.873 (+/-0.18)
#> 
#> ── Iteration 10 ──────────────────────────────────────────────────────
#> 
#> i Current best:      mae=9.651 (@iter 1)
#> i Gaussian process model
#> βœ“ Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i min_n=3, tree_depth=9, learn_rate=0.0245, loss_reduction=5.44e-09, stop_iter=4
#> i Estimating performance
#> βœ“ Estimating performance
#> β“§ Newest results:    mae=9.863 (+/-0.18)
#> 
#> ── Iteration 11 ──────────────────────────────────────────────────────
#> 
#> i Current best:      mae=9.651 (@iter 1)
#> i Gaussian process model
#> βœ“ Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i min_n=5, tree_depth=12, learn_rate=0.0593, loss_reduction=1.16e-09, stop_iter=6
#> i Estimating performance
#> βœ“ Estimating performance
#> β™₯ Newest results:    mae=9.645 (+/-0.171)
#> 
#> ── Iteration 12 ──────────────────────────────────────────────────────
#> 
#> i Current best:      mae=9.645 (@iter 11)
#> i Gaussian process model
#> βœ“ Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i min_n=23, tree_depth=4, learn_rate=0.0739, loss_reduction=3.99e-10, stop_iter=4
#> i Estimating performance
#> βœ“ Estimating performance
#> β“§ Newest results:    mae=10.09 (+/-0.193)
#> 
#> ── Iteration 13 ──────────────────────────────────────────────────────
#> 
#> i Current best:      mae=9.645 (@iter 11)
#> i Gaussian process model
#> βœ“ Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i min_n=4, tree_depth=2, learn_rate=0.0439, loss_reduction=0.00914, stop_iter=5
#> i Estimating performance
#> βœ“ Estimating performance
#> β“§ Newest results:    mae=11.59 (+/-0.218)
#> 
#> ── Iteration 14 ──────────────────────────────────────────────────────
#> 
#> i Current best:      mae=9.645 (@iter 11)
#> i Gaussian process model
#> βœ“ Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i min_n=22, tree_depth=15, learn_rate=0.00489, loss_reduction=3.28e-05, stop_iter=17
#> i Estimating performance
#> βœ“ Estimating performance
#> β“§ Newest results:    mae=11.22 (+/-0.231)
#> 
#> ── Iteration 15 ──────────────────────────────────────────────────────
#> 
#> i Current best:      mae=9.645 (@iter 11)
#> i Gaussian process model
#> βœ“ Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i min_n=27, tree_depth=8, learn_rate=0.0524, loss_reduction=4.24e-08, stop_iter=11
#> i Estimating performance
#> βœ“ Estimating performance
#> β“§ Newest results:    mae=9.801 (+/-0.184)
#> 
#> ── Iteration 16 ──────────────────────────────────────────────────────
#> 
#> i Current best:      mae=9.645 (@iter 11)
#> i Gaussian process model
#> βœ“ Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i min_n=3, tree_depth=13, learn_rate=0.00837, loss_reduction=4.21e-10, stop_iter=9
#> i Estimating performance
#> βœ“ Estimating performance
#> β“§ Newest results:    mae=10.45 (+/-0.195)
#> 
#> ── Iteration 17 ──────────────────────────────────────────────────────
#> 
#> i Current best:      mae=9.645 (@iter 11)
#> i Gaussian process model
#> βœ“ Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i min_n=33, tree_depth=13, learn_rate=0.0767, loss_reduction=1.36e-10, stop_iter=4
#> i Estimating performance
#> βœ“ Estimating performance
#> β“§ Newest results:    mae=9.921 (+/-0.181)
#> 
#> ── Iteration 18 ──────────────────────────────────────────────────────
#> 
#> i Current best:      mae=9.645 (@iter 11)
#> i Gaussian process model
#> βœ“ Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i min_n=3, tree_depth=7, learn_rate=0.0711, loss_reduction=3.14e-07, stop_iter=13
#> i Estimating performance
#> βœ“ Estimating performance
#> β“§ Newest results:    mae=9.664 (+/-0.161)
#> 
#> ── Iteration 19 ──────────────────────────────────────────────────────
#> 
#> i Current best:      mae=9.645 (@iter 11)
#> i Gaussian process model
#> βœ“ Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i min_n=3, tree_depth=9, learn_rate=0.0613, loss_reduction=2.02, stop_iter=8
#> i Estimating performance
#> βœ“ Estimating performance
#> β“§ Newest results:    mae=9.772 (+/-0.165)
#> 
#> ── Iteration 20 ──────────────────────────────────────────────────────
#> 
#> i Current best:      mae=9.645 (@iter 11)
#> i Gaussian process model
#> βœ“ Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i min_n=5, tree_depth=6, learn_rate=0.0896, loss_reduction=4.87e-09, stop_iter=9
#> i Estimating performance
#> βœ“ Estimating performance
#> β“§ Newest results:    mae=9.721 (+/-0.146)

Best results

show_best(lgbm_bayes_res, metric = "mae") %>% select(-.metric, -.estimator)
#> # A tibble: 5 Γ— 10
#>   min_n tree_depth learn_rate loss_reduction stop_iter  mean     n std_err .config .iter
#>   <int>      <int>      <dbl>          <dbl>     <int> <dbl> <int>   <dbl> <chr>   <int>
#> 1     5         12     0.0593       1.16e- 9         6  9.64    10   0.171 Iter11     11
#> 2     3          9     0.0890       2.73e-10         4  9.65    10   0.159 Iter1       1
#> 3     3          7     0.0711       3.14e- 7        13  9.66    10   0.161 Iter18     18
#> 4    17         12     0.0359       1.70e-10        12  9.71    10   0.159 Iter7       7
#> 5     5          6     0.0896       4.87e- 9         9  9.72    10   0.146 Iter20     20

Plotting BO Results

autoplot(lgbm_bayes_res, metric = "mae")

Plotting BO Results

autoplot(lgbm_bayes_res, metric = "mae", type = "parameters")

Plotting BO Results

autoplot(lgbm_bayes_res, metric = "mae", type = "performance")

ENHANCE

autoplot(lgbm_bayes_res, metric = "mae", type = "performance") +
  ylim(c(9, 14))

Your turn

Let’s try a different acquisition function: conf_bound(kappa).

We’ll use the objective argument to set it.

Choose your own kappa value:

  • Larger values will explore the space more.
  • β€œLarge” values are usually less than one.

Bonus points: Before the optimization is done, press <esc> and see what happens.

15:00

Notes

  • Stopping tune_bayes() will return the current results.

  • Parallel processing can still be used to more efficiently measure each candidate point.

  • There are a lot of other iterative methods that you can use.

  • The finetune package also has functions for simulated annealing search.

Finalizing the Model

Let’s say that we’ve tried a lot of different models and we like our lightgbm model the most.

What do we do now?

  • Finalize the workflow by choosing the values for the tuning parameters.
  • Fit the model on the entire training set.
  • Verify performance using the test set.
  • Document and publish the model(?)

Locking Down the Tuning Parameters

We can take the results of the Bayesian optimization and accept the best results:

best_param <- select_best(lgbm_bayes_res, metric = "mae")
final_wflow <- 
  lgbm_wflow %>% 
  finalize_workflow(best_param)
final_wflow
#> ══ Workflow ══════════════════════════════════════════════════════════
#> Preprocessor: Formula
#> Model: boost_tree()
#> 
#> ── Preprocessor ──────────────────────────────────────────────────────
#> avg_price_per_room ~ .
#> 
#> ── Model ─────────────────────────────────────────────────────────────
#> Boosted Tree Model Specification (regression)
#> 
#> Main Arguments:
#>   trees = 1000
#>   min_n = 5
#>   tree_depth = 12
#>   learn_rate = 0.0593110845668806
#>   loss_reduction = 1.16244203146145e-09
#>   stop_iter = 6
#> 
#> Engine-Specific Arguments:
#>   num_threads = 1
#> 
#> Computational engine: lightgbm

The Final Fit

We can use individual functions:

final_fit <- final_wflow %>% fit(data = hotel_train)

# then predict() or augment() 
# then compute metrics


Remember that there is also a convenience function to do all of this:

set.seed(3893)
final_res <- final_wflow %>% last_fit(hotel_split, metrics = reg_metrics)
final_res
#> # Resampling results
#> # Manual resampling 
#> # A tibble: 1 Γ— 6
#>   splits              id               .metrics         .notes           .predictions         .workflow 
#>   <list>              <chr>            <list>           <list>           <list>               <list>    
#> 1 <split [3749/1251]> train/test split <tibble [2 Γ— 4]> <tibble [0 Γ— 3]> <tibble [1,251 Γ— 4]> <workflow>

Test Set Results

final_res %>% 
  collect_predictions() %>% 
  cal_plot_regression(
    truth = avg_price_per_room, 
    estimate = .pred)

Test set performance:

final_res %>% collect_metrics()
#> # A tibble: 2 Γ— 4
#>   .metric .estimator .estimate .config             
#>   <chr>   <chr>          <dbl> <chr>               
#> 1 mae     standard       9.66  Preprocessor1_Model1
#> 2 rsq     standard       0.948 Preprocessor1_Model1

Recall that resampling predicted the MAE to be 9.645.

We made it!

Resources to keep learning