Tidymodeling

Part II

Tentative outline for part 2

Feature engineering with recipes
Model optimization by tuning
- Grid search
- Racing
- Iterative methods
Extras (time permitting)
- Effect encodings
- Case studies

Let’s install some packages

# Packages needed
pkgs <- 
  c("bonsai", "Cubist", "doParallel", "earth", "embed", "finetune", 
    "forested", "lightgbm", "lme4", "parallelly", "plumber", "probably", 
    "ranger", "rpart", "rpart.plot", "rules", "splines2", "stacks", 
    "text2vec", "textrecipes", "tidymodels", "vetiver")

install.packages(pkgs)

Also, you should install the newest version of the dials package (version 1.3.0). To check this, you can run:

rlang::check_installed("dials", version = "1.3.0")

My versions

R version 4.5.0 (2025-04-11), Quarto (1.7.29)

package	version
bonsai	0.3.2
broom	1.0.8
Cubist	0.5.0
dials	1.4.0
doParallel	1.0.17
dplyr	1.1.4
earth	5.3.4
embed	1.1.5
finetune	1.2.0
Formula	1.2-5

package	version
ggplot2	3.5.2
lattice	0.22-6
lightgbm	4.6.0
lme4	1.1-37
modeldata	1.4.0
parallelly	1.43.0
parsnip	1.3.1
plotmo	3.6.4
plotrix	3.8-4
plumber	1.3.0

package	version
probably	1.0.3
purrr	1.0.4
recipes	1.3.0
rsample	1.3.0
rules	1.0.2
scales	1.4.0
splines2	0.5.4
stacks	1.1.0
text2vec	0.6.4
textrecipes	1.1.0

package	version
tibble	3.2.1
tidymodels	1.3.0
tidyr	1.3.1
tune	1.3.0
vetiver	0.2.5
workflows	1.2.0
workflowsets	1.1.0
yardstick	1.3.2

Open Worksheet 1 📄

Hotel Data

We’ll use data on hotels to predict the cost of a room.

The data are in the modeldata package. We’ll sample down the data and refactor some columns:

library(tidymodels)

# extra settings: 
tidymodels_prefer()
theme_set(theme_bw())
options(
  pillar.advice = FALSE, 
  pillar.min_title_chars = Inf
)

data(hotel_rates)
set.seed(295)
hotel_rates <- 
  hotel_rates %>% 
  sample_n(5000) %>% 
  arrange(arrival_date) %>% 
  select(-arrival_date) %>% 
  mutate(
    company = factor(as.character(company)),
    country = factor(as.character(country)),
    agent = factor(as.character(agent))
  )

Hotel date columns

names(hotel_rates)
#>  [1] "avg_price_per_room"             "lead_time"                     
#>  [3] "stays_in_weekend_nights"        "stays_in_week_nights"          
#>  [5] "adults"                         "children"                      
#>  [7] "babies"                         "meal"                          
#>  [9] "country"                        "market_segment"                
#> [11] "distribution_channel"           "is_repeated_guest"             
#> [13] "previous_cancellations"         "previous_bookings_not_canceled"
#> [15] "reserved_room_type"             "assigned_room_type"            
#> [17] "booking_changes"                "agent"                         
#> [19] "company"                        "days_in_waiting_list"          
#> [21] "customer_type"                  "required_car_parking_spaces"   
#> [23] "total_of_special_requests"      "arrival_date_num"              
#> [25] "near_christmas"                 "near_new_years"                
#> [27] "historical_adr"

Data splitting strategy

Data Spending

Let’s split the data into a training set (75%) and testing set (25%) using stratification:

set.seed(4028)
hotel_split <- initial_split(hotel_rates, strata = avg_price_per_room)

hotel_train <- training(hotel_split)
hotel_test <- testing(hotel_split)

Your turn

Let’s take some time and investigate the training data. The outcome is avg_price_per_room.

Are there any interesting characteristics of the data?

10:00

Feature Engineering

Working with our predictors

We might want to modify our predictors columns for a few reasons:

The model requires them in a different format (e.g. dummy variables for linear regression).
The model needs certain data qualities (e.g. same units for K-NN).
The outcome is better predicted when one or more columns are transformed in some way (a.k.a “feature engineering”).

The first two reasons are fairly predictable (next page).

The last one depends on your modeling problem.

What is feature engineering?

Think of a feature as some representation of a predictor that will be used in a model.

Example representations:

Interactions
Polynomial expansions/splines
Principal component analysis (PCA) feature extraction

There are a lot of examples in Feature Engineering and Selection (FES) and Applied Machine Learning for Tabular Data (aml4td).

Example: Dates

How can we represent date columns for our model?

When we use a date column in its native format, most models in R convert it to an integer.

We can re-engineer it as:

Days since a reference date
Day of the week
Month
Year
Indicators for holidays

General definitions

Data preprocessing steps allow your model to fit.
Feature engineering steps help the model do the least work to predict the outcome as well as possible.

The recipes package can handle both!

Previously - Setup

library(tidymodels)

# Add another package:
library(textrecipes)

# extra settings: 
tidymodels_prefer()
theme_set(theme_bw())
options(
  pillar.advice = FALSE, 
  pillar.min_title_chars = Inf
)

data(hotel_rates)
set.seed(295)
hotel_rates <- 
  hotel_rates %>% 
  sample_n(5000) %>% 
  arrange(arrival_date) %>% 
  select(-arrival_date) %>% 
  mutate(
    company = factor(as.character(company)),
    country = factor(as.character(country)),
    agent = factor(as.character(agent))
  )

Previously - Data Usage

set.seed(4028)
hotel_split <- initial_split(hotel_rates, strata = avg_price_per_room)

hotel_train <- training(hotel_split)
hotel_test <- testing(hotel_split)

We’ll go from here and create a set of resamples to use for model assessments.

Resampling Strategy

We’ll use simple 10-fold cross-validation (stratified sampling):

set.seed(472)
hotel_rs <- vfold_cv(hotel_train, strata = avg_price_per_room)
hotel_rs
#> #  10-fold cross-validation using stratification 
#> # A tibble: 10 × 2
#>    splits             id    
#>    <list>             <chr> 
#>  1 <split [3372/377]> Fold01
#>  2 <split [3373/376]> Fold02
#>  3 <split [3373/376]> Fold03
#>  4 <split [3373/376]> Fold04
#>  5 <split [3373/376]> Fold05
#>  6 <split [3374/375]> Fold06
#>  7 <split [3375/374]> Fold07
#>  8 <split [3376/373]> Fold08
#>  9 <split [3376/373]> Fold09
#> 10 <split [3376/373]> Fold10

Prepare your data for modeling

The recipes package is an extensible framework for pipeable sequences of preprocessing and feature engineering steps.

Statistical parameters for the steps can be estimated from an initial data set and then applied to other data sets.

The resulting processed output can be used as inputs for statistical or machine learning models.

A first recipe

hotel_rec <- 
  recipe(avg_price_per_room ~ ., data = hotel_train)

The recipe() function assigns columns to roles of “outcome” or “predictor” using the formula

A first recipe

summary(hotel_rec)
#> # A tibble: 27 × 4
#>    variable                type      role      source  
#>    <chr>                   <list>    <chr>     <chr>   
#>  1 lead_time               <chr [2]> predictor original
#>  2 stays_in_weekend_nights <chr [2]> predictor original
#>  3 stays_in_week_nights    <chr [2]> predictor original
#>  4 adults                  <chr [2]> predictor original
#>  5 children                <chr [2]> predictor original
#>  6 babies                  <chr [2]> predictor original
#>  7 meal                    <chr [3]> predictor original
#>  8 country                 <chr [3]> predictor original
#>  9 market_segment          <chr [3]> predictor original
#> 10 distribution_channel    <chr [3]> predictor original
#> # ℹ 17 more rows

The type column contains information on the variables

Your turn

What do you think are in the type vectors for the lead_time and country columns?

03:00

Create indicator variables

hotel_rec <- 
  recipe(avg_price_per_room ~ ., data = hotel_train) %>% 
  step_dummy(all_nominal_predictors())

For any factor or character predictors, make binary indicators.
There are many recipe steps that can convert categorical predictors to numeric columns.
step_dummy() records the levels of the categorical predictors in the training set.

Filter out constant columns

hotel_rec <- 
  recipe(avg_price_per_room ~ ., data = hotel_train) %>% 
  step_dummy(all_nominal_predictors()) %>% 
  step_zv(all_predictors())

In case there is a factor level that was never observed in the training data (resulting in a column of all 0s), we can delete any zero-variance predictors that have a single unique value.

Normalization

hotel_rec <- 
  recipe(avg_price_per_room ~ ., data = hotel_train) %>% 
  step_dummy(all_nominal_predictors()) %>% 
  step_zv(all_predictors()) %>% 
  step_normalize(all_numeric_predictors())

This centers and scales the numeric predictors.
The recipe will use the training set to estimate the means and standard deviations of the data.

All data the recipe is applied to will be normalized using those statistics (there is no re-estimation).

Reduce correlation

hotel_rec <- 
  recipe(avg_price_per_room ~ ., data = hotel_train) %>% 
  step_dummy(all_nominal_predictors()) %>% 
  step_zv(all_predictors()) %>% 
  step_normalize(all_numeric_predictors()) %>% 
  step_corr(all_numeric_predictors(), threshold = 0.9)

To deal with highly correlated predictors, find the minimum set of predictor columns that make the pairwise correlations less than the threshold.

Other possible steps

hotel_rec <- 
  recipe(avg_price_per_room ~ ., data = hotel_train) %>% 
  step_dummy(all_nominal_predictors()) %>% 
  step_zv(all_predictors()) %>% 
  step_normalize(all_numeric_predictors()) %>% 
  step_pca(all_numeric_predictors())

PCA feature extraction…

Other possible steps

hotel_rec <- 
  recipe(avg_price_per_room ~ ., data = hotel_train) %>% 
  step_dummy(all_nominal_predictors()) %>% 
  step_zv(all_predictors()) %>% 
  step_normalize(all_numeric_predictors()) %>% 
  embed::step_umap(all_numeric_predictors(), outcome = vars(avg_price_per_room))

A fancy machine learning supervised dimension reduction technique called UMAP…

Other possible steps

hotel_rec <- 
  recipe(avg_price_per_room ~ ., data = hotel_train) %>% 
  step_dummy(all_nominal_predictors()) %>% 
  step_zv(all_predictors()) %>% 
  step_normalize(all_numeric_predictors()) %>% 
  step_spline_natural(arrival_date_num, deg_free = 10)

Nonlinear transforms like natural splines, and so on!

Your turn

Create a recipe() for the hotel data to:

use a Yeo-Johnson (YJ) transformation on lead_time
convert factors to indicator variables
remove zero-variance variables
add the spline technique shown above

10:00

Minimal recipe

hotel_indicators <-
  recipe(avg_price_per_room ~ ., data = hotel_train) %>% 
  step_YeoJohnson(lead_time) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_zv(all_predictors()) %>% 
  step_spline_natural(arrival_date_num, deg_free = 10)

Measuring Performance

We’ll compute two measures: mean absolute error and the coefficient of determination (a.k.a \(R^2\)).

\[\begin{align} MAE &= \frac{1}{n}\sum_{i=1}^n |y_i - \hat{y}_i| \notag \\ R^2 &= cor(y_i, \hat{y}_i)^2 \end{align}\]

The focus will be on MAE for parameter optimization. We’ll use a metric set to compute these:

reg_metrics <- metric_set(mae, rsq)

Using a workflow

set.seed(9)

hotel_lm_wflow <-
  workflow() %>%
  add_recipe(hotel_indicators) %>%
  add_model(linear_reg())
 
ctrl <- control_resamples(save_pred = TRUE)
hotel_lm_res <-
  hotel_lm_wflow %>%
  fit_resamples(hotel_rs, control = ctrl, metrics = reg_metrics)

collect_metrics(hotel_lm_res)
#> # A tibble: 2 × 6
#>   .metric .estimator   mean     n std_err .config             
#>   <chr>   <chr>       <dbl> <int>   <dbl> <chr>               
#> 1 mae     standard   16.6      10 0.214   Preprocessor1_Model1
#> 2 rsq     standard    0.884    10 0.00339 Preprocessor1_Model1

Your turn

Use fit_resamples() to fit your workflow with a recipe.

Collect the predictions from the results.

05:00

Holdout predictions

# Since we used `save_pred = TRUE`
lm_cv_pred <- collect_predictions(hotel_lm_res)
lm_cv_pred %>% print(n = 7)
#> # A tibble: 3,749 × 5
#>   .pred id      .row avg_price_per_room .config             
#>   <dbl> <chr>  <int>              <dbl> <chr>               
#> 1  75.1 Fold01    20                 40 Preprocessor1_Model1
#> 2  49.3 Fold01    28                 54 Preprocessor1_Model1
#> 3  64.9 Fold01    45                 50 Preprocessor1_Model1
#> 4  52.8 Fold01    49                 42 Preprocessor1_Model1
#> 5  48.6 Fold01    61                 49 Preprocessor1_Model1
#> 6  29.8 Fold01    66                 40 Preprocessor1_Model1
#> 7  36.9 Fold01    88                 49 Preprocessor1_Model1
#> # ℹ 3,742 more rows

Calibration Plot

library(probably)

cal_plot_regression(hotel_lm_res)

What do we do with the agent and company data?

There are 98 unique agent values and 100 unique companies in our training set. How can we include this information in our model?

We could:

make the full set of indicator variables 😳
lump agents and companies that rarely occur into an “other” group
use feature hashing to create a smaller set of indicator variables
use effect encoding to replace the agent and company columns with the estimated effect of that predictor (in the extra materials)

Per-agent statistics

Collapsing factor levels

There is a recipe step that will redefine factor levels based on their frequency in the training set:

hotel_other_rec <-
  recipe(avg_price_per_room ~ ., data = hotel_train) %>% 
  step_YeoJohnson(lead_time) %>%
  step_other(agent, threshold = 0.001) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_zv(all_predictors()) %>% 
  step_spline_natural(arrival_date_num, deg_free = 10)

Using this code, 34 agents (out of 98) were collapsed into “other” based on the training set.

We could try to optimize the threshold for collapsing (we will get to model tuning later).

Does othering help?

hotel_other_wflow <-
  hotel_lm_wflow %>%
  update_recipe(hotel_other_rec)

hotel_other_res <-
  hotel_other_wflow %>%
  fit_resamples(hotel_rs, control = ctrl, metrics = reg_metrics)

collect_metrics(hotel_other_res)
#> # A tibble: 2 × 6
#>   .metric .estimator   mean     n std_err .config             
#>   <chr>   <chr>       <dbl> <int>   <dbl> <chr>               
#> 1 mae     standard   16.7      10 0.213   Preprocessor1_Model1
#> 2 rsq     standard    0.884    10 0.00341 Preprocessor1_Model1

About the same MAE and much faster to complete.

Now let’s look at a more sophisticated tool called effect feature hashing.

Feature Hashing

Between agent and company, simple dummy variables would create 198 new columns (that are mostly zeros).

Another option is to have a binary indicator that combines some levels of these variables.

Feature hashing (for more see FES, SMLTAR, TMwR, and aml4td):

uses the character values of the levels
converts them to integer hash values
uses the integers to assign them to a specific indicator column.

Feature Hashing

Suppose we want to use 32 indicator variables for agent.

For a agent with value “Max_Kuhn”, a hashing function converts it to an integer (say 210397726).

To assign it to one of the 32 columns, we would use modular arithmetic to assign it to a column:

# For "Max_Kuhn" put a '1' in column: 
210397726 %% 32
#> [1] 30

Hash functions are meant to emulate randomness.

Feature Hashing Pros

The procedure will automatically work on new values of the predictors.
It is fast.
“Signed” hashes add a sign to help avoid aliasing.

Feature Hashing Cons

There is no real logic behind which factor levels are combined.
We don’t know how many columns to add (more in the next section).
Some columns may have all zeros.
If a indicator column is important to the model, we can’t easily determine why.

Feature Hashing in recipes

The textrecipes package has a step that can be added to the recipe:

library(textrecipes)

hash_rec <-
  recipe(avg_price_per_room ~ ., data = hotel_train) %>%
  step_YeoJohnson(lead_time) %>%
  # Defaults to 32 signed indicator columns
  step_dummy_hash(agent) %>%
  step_dummy_hash(company) %>%
  # Regular indicators for the others
  step_dummy(all_nominal_predictors()) %>% 
  step_zv(all_predictors()) %>% 
  step_spline_natural(arrival_date_num, deg_free = 10)

hotel_hash_wflow <-
  hotel_lm_wflow %>%
  update_recipe(hash_rec)

Feature Hashing in recipes

hotel_hash_res <-
  hotel_hash_wflow %>%
  fit_resamples(hotel_rs, control = ctrl, metrics = reg_metrics)

collect_metrics(hotel_hash_res)
#> # A tibble: 2 × 6
#>   .metric .estimator   mean     n std_err .config             
#>   <chr>   <chr>       <dbl> <int>   <dbl> <chr>               
#> 1 mae     standard   16.7      10 0.239   Preprocessor1_Model1
#> 2 rsq     standard    0.884    10 0.00324 Preprocessor1_Model1

About the same performance but now we can handle new values.

When you use step_dummy():

•   It creates one column per category 
(e.g., agent_241, agent_204).
•   If a new level appears in test/predict data 
(e.g., agent_999), it wasn’t seen during training
so prediction fails or ignores that value.

Now every category (even new ones) is passed through a hash function, which maps it to one of the fixed output columns (like 32 buckets). This means:

•   No need to store or match all 
unique category names
•   New/unseen levels 
won’t break the model
•   The model remains robust 
and deployable in production

Debugging a recipe

Typically, you will want to use a workflow to estimate and apply a recipe.

If you have an error and need to debug your recipe, the original recipe object (e.g. hash_rec) can be estimated manually with a function called prep(). It is analogous to fit(). See TMwR section 16.4

Another function (bake()) is analogous to predict(), and gives you the processed data back.

The tidy() function can be used to get specific results from the recipe.

Example

hash_rec_fit <- prep(hash_rec)

# Get the transformation coefficient
tidy(hash_rec_fit, number = 1)

# Get the processed data
bake(hash_rec_fit, hotel_train %>% slice(1:3), contains("_agent_"))

prep summary:

•   Applies transformations to numeric variables
•   Handles high-cardinality categorical variables (agent, company) using feature hashing
•   One-hot encodes smaller factors
•   Drops useless predictors
•   Adds nonlinear flexibility via splines

Open Worksheet 2 📄

Previously - Setup

library(tidymodels)
library(textrecipes)
library(bonsai)

# extra settings: 
tidymodels_prefer()
theme_set(theme_bw())
options(
  pillar.advice = FALSE, 
  pillar.min_title_chars = Inf
)

reg_metrics <- metric_set(mae, rsq)

data(hotel_rates)
set.seed(295)
hotel_rates <- 
  hotel_rates %>% 
  sample_n(5000) %>% 
  arrange(arrival_date) %>% 
  select(-arrival_date) %>% 
  mutate(
    company = factor(as.character(company)),
    country = factor(as.character(country)),
    agent = factor(as.character(agent))
  )

Previously - Data Usage

set.seed(4028)
hotel_split <- initial_split(hotel_rates, strata = avg_price_per_room)

hotel_train <- training(hotel_split)
hotel_test <- testing(hotel_split)

set.seed(472)
hotel_rs <- vfold_cv(hotel_train, strata = avg_price_per_room)

Previously - Feature engineering

library(textrecipes)

hash_rec <-
  recipe(avg_price_per_room ~ ., data = hotel_train) %>%
  step_YeoJohnson(lead_time) %>%
  # Defaults to 32 signed indicator columns
  step_dummy_hash(agent) %>%
  step_dummy_hash(company) %>%
  # Regular indicators for the others
  step_dummy(all_nominal_predictors()) %>% 
  step_zv(all_predictors())

Optimizing Models via Tuning Parameters

Tuning parameters

Some model or preprocessing parameters cannot be estimated directly from the data.

Some examples:

Tree depth in decision trees
Number of neighbors in a K-nearest neighbor model

Activation function in neural networks?

Sigmoidal functions, ReLu, etc.

Yes, it is a tuning parameter. ✅

Number of feature hashing columns to generate?

Yes, it is a tuning parameter. ✅

Bayesian priors for model parameters?

Hmmmm, probably not. These are based on prior belief. ❌

The random seed?

Nope. It is not. ❌

Optimize tuning parameters

Try different values and measure their performance.

Find good values for these parameters.

Once the value(s) of the parameter(s) are determined, a model can be finalized by fitting the model to the entire training set.

Tagging parameters for tuning

With tidymodels, you can mark the parameters that you want to optimize with a value of tune().

The function itself just returns… itself:

tune()
#> tune()
str(tune())
#>  language tune()

# optionally add a label
tune("I hope that this is going well")
#> tune("I hope that this is going well")

For example…

Optimizing the hash features

Our new recipe is:

hash_rec <-
  recipe(avg_price_per_room ~ ., data = hotel_train) %>%
  step_YeoJohnson(lead_time) %>%
  step_dummy_hash(agent,   num_terms = tune("agent hash")) %>%
  step_dummy_hash(company, num_terms = tune("company hash")) %>%
  step_zv(all_predictors())

We will be using a tree-based model in a minute.

The other categorical predictors are left as-is.
That’s why there is no step_dummy().

Boosted Trees

These are popular ensemble methods that build a sequence of tree models.

Each tree uses the results of the previous tree to better predict samples, especially those that have been poorly predicted.

Each tree in the ensemble is saved and new samples are predicted using a weighted average of the votes of each tree in the ensemble.

We’ll focus on the popular lightgbm implementation.

Boosted Tree Tuning Parameters

Some possible parameters:

mtry: The number of predictors randomly sampled at each split (in \([1, ncol(x)]\) or \((0, 1]\)).
trees: The number of trees (\([1, \infty]\), but usually up to thousands)
min_n: The number of samples needed to further split (\([1, n]\)).
learn_rate: The rate that each tree adapts from previous iterations (\((0, \infty]\), usual maximum is 0.1).
stop_iter: The number of iterations of boosting where no improvement was shown before stopping (\([1, trees]\))

Boosted Tree Tuning Parameters

TBH it is usually not difficult to optimize these models.

Often, there are multiple candidate tuning parameter combinations that have very good results.

To demonstrate simple concepts, we’ll look at optimizing the number of trees in the ensemble (between 1 and 100) and the learning rate (\(10^{-5}\) to \(10^{-1}\)).

Boosted Tree Tuning Parameters

We’ll need to load the bonsai package. This has the information needed to use lightgbm

library(bonsai)
lgbm_spec <- 
  boost_tree(trees = tune(), learn_rate = tune()) %>% 
  set_mode("regression") %>% 
  set_engine("lightgbm", num_threads = 1)

lgbm_wflow <- workflow(hash_rec, lgbm_spec)

Optimize tuning parameters

The main two strategies for optimization are:

Grid search 💠 which tests a pre-defined set of candidate values
Iterative search 🌀 which suggests/estimates new values of candidate parameters to evaluate

Grid search

A small grid of points trying to minimize the error via learning rate:

Grid search

In reality we would probably sample the space more densely:

Iterative Search

We could start with a few points and search the space:

Grid Search

Parameters

The tidymodels framework provides pre-defined information on tuning parameters (such as their type, range, transformations, etc).
The extract_parameter_set_dials() function extracts these tuning parameters and the info.

Grids

Create your grid manually or automatically.
The grid_*() functions can make a grid.

Different types of grids

Space-filling designs (SFD) attempt to cover the parameter space without redundant candidates (recommended).

Create a grid

lgbm_wflow %>% 
  extract_parameter_set_dials()

# Individual functions: 
trees()
learn_rate()

A parameter set can be updated (e.g. to change the ranges).

Create a grid

set.seed(12)
grid <- 
  lgbm_wflow %>% 
  extract_parameter_set_dials() %>% 
  grid_space_filling(size = 25)

grid
#> # A tibble: 25 × 4
#>    trees learn_rate `agent hash` `company hash`
#>    <int>      <dbl>        <int>          <int>
#>  1     1   7.50e- 6          574            574
#>  2    84   1.78e- 5         2048           2298
#>  3   167   5.62e-10         1824            912
#>  4   250   4.22e- 5         3250            512
#>  5   334   1.78e- 8          512           2896
#>  6   417   1.33e- 3          322           1625
#>  7   500   1   e- 1         1448           1149
#>  8   584   1   e- 7         1290            256
#>  9   667   2.37e-10          456            724
#> 10   750   1.78e- 2          645            322
#> # ℹ 15 more rows

Your turn

Create a grid for our tunable workflow.

Try creating a regular grid.

10:00

Create a regular grid

set.seed(12)
grid <- 
  lgbm_wflow %>% 
  extract_parameter_set_dials() %>% 
  grid_regular(levels = 4)

grid
#> # A tibble: 256 × 4
#>    trees   learn_rate `agent hash` `company hash`
#>    <int>        <dbl>        <int>          <int>
#>  1     1 0.0000000001          256            256
#>  2   667 0.0000000001          256            256
#>  3  1333 0.0000000001          256            256
#>  4  2000 0.0000000001          256            256
#>  5     1 0.0000001             256            256
#>  6   667 0.0000001             256            256
#>  7  1333 0.0000001             256            256
#>  8  2000 0.0000001             256            256
#>  9     1 0.0001                256            256
#> 10   667 0.0001                256            256
#> # ℹ 246 more rows

Your turn

What advantage would a regular grid have?

Update parameter ranges

lgbm_param <- 
  lgbm_wflow %>% 
  extract_parameter_set_dials() %>% 
  update(trees = trees(c(1L, 100L)),
         learn_rate = learn_rate(c(-5, -1)))

set.seed(712)
grid <- 
  lgbm_param %>% 
  grid_space_filling(size = 25)

grid
#> # A tibble: 25 × 4
#>    trees learn_rate `agent hash` `company hash`
#>    <int>      <dbl>        <int>          <int>
#>  1     1  0.00147            574            574
#>  2     5  0.00215           2048           2298
#>  3     9  0.0000215         1824            912
#>  4    13  0.00316           3250            512
#>  5    17  0.0001             512           2896
#>  6    21  0.0147             322           1625
#>  7    25  0.1               1448           1149
#>  8    29  0.000215          1290            256
#>  9    34  0.0000147          456            724
#> 10    38  0.0464             645            322
#> # ℹ 15 more rows

The results

grid %>% 
  ggplot(aes(trees, learn_rate)) +
  geom_point(size = 4) +
  scale_y_log10()

Note that the learning rates are uniform on the log-10 scale and this shows 2 of 4 dimensions.

Use the `tune_*()` functions to tune models

Choosing tuning parameters

Let’s take our previous model and tune more parameters:

lgbm_spec <- 
  boost_tree(trees = tune(), learn_rate = tune(),  min_n = tune()) %>% 
  set_mode("regression") %>% 
  set_engine("lightgbm", num_threads = 1)

lgbm_wflow <- workflow(hash_rec, lgbm_spec)

# Update the feature hash ranges (log-2 units)
lgbm_param <-
  lgbm_wflow %>%
  extract_parameter_set_dials() %>%
  update(`agent hash`   = num_hash(c(3, 8)),
         `company hash` = num_hash(c(3, 8)))

trees = tune() you’re telling tidymodels to tune the number of boosting rounds (trees).

-   Controls how many trees are built
-   More trees = potentially better fit, but more overfitting risk and time

learn_rate = tune() you’ll tune the learning rate (how much each tree contributes to the final model).

-   Smaller = slower learning, better generalization
- Larger = faster, but more risk of overshooting or overfitting

min_n = tune() You’ll tune the minimum number of data points required in a leaf. - Controls complexity of trees - Larger = simpler trees (less overfitting) - Smaller = more complex trees (risk of overfitting)

Grid Search

set.seed(9)
ctrl <- control_grid(save_pred = TRUE)

lgbm_res <-
  lgbm_wflow %>%
  tune_grid(
    resamples = hotel_rs,
    grid = 25,
    # The options below are not required by default
    param_info = lgbm_param, 
    control = ctrl,
    metrics = reg_metrics
  )

Grid Search

lgbm_res 
#> # Tuning results
#> # 10-fold cross-validation using stratification 
#> # A tibble: 10 × 5
#>    splits             id     .metrics          .notes           .predictions        
#>    <list>             <chr>  <list>            <list>           <list>              
#>  1 <split [3372/377]> Fold01 <tibble [50 × 9]> <tibble [0 × 3]> <tibble [9,425 × 9]>
#>  2 <split [3373/376]> Fold02 <tibble [50 × 9]> <tibble [0 × 3]> <tibble [9,400 × 9]>
#>  3 <split [3373/376]> Fold03 <tibble [50 × 9]> <tibble [0 × 3]> <tibble [9,400 × 9]>
#>  4 <split [3373/376]> Fold04 <tibble [50 × 9]> <tibble [0 × 3]> <tibble [9,400 × 9]>
#>  5 <split [3373/376]> Fold05 <tibble [50 × 9]> <tibble [0 × 3]> <tibble [9,400 × 9]>
#>  6 <split [3374/375]> Fold06 <tibble [50 × 9]> <tibble [0 × 3]> <tibble [9,375 × 9]>
#>  7 <split [3375/374]> Fold07 <tibble [50 × 9]> <tibble [0 × 3]> <tibble [9,350 × 9]>
#>  8 <split [3376/373]> Fold08 <tibble [50 × 9]> <tibble [0 × 3]> <tibble [9,325 × 9]>
#>  9 <split [3376/373]> Fold09 <tibble [50 × 9]> <tibble [0 × 3]> <tibble [9,325 × 9]>
#> 10 <split [3376/373]> Fold10 <tibble [50 × 9]> <tibble [0 × 3]> <tibble [9,325 × 9]>

Grid results

autoplot(lgbm_res)

Trees

mae: Best performance clusters at higher tree counts (around 1500–2000). rsq: Same pattern — higher tree counts give better fit (higher R²).
agent hash (log-2) mae and rsq: Performance is fairly stable once hash size is large enough (log2 ≈ 9–11 → 512–2048 buckets). Bigger hash spaces seem to help a little but reach diminishing returns.
company hash (log-2) Similar to agent hash, though with slightly more variation. Mid-to-high hash sizes (log2 ≈ 8–11) seem to perform better. Very low hash sizes (e.g., 8 or 16) likely lead to too many collisions → worse performance.
Learning Rate (log-10) Lower values (log10 ≈ -5 to -2 → learning rates between 1e-5 to 1e-2) give better MAE and R². Slower learning = better generalization Larger learning rates (closer to 0.1 or log10 = -1) tend to overfit or overshoot
Minimal Node Size (min_n) Low values (small leaves) seem to increase variance in performance. Moderate to higher values (10–30) result in more stable performance. Very low values (e.g., 1) likely overfit → worse MAE and R²

Tuning results

collect_metrics(lgbm_res)
#> # A tibble: 50 × 11
#>    trees min_n    learn_rate `agent hash` `company hash` .metric .estimator   mean     n std_err .config              
#>    <int> <int>         <dbl>        <int>          <int> <chr>   <chr>       <dbl> <int>   <dbl> <chr>                
#>  1     1    17 0.00000000133           93             25 mae     standard   53.2      10 0.427   Preprocessor01_Model1
#>  2     1    17 0.00000000133           93             25 rsq     standard    0.810    10 0.00746 Preprocessor01_Model1
#>  3    84    27 0.0178                  80             69 mae     standard   20.0      10 0.293   Preprocessor02_Model1
#>  4    84    27 0.0178                  80             69 rsq     standard    0.897    10 0.00423 Preprocessor02_Model1
#>  5   167    19 0.00000133              12            143 mae     standard   53.2      10 0.426   Preprocessor03_Model1
#>  6   167    19 0.00000133              12            143 rsq     standard    0.811    10 0.00698 Preprocessor03_Model1
#>  7   250     2 0.000237                33             33 mae     standard   50.6      10 0.409   Preprocessor04_Model1
#>  8   250     2 0.000237                33             33 rsq     standard    0.815    10 0.00800 Preprocessor04_Model1
#>  9   334    36 0.0000000178            19             21 mae     standard   53.2      10 0.427   Preprocessor05_Model1
#> 10   334    36 0.0000000178            19             21 rsq     standard    0.807    10 0.00692 Preprocessor05_Model1
#> # ℹ 40 more rows

Tuning results

collect_metrics(lgbm_res, summarize = FALSE)
#> # A tibble: 500 × 10
#>    id     trees min_n    learn_rate `agent hash` `company hash` .metric .estimator .estimate .config              
#>    <chr>  <int> <int>         <dbl>        <int>          <int> <chr>   <chr>          <dbl> <chr>                
#>  1 Fold01     1    17 0.00000000133           93             25 mae     standard      51.8   Preprocessor01_Model1
#>  2 Fold01     1    17 0.00000000133           93             25 rsq     standard       0.821 Preprocessor01_Model1
#>  3 Fold02     1    17 0.00000000133           93             25 mae     standard      52.1   Preprocessor01_Model1
#>  4 Fold02     1    17 0.00000000133           93             25 rsq     standard       0.810 Preprocessor01_Model1
#>  5 Fold03     1    17 0.00000000133           93             25 mae     standard      52.2   Preprocessor01_Model1
#>  6 Fold03     1    17 0.00000000133           93             25 rsq     standard       0.783 Preprocessor01_Model1
#>  7 Fold04     1    17 0.00000000133           93             25 mae     standard      51.7   Preprocessor01_Model1
#>  8 Fold04     1    17 0.00000000133           93             25 rsq     standard       0.815 Preprocessor01_Model1
#>  9 Fold05     1    17 0.00000000133           93             25 mae     standard      55.2   Preprocessor01_Model1
#> 10 Fold05     1    17 0.00000000133           93             25 rsq     standard       0.847 Preprocessor01_Model1
#> # ℹ 490 more rows

Choose a parameter combination

show_best(lgbm_res, metric = "rsq")
#> # A tibble: 5 × 11
#>   trees min_n learn_rate `agent hash` `company hash` .metric .estimator  mean     n std_err .config              
#>   <int> <int>      <dbl>        <int>          <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
#> 1  1250     9    0.1              107             80 rsq     standard   0.947    10 0.00395 Preprocessor16_Model1
#> 2   917    16    0.0422             9             39 rsq     standard   0.946    10 0.00362 Preprocessor12_Model1
#> 3  1666    30    0.00750          124             19 rsq     standard   0.943    10 0.00309 Preprocessor21_Model1
#> 4  2000    24    0.00133           25            124 rsq     standard   0.919    10 0.00383 Preprocessor25_Model1
#> 5   500    25    0.00316           52              8 rsq     standard   0.899    10 0.00435 Preprocessor07_Model1

Choose a parameter combination

Create your own tibble for final parameters or use one of the tune::select_*() functions:

lgbm_best <- select_best(lgbm_res, metric = "mae")
lgbm_best
#> # A tibble: 1 × 6
#>   trees min_n learn_rate `agent hash` `company hash` .config              
#>   <int> <int>      <dbl>        <int>          <int> <chr>                
#> 1  1250     9        0.1          107             80 Preprocessor16_Model1

Checking Calibration

library(probably)
lgbm_res %>%
  collect_predictions(
    parameters = lgbm_best
  ) %>%
  cal_plot_regression(
    truth = avg_price_per_room,
    estimate = .pred
  )

Running in parallel

Grid search, combined with resampling, requires fitting a lot of models!
These models don’t depend on one another and can be run in parallel.

We can use a parallel backend to do this:

cores <- parallelly::availableCores(logical = FALSE)
cl <- parallel::makePSOCKcluster(cores)
doParallel::registerDoParallel(cl)

# Now call `tune_grid()`!

# Shut it down with:
foreach::registerDoSEQ()
parallel::stopCluster(cl)

Running in parallel

Speed-ups are fairly linear up to the number of physical cores (10 here).

-X-axis: Number of workers (i.e., CPU cores used in parallel) - Y-axis: Speed-up — how many times faster the tuning is compared to using 1 core

A perfect line (black diagonal) means linear speed-up: → 2 workers = 2× faster, 4 workers = 4× faster, etc.

Faceted on the expensiveness of preprocessing used. Each panel represents a different model complexity or workload type:

1.  none – Models that are very cheap/fast to train
2.  light – Models that are moderately expensive
3.  expensive – Models that are very computationally intensive

• 🟢 Green circles (“everything”): Parallelization is done over both resamples and tuning grid combinations • 🔷 Blue triangles (“resamples”): Parallelization is done only over resamples (e.g., CV folds)

Interpretation:

•   Expensive models (right panel) benefit the most from parallelism — you get up to 9× speed-up with 20 workers
•   Lightweight or fast models don’t benefit much after a few cores — overhead outweighs the gains
•   Parallelizing everything (green) tends to outperform or match parallelizing just resamples (blue)

Early stopping for boosted trees

We have directly optimized the number of trees as a tuning parameter.

Instead we could

Set the number of trees to a single large number.
Stop adding trees when performance gets worse.

This is known as “early stopping” and there is a parameter for that: stop_iter.

Early stopping has a potential to decrease the tuning time.

Your turn

Set trees = 2000 and tune the stop_iter parameter.

Note that you will need to regenerate lgbm_param with your new workflow!

10:00

Open Worksheet 3 📄

Previously - Setup

library(tidymodels)
library(textrecipes)
library(bonsai)

# extra settings: 
tidymodels_prefer()
theme_set(theme_bw())
options(
  pillar.advice = FALSE, 
  pillar.min_title_chars = Inf
)

reg_metrics <- metric_set(mae, rsq)

data(hotel_rates)
set.seed(295)
hotel_rates <- 
  hotel_rates %>% 
  sample_n(5000) %>% 
  arrange(arrival_date) %>% 
  select(-arrival_date) %>% 
  mutate(
    company = factor(as.character(company)),
    country = factor(as.character(country)),
    agent = factor(as.character(agent))
  )

Previously - Data Usage

set.seed(4028)
hotel_split <-
  initial_split(hotel_rates, strata = avg_price_per_room)

hotel_train <- training(hotel_split)
hotel_test <- testing(hotel_split)

set.seed(472)
hotel_rs <- vfold_cv(hotel_train, strata = avg_price_per_room)

Previously - Boosting Model

hotel_rec <-
  recipe(avg_price_per_room ~ ., data = hotel_train) %>%
  step_YeoJohnson(lead_time) %>%
  step_dummy_hash(agent,   num_terms = tune("agent hash")) %>%
  step_dummy_hash(company, num_terms = tune("company hash")) %>%
  step_zv(all_predictors())

lgbm_spec <- 
  boost_tree(trees = tune(), learn_rate = tune(), min_n = tune()) %>% 
  set_mode("regression") %>% 
  set_engine("lightgbm", num_threads = 1)

lgbm_wflow <- workflow(hotel_rec, lgbm_spec)

lgbm_param <-
  lgbm_wflow %>%
  extract_parameter_set_dials() %>%
  update(`agent hash`   = num_hash(c(3, 8)),
         `company hash` = num_hash(c(3, 8)))

Making Grid Search More Efficient

In the last section, we evaluated 250 models (25 candidates times 10 resamples).

We can make this go faster using parallel processing.

Also, for some models, we can fit far fewer models than the number that are being evaluated.

For boosting, a model with X trees can often predict on candidates with less than X trees.

Both of these methods can lead to enormous speed-ups.

Model Racing

Racing is an old tool that we can use to go even faster.

Evaluate all of the candidate models but only for a few resamples.
Determine which candidates have a low probability of being selected.
Eliminate poor candidates.
Repeat with next resample (until no more resamples remain)

This can result in fitting a small number of models.

Discarding Candidates

How do we eliminate tuning parameter combinations?

There are a few methods to do so. We’ll use one based on analysis of variance (ANOVA).

However… there is typically a large difference between resamples in the results.

Resampling Results (Non-Racing)

Here are some realistic (but simulated) examples of two candidate models.

An error estimate is measured for each of 10 resamples.

The lines connect resamples.

There is usually a significant resample-to-resample effect (rank corr: 0.83).

This is called non-racing resampling, meaning:

All candidates are fully evaluated on all resamples.
Contrast this with racing methods, which drop poor performers early to save time.
Each colored line represents the results of a single resample.
The y-axis is the model error (lower is better).
The x-axis shows the two model candidates.

Each Line Connects One Resample Across Models:

You see how each resample (fold) ranks the models.
Most lines go downward from candidate 1 to candidate 2 → indicating candidate 2 generally performs better (lower error).

Resample-to-Resample Variability:

Some lines are steeper or flatter than others.
This shows that performance varies across folds (which is expected in cross-validation).

A rank correlation of 0.83 suggests a strong agreement:

If candidate 2 performs better in one fold, it’s likely to perform better in most others.

Are Candidates Different?

One way to evaluate these models is to do a paired t-test

or a t-test on their differences matched by resamples

With \(n = 10\) resamples, the confidence interval for the difference in RMSE is (0.99, 2.8), indicating that candidate number 2 has smaller error.

Evaluating Differences in Candidates

What if we were to have compared the candidates while we sequentially evaluated each resample?

👉

One candidate shows superiority when 4 resamples have been evaluated.

Interim Analysis of Results

One version of racing uses a mixed model ANOVA to construct one-sided confidence intervals for each candidate versus the current best.

Any candidates whose bound does not include zero are discarded. Here is an animation.

The resamples are analyzed in a random order (so set the seed).

Kuhn (2014) has examples and simulations to show that the method works.

The finetune package has functions tune_race_anova() and tune_race_win_loss().

Instead of evaluating all candidate models across all resamples, we:

Evaluate resamples sequentially
Use a mixed model ANOVA to compare each model’s performance to the current best
Drop bad performers early based on a statistical rule

Statistical Rule:

“One-sided confidence intervals” are computed for each model’s error difference vs. the best model.

A two-sided CI answers: “How different could this model be, in either direction?”
A one-sided CI answers: “How much worse could this model be compared to the best?”
Focuses only on whether a model is worse, not better.
It’s a statistical test for inferiority.

Let’s say we compute:

Mean Error of Model A - Mean Error of Best Model = \(\Delta\)

And estimate a standard error from the resamples.

We then construct a one-sided upper confidence interval: \(\Delta + z \cdot SE\)

If this interval is entirely greater than 0, we’re confident that Model A is worse than the best, and we can drop it.

If zero is not in the interval, the model is statistically unlikely to be best
If zero is included, we can’t be sure — keep evaluating.

Statistically principled — not just based on raw ranks or averages.

steps 1–2: Collect early resample results

steps 2+: Compute mean & SE of error difference

Video:

Start: All Models In

A large number of models (e.g., 20–30 combinations of preprocessing + algorithm) are listed.
Each model starts off evaluated on just 1 resample.
All models are marked black — meaning: still under evaluation.

As More Resamples Are Evaluated

Each iteration corresponds to a new fold being evaluated.
For each model, the plot updates:
The point (average loss vs. best model)
The error bar (confidence interval for that loss)

One Model Becomes the Leader

The model with the lowest mean error becomes the current best (marked blue).
Other models’ performance is now measured as loss compared to the best (e.g., Loss of ROC AUC).

Elimination Begins

Once a model’s one-sided confidence interval no longer includes 0, it is statistically worse.
It gets gray and disappears (dropped from further evaluation).
This is the essence of racing — eliminating models that clearly won’t win.

End Game: Fewer Models Remaining

As resamples accumulate, uncertainty shrinks (error bars tighten).
Most poor-performing models are eliminated.
Only a handful of strong candidates remain.

Eventually:

One model is confirmed best
Other models are either retained (uncertain) or discarded

Racing

# Let's use a larger grid
set.seed(8945)
lgbm_grid <- 
  lgbm_param %>% 
  grid_space_filling(size = 50)

library(finetune)

set.seed(9)
lgbm_race_res <-
  lgbm_wflow %>%
  tune_race_anova(
    resamples = hotel_rs,
    grid = lgbm_grid, 
    metrics = reg_metrics
  )

The syntax and helper functions are extremely similar to those shown for tune_grid().

Racing Results

show_best(lgbm_race_res, metric = "mae")
#> # A tibble: 2 × 11
#>   trees min_n learn_rate `agent hash` `company hash` .metric .estimator  mean     n std_err .config              
#>   <int> <int>      <dbl>        <int>          <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
#> 1  1347     5     0.0655           66             26 mae     standard    9.64    10   0.173 Preprocessor34_Model1
#> 2   980     8     0.0429           17            135 mae     standard    9.76    10   0.164 Preprocessor25_Model1

Racing Results

Only 171 models were fit (out of 500).

select_best() never considers candidate models that did not get to the end of the race.

There is a helper function to see how candidate models were removed from consideration.

plot_race(lgbm_race_res) + 
  scale_x_continuous(breaks = pretty_breaks())

X-axis: Analysis Stage

Each number (1–10) represents a stage in the racing process — roughly corresponding to how many resamples (e.g., CV folds) have been used to evaluate each model.

Y-axis: mae

Mean Absolute Error (MAE) — a lower value is better.
It shows the current estimated performance of each candidate model at each stage.

Colored lines:

Each line represents a single candidate model (e.g., a unique combo of hyperparameters and preprocessing).
The lines trace each model’s cumulative average MAE as more resamples are completed.

Interpretation

Early stages (left side):

All models are present.
MAE estimates are noisy (due to fewer resamples).
You see a wide spread: some models are at ~10 MAE, others are 20–50+.

Middle stages (~Stage 4–5):

Poor models (high MAE) stop updating → they’ve been discarded by racing.
Only promising models remain (lines flatten out).

Later stages (~Stage 6–10):

Only top-performing models are still being evaluated.
Their MAE estimates stabilize.
Most high-MAE lines disappear (they were dropped earlier).

So we see that:

Racing is working as intended:
Poor models are eliminated early (saves time).
Good models are evaluated fully.
The best model is likely the one with the flattest line near the bottom (lowest MAE).
Most models were clearly worse (many start > MAE 20 or 30) and got dropped before stage 6.

Your turn

Run tune_race_anova() with a different seed.
Did you get the same or similar results?

10:00

Your turn

Now repeat the process using tune_race_anova(), but this time focus on performance metric rsq (use set.seed(9) as before).
What can you conclude when you compare results from mae and rsq?

15:00

Open Worksheet 4 📄

Iterative Search

Previously - Setup

library(tidymodels)
library(textrecipes)
library(bonsai)

# extra settings: 
tidymodels_prefer()
theme_set(theme_bw())
options(
  pillar.advice = FALSE, 
  pillar.min_title_chars = Inf
)

data(hotel_rates)
set.seed(295)
hotel_rates <- 
  hotel_rates %>% 
  sample_n(5000) %>% 
  arrange(arrival_date) %>% 
  select(-arrival_date) %>%  
  mutate(
    company = factor(as.character(company)),
    country = factor(as.character(country)),
    agent = factor(as.character(agent))
  )

Previously - Data Usage

set.seed(4028)
hotel_split <- initial_split(hotel_rates, strata = avg_price_per_room)

hotel_train <- training(hotel_split)
hotel_test <- testing(hotel_split)

set.seed(472)
hotel_rs <- vfold_cv(hotel_train, strata = avg_price_per_room)

Our Boosting Model

We used feature hashing to generate a smaller set of indicator columns to deal with the large number of levels for the agent and country predictors.

Tree-based models (and a few others) don’t require indicators for categorical predictors. They can split on these variables as-is.

We’ll keep all categorical predictors as factors and focus on optimizing additional boosting parameters.

Our Boosting Model

lgbm_spec <- 
  boost_tree(trees = 1000, learn_rate = tune(), min_n = tune(), 
             tree_depth = tune(), loss_reduction = tune(), 
             stop_iter = tune()) %>% 
  set_mode("regression") %>% 
  set_engine("lightgbm", num_threads = 1)

lgbm_wflow <- workflow(avg_price_per_room ~ ., lgbm_spec)

lgbm_param <- 
  lgbm_wflow %>%
    extract_parameter_set_dials() %>%
    update(learn_rate = learn_rate(c(-5, -1)))

Iterative Search

Instead of pre-defining a grid of candidate points, we can model our current results to predict what the next candidate point should be.

Suppose that we are only tuning the learning rate in our boosted tree.

We could do something like:

mae_pred <- lm(mae ~ learn_rate, data = resample_results)

and use this to predict and rank new learning rate candidates.

Iterative Search

A linear model probably isn’t the best choice though (more in a minute).

To illustrate the process, we resampled a large grid of learning rate values for our data to show what the relationship is between MAE and learning rate.

Now suppose that we used a grid of three points in the parameter range for learning rate…

A Large Grid

A Three Point Grid

Gaussian Processes and Optimization

We can make a “meta-model” with a small set of historical performance results.

Gaussian Processes (GP) models are a good choice to model performance.

It is a Bayesian model so we are using Bayesian Optimization (BO).
For regression, we can assume that our data are multivariate normal.
We also define a covariance function for the variance relationship between data points. A common one is:

\[\operatorname{cov}(\boldsymbol{x}_i, \boldsymbol{x}_j) = \exp\left(-\frac{1}{2}|\boldsymbol{x}_i - \boldsymbol{x}_j|^2\right) + \sigma^2_{ij}\]

Bayesian Optimization: Smart search strategy for tuning hyperparameters

Gaussian Process: A flexible, uncertainty-aware prediction tool used inside BO

Used For: Finding good hyperparameters efficiently, especially when models are slow to train

•   GP models the error landscape + uncertainty
•   BO uses that to pick the next best point to try

GPs are good because

they are flexible regression models (in the sense that splines are flexible).
we need to get mean and variance predictions (and they are Bayesian)
their variability is based on spatial distances.

Some people use random forests (with conformal variance estimates) or other methods but GPs are most popular.

Predicting Candidates

The GP model can take candidate tuning parameter combinations as inputs and make predictions for performance (e.g. MAE)

The mean performance
The variance of performance

The variance is mostly driven by spatial variability (the previous equation).

The predicted variance is zero at locations of actual data points and becomes very high when far away from any observed data.

Your turn

Your GP makes predictions on two new candidate tuning parameters.

We want to minimize MAE.

Which should we choose?

05:00

Two curves:

Green (Candidate 1)
Orange (Candidate 2)
These curves are the GP-predicted distributions of MAE for each candidate.
The vertical black line marks the current best MAE so far.

We are minimizing MAE, so we want to pick the candidate that has:

A low expected MAE, and/or
A reasonable chance of beating the current best (i.e., mass to the left of the black line)

Candidate 1 (Green): Has more uncertainty, but part of its distribution is well left of the current best. That means: There’s a decent chance it could find a better MAE.

Candidate 2 (Orange): Has a very narrow distribution centered centered just slightly better than the current best

Candidate 2 (orange) is centered just better than the current best.

But Candidate 1 (green) still has higher potential and is more exploratory.

In early tuning stages, Candidate 1 is often preferred.

GP Fit (ribbon is mean +/- 1SD)

Choosing New Candidates

This isn’t a very good fit but we can still use it.

How can we use the outputs to choose the next point to measure?

Acquisition functions take the predicted mean and variance and use them to balance:

exploration: new candidates should explore new areas.
exploitation: new candidates must stay near existing values.

Exploration focuses on the variance, exploitation is about the mean.

Poor GP fit: The black line — the GP’s mean predicted MAE — is strangely flat across most of the learning rate range.

It doesn’t follow the underlying U-shaped pattern you’d expect (where MAE is high at extremes and low in the middle).
This suggests the GP is underfitting the actual trend. HUge uncertainty!

Only three black dots (actual evaluated models) exist. - The GP has to guess the shape of the MAE surface from very limited info. - This early stage is common in Bayesian optimization — you need more evaluations for a better fit.

An acquisition function is a strategy or rule that tells you:

“Given what we know, where should we evaluate next to most efficiently improve our model?”

It uses: - The predicted mean (what performance we expect) - The predicted variance (how uncertain we are)

Exploration • Try new or uncertain regions (high variance) • We may discover better solutions we haven’t tried yet
Exploitation • Focus on what we already think is best • Pick values near previously good results (low predicted MAE)

Exploration looks for potential

Exploitation bets on what already looks good

It balances the choice: Exploration (uncertainty) vs. Exploitation (low predicted error)

Acquisition Functions

We’ll use an acquisition function to select a new candidate.

The most popular method appears to be expected improvement (EI) above the current best results.

Zero at existing data points.
The expected improvement is integrated over all possible improvement (“expected” in the probability sense).

We would probably pick the point with the largest EI as the next point.

(There are other functions beyond EI.)

Expected Improvement

Iteration

Once we pick the candidate point, we measure performance for it (e.g. resampling).

Another GP is fit, EI is recomputed, and so on.

We stop when we have completed the allowed number of iterations or if we don’t see any improvement after a pre-set number of attempts.

GP Fit with four points

Expected Improvement

GP Evolution

Expected Improvement Evolution

BO in tidymodels

We’ll use a function called tune_bayes() that has very similar syntax to tune_grid().

It has an additional initial argument for the initial set of performance estimates and parameter combinations for the GP model.

Initial grid points

initial can be the results of another tune_*() function or an integer (in which case tune_grid() is used under to hood to make such an initial set of results).

We’ll run the optimization more than once, so let’s make an initial grid of results to serve as the substrate for the BO.
I suggest at least the number of tuning parameters plus two as the initial grid for BO.

An Initial Grid

reg_metrics <- metric_set(mae, rsq)

set.seed(12)
init_res <-
  lgbm_wflow %>%
  tune_grid(
    resamples = hotel_rs,
    grid = nrow(lgbm_param) + 2,
    param_info = lgbm_param,
    metrics = reg_metrics
  )

show_best(init_res, metric = "mae") %>% select(-.metric, -.estimator)
#> # A tibble: 5 × 9
#>   min_n tree_depth learn_rate loss_reduction stop_iter  mean     n std_err .config             
#>   <int>      <int>      <dbl>          <dbl>     <int> <dbl> <int>   <dbl> <chr>               
#> 1    33         12   0.0215         8.25e- 9         5  10.1    10   0.182 Preprocessor1_Model6
#> 2    14         10   0.1            3.16e+ 1         8  10.2    10   0.192 Preprocessor1_Model3
#> 3     8          3   0.00464        1   e-10        11  14.4    10   0.305 Preprocessor1_Model2
#> 4    40          1   0.001          4.64e- 3        14  36.1    10   0.408 Preprocessor1_Model7
#> 5    27         15   0.000215       6.81e- 7        20  44.5    10   0.371 Preprocessor1_Model5

BO using tidymodels

ctrl_bo <- control_bayes(seed = 15, verbose_iter = TRUE) # <- for demonstration

set.seed(15)
lgbm_bayes_res <-
  lgbm_wflow %>%
  tune_bayes(
    resamples = hotel_rs,
    initial = init_res,     # <- initial results
    iter = 20,
    param_info = lgbm_param,
    control = ctrl_bo,
    metrics = reg_metrics
  )
#> Optimizing mae using the expected improvement
#> 
#> ── Iteration 1 ───────────────────────────────────────────────────────
#> 
#> i Current best:      mae=10.05 (@iter 0)
#> i Gaussian process model
#> ✓ Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i min_n=3, tree_depth=9, learn_rate=0.089, loss_reduction=2.73e-10, stop_iter=4
#> i Estimating performance
#> ✓ Estimating performance
#> ♥ Newest results:    mae=9.651 (+/-0.159)
#> 
#> ── Iteration 2 ───────────────────────────────────────────────────────
#> 
#> i Current best:      mae=9.651 (@iter 1)
#> i Gaussian process model
#> ✓ Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i min_n=2, tree_depth=13, learn_rate=0.0196, loss_reduction=0.000897, stop_iter=19
#> i Estimating performance
#> ✓ Estimating performance
#> ⓧ Newest results:    mae=9.971 (+/-0.172)
#> 
#> ── Iteration 3 ───────────────────────────────────────────────────────
#> 
#> i Current best:      mae=9.651 (@iter 1)
#> i Gaussian process model
#> ✓ Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i min_n=24, tree_depth=15, learn_rate=0.0554, loss_reduction=0.00201, stop_iter=16
#> i Estimating performance
#> ✓ Estimating performance
#> ⓧ Newest results:    mae=9.73 (+/-0.149)
#> 
#> ── Iteration 4 ───────────────────────────────────────────────────────
#> 
#> i Current best:      mae=9.651 (@iter 1)
#> i Gaussian process model
#> ✓ Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i min_n=37, tree_depth=1, learn_rate=0.0117, loss_reduction=1.16e-06, stop_iter=5
#> i Estimating performance
#> ✓ Estimating performance
#> ⓧ Newest results:    mae=17.91 (+/-0.308)
#> 
#> ── Iteration 5 ───────────────────────────────────────────────────────
#> 
#> i Current best:      mae=9.651 (@iter 1)
#> i Gaussian process model
#> ✓ Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i min_n=2, tree_depth=14, learn_rate=0.0253, loss_reduction=3.1e-05, stop_iter=4
#> i Estimating performance
#> ✓ Estimating performance
#> ⓧ Newest results:    mae=9.852 (+/-0.161)
#> 
#> ── Iteration 6 ───────────────────────────────────────────────────────
#> 
#> i Current best:      mae=9.651 (@iter 1)
#> i Gaussian process model
#> ✓ Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i min_n=6, tree_depth=7, learn_rate=0.0323, loss_reduction=2.04e-10, stop_iter=6
#> i Estimating performance
#> ✓ Estimating performance
#> ⓧ Newest results:    mae=9.816 (+/-0.154)
#> 
#> ── Iteration 7 ───────────────────────────────────────────────────────
#> 
#> i Current best:      mae=9.651 (@iter 1)
#> i Gaussian process model
#> ✓ Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i min_n=17, tree_depth=12, learn_rate=0.0359, loss_reduction=1.7e-10, stop_iter=12
#> i Estimating performance
#> ✓ Estimating performance
#> ⓧ Newest results:    mae=9.706 (+/-0.159)
#> 
#> ── Iteration 8 ───────────────────────────────────────────────────────
#> 
#> i Current best:      mae=9.651 (@iter 1)
#> i Gaussian process model
#> ✓ Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i min_n=39, tree_depth=15, learn_rate=0.029, loss_reduction=9.27e-08, stop_iter=6
#> i Estimating performance
#> ✓ Estimating performance
#> ⓧ Newest results:    mae=9.999 (+/-0.192)
#> 
#> ── Iteration 9 ───────────────────────────────────────────────────────
#> 
#> i Current best:      mae=9.651 (@iter 1)
#> i Gaussian process model
#> ✓ Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i min_n=26, tree_depth=15, learn_rate=0.0249, loss_reduction=6.75e-08, stop_iter=13
#> i Estimating performance
#> ✓ Estimating performance
#> ⓧ Newest results:    mae=9.873 (+/-0.18)
#> 
#> ── Iteration 10 ──────────────────────────────────────────────────────
#> 
#> i Current best:      mae=9.651 (@iter 1)
#> i Gaussian process model
#> ✓ Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i min_n=3, tree_depth=9, learn_rate=0.0245, loss_reduction=5.44e-09, stop_iter=4
#> i Estimating performance
#> ✓ Estimating performance
#> ⓧ Newest results:    mae=9.863 (+/-0.18)
#> 
#> ── Iteration 11 ──────────────────────────────────────────────────────
#> 
#> i Current best:      mae=9.651 (@iter 1)
#> i Gaussian process model
#> ✓ Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i min_n=5, tree_depth=12, learn_rate=0.0593, loss_reduction=1.16e-09, stop_iter=6
#> i Estimating performance
#> ✓ Estimating performance
#> ♥ Newest results:    mae=9.645 (+/-0.171)
#> 
#> ── Iteration 12 ──────────────────────────────────────────────────────
#> 
#> i Current best:      mae=9.645 (@iter 11)
#> i Gaussian process model
#> ✓ Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i min_n=23, tree_depth=4, learn_rate=0.0739, loss_reduction=3.99e-10, stop_iter=4
#> i Estimating performance
#> ✓ Estimating performance
#> ⓧ Newest results:    mae=10.09 (+/-0.193)
#> 
#> ── Iteration 13 ──────────────────────────────────────────────────────
#> 
#> i Current best:      mae=9.645 (@iter 11)
#> i Gaussian process model
#> ✓ Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i min_n=4, tree_depth=2, learn_rate=0.0439, loss_reduction=0.00914, stop_iter=5
#> i Estimating performance
#> ✓ Estimating performance
#> ⓧ Newest results:    mae=11.59 (+/-0.218)
#> 
#> ── Iteration 14 ──────────────────────────────────────────────────────
#> 
#> i Current best:      mae=9.645 (@iter 11)
#> i Gaussian process model
#> ✓ Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i min_n=22, tree_depth=15, learn_rate=0.00489, loss_reduction=3.28e-05, stop_iter=17
#> i Estimating performance
#> ✓ Estimating performance
#> ⓧ Newest results:    mae=11.22 (+/-0.231)
#> 
#> ── Iteration 15 ──────────────────────────────────────────────────────
#> 
#> i Current best:      mae=9.645 (@iter 11)
#> i Gaussian process model
#> ✓ Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i min_n=27, tree_depth=8, learn_rate=0.0524, loss_reduction=4.24e-08, stop_iter=11
#> i Estimating performance
#> ✓ Estimating performance
#> ⓧ Newest results:    mae=9.801 (+/-0.184)
#> 
#> ── Iteration 16 ──────────────────────────────────────────────────────
#> 
#> i Current best:      mae=9.645 (@iter 11)
#> i Gaussian process model
#> ✓ Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i min_n=3, tree_depth=13, learn_rate=0.00837, loss_reduction=4.21e-10, stop_iter=9
#> i Estimating performance
#> ✓ Estimating performance
#> ⓧ Newest results:    mae=10.45 (+/-0.195)
#> 
#> ── Iteration 17 ──────────────────────────────────────────────────────
#> 
#> i Current best:      mae=9.645 (@iter 11)
#> i Gaussian process model
#> ✓ Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i min_n=33, tree_depth=13, learn_rate=0.0767, loss_reduction=1.36e-10, stop_iter=4
#> i Estimating performance
#> ✓ Estimating performance
#> ⓧ Newest results:    mae=9.921 (+/-0.181)
#> 
#> ── Iteration 18 ──────────────────────────────────────────────────────
#> 
#> i Current best:      mae=9.645 (@iter 11)
#> i Gaussian process model
#> ✓ Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i min_n=3, tree_depth=7, learn_rate=0.0711, loss_reduction=3.14e-07, stop_iter=13
#> i Estimating performance
#> ✓ Estimating performance
#> ⓧ Newest results:    mae=9.664 (+/-0.161)
#> 
#> ── Iteration 19 ──────────────────────────────────────────────────────
#> 
#> i Current best:      mae=9.645 (@iter 11)
#> i Gaussian process model
#> ✓ Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i min_n=3, tree_depth=9, learn_rate=0.0613, loss_reduction=2.02, stop_iter=8
#> i Estimating performance
#> ✓ Estimating performance
#> ⓧ Newest results:    mae=9.772 (+/-0.165)
#> 
#> ── Iteration 20 ──────────────────────────────────────────────────────
#> 
#> i Current best:      mae=9.645 (@iter 11)
#> i Gaussian process model
#> ✓ Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i min_n=5, tree_depth=6, learn_rate=0.0896, loss_reduction=4.87e-09, stop_iter=9
#> i Estimating performance
#> ✓ Estimating performance
#> ⓧ Newest results:    mae=9.721 (+/-0.146)

Best results

show_best(lgbm_bayes_res, metric = "mae") %>% select(-.metric, -.estimator)
#> # A tibble: 5 × 10
#>   min_n tree_depth learn_rate loss_reduction stop_iter  mean     n std_err .config .iter
#>   <int>      <int>      <dbl>          <dbl>     <int> <dbl> <int>   <dbl> <chr>   <int>
#> 1     5         12     0.0593       1.16e- 9         6  9.64    10   0.171 Iter11     11
#> 2     3          9     0.0890       2.73e-10         4  9.65    10   0.159 Iter1       1
#> 3     3          7     0.0711       3.14e- 7        13  9.66    10   0.161 Iter18     18
#> 4    17         12     0.0359       1.70e-10        12  9.71    10   0.159 Iter7       7
#> 5     5          6     0.0896       4.87e- 9         9  9.72    10   0.146 Iter20     20

Plotting BO Results

autoplot(lgbm_bayes_res, metric = "mae")

Plotting BO Results

autoplot(lgbm_bayes_res, metric = "mae", type = "parameters")

Plotting BO Results

autoplot(lgbm_bayes_res, metric = "mae", type = "performance")

ENHANCE

autoplot(lgbm_bayes_res, metric = "mae", type = "performance") +
  ylim(c(9, 14))

Your turn

Let’s try a different acquisition function: conf_bound(kappa).

We’ll use the objective argument to set it.

Choose your own kappa value:

Larger values will explore the space more.
“Large” values are usually less than one.

Bonus points: Before the optimization is done, press <esc> and see what happens.

15:00

Notes

Stopping tune_bayes() will return the current results.
Parallel processing can still be used to more efficiently measure each candidate point.
There are a lot of other iterative methods that you can use.
The finetune package also has functions for simulated annealing search.

Finalizing the Model

Let’s say that we’ve tried a lot of different models and we like our lightgbm model the most.

What do we do now?

Finalize the workflow by choosing the values for the tuning parameters.
Fit the model on the entire training set.
Verify performance using the test set.
Document and publish the model(?)

Locking Down the Tuning Parameters

We can take the results of the Bayesian optimization and accept the best results:

best_param <- select_best(lgbm_bayes_res, metric = "mae")
final_wflow <- 
  lgbm_wflow %>% 
  finalize_workflow(best_param)
final_wflow
#> ══ Workflow ══════════════════════════════════════════════════════════
#> Preprocessor: Formula
#> Model: boost_tree()
#> 
#> ── Preprocessor ──────────────────────────────────────────────────────
#> avg_price_per_room ~ .
#> 
#> ── Model ─────────────────────────────────────────────────────────────
#> Boosted Tree Model Specification (regression)
#> 
#> Main Arguments:
#>   trees = 1000
#>   min_n = 5
#>   tree_depth = 12
#>   learn_rate = 0.0593110845668806
#>   loss_reduction = 1.16244203146145e-09
#>   stop_iter = 6
#> 
#> Engine-Specific Arguments:
#>   num_threads = 1
#> 
#> Computational engine: lightgbm

The Final Fit

We can use individual functions:

final_fit <- final_wflow %>% fit(data = hotel_train)

# then predict() or augment() 
# then compute metrics

Remember that there is also a convenience function to do all of this:

set.seed(3893)
final_res <- final_wflow %>% last_fit(hotel_split, metrics = reg_metrics)
final_res
#> # Resampling results
#> # Manual resampling 
#> # A tibble: 1 × 6
#>   splits              id               .metrics         .notes           .predictions         .workflow 
#>   <list>              <chr>            <list>           <list>           <list>               <list>    
#> 1 <split [3749/1251]> train/test split <tibble [2 × 4]> <tibble [0 × 3]> <tibble [1,251 × 4]> <workflow>

Test Set Results

final_res %>% 
  collect_predictions() %>% 
  cal_plot_regression(
    truth = avg_price_per_room, 
    estimate = .pred)

Test set performance:

final_res %>% collect_metrics()
#> # A tibble: 2 × 4
#>   .metric .estimator .estimate .config             
#>   <chr>   <chr>          <dbl> <chr>               
#> 1 mae     standard       9.66  Preprocessor1_Model1
#> 2 rsq     standard       0.948 Preprocessor1_Model1

Recall that resampling predicted the MAE to be 9.645.

We made it!

Resources to keep learning

https://www.tidymodels.org/

https://www.tmwr.org/

http://www.feat.engineering/

https://smltar.com/

Tidymodeling

Tentative outline for part 2

Let’s install some packages

My versions

Open Worksheet 1 📄

Hotel Data

Hotel date columns

Data splitting strategy

Data Spending

Your turn

Feature Engineering

Working with our predictors

What is feature engineering?

Example: Dates

General definitions

Previously - Setup

Previously - Data Usage

Resampling Strategy

Resampling Strategy

Prepare your data for modeling

A first recipe

A first recipe

Your turn

Create indicator variables

Filter out constant columns

Normalization

Reduce correlation

Other possible steps

Other possible steps

Other possible steps

Your turn

Minimal recipe

Measuring Performance

Using a workflow

Your turn

Holdout predictions

Calibration Plot

What do we do with the agent and company data?

Per-agent statistics

Collapsing factor levels

Does othering help?

Feature Hashing

Feature Hashing

Feature Hashing Pros

Feature Hashing Cons

Feature Hashing in recipes

Feature Hashing in recipes

Debugging a recipe

Example

More on recipes

Open Worksheet 2 📄

Previously - Setup

Previously - Data Usage

Previously - Feature engineering

Optimizing Models via Tuning Parameters

Tuning parameters

Activation function in neural networks?

Number of feature hashing columns to generate?

Bayesian priors for model parameters?

The random seed?

Optimize tuning parameters

Tagging parameters for tuning

Optimizing the hash features

Boosted Trees

Boosted Tree Tuning Parameters

Boosted Tree Tuning Parameters

Boosted Tree Tuning Parameters

Optimize tuning parameters

Grid search

Grid search

Iterative Search

Grid Search

Parameters

Grids

Different types of grids

Create a grid

Create a grid

Your turn

Create a regular grid

Your turn

Use the `tune_*()` functions to tune models