Part II
Also, you should install the newest version of the dials package (version 1.3.0). To check this, you can run:
R version 4.5.0 (2025-04-11), Quarto (1.7.29)
package | version |
---|---|
bonsai | 0.3.2 |
broom | 1.0.8 |
Cubist | 0.5.0 |
dials | 1.4.0 |
doParallel | 1.0.17 |
dplyr | 1.1.4 |
earth | 5.3.4 |
embed | 1.1.5 |
finetune | 1.2.0 |
Formula | 1.2-5 |
package | version |
---|---|
ggplot2 | 3.5.2 |
lattice | 0.22-6 |
lightgbm | 4.6.0 |
lme4 | 1.1-37 |
modeldata | 1.4.0 |
parallelly | 1.43.0 |
parsnip | 1.3.1 |
plotmo | 3.6.4 |
plotrix | 3.8-4 |
plumber | 1.3.0 |
package | version |
---|---|
probably | 1.0.3 |
purrr | 1.0.4 |
recipes | 1.3.0 |
rsample | 1.3.0 |
rules | 1.0.2 |
scales | 1.4.0 |
splines2 | 0.5.4 |
stacks | 1.1.0 |
text2vec | 0.6.4 |
textrecipes | 1.1.0 |
package | version |
---|---|
tibble | 3.2.1 |
tidymodels | 1.3.0 |
tidyr | 1.3.1 |
tune | 1.3.0 |
vetiver | 0.2.5 |
workflows | 1.2.0 |
workflowsets | 1.1.0 |
yardstick | 1.3.2 |
Weβll use data on hotels to predict the cost of a room.
The data are in the modeldata package. Weβll sample down the data and refactor some columns:
names(hotel_rates)
#> [1] "avg_price_per_room" "lead_time"
#> [3] "stays_in_weekend_nights" "stays_in_week_nights"
#> [5] "adults" "children"
#> [7] "babies" "meal"
#> [9] "country" "market_segment"
#> [11] "distribution_channel" "is_repeated_guest"
#> [13] "previous_cancellations" "previous_bookings_not_canceled"
#> [15] "reserved_room_type" "assigned_room_type"
#> [17] "booking_changes" "agent"
#> [19] "company" "days_in_waiting_list"
#> [21] "customer_type" "required_car_parking_spaces"
#> [23] "total_of_special_requests" "arrival_date_num"
#> [25] "near_christmas" "near_new_years"
#> [27] "historical_adr"
Letβs split the data into a training set (75%) and testing set (25%) using stratification:
Letβs take some time and investigate the training data. The outcome is avg_price_per_room
.
Are there any interesting characteristics of the data?
10:00
We might want to modify our predictors columns for a few reasons:
The first two reasons are fairly predictable (next page).
The last one depends on your modeling problem.
Think of a feature as some representation of a predictor that will be used in a model.
Example representations:
There are a lot of examples in Feature Engineering and Selection (FES) and Applied Machine Learning for Tabular Data (aml4td).
How can we represent date columns for our model?
When we use a date column in its native format, most models in R convert it to an integer.
We can re-engineer it as:
Data preprocessing steps allow your model to fit.
Feature engineering steps help the model do the least work to predict the outcome as well as possible.
The recipes package can handle both!
Weβll go from here and create a set of resamples to use for model assessments.
Weβll use simple 10-fold cross-validation (stratified sampling):
set.seed(472)
hotel_rs <- vfold_cv(hotel_train, strata = avg_price_per_room)
hotel_rs
#> # 10-fold cross-validation using stratification
#> # A tibble: 10 Γ 2
#> splits id
#> <list> <chr>
#> 1 <split [3372/377]> Fold01
#> 2 <split [3373/376]> Fold02
#> 3 <split [3373/376]> Fold03
#> 4 <split [3373/376]> Fold04
#> 5 <split [3373/376]> Fold05
#> 6 <split [3374/375]> Fold06
#> 7 <split [3375/374]> Fold07
#> 8 <split [3376/373]> Fold08
#> 9 <split [3376/373]> Fold09
#> 10 <split [3376/373]> Fold10
recipe()
function assigns columns to roles of βoutcomeβ or βpredictorβ using the formulasummary(hotel_rec)
#> # A tibble: 27 Γ 4
#> variable type role source
#> <chr> <list> <chr> <chr>
#> 1 lead_time <chr [2]> predictor original
#> 2 stays_in_weekend_nights <chr [2]> predictor original
#> 3 stays_in_week_nights <chr [2]> predictor original
#> 4 adults <chr [2]> predictor original
#> 5 children <chr [2]> predictor original
#> 6 babies <chr [2]> predictor original
#> 7 meal <chr [3]> predictor original
#> 8 country <chr [3]> predictor original
#> 9 market_segment <chr [3]> predictor original
#> 10 distribution_channel <chr [3]> predictor original
#> # βΉ 17 more rows
The type
column contains information on the variables
What do you think are in the type
vectors for the lead_time
and country
columns?
03:00
For any factor or character predictors, make binary indicators.
There are many recipe steps that can convert categorical predictors to numeric columns.
step_dummy()
records the levels of the categorical predictors in the training set.
In case there is a factor level that was never observed in the training data (resulting in a column of all 0
s), we can delete any zero-variance predictors that have a single unique value.
This centers and scales the numeric predictors.
The recipe will use the training set to estimate the means and standard deviations of the data.
To deal with highly correlated predictors, find the minimum set of predictor columns that make the pairwise correlations less than the threshold.
PCA feature extractionβ¦
A fancy machine learning supervised dimension reduction technique called UMAPβ¦
Nonlinear transforms like natural splines, and so on!
Create a recipe()
for the hotel data to:
lead_time
10:00
Weβll compute two measures: mean absolute error and the coefficient of determination (a.k.a \(R^2\)).
\[\begin{align} MAE &= \frac{1}{n}\sum_{i=1}^n |y_i - \hat{y}_i| \notag \\ R^2 &= cor(y_i, \hat{y}_i)^2 \end{align}\]
The focus will be on MAE for parameter optimization. Weβll use a metric set to compute these:
set.seed(9)
hotel_lm_wflow <-
workflow() %>%
add_recipe(hotel_indicators) %>%
add_model(linear_reg())
ctrl <- control_resamples(save_pred = TRUE)
hotel_lm_res <-
hotel_lm_wflow %>%
fit_resamples(hotel_rs, control = ctrl, metrics = reg_metrics)
collect_metrics(hotel_lm_res)
#> # A tibble: 2 Γ 6
#> .metric .estimator mean n std_err .config
#> <chr> <chr> <dbl> <int> <dbl> <chr>
#> 1 mae standard 16.6 10 0.214 Preprocessor1_Model1
#> 2 rsq standard 0.884 10 0.00339 Preprocessor1_Model1
Use fit_resamples()
to fit your workflow with a recipe.
Collect the predictions from the results.
05:00
# Since we used `save_pred = TRUE`
lm_cv_pred <- collect_predictions(hotel_lm_res)
lm_cv_pred %>% print(n = 7)
#> # A tibble: 3,749 Γ 5
#> .pred id .row avg_price_per_room .config
#> <dbl> <chr> <int> <dbl> <chr>
#> 1 75.1 Fold01 20 40 Preprocessor1_Model1
#> 2 49.3 Fold01 28 54 Preprocessor1_Model1
#> 3 64.9 Fold01 45 50 Preprocessor1_Model1
#> 4 52.8 Fold01 49 42 Preprocessor1_Model1
#> 5 48.6 Fold01 61 49 Preprocessor1_Model1
#> 6 29.8 Fold01 66 40 Preprocessor1_Model1
#> 7 36.9 Fold01 88 49 Preprocessor1_Model1
#> # βΉ 3,742 more rows
There are 98 unique agent values and 100 unique companies in our training set. How can we include this information in our model?
We could:
make the full set of indicator variables π³
lump agents and companies that rarely occur into an βotherβ group
use feature hashing to create a smaller set of indicator variables
use effect encoding to replace the agent
and company
columns with the estimated effect of that predictor (in the extra materials)
There is a recipe step that will redefine factor levels based on their frequency in the training set:
Using this code, 34 agents (out of 98) were collapsed into βotherβ based on the training set.
We could try to optimize the threshold for collapsing (we will get to model tuning later).
hotel_other_wflow <-
hotel_lm_wflow %>%
update_recipe(hotel_other_rec)
hotel_other_res <-
hotel_other_wflow %>%
fit_resamples(hotel_rs, control = ctrl, metrics = reg_metrics)
collect_metrics(hotel_other_res)
#> # A tibble: 2 Γ 6
#> .metric .estimator mean n std_err .config
#> <chr> <chr> <dbl> <int> <dbl> <chr>
#> 1 mae standard 16.7 10 0.213 Preprocessor1_Model1
#> 2 rsq standard 0.884 10 0.00341 Preprocessor1_Model1
About the same MAE and much faster to complete.
Now letβs look at a more sophisticated tool called effect feature hashing.
Between agent
and company
, simple dummy variables would create 198 new columns (that are mostly zeros).
Another option is to have a binary indicator that combines some levels of these variables.
Feature hashing (for more see FES, SMLTAR, TMwR, and aml4td):
Suppose we want to use 32 indicator variables for agent
.
For a agent with value βMax_Kuhn
β, a hashing function converts it to an integer (say 210397726).
To assign it to one of the 32 columns, we would use modular arithmetic to assign it to a column:
Hash functions are meant to emulate randomness.
The textrecipes package has a step that can be added to the recipe:
library(textrecipes)
hash_rec <-
recipe(avg_price_per_room ~ ., data = hotel_train) %>%
step_YeoJohnson(lead_time) %>%
# Defaults to 32 signed indicator columns
step_dummy_hash(agent) %>%
step_dummy_hash(company) %>%
# Regular indicators for the others
step_dummy(all_nominal_predictors()) %>%
step_zv(all_predictors()) %>%
step_spline_natural(arrival_date_num, deg_free = 10)
hotel_hash_wflow <-
hotel_lm_wflow %>%
update_recipe(hash_rec)
hotel_hash_res <-
hotel_hash_wflow %>%
fit_resamples(hotel_rs, control = ctrl, metrics = reg_metrics)
collect_metrics(hotel_hash_res)
#> # A tibble: 2 Γ 6
#> .metric .estimator mean n std_err .config
#> <chr> <chr> <dbl> <int> <dbl> <chr>
#> 1 mae standard 16.7 10 0.239 Preprocessor1_Model1
#> 2 rsq standard 0.884 10 0.00324 Preprocessor1_Model1
About the same performance but now we can handle new values.
hash_rec
) can be estimated manually with a function called prep()
. It is analogous to fit()
. See TMwR section 16.4bake()
) is analogous to predict()
, and gives you the processed data back.tidy()
function can be used to get specific results from the recipe.fit()
is called on a workflow, changing the model does not re-fit the recipe.predict()
.library(textrecipes)
hash_rec <-
recipe(avg_price_per_room ~ ., data = hotel_train) %>%
step_YeoJohnson(lead_time) %>%
# Defaults to 32 signed indicator columns
step_dummy_hash(agent) %>%
step_dummy_hash(company) %>%
# Regular indicators for the others
step_dummy(all_nominal_predictors()) %>%
step_zv(all_predictors())
Some model or preprocessing parameters cannot be estimated directly from the data.
Some examples:
Sigmoidal functions, ReLu, etc.
Yes, it is a tuning parameter. β
Yes, it is a tuning parameter. β
Hmmmm, probably not. These are based on prior belief. β
Nope. It is not. β
With tidymodels, you can mark the parameters that you want to optimize with a value of tune()
.
The function itself just returns⦠itself:
For exampleβ¦
Our new recipe is:
We will be using a tree-based model in a minute.
step_dummy()
.These are popular ensemble methods that build a sequence of tree models.
Each tree uses the results of the previous tree to better predict samples, especially those that have been poorly predicted.
Each tree in the ensemble is saved and new samples are predicted using a weighted average of the votes of each tree in the ensemble.
Weβll focus on the popular lightgbm implementation.
Some possible parameters:
mtry
: The number of predictors randomly sampled at each split (in \([1, ncol(x)]\) or \((0, 1]\)).trees
: The number of trees (\([1, \infty]\), but usually up to thousands)min_n
: The number of samples needed to further split (\([1, n]\)).learn_rate
: The rate that each tree adapts from previous iterations (\((0, \infty]\), usual maximum is 0.1).stop_iter
: The number of iterations of boosting where no improvement was shown before stopping (\([1, trees]\))TBH it is usually not difficult to optimize these models.
Often, there are multiple candidate tuning parameter combinations that have very good results.
To demonstrate simple concepts, weβll look at optimizing the number of trees in the ensemble (between 1 and 100) and the learning rate (\(10^{-5}\) to \(10^{-1}\)).
Weβll need to load the bonsai package. This has the information needed to use lightgbm
The main two strategies for optimization are:
Grid search π which tests a pre-defined set of candidate values
Iterative search π which suggests/estimates new values of candidate parameters to evaluate
A small grid of points trying to minimize the error via learning rate:
In reality we would probably sample the space more densely:
We could start with a few points and search the space:
The tidymodels framework provides pre-defined information on tuning parameters (such as their type, range, transformations, etc).
The extract_parameter_set_dials()
function extracts these tuning parameters and the info.
Create your grid manually or automatically.
The grid_*()
functions can make a grid.
Space-filling designs (SFD) attempt to cover the parameter space without redundant candidates (recommended).
A parameter set can be updated (e.g. to change the ranges).
set.seed(12)
grid <-
lgbm_wflow %>%
extract_parameter_set_dials() %>%
grid_space_filling(size = 25)
grid
#> # A tibble: 25 Γ 4
#> trees learn_rate `agent hash` `company hash`
#> <int> <dbl> <int> <int>
#> 1 1 7.50e- 6 574 574
#> 2 84 1.78e- 5 2048 2298
#> 3 167 5.62e-10 1824 912
#> 4 250 4.22e- 5 3250 512
#> 5 334 1.78e- 8 512 2896
#> 6 417 1.33e- 3 322 1625
#> 7 500 1 e- 1 1448 1149
#> 8 584 1 e- 7 1290 256
#> 9 667 2.37e-10 456 724
#> 10 750 1.78e- 2 645 322
#> # βΉ 15 more rows
Create a grid for our tunable workflow.
Try creating a regular grid.
10:00
set.seed(12)
grid <-
lgbm_wflow %>%
extract_parameter_set_dials() %>%
grid_regular(levels = 4)
grid
#> # A tibble: 256 Γ 4
#> trees learn_rate `agent hash` `company hash`
#> <int> <dbl> <int> <int>
#> 1 1 0.0000000001 256 256
#> 2 667 0.0000000001 256 256
#> 3 1333 0.0000000001 256 256
#> 4 2000 0.0000000001 256 256
#> 5 1 0.0000001 256 256
#> 6 667 0.0000001 256 256
#> 7 1333 0.0000001 256 256
#> 8 2000 0.0000001 256 256
#> 9 1 0.0001 256 256
#> 10 667 0.0001 256 256
#> # βΉ 246 more rows
What advantage would a regular grid have?
lgbm_param <-
lgbm_wflow %>%
extract_parameter_set_dials() %>%
update(trees = trees(c(1L, 100L)),
learn_rate = learn_rate(c(-5, -1)))
set.seed(712)
grid <-
lgbm_param %>%
grid_space_filling(size = 25)
grid
#> # A tibble: 25 Γ 4
#> trees learn_rate `agent hash` `company hash`
#> <int> <dbl> <int> <int>
#> 1 1 0.00147 574 574
#> 2 5 0.00215 2048 2298
#> 3 9 0.0000215 1824 912
#> 4 13 0.00316 3250 512
#> 5 17 0.0001 512 2896
#> 6 21 0.0147 322 1625
#> 7 25 0.1 1448 1149
#> 8 29 0.000215 1290 256
#> 9 34 0.0000147 456 724
#> 10 38 0.0464 645 322
#> # βΉ 15 more rows
Note that the learning rates are uniform on the log-10 scale and this shows 2 of 4 dimensions.
tune_*()
functions to tune modelsLetβs take our previous model and tune more parameters:
lgbm_spec <-
boost_tree(trees = tune(), learn_rate = tune(), min_n = tune()) %>%
set_mode("regression") %>%
set_engine("lightgbm", num_threads = 1)
lgbm_wflow <- workflow(hash_rec, lgbm_spec)
# Update the feature hash ranges (log-2 units)
lgbm_param <-
lgbm_wflow %>%
extract_parameter_set_dials() %>%
update(`agent hash` = num_hash(c(3, 8)),
`company hash` = num_hash(c(3, 8)))
lgbm_res
#> # Tuning results
#> # 10-fold cross-validation using stratification
#> # A tibble: 10 Γ 5
#> splits id .metrics .notes .predictions
#> <list> <chr> <list> <list> <list>
#> 1 <split [3372/377]> Fold01 <tibble [50 Γ 9]> <tibble [0 Γ 3]> <tibble [9,425 Γ 9]>
#> 2 <split [3373/376]> Fold02 <tibble [50 Γ 9]> <tibble [0 Γ 3]> <tibble [9,400 Γ 9]>
#> 3 <split [3373/376]> Fold03 <tibble [50 Γ 9]> <tibble [0 Γ 3]> <tibble [9,400 Γ 9]>
#> 4 <split [3373/376]> Fold04 <tibble [50 Γ 9]> <tibble [0 Γ 3]> <tibble [9,400 Γ 9]>
#> 5 <split [3373/376]> Fold05 <tibble [50 Γ 9]> <tibble [0 Γ 3]> <tibble [9,400 Γ 9]>
#> 6 <split [3374/375]> Fold06 <tibble [50 Γ 9]> <tibble [0 Γ 3]> <tibble [9,375 Γ 9]>
#> 7 <split [3375/374]> Fold07 <tibble [50 Γ 9]> <tibble [0 Γ 3]> <tibble [9,350 Γ 9]>
#> 8 <split [3376/373]> Fold08 <tibble [50 Γ 9]> <tibble [0 Γ 3]> <tibble [9,325 Γ 9]>
#> 9 <split [3376/373]> Fold09 <tibble [50 Γ 9]> <tibble [0 Γ 3]> <tibble [9,325 Γ 9]>
#> 10 <split [3376/373]> Fold10 <tibble [50 Γ 9]> <tibble [0 Γ 3]> <tibble [9,325 Γ 9]>
collect_metrics(lgbm_res)
#> # A tibble: 50 Γ 11
#> trees min_n learn_rate `agent hash` `company hash` .metric .estimator mean n std_err .config
#> <int> <int> <dbl> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
#> 1 1 17 0.00000000133 93 25 mae standard 53.2 10 0.427 Preprocessor01_Model1
#> 2 1 17 0.00000000133 93 25 rsq standard 0.810 10 0.00746 Preprocessor01_Model1
#> 3 84 27 0.0178 80 69 mae standard 20.0 10 0.293 Preprocessor02_Model1
#> 4 84 27 0.0178 80 69 rsq standard 0.897 10 0.00423 Preprocessor02_Model1
#> 5 167 19 0.00000133 12 143 mae standard 53.2 10 0.426 Preprocessor03_Model1
#> 6 167 19 0.00000133 12 143 rsq standard 0.811 10 0.00698 Preprocessor03_Model1
#> 7 250 2 0.000237 33 33 mae standard 50.6 10 0.409 Preprocessor04_Model1
#> 8 250 2 0.000237 33 33 rsq standard 0.815 10 0.00800 Preprocessor04_Model1
#> 9 334 36 0.0000000178 19 21 mae standard 53.2 10 0.427 Preprocessor05_Model1
#> 10 334 36 0.0000000178 19 21 rsq standard 0.807 10 0.00692 Preprocessor05_Model1
#> # βΉ 40 more rows
collect_metrics(lgbm_res, summarize = FALSE)
#> # A tibble: 500 Γ 10
#> id trees min_n learn_rate `agent hash` `company hash` .metric .estimator .estimate .config
#> <chr> <int> <int> <dbl> <int> <int> <chr> <chr> <dbl> <chr>
#> 1 Fold01 1 17 0.00000000133 93 25 mae standard 51.8 Preprocessor01_Model1
#> 2 Fold01 1 17 0.00000000133 93 25 rsq standard 0.821 Preprocessor01_Model1
#> 3 Fold02 1 17 0.00000000133 93 25 mae standard 52.1 Preprocessor01_Model1
#> 4 Fold02 1 17 0.00000000133 93 25 rsq standard 0.810 Preprocessor01_Model1
#> 5 Fold03 1 17 0.00000000133 93 25 mae standard 52.2 Preprocessor01_Model1
#> 6 Fold03 1 17 0.00000000133 93 25 rsq standard 0.783 Preprocessor01_Model1
#> 7 Fold04 1 17 0.00000000133 93 25 mae standard 51.7 Preprocessor01_Model1
#> 8 Fold04 1 17 0.00000000133 93 25 rsq standard 0.815 Preprocessor01_Model1
#> 9 Fold05 1 17 0.00000000133 93 25 mae standard 55.2 Preprocessor01_Model1
#> 10 Fold05 1 17 0.00000000133 93 25 rsq standard 0.847 Preprocessor01_Model1
#> # βΉ 490 more rows
show_best(lgbm_res, metric = "rsq")
#> # A tibble: 5 Γ 11
#> trees min_n learn_rate `agent hash` `company hash` .metric .estimator mean n std_err .config
#> <int> <int> <dbl> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
#> 1 1250 9 0.1 107 80 rsq standard 0.947 10 0.00395 Preprocessor16_Model1
#> 2 917 16 0.0422 9 39 rsq standard 0.946 10 0.00362 Preprocessor12_Model1
#> 3 1666 30 0.00750 124 19 rsq standard 0.943 10 0.00309 Preprocessor21_Model1
#> 4 2000 24 0.00133 25 124 rsq standard 0.919 10 0.00383 Preprocessor25_Model1
#> 5 500 25 0.00316 52 8 rsq standard 0.899 10 0.00435 Preprocessor07_Model1
Create your own tibble for final parameters or use one of the tune::select_*()
functions:
Grid search, combined with resampling, requires fitting a lot of models!
These models donβt depend on one another and can be run in parallel.
We can use a parallel backend to do this:
Speed-ups are fairly linear up to the number of physical cores (10 here).
We have directly optimized the number of trees as a tuning parameter.
Instead we could
This is known as βearly stoppingβ and there is a parameter for that: stop_iter
.
Early stopping has a potential to decrease the tuning time.
Set trees = 2000
and tune the stop_iter
parameter.
Note that you will need to regenerate lgbm_param
with your new workflow!
10:00
hotel_rec <-
recipe(avg_price_per_room ~ ., data = hotel_train) %>%
step_YeoJohnson(lead_time) %>%
step_dummy_hash(agent, num_terms = tune("agent hash")) %>%
step_dummy_hash(company, num_terms = tune("company hash")) %>%
step_zv(all_predictors())
lgbm_spec <-
boost_tree(trees = tune(), learn_rate = tune(), min_n = tune()) %>%
set_mode("regression") %>%
set_engine("lightgbm", num_threads = 1)
lgbm_wflow <- workflow(hotel_rec, lgbm_spec)
lgbm_param <-
lgbm_wflow %>%
extract_parameter_set_dials() %>%
update(`agent hash` = num_hash(c(3, 8)),
`company hash` = num_hash(c(3, 8)))
In the last section, we evaluated 250 models (25 candidates times 10 resamples).
We can make this go faster using parallel processing.
Also, for some models, we can fit far fewer models than the number that are being evaluated.
X
trees can often predict on candidates with less than X
trees.Both of these methods can lead to enormous speed-ups.
Racing is an old tool that we can use to go even faster.
This can result in fitting a small number of models.
How do we eliminate tuning parameter combinations?
There are a few methods to do so. Weβll use one based on analysis of variance (ANOVA).
However⦠there is typically a large difference between resamples in the results.
Here are some realistic (but simulated) examples of two candidate models.
An error estimate is measured for each of 10 resamples.
There is usually a significant resample-to-resample effect (rank corr: 0.83).
One way to evaluate these models is to do a paired t-test
With \(n = 10\) resamples, the confidence interval for the difference in RMSE is (0.99, 2.8), indicating that candidate number 2 has smaller error.
What if we were to have compared the candidates while we sequentially evaluated each resample?
π
One candidate shows superiority when 4 resamples have been evaluated.
One version of racing uses a mixed model ANOVA to construct one-sided confidence intervals for each candidate versus the current best.
Any candidates whose bound does not include zero are discarded. Here is an animation.
The resamples are analyzed in a random order (so set the seed).
Kuhn (2014) has examples and simulations to show that the method works.
The finetune package has functions tune_race_anova()
and tune_race_win_loss()
.
The syntax and helper functions are extremely similar to those shown for tune_grid()
.
show_best(lgbm_race_res, metric = "mae")
#> # A tibble: 2 Γ 11
#> trees min_n learn_rate `agent hash` `company hash` .metric .estimator mean n std_err .config
#> <int> <int> <dbl> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
#> 1 1347 5 0.0655 66 26 mae standard 9.64 10 0.173 Preprocessor34_Model1
#> 2 980 8 0.0429 17 135 mae standard 9.76 10 0.164 Preprocessor25_Model1
Only 171 models were fit (out of 500).
select_best()
never considers candidate models that did not get to the end of the race.
There is a helper function to see how candidate models were removed from consideration.
tune_race_anova()
with a different seed.10:00
Now repeat the process using tune_race_anova()
, but this time focus on performance metric rsq
(use set.seed(9)
as before).
What can you conclude when you compare results from mae
and rsq
?
15:00
We used feature hashing to generate a smaller set of indicator columns to deal with the large number of levels for the agent and country predictors.
Tree-based models (and a few others) donβt require indicators for categorical predictors. They can split on these variables as-is.
Weβll keep all categorical predictors as factors and focus on optimizing additional boosting parameters.
lgbm_spec <-
boost_tree(trees = 1000, learn_rate = tune(), min_n = tune(),
tree_depth = tune(), loss_reduction = tune(),
stop_iter = tune()) %>%
set_mode("regression") %>%
set_engine("lightgbm", num_threads = 1)
lgbm_wflow <- workflow(avg_price_per_room ~ ., lgbm_spec)
lgbm_param <-
lgbm_wflow %>%
extract_parameter_set_dials() %>%
update(learn_rate = learn_rate(c(-5, -1)))
Instead of pre-defining a grid of candidate points, we can model our current results to predict what the next candidate point should be.
Suppose that we are only tuning the learning rate in our boosted tree.
We could do something like:
and use this to predict and rank new learning rate candidates.
A linear model probably isnβt the best choice though (more in a minute).
To illustrate the process, we resampled a large grid of learning rate values for our data to show what the relationship is between MAE and learning rate.
Now suppose that we used a grid of three points in the parameter range for learning rateβ¦
We can make a βmeta-modelβ with a small set of historical performance results.
Gaussian Processes (GP) models are a good choice to model performance.
\[\operatorname{cov}(\boldsymbol{x}_i, \boldsymbol{x}_j) = \exp\left(-\frac{1}{2}|\boldsymbol{x}_i - \boldsymbol{x}_j|^2\right) + \sigma^2_{ij}\]
The GP model can take candidate tuning parameter combinations as inputs and make predictions for performance (e.g. MAE)
The variance is mostly driven by spatial variability (the previous equation).
The predicted variance is zero at locations of actual data points and becomes very high when far away from any observed data.
Your GP makes predictions on two new candidate tuning parameters.
We want to minimize MAE.
Which should we choose?
05:00
This isnβt a very good fit but we can still use it.
How can we use the outputs to choose the next point to measure?
Acquisition functions take the predicted mean and variance and use them to balance:
Exploration focuses on the variance, exploitation is about the mean.
Weβll use an acquisition function to select a new candidate.
The most popular method appears to be expected improvement (EI) above the current best results.
We would probably pick the point with the largest EI as the next point.
(There are other functions beyond EI.)
Once we pick the candidate point, we measure performance for it (e.g. resampling).
Another GP is fit, EI is recomputed, and so on.
We stop when we have completed the allowed number of iterations or if we donβt see any improvement after a pre-set number of attempts.
Weβll use a function called tune_bayes()
that has very similar syntax to tune_grid()
.
It has an additional initial
argument for the initial set of performance estimates and parameter combinations for the GP model.
initial
can be the results of another tune_*()
function or an integer (in which case tune_grid()
is used under to hood to make such an initial set of results).
Weβll run the optimization more than once, so letβs make an initial grid of results to serve as the substrate for the BO.
I suggest at least the number of tuning parameters plus two as the initial grid for BO.
reg_metrics <- metric_set(mae, rsq)
set.seed(12)
init_res <-
lgbm_wflow %>%
tune_grid(
resamples = hotel_rs,
grid = nrow(lgbm_param) + 2,
param_info = lgbm_param,
metrics = reg_metrics
)
show_best(init_res, metric = "mae") %>% select(-.metric, -.estimator)
#> # A tibble: 5 Γ 9
#> min_n tree_depth learn_rate loss_reduction stop_iter mean n std_err .config
#> <int> <int> <dbl> <dbl> <int> <dbl> <int> <dbl> <chr>
#> 1 33 12 0.0215 8.25e- 9 5 10.1 10 0.182 Preprocessor1_Model6
#> 2 14 10 0.1 3.16e+ 1 8 10.2 10 0.192 Preprocessor1_Model3
#> 3 8 3 0.00464 1 e-10 11 14.4 10 0.305 Preprocessor1_Model2
#> 4 40 1 0.001 4.64e- 3 14 36.1 10 0.408 Preprocessor1_Model7
#> 5 27 15 0.000215 6.81e- 7 20 44.5 10 0.371 Preprocessor1_Model5
ctrl_bo <- control_bayes(seed = 15, verbose_iter = TRUE) # <- for demonstration
set.seed(15)
lgbm_bayes_res <-
lgbm_wflow %>%
tune_bayes(
resamples = hotel_rs,
initial = init_res, # <- initial results
iter = 20,
param_info = lgbm_param,
control = ctrl_bo,
metrics = reg_metrics
)
#> Optimizing mae using the expected improvement
#>
#> ββ Iteration 1 βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#>
#> i Current best: mae=10.05 (@iter 0)
#> i Gaussian process model
#> β Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i min_n=3, tree_depth=9, learn_rate=0.089, loss_reduction=2.73e-10, stop_iter=4
#> i Estimating performance
#> β Estimating performance
#> β₯ Newest results: mae=9.651 (+/-0.159)
#>
#> ββ Iteration 2 βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#>
#> i Current best: mae=9.651 (@iter 1)
#> i Gaussian process model
#> β Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i min_n=2, tree_depth=13, learn_rate=0.0196, loss_reduction=0.000897, stop_iter=19
#> i Estimating performance
#> β Estimating performance
#> β§ Newest results: mae=9.971 (+/-0.172)
#>
#> ββ Iteration 3 βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#>
#> i Current best: mae=9.651 (@iter 1)
#> i Gaussian process model
#> β Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i min_n=24, tree_depth=15, learn_rate=0.0554, loss_reduction=0.00201, stop_iter=16
#> i Estimating performance
#> β Estimating performance
#> β§ Newest results: mae=9.73 (+/-0.149)
#>
#> ββ Iteration 4 βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#>
#> i Current best: mae=9.651 (@iter 1)
#> i Gaussian process model
#> β Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i min_n=37, tree_depth=1, learn_rate=0.0117, loss_reduction=1.16e-06, stop_iter=5
#> i Estimating performance
#> β Estimating performance
#> β§ Newest results: mae=17.91 (+/-0.308)
#>
#> ββ Iteration 5 βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#>
#> i Current best: mae=9.651 (@iter 1)
#> i Gaussian process model
#> β Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i min_n=2, tree_depth=14, learn_rate=0.0253, loss_reduction=3.1e-05, stop_iter=4
#> i Estimating performance
#> β Estimating performance
#> β§ Newest results: mae=9.852 (+/-0.161)
#>
#> ββ Iteration 6 βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#>
#> i Current best: mae=9.651 (@iter 1)
#> i Gaussian process model
#> β Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i min_n=6, tree_depth=7, learn_rate=0.0323, loss_reduction=2.04e-10, stop_iter=6
#> i Estimating performance
#> β Estimating performance
#> β§ Newest results: mae=9.816 (+/-0.154)
#>
#> ββ Iteration 7 βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#>
#> i Current best: mae=9.651 (@iter 1)
#> i Gaussian process model
#> β Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i min_n=17, tree_depth=12, learn_rate=0.0359, loss_reduction=1.7e-10, stop_iter=12
#> i Estimating performance
#> β Estimating performance
#> β§ Newest results: mae=9.706 (+/-0.159)
#>
#> ββ Iteration 8 βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#>
#> i Current best: mae=9.651 (@iter 1)
#> i Gaussian process model
#> β Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i min_n=39, tree_depth=15, learn_rate=0.029, loss_reduction=9.27e-08, stop_iter=6
#> i Estimating performance
#> β Estimating performance
#> β§ Newest results: mae=9.999 (+/-0.192)
#>
#> ββ Iteration 9 βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#>
#> i Current best: mae=9.651 (@iter 1)
#> i Gaussian process model
#> β Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i min_n=26, tree_depth=15, learn_rate=0.0249, loss_reduction=6.75e-08, stop_iter=13
#> i Estimating performance
#> β Estimating performance
#> β§ Newest results: mae=9.873 (+/-0.18)
#>
#> ββ Iteration 10 ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#>
#> i Current best: mae=9.651 (@iter 1)
#> i Gaussian process model
#> β Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i min_n=3, tree_depth=9, learn_rate=0.0245, loss_reduction=5.44e-09, stop_iter=4
#> i Estimating performance
#> β Estimating performance
#> β§ Newest results: mae=9.863 (+/-0.18)
#>
#> ββ Iteration 11 ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#>
#> i Current best: mae=9.651 (@iter 1)
#> i Gaussian process model
#> β Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i min_n=5, tree_depth=12, learn_rate=0.0593, loss_reduction=1.16e-09, stop_iter=6
#> i Estimating performance
#> β Estimating performance
#> β₯ Newest results: mae=9.645 (+/-0.171)
#>
#> ββ Iteration 12 ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#>
#> i Current best: mae=9.645 (@iter 11)
#> i Gaussian process model
#> β Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i min_n=23, tree_depth=4, learn_rate=0.0739, loss_reduction=3.99e-10, stop_iter=4
#> i Estimating performance
#> β Estimating performance
#> β§ Newest results: mae=10.09 (+/-0.193)
#>
#> ββ Iteration 13 ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#>
#> i Current best: mae=9.645 (@iter 11)
#> i Gaussian process model
#> β Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i min_n=4, tree_depth=2, learn_rate=0.0439, loss_reduction=0.00914, stop_iter=5
#> i Estimating performance
#> β Estimating performance
#> β§ Newest results: mae=11.59 (+/-0.218)
#>
#> ββ Iteration 14 ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#>
#> i Current best: mae=9.645 (@iter 11)
#> i Gaussian process model
#> β Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i min_n=22, tree_depth=15, learn_rate=0.00489, loss_reduction=3.28e-05, stop_iter=17
#> i Estimating performance
#> β Estimating performance
#> β§ Newest results: mae=11.22 (+/-0.231)
#>
#> ββ Iteration 15 ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#>
#> i Current best: mae=9.645 (@iter 11)
#> i Gaussian process model
#> β Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i min_n=27, tree_depth=8, learn_rate=0.0524, loss_reduction=4.24e-08, stop_iter=11
#> i Estimating performance
#> β Estimating performance
#> β§ Newest results: mae=9.801 (+/-0.184)
#>
#> ββ Iteration 16 ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#>
#> i Current best: mae=9.645 (@iter 11)
#> i Gaussian process model
#> β Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i min_n=3, tree_depth=13, learn_rate=0.00837, loss_reduction=4.21e-10, stop_iter=9
#> i Estimating performance
#> β Estimating performance
#> β§ Newest results: mae=10.45 (+/-0.195)
#>
#> ββ Iteration 17 ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#>
#> i Current best: mae=9.645 (@iter 11)
#> i Gaussian process model
#> β Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i min_n=33, tree_depth=13, learn_rate=0.0767, loss_reduction=1.36e-10, stop_iter=4
#> i Estimating performance
#> β Estimating performance
#> β§ Newest results: mae=9.921 (+/-0.181)
#>
#> ββ Iteration 18 ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#>
#> i Current best: mae=9.645 (@iter 11)
#> i Gaussian process model
#> β Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i min_n=3, tree_depth=7, learn_rate=0.0711, loss_reduction=3.14e-07, stop_iter=13
#> i Estimating performance
#> β Estimating performance
#> β§ Newest results: mae=9.664 (+/-0.161)
#>
#> ββ Iteration 19 ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#>
#> i Current best: mae=9.645 (@iter 11)
#> i Gaussian process model
#> β Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i min_n=3, tree_depth=9, learn_rate=0.0613, loss_reduction=2.02, stop_iter=8
#> i Estimating performance
#> β Estimating performance
#> β§ Newest results: mae=9.772 (+/-0.165)
#>
#> ββ Iteration 20 ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#>
#> i Current best: mae=9.645 (@iter 11)
#> i Gaussian process model
#> β Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i min_n=5, tree_depth=6, learn_rate=0.0896, loss_reduction=4.87e-09, stop_iter=9
#> i Estimating performance
#> β Estimating performance
#> β§ Newest results: mae=9.721 (+/-0.146)
show_best(lgbm_bayes_res, metric = "mae") %>% select(-.metric, -.estimator)
#> # A tibble: 5 Γ 10
#> min_n tree_depth learn_rate loss_reduction stop_iter mean n std_err .config .iter
#> <int> <int> <dbl> <dbl> <int> <dbl> <int> <dbl> <chr> <int>
#> 1 5 12 0.0593 1.16e- 9 6 9.64 10 0.171 Iter11 11
#> 2 3 9 0.0890 2.73e-10 4 9.65 10 0.159 Iter1 1
#> 3 3 7 0.0711 3.14e- 7 13 9.66 10 0.161 Iter18 18
#> 4 17 12 0.0359 1.70e-10 12 9.71 10 0.159 Iter7 7
#> 5 5 6 0.0896 4.87e- 9 9 9.72 10 0.146 Iter20 20
Letβs try a different acquisition function: conf_bound(kappa)
.
Weβll use the objective
argument to set it.
Choose your own kappa
value:
Bonus points: Before the optimization is done, press <esc>
and see what happens.
15:00
Stopping tune_bayes()
will return the current results.
Parallel processing can still be used to more efficiently measure each candidate point.
There are a lot of other iterative methods that you can use.
The finetune package also has functions for simulated annealing search.
Letβs say that weβve tried a lot of different models and we like our lightgbm model the most.
What do we do now?
We can take the results of the Bayesian optimization and accept the best results:
best_param <- select_best(lgbm_bayes_res, metric = "mae")
final_wflow <-
lgbm_wflow %>%
finalize_workflow(best_param)
final_wflow
#> ββ Workflow ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#> Preprocessor: Formula
#> Model: boost_tree()
#>
#> ββ Preprocessor ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#> avg_price_per_room ~ .
#>
#> ββ Model βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#> Boosted Tree Model Specification (regression)
#>
#> Main Arguments:
#> trees = 1000
#> min_n = 5
#> tree_depth = 12
#> learn_rate = 0.0593110845668806
#> loss_reduction = 1.16244203146145e-09
#> stop_iter = 6
#>
#> Engine-Specific Arguments:
#> num_threads = 1
#>
#> Computational engine: lightgbm
We can use individual functions:
final_fit <- final_wflow %>% fit(data = hotel_train)
# then predict() or augment()
# then compute metrics
Remember that there is also a convenience function to do all of this:
set.seed(3893)
final_res <- final_wflow %>% last_fit(hotel_split, metrics = reg_metrics)
final_res
#> # Resampling results
#> # Manual resampling
#> # A tibble: 1 Γ 6
#> splits id .metrics .notes .predictions .workflow
#> <list> <chr> <list> <list> <list> <list>
#> 1 <split [3749/1251]> train/test split <tibble [2 Γ 4]> <tibble [0 Γ 3]> <tibble [1,251 Γ 4]> <workflow>
Test set performance:
Recall that resampling predicted the MAE to be 9.645.
We made it!
inspired and adapted from https://tidymodels.org and https://workshops.tidymodels.org