Tidymodeling
Case Study on Transportation
Chicago L-Train data
Several years worth of pre-pandemic data were assembled to try to predict the daily number of people entering the Clark and Lake elevated (“L”) train station in Chicago.
More information:
Predictors
the 14-day lagged ridership at this and other stations (units: thousands of rides/day)
weather data
home/away game schedules for Chicago teams
the date
The data are in modeldata
. See ?Chicago
.
Your turn: Explore the Data
Take a look at these data for a few minutes and see if you can find any interesting characteristics in the predictors or the outcome.
library (tidymodels)
library (rules)
data ("Chicago" )
dim (Chicago)
#> [1] 5698 50
stations
#> [1] "Austin" "Quincy_Wells" "Belmont" "Archer_35th"
#> [5] "Oak_Park" "Western" "Clark_Lake" "Clinton"
#> [9] "Merchandise_Mart" "Irving_Park" "Washington_Wells" "Harlem"
#> [13] "Monroe" "Polk" "Ashland" "Kedzie"
#> [17] "Addison" "Jefferson_Park" "Montrose" "California"
Splitting with Chicago data
Let’s put the last two weeks of data into the test set. initial_time_split()
can be used for this purpose:
data (Chicago)
chi_split <- initial_time_split (Chicago, prop = 1 - (14 / nrow (Chicago)))
chi_split
#> <Training/Testing/Total>
#> <5684/14/5698>
chi_train <- training (chi_split)
chi_test <- testing (chi_split)
## training
nrow (chi_train)
#> [1] 5684
## testing
nrow (chi_test)
#> [1] 14
initial_time_split() is a function from the rsample package used for time-series data splitting.
prop = 1 - (14/nrow(Chicago)) calculates the proportion of data to keep for training:
• nrow(Chicago) is 5698 (total number of rows). • (14 / 5698) gives the proportion you want to reserve for testing → only the last 14 data points. • So, prop becomes 1 - (14/5698) = approximately 0.99754.
This ensures the last 14 rows are used for testing, and the rest (5684) are used for training.
Time series resampling
Our Chicago data is over time. Regular cross-validation, which uses random sampling, may not be the best idea.
We can emulate our training/test split by making similar resamples.
Fold 1: Take the first X years of data as the analysis set, the next 2 weeks as the assessment set.
Fold 2: Take the first X years + 2 weeks of data as the analysis set, the next 2 weeks as the assessment set.
and so on
Rolling forecast origin resampling
This image shows overlapping assessment sets. We will use non-overlapping data but it could be done either way.
Times series resampling
chi_rs <-
chi_train %>%
sliding_period (
index = "date" ,
)
Use the date
column to find the date data.
Times series resampling
chi_rs <-
chi_train %>%
sliding_period (
index = "date" ,
period = "week" ,
)
Our units will be weeks.
Step forward in weekly intervals. Each new resample starts 1 week later than the last one (but step = 2, see below).
Times series resampling
chi_rs <-
chi_train %>%
sliding_period (
index = "date" ,
period = "week" ,
lookback = 52 * 15
)
Every analysis set has 15 years of data
Each resample uses 15 years of weekly data (since 52 weeks per year × 15). That’s your training window size for each fold.
Times series resampling
chi_rs <-
chi_train %>%
sliding_period (
index = "date" ,
period = "week" ,
lookback = 52 * 15 ,
assess_stop = 2 ,
)
Every assessment set has 2 weeks of data
Evaluate on 2 consecutive weekly observations after each training window — this is your test window per resample.
Times series resampling
chi_rs <-
chi_train %>%
sliding_period (
index = "date" ,
period = "week" ,
lookback = 52 * 15 ,
assess_stop = 2 ,
step = 2
)
Increment by 2 weeks so that there are no overlapping assessment sets.
chi_rs$ splits[[1 ]] %>% assessment () %>% pluck ("date" ) %>% range ()
#> [1] "2016-01-07" "2016-01-20"
chi_rs$ splits[[2 ]] %>% assessment () %>% pluck ("date" ) %>% range ()
#> [1] "2016-01-21" "2016-02-03"
Move the window forward by 2 weeks each time. That is, don’t slide forward every single week, but every second week. Avoids overlap!
Our resampling object
chi_rs
#> # Sliding period resampling
#> # A tibble: 16 × 2
#> splits id
#> <list> <chr>
#> 1 <split [5463/14]> Slice01
#> 2 <split [5467/14]> Slice02
#> 3 <split [5467/14]> Slice03
#> 4 <split [5467/14]> Slice04
#> 5 <split [5467/14]> Slice05
#> 6 <split [5467/14]> Slice06
#> 7 <split [5467/14]> Slice07
#> 8 <split [5467/14]> Slice08
#> 9 <split [5467/14]> Slice09
#> 10 <split [5467/14]> Slice10
#> 11 <split [5467/14]> Slice11
#> 12 <split [5467/14]> Slice12
#> 13 <split [5467/14]> Slice13
#> 14 <split [5467/14]> Slice14
#> 15 <split [5467/14]> Slice15
#> 16 <split [5467/11]> Slice16
We will fit 16 models on 16 slightly different analysis sets.
Each will produce a separate performance metrics.
We will average the 16 metrics to get the resampling estimate of that statistic.
Feature engineering with recipes
chi_rec <-
recipe (ridership ~ ., data = chi_train)
Based on the formula, the function assigns columns to roles of “outcome” or “predictor”
A recipe
summary (chi_rec)
#> # A tibble: 50 × 4
#> variable type role source
#> <chr> <list> <chr> <chr>
#> 1 Austin <chr [2]> predictor original
#> 2 Quincy_Wells <chr [2]> predictor original
#> 3 Belmont <chr [2]> predictor original
#> 4 Archer_35th <chr [2]> predictor original
#> 5 Oak_Park <chr [2]> predictor original
#> 6 Western <chr [2]> predictor original
#> 7 Clark_Lake <chr [2]> predictor original
#> 8 Clinton <chr [2]> predictor original
#> 9 Merchandise_Mart <chr [2]> predictor original
#> 10 Irving_Park <chr [2]> predictor original
#> # ℹ 40 more rows
A recipe - work with dates
chi_rec <-
recipe (ridership ~ ., data = chi_train) %>%
step_date (date, features = c ("dow" , "month" , "year" ))
This creates three new columns in the data based on the date. Note that the day-of-the-week column is a factor.
A recipe - work with dates
chi_rec <-
recipe (ridership ~ ., data = chi_train) %>%
step_date (date, features = c ("dow" , "month" , "year" )) %>%
step_holiday (date)
Add indicators for major holidays. Specific holidays, especially those non-USA, can also be generated.
At this point, we don’t need date
anymore. Instead of deleting it (there is a step for that) we will change its role to be an identification variable.
We might want to change the role (instead of removing the column) because it will stay in the data set (even when resampled) and might be useful for diagnosing issues.
A recipe - work with dates
chi_rec <-
recipe (ridership ~ ., data = chi_train) %>%
step_date (date, features = c ("dow" , "month" , "year" )) %>%
step_holiday (date) %>%
update_role (date, new_role = "id" ) %>%
update_role_requirements (role = "id" , bake = TRUE )
date
is still in the data set but tidymodels knows not to treat it as an analysis column.
update_role_requirements()
is needed to make sure that this column is required when making new data points. The help page has a good discussion about the nuances.
means dates won’t be used in modeling directly but will still be present in the data (e.g., for tracking or plotting).
“Keep the date column when prepping/baking the data.”
A recipe - remove constant columns
chi_rec <-
recipe (ridership ~ ., data = chi_train) %>%
step_date (date, features = c ("dow" , "month" , "year" )) %>%
step_holiday (date) %>%
update_role (date, new_role = "id" ) %>%
update_role_requirements (role = "id" , bake = TRUE ) %>%
step_zv (all_nominal_predictors ())
A recipe - handle correlations
The station columns have a very high degree of correlation.
We might want to decorrelated them with principle component analysis to help the model fits go more easily.
The vector stations
contains all station names and can be used to identify all the relevant columns.
chi_pca_rec <-
chi_rec %>%
step_normalize (all_of (!! stations)) %>%
step_pca (all_of (!! stations), num_comp = tune ())
We’ll tune the number of PCA components for (default) values of one to four.
Start with the recipe you’ve already built (chi_rec) — this already includes date features, holidays, and ID role for date.
step_normalize(…) standardizes variables (mean = 0, SD = 1) — this is required before PCA, because PCA is sensitive to scale.
all_of(!!stations) is a vector of variable names (column names of stations).
!!stations uses tidy evaluation (from rlang) to inject a vector of names stored in a variable stations.
The !! is called the unquote operator. It tells R to evaluate the variable in the current environment and inject its value into a tidyverse function (like dplyr, recipes, or ggplot2 functions).
Make some models
Let’s try three models. The first one requires the rules
package (loaded earlier).
cb_spec <- cubist_rules (committees = 25 , neighbors = tune ())
mars_spec <- mars (prod_degree = tune ()) %>% set_mode ("regression" )
lm_spec <- linear_reg ()
chi_set <-
workflow_set (
list (pca = chi_pca_rec, basic = chi_rec),
list (cubist = cb_spec, mars = mars_spec, lm = lm_spec)
) %>%
# Evaluate models using mean absolute errors
option_add (metrics = metric_set (mae))
Briefly talk about Cubist being a (sort of) boosted rule-based model and MARS being a nonlinear regression model. Both incorporate feature selection nicely.
cubist: rule-based regression model that combines decision trees with linear regression. It’s particularly good at handling complex, nonlinear relationships while keeping interpretability. It works like this:
Builds a decision tree (like in regression trees),
At each terminal leaf node, instead of a single prediction value, it fits a linear regression model on the data in that node,
It can use multiple “committees” (like boosting) to average predictions from several models,
Optionally, it uses nearest neighbors to refine predictions based on similar training points.
MARS (Multivariate Adaptive Regression Spline) is a regression modeling technique that automatically:
Captures non-linear relationships,
Detects and models interactions between variables,
Builds a model using piecewise linear splines (like fitting straight lines that bend where needed).
prod_degree: Controls the interaction depth: - 0 = only additive (no interaction), - 1 = pairwise interactions, - 2 = up to 3-way interactions, etc.
Process them on the resamples
# Set up some objects for stacking ensembles (in a few slides)
grid_ctrl <- control_grid (save_pred = TRUE , save_workflow = TRUE )
chi_res <-
chi_set %>%
workflow_map (
resamples = chi_rs,
grid = 10 ,
control = grid_ctrl,
verbose = TRUE ,
seed = 12
)
save_pred = TRUE
Saves the predictions from each model/resample combination — needed for stacking, diagnostics, or error analysis.
save_workflow = TRUE
Saves the full workflow (recipe + model) — important if you want to reuse the trained models later (e.g., for ensembles).
second chunk: runs hyperparameter tuning on each workflow in chi_set.
Tuning 6 different workflows using rolling time series CV,
Saving predictions and workflows for later use (e.g., ensembling),
Using workflow_map() to automate tuning across your whole model grid.
How do the results look?
rank_results (chi_res)
#> # A tibble: 34 × 9
#> wflow_id .config .metric mean std_err n preprocessor model rank
#> <chr> <chr> <chr> <dbl> <dbl> <int> <chr> <chr> <int>
#> 1 pca_cubist Preprocessor1_Model1 mae 0.804 0.101 16 recipe cubis… 1
#> 2 basic_cubist Preprocessor1_Model1 mae 0.901 0.115 16 recipe cubis… 2
#> 3 pca_cubist Preprocessor2_Model2 mae 0.985 0.120 16 recipe cubis… 3
#> 4 pca_cubist Preprocessor4_Model2 mae 0.990 0.121 16 recipe cubis… 4
#> 5 pca_cubist Preprocessor1_Model3 mae 1.00 0.119 16 recipe cubis… 5
#> 6 pca_cubist Preprocessor4_Model1 mae 1.02 0.131 16 recipe cubis… 6
#> 7 pca_cubist Preprocessor3_Model3 mae 1.03 0.121 16 recipe cubis… 7
#> 8 pca_cubist Preprocessor3_Model2 mae 1.07 0.137 16 recipe cubis… 8
#> 9 basic_cubist Preprocessor1_Model9 mae 1.07 0.115 16 recipe cubis… 9
#> 10 basic_cubist Preprocessor1_Model8 mae 1.07 0.114 16 recipe cubis… 10
#> # ℹ 24 more rows
Plot the results
Pull out specific results
We can also pull out the specific tuning results and look at them:
chi_res %>%
extract_workflow_set_result ("pca_cubist" ) %>%
autoplot ()
Tuning parameter performance plot:
Components (PCA) 1 or 2 Best tradeoff between dimension reduction and loss of information.
Nearest Neighbors 0 Best MAE, adding neighbors doesn’t help much.
Why choose just one final_fit
?
Model stacks generate predictions that are informed by several models.
A model stack is a type of ensemble model built using the stacks package in the tidymodels ecosystem. It combines the predictions of multiple tuned models into a single, more powerful model.
Instead of choosing just one “best” model, a stack uses many models together — assigning each one a weight — to improve overall prediction accuracy.
Think of it as:
Collecting predictions from several strong models (e.g., Cubist, MARS, Linear Reg),
Training a meta-model (like a regularized regression) to find the best way to combine them.
“Take all the top models from my tuning results, and add them as candidates for stacking.”
Why choose just one final_fit
?
Why choose just one final_fit
?
Why choose just one final_fit
?
Why choose just one final_fit
?
Why choose just one final_fit
?
Building a model stack
Define candidate members
Initialize a data stack object
Add candidate ensemble members to the data stack
Evaluate how to combine their predictions
Fit candidate ensemble members with non-zero stacking coefficients
Predict on new data!
Start the stack and add members
Collect all of the resampling results for all model configurations.
chi_stack <-
stacks () %>%
add_candidates (chi_res)
‘add_candidates’ pulls in:
the best-performing configurations from each workflow,
their predictions on the resamples,
and stores them in the stacking object for ensembling.
You’re preparing to build a meta-model that learns how to combine all your best models for better performance.
by default, the meta-learner is a penalized linear regression, typically ridge regression (i.e., linear regression with L2 regularization).
Estimate weights for each candidate
Which configurations should be retained? Uses a penalized linear model:
set.seed (122 )
chi_stack_res <- blend_predictions (chi_stack)
chi_stack_res
#> # A tibble: 5 × 3
#> member type weight
#> <chr> <chr> <dbl>
#> 1 pca_cubist_1_1 cubist_rules 0.351
#> 2 basic_cubist_1_1 cubist_rules 0.260
#> 3 pca_cubist_1_3 cubist_rules 0.203
#> 4 basic_cubist_1_6 cubist_rules 0.157
#> 5 pca_lm_2_1 linear_reg 0.0520
This is the key ensembling step. It:
Takes the candidate models from chi_stack (you added those earlier with add_candidates()),
Uses the saved resample predictions to learn how to combine them,
Fits a meta-model (usually a regularized linear regression like ridge) to find the optimal weights for each candidate’s predictions,
Returns a new object (chi_stack_res) that contains the weighted ensemble.
output tibble shows: which candidate models were retained in the stack and their respective weights in the final prediction.
The stack kept 5 models (it discards weak/irrelevant ones automatically).
All top models are Cubist models — no surprise given Cubist performed best earlier.
One Linear Regression model made it in with a small weight (5.2%).
The rest were effectively ignored (shrunk to 0) because they were: - Redundant (correlated with better models), - Noisy, - Weak predictors.
Ridge regression shrinks unhelpful weights toward 0.
How did it do?
The overall results of the penalized model:
x axis: Penalty value used in the regularized regression (meta-learner).
The best balance between complexity and performance was found at penalty = 1e-3.
num_members Higher penalty → fewer models kept in the stack.
rmse There’s a sweet spot (≈ 1e-3) where prediction error is lowest.
rsq That same penalty also gives the best variance explained.
The chosen penalty (1e-3) achieves optimal accuracy while using a modest number of models (≈ 4) — a good tradeoff between performance and simplicity.
Top panel: What you’re seeing is the average number of non-zero-weight models across all resamples, not the count from a single model.
What does it use?
autoplot (chi_stack_res, type = "weights" )
weights assigned to each model in the stack ensemble.
The header saying penalty = 0.001 in your bar plot is very important — it tells you which regularization strength was used to train the meta-learner (stacking model) that produced those weights.
Lower penalty (e.g., 0.0001) → allows more models to contribute (less shrinkage).
Higher penalty (e.g., 0.1) → aggressively shrinks weights to 0 (fewer models retained).
Fit the required candidate models
For each model we retain in the stack, we need their model fit on the entire training set.
chi_stack_res <- fit_members (chi_stack_res)
The test set: best Cubist model
We can pull out the results and the workflow to fit the single best cubist model.
best_cubist <-
chi_res %>%
extract_workflow_set_result ("pca_cubist" ) %>%
select_best ()
cubist_res <-
chi_res %>%
extract_workflow ("pca_cubist" ) %>%
finalize_workflow (best_cubist) %>%
last_fit (split = chi_split, metrics = metric_set (mae))
The test set: stack ensemble
We don’t have last_fit()
for stacks (yet) so we manually make predictions.
stack_pred <-
predict (chi_stack_res, chi_test) %>%
bind_cols (chi_test)
Compare the results
Single best versus the stack:
collect_metrics (cubist_res)
#> # A tibble: 1 × 4
#> .metric .estimator .estimate .config
#> <chr> <chr> <dbl> <chr>
#> 1 mae standard 0.669 Preprocessor1_Model1
stack_pred %>% mae (ridership, .pred)
#> # A tibble: 1 × 3
#> .metric .estimator .estimate
#> <chr> <chr> <dbl>
#> 1 mae standard 0.670
comparing the final test set performance of two models:
A tuned Cubist + PCA model (cubist_res)
A stacked ensemble (stack_pred)
Both models perform very similarly.
The stack didn’t significantly improve over the best single model — which makes sense, since Cubist was already dominant in the stack. You might prefer Cubist if you want simplicity, or keep the stack if you want to hedge across multiple models.
Plot the test set
library (probably)
cubist_res %>%
collect_predictions () %>%
ggplot (aes (ridership, .pred)) +
geom_point (alpha = 1 / 2 ) +
geom_abline (lty = 2 , col = "green" ) +
coord_obs_pred ()
Points close to the green line → accurate predictions.
Points above the line → predictions are too high.
Points below the line → predictions are too low.
If you see a pattern or curve, your model might be missing a key relationship.