Tidymodeling

Chicago L-Train data

Several years worth of pre-pandemic data were assembled to try to predict the daily number of people entering the Clark and Lake elevated (“L”) train station in Chicago.

More information:

Several Chapters in Feature Engineering and Selection.
- Start with Section 4.1
- See Section 1.3
Video: The Global Pandemic Ruined My Favorite Data Set

Predictors

the 14-day lagged ridership at this and other stations (units: thousands of rides/day)
weather data
home/away game schedules for Chicago teams
the date

The data are in modeldata. See ?Chicago.

L Train Locations

Your turn: Explore the Data

Take a look at these data for a few minutes and see if you can find any interesting characteristics in the predictors or the outcome.

library(tidymodels)
library(rules)
data("Chicago")
dim(Chicago)
#> [1] 5698   50
stations
#>  [1] "Austin"           "Quincy_Wells"     "Belmont"          "Archer_35th"     
#>  [5] "Oak_Park"         "Western"          "Clark_Lake"       "Clinton"         
#>  [9] "Merchandise_Mart" "Irving_Park"      "Washington_Wells" "Harlem"          
#> [13] "Monroe"           "Polk"             "Ashland"          "Kedzie"          
#> [17] "Addison"          "Jefferson_Park"   "Montrose"         "California"

10:00

Splitting with Chicago data

Let’s put the last two weeks of data into the test set. initial_time_split() can be used for this purpose:

data(Chicago)

chi_split <- initial_time_split(Chicago, prop = 1 - (14/nrow(Chicago)))
chi_split
#> <Training/Testing/Total>
#> <5684/14/5698>

chi_train <- training(chi_split)
chi_test  <- testing(chi_split)

## training
nrow(chi_train)
#> [1] 5684
 
## testing
nrow(chi_test)
#> [1] 14

Time series resampling

Our Chicago data is over time. Regular cross-validation, which uses random sampling, may not be the best idea.

We can emulate our training/test split by making similar resamples.

Fold 1: Take the first X years of data as the analysis set, the next 2 weeks as the assessment set.
Fold 2: Take the first X years + 2 weeks of data as the analysis set, the next 2 weeks as the assessment set.
and so on

Rolling forecast origin resampling

Times series resampling

chi_rs <-
  chi_train %>%
  sliding_period(
    index = "date",  




  )

Use the date column to find the date data.

Times series resampling

chi_rs <-
  chi_train %>%
  sliding_period(
    index = "date",  
    period = "week",



  )

Our units will be weeks.

Times series resampling

chi_rs <-
  chi_train %>%
  sliding_period(
    index = "date",  
    period = "week",
    lookback = 52 * 15  
    
    
  )

Every analysis set has 15 years of data

Times series resampling

chi_rs <-
  chi_train %>%
  sliding_period(
    index = "date",  
    period = "week",
    lookback = 52 * 15,
    assess_stop = 2,

  )

Every assessment set has 2 weeks of data

Times series resampling

chi_rs <-
  chi_train %>%
  sliding_period(
    index = "date",  
    period = "week",
    lookback = 52 * 15,
    assess_stop = 2,
    step = 2 
  )

Increment by 2 weeks so that there are no overlapping assessment sets.

chi_rs$splits[[1]] %>% assessment() %>% pluck("date") %>% range()
#> [1] "2016-01-07" "2016-01-20"
chi_rs$splits[[2]] %>% assessment() %>% pluck("date") %>% range()
#> [1] "2016-01-21" "2016-02-03"

Our resampling object

chi_rs
#> # Sliding period resampling 
#> # A tibble: 16 × 2
#>    splits            id     
#>    <list>            <chr>  
#>  1 <split [5463/14]> Slice01
#>  2 <split [5467/14]> Slice02
#>  3 <split [5467/14]> Slice03
#>  4 <split [5467/14]> Slice04
#>  5 <split [5467/14]> Slice05
#>  6 <split [5467/14]> Slice06
#>  7 <split [5467/14]> Slice07
#>  8 <split [5467/14]> Slice08
#>  9 <split [5467/14]> Slice09
#> 10 <split [5467/14]> Slice10
#> 11 <split [5467/14]> Slice11
#> 12 <split [5467/14]> Slice12
#> 13 <split [5467/14]> Slice13
#> 14 <split [5467/14]> Slice14
#> 15 <split [5467/14]> Slice15
#> 16 <split [5467/11]> Slice16

We will fit 16 models on 16 slightly different analysis sets.

Each will produce a separate performance metrics.

We will average the 16 metrics to get the resampling estimate of that statistic.

Feature engineering with recipes

chi_rec <- 
  recipe(ridership ~ ., data = chi_train)

Based on the formula, the function assigns columns to roles of “outcome” or “predictor”

A recipe

summary(chi_rec)
#> # A tibble: 50 × 4
#>    variable         type      role      source  
#>    <chr>            <list>    <chr>     <chr>   
#>  1 Austin           <chr [2]> predictor original
#>  2 Quincy_Wells     <chr [2]> predictor original
#>  3 Belmont          <chr [2]> predictor original
#>  4 Archer_35th      <chr [2]> predictor original
#>  5 Oak_Park         <chr [2]> predictor original
#>  6 Western          <chr [2]> predictor original
#>  7 Clark_Lake       <chr [2]> predictor original
#>  8 Clinton          <chr [2]> predictor original
#>  9 Merchandise_Mart <chr [2]> predictor original
#> 10 Irving_Park      <chr [2]> predictor original
#> # ℹ 40 more rows

A recipe - work with dates

chi_rec <- 
  recipe(ridership ~ ., data = chi_train) %>% 
  step_date(date, features = c("dow", "month", "year"))

This creates three new columns in the data based on the date. Note that the day-of-the-week column is a factor.

A recipe - work with dates

chi_rec <- 
  recipe(ridership ~ ., data = chi_train) %>% 
  step_date(date, features = c("dow", "month", "year")) %>% 
  step_holiday(date)

Add indicators for major holidays. Specific holidays, especially those non-USA, can also be generated.

At this point, we don’t need date anymore. Instead of deleting it (there is a step for that) we will change its role to be an identification variable.

A recipe - work with dates

chi_rec <- 
  recipe(ridership ~ ., data = chi_train) %>% 
  step_date(date, features = c("dow", "month", "year")) %>% 
  step_holiday(date) %>% 
  update_role(date, new_role = "id") %>%
  update_role_requirements(role = "id", bake = TRUE)

date is still in the data set but tidymodels knows not to treat it as an analysis column.

update_role_requirements() is needed to make sure that this column is required when making new data points. The help page has a good discussion about the nuances.

A recipe - remove constant columns

chi_rec <- 
  recipe(ridership ~ ., data = chi_train) %>% 
  step_date(date, features = c("dow", "month", "year")) %>% 
  step_holiday(date) %>% 
  update_role(date, new_role = "id") %>%
  update_role_requirements(role = "id", bake = TRUE) %>% 
  step_zv(all_nominal_predictors())

A recipe - handle correlations

The station columns have a very high degree of correlation.

We might want to decorrelated them with principle component analysis to help the model fits go more easily.

The vector stations contains all station names and can be used to identify all the relevant columns.

chi_pca_rec <- 
  chi_rec %>% 
  step_normalize(all_of(!!stations)) %>% 
  step_pca(all_of(!!stations), num_comp = tune())

We’ll tune the number of PCA components for (default) values of one to four.

Make some models

Let’s try three models. The first one requires the rules package (loaded earlier).

cb_spec <- cubist_rules(committees = 25, neighbors = tune())
mars_spec <- mars(prod_degree = tune()) %>% set_mode("regression")
lm_spec <- linear_reg()

chi_set <- 
  workflow_set(
    list(pca = chi_pca_rec, basic = chi_rec), 
    list(cubist = cb_spec, mars = mars_spec, lm = lm_spec)
  ) %>% 
  # Evaluate models using mean absolute errors
  option_add(metrics = metric_set(mae))

Briefly talk about Cubist being a (sort of) boosted rule-based model and MARS being a nonlinear regression model. Both incorporate feature selection nicely.

cubist: rule-based regression model that combines decision trees with linear regression. It’s particularly good at handling complex, nonlinear relationships while keeping interpretability. It works like this:

Builds a decision tree (like in regression trees),
At each terminal leaf node, instead of a single prediction value, it fits a linear regression model on the data in that node,
It can use multiple “committees” (like boosting) to average predictions from several models,
Optionally, it uses nearest neighbors to refine predictions based on similar training points.

MARS (Multivariate Adaptive Regression Spline) is a regression modeling technique that automatically:

Captures non-linear relationships,
Detects and models interactions between variables,
Builds a model using piecewise linear splines (like fitting straight lines that bend where needed).

prod_degree: Controls the interaction depth: - 0 = only additive (no interaction), - 1 = pairwise interactions, - 2 = up to 3-way interactions, etc.

Process them on the resamples

# Set up some objects for stacking ensembles (in a few slides)
grid_ctrl <- control_grid(save_pred = TRUE, save_workflow = TRUE)

chi_res <- 
  chi_set %>% 
  workflow_map(
    resamples = chi_rs,
    grid = 10,
    control = grid_ctrl,
    verbose = TRUE,
    seed = 12
  )

How do the results look?

rank_results(chi_res)
#> # A tibble: 34 × 9
#>    wflow_id     .config              .metric  mean std_err     n preprocessor model   rank
#>    <chr>        <chr>                <chr>   <dbl>   <dbl> <int> <chr>        <chr>  <int>
#>  1 pca_cubist   Preprocessor1_Model1 mae     0.804   0.101    16 recipe       cubis…     1
#>  2 basic_cubist Preprocessor1_Model1 mae     0.901   0.115    16 recipe       cubis…     2
#>  3 pca_cubist   Preprocessor2_Model2 mae     0.985   0.120    16 recipe       cubis…     3
#>  4 pca_cubist   Preprocessor4_Model2 mae     0.990   0.121    16 recipe       cubis…     4
#>  5 pca_cubist   Preprocessor1_Model3 mae     1.00    0.119    16 recipe       cubis…     5
#>  6 pca_cubist   Preprocessor4_Model1 mae     1.02    0.131    16 recipe       cubis…     6
#>  7 pca_cubist   Preprocessor3_Model3 mae     1.03    0.121    16 recipe       cubis…     7
#>  8 pca_cubist   Preprocessor3_Model2 mae     1.07    0.137    16 recipe       cubis…     8
#>  9 basic_cubist Preprocessor1_Model9 mae     1.07    0.115    16 recipe       cubis…     9
#> 10 basic_cubist Preprocessor1_Model8 mae     1.07    0.114    16 recipe       cubis…    10
#> # ℹ 24 more rows

Plot the results

autoplot(chi_res)

Pull out specific results

We can also pull out the specific tuning results and look at them:

chi_res %>% 
  extract_workflow_set_result("pca_cubist") %>% 
  autoplot()

Why choose just one `final_fit`?

Model stacks generate predictions that are informed by several models.

Why choose just one `final_fit`?

Building a model stack

library(stacks)

Define candidate members
Initialize a data stack object
Add candidate ensemble members to the data stack
Evaluate how to combine their predictions
Fit candidate ensemble members with non-zero stacking coefficients
Predict on new data!

Start the stack and add members

Collect all of the resampling results for all model configurations.

chi_stack <- 
  stacks() %>% 
  add_candidates(chi_res)

Estimate weights for each candidate

Which configurations should be retained? Uses a penalized linear model:

set.seed(122)
chi_stack_res <- blend_predictions(chi_stack)

chi_stack_res
#> # A tibble: 5 × 3
#>   member           type         weight
#>   <chr>            <chr>         <dbl>
#> 1 pca_cubist_1_1   cubist_rules 0.351 
#> 2 basic_cubist_1_1 cubist_rules 0.260 
#> 3 pca_cubist_1_3   cubist_rules 0.203 
#> 4 basic_cubist_1_6 cubist_rules 0.157 
#> 5 pca_lm_2_1       linear_reg   0.0520

This is the key ensembling step. It:

Takes the candidate models from chi_stack (you added those earlier with add_candidates()),
Uses the saved resample predictions to learn how to combine them,
Fits a meta-model (usually a regularized linear regression like ridge) to find the optimal weights for each candidate’s predictions,
Returns a new object (chi_stack_res) that contains the weighted ensemble.

output tibble shows: which candidate models were retained in the stack and their respective weights in the final prediction.

The stack kept 5 models (it discards weak/irrelevant ones automatically).
All top models are Cubist models — no surprise given Cubist performed best earlier.
One Linear Regression model made it in with a small weight (5.2%).

The rest were effectively ignored (shrunk to 0) because they were: - Redundant (correlated with better models), - Noisy, - Weak predictors.

Ridge regression shrinks unhelpful weights toward 0.

How did it do?

The overall results of the penalized model:

autoplot(chi_stack_res)

What does it use?

autoplot(chi_stack_res, type = "weights")

Fit the required candidate models

For each model we retain in the stack, we need their model fit on the entire training set.

chi_stack_res <- fit_members(chi_stack_res)

The test set: best Cubist model

We can pull out the results and the workflow to fit the single best cubist model.

best_cubist <- 
  chi_res %>% 
  extract_workflow_set_result("pca_cubist") %>% 
  select_best()

cubist_res <- 
  chi_res %>% 
  extract_workflow("pca_cubist") %>% 
  finalize_workflow(best_cubist) %>% 
  last_fit(split = chi_split, metrics = metric_set(mae))

The test set: stack ensemble

We don’t have last_fit() for stacks (yet) so we manually make predictions.

stack_pred <- 
  predict(chi_stack_res, chi_test) %>% 
  bind_cols(chi_test)

Compare the results

Single best versus the stack:

collect_metrics(cubist_res)
#> # A tibble: 1 × 4
#>   .metric .estimator .estimate .config             
#>   <chr>   <chr>          <dbl> <chr>               
#> 1 mae     standard       0.669 Preprocessor1_Model1

stack_pred %>% mae(ridership, .pred)
#> # A tibble: 1 × 3
#>   .metric .estimator .estimate
#>   <chr>   <chr>          <dbl>
#> 1 mae     standard       0.670

Plot the test set

library(probably)
cubist_res %>% 
  collect_predictions() %>% 
  ggplot(aes(ridership, .pred)) + 
  geom_point(alpha = 1 / 2) + 
  geom_abline(lty = 2, col = "green") + 
  coord_obs_pred()

Tidymodeling

Chicago L-Train data

Predictors

L Train Locations

Your turn: Explore the Data

Splitting with Chicago data

Time series resampling

Rolling forecast origin resampling

Times series resampling

Times series resampling

Times series resampling

Times series resampling

Times series resampling

Our resampling object

Feature engineering with recipes

A recipe

A recipe - work with dates

A recipe - work with dates

A recipe - work with dates

A recipe - remove constant columns

A recipe - handle correlations

Make some models

Process them on the resamples

How do the results look?

Plot the results

Pull out specific results

Why choose just one final_fit?

Why choose just one final_fit?

Why choose just one final_fit?

Why choose just one final_fit?

Why choose just one final_fit?

Why choose just one final_fit?

Building a model stack

Start the stack and add members

Estimate weights for each candidate

How did it do?

What does it use?

Fit the required candidate models

The test set: best Cubist model

The test set: stack ensemble

Compare the results

Plot the test set

Why choose just one `final_fit`?

Why choose just one `final_fit`?

Why choose just one `final_fit`?

Why choose just one `final_fit`?

Why choose just one `final_fit`?

Why choose just one `final_fit`?