Tidymodeling

Introduction

Welcome!

What you should know

You can use the magrittr %>% or base R |> pipe
You are familiar with functions from dplyr, tidyr, ggplot2
You have some exposure to basic statistical concepts like linear models and residuals
You do not need intermediate or expert familiarity with modeling or ML
You can work with Rmarkdown and Quarto

What is tidymodels?

library(tidymodels)
#> ── Attaching packages ──────────────────────────── tidymodels 1.3.0 ──
#> ✔ broom        1.0.7     ✔ rsample      1.2.1
#> ✔ dials        1.4.0     ✔ tibble       3.2.1
#> ✔ dplyr        1.1.4     ✔ tidyr        1.3.1
#> ✔ infer        1.0.7     ✔ tune         1.3.0
#> ✔ modeldata    1.4.0     ✔ workflows    1.2.0
#> ✔ parsnip      1.3.1     ✔ workflowsets 1.1.0
#> ✔ purrr        1.0.4     ✔ yardstick    1.3.2
#> ✔ recipes      1.2.0
#> ── Conflicts ─────────────────────────────── tidymodels_conflicts() ──
#> ✖ purrr::discard() masks scales::discard()
#> ✖ dplyr::filter()  masks stats::filter()
#> ✖ dplyr::lag()     masks stats::lag()
#> ✖ recipes::step()  masks stats::step()

Let’s install some packages

# Install the packages for the workshop
pkgs <- 
  c("bonsai", "Cubist", "doParallel", "earth", "embed", "finetune", 
    "forested", "lightgbm", "lme4", "parallelly", "plumber", "probably", 
    "ranger", "rpart", "rpart.plot", "rules", "splines2", "stacks", 
    "text2vec", "textrecipes", "tidymodels", "vetiver")

install.packages(pkgs)

Also for part II, you should install the newest version of the dials package. To check this, you can run:

rlang::check_installed("dials", version = "1.4.0")

The Course - Part I

Your data budget
What makes a model
Evaluating models
Tuning models

Running Example Part I: The Whole Game

Minimal version of predictive modeling process
Feature engineering and tuning as iterative extensions

Running Example Part I: The Whole Game

Data Part I: Data on forests in Washington

The U.S. Forest Service maintains ML models to predict whether a plot of land is “forested.”
This classification is important for all sorts of research, legislation, and land management purposes.
Plots are typically remeasured every 10 years and this dataset contains the most recent measurement per plot.
Type ?forested to learn more about this dataset, including references (make sure to do this before next week’s lecture).

The Course - Part II

Feature engineering using recipes
Tuning hyperparameters (grid search)
Grid search via racing
Iterative search

Hotel Data

We’ll use data on hotels to predict the cost of a room.

The data are in the modeldata package.

The Course - Case Studies

Case studies for even more practice and for trying out other packages in Tidymodels as well.

Some thoughts before we get started 💭

Prediction

The mechanics of prediction is easy:
- Plug in values of predictors to the model equation
- Calculate the predicted value of the response variable
Getting it right is hard!
- There is no guarantee the model estimates you have are correct
- Or that your model will perform as well with new data as it did with your sample data

Spending our data

Several steps to create a useful model: parameter estimation, model selection, performance assessment, etc.
Doing all of this on the entire data we have available can lead to overfitting
Allocate specific subsets of data for different tasks, as opposed to allocating the largest possible amount to the model parameter estimation only (which is what we’ve done so far)

Splitting data

Training set:
- Sandbox for model building
- Spend most of your time using the training set to develop the model
- Majority of the data (usually 80%)
Testing set:
- Held in reserve to determine efficacy of one or two chosen models
- Critical to look at it once, otherwise it becomes part of the modeling process
- Remainder of the data (usually 20%)

Your turn

How are statistics and machine learning related?

How are they similar? Different?

03:00

The “two cultures”

model first vs. data first

inference vs. prediction

Tidymodeling

What you should know

What is tidymodels?

Let’s install some packages

The Course - Part I

Running Example Part I: The Whole Game

Running Example Part I: The Whole Game

Data Part I: Data on forests in Washington

The Course - Part II

Hotel Data

The Course - Case Studies

Some thoughts before we get started 💭

Prediction

Spending our data

Splitting data

Your turn

The “two cultures”

What is machine learning?