Tidymodeling

Introduction

Welcome!

What you should know

  • You can use the magrittr %>% or base R |> pipe

  • You are familiar with functions from dplyr, tidyr, ggplot2

  • You have some exposure to basic statistical concepts like linear models and residuals

  • You do not need intermediate or expert familiarity with modeling or ML

  • You can work with Rmarkdown and Quarto

What is tidymodels?

library(tidymodels)
#> ── Attaching packages ──────────────────────────── tidymodels 1.3.0 ──
#> ✔ broom        1.0.7     ✔ rsample      1.2.1
#> ✔ dials        1.4.0     ✔ tibble       3.2.1
#> ✔ dplyr        1.1.4     ✔ tidyr        1.3.1
#> ✔ infer        1.0.7     ✔ tune         1.3.0
#> ✔ modeldata    1.4.0     ✔ workflows    1.2.0
#> ✔ parsnip      1.3.1     ✔ workflowsets 1.1.0
#> ✔ purrr        1.0.4     ✔ yardstick    1.3.2
#> ✔ recipes      1.2.0
#> ── Conflicts ─────────────────────────────── tidymodels_conflicts() ──
#> ✖ purrr::discard() masks scales::discard()
#> ✖ dplyr::filter()  masks stats::filter()
#> ✖ dplyr::lag()     masks stats::lag()
#> ✖ recipes::step()  masks stats::step()

Let’s install some packages

# Install the packages for the workshop
pkgs <- 
  c("bonsai", "Cubist", "doParallel", "earth", "embed", "finetune", 
    "forested", "lightgbm", "lme4", "parallelly", "plumber", "probably", 
    "ranger", "rpart", "rpart.plot", "rules", "splines2", "stacks", 
    "text2vec", "textrecipes", "tidymodels", "vetiver")

install.packages(pkgs)

Also for part II, you should install the newest version of the dials package. To check this, you can run:

rlang::check_installed("dials", version = "1.4.0")

The Course - Part I

  • Your data budget
  • What makes a model
  • Evaluating models
  • Tuning models

Running Example Part I: The Whole Game

  • Minimal version of predictive modeling process
  • Feature engineering and tuning as iterative extensions

Running Example Part I: The Whole Game

Data Part I: Data on forests in Washington

  • The U.S. Forest Service maintains ML models to predict whether a plot of land is “forested.”
  • This classification is important for all sorts of research, legislation, and land management purposes.
  • Plots are typically remeasured every 10 years and this dataset contains the most recent measurement per plot.
  • Type ?forested to learn more about this dataset, including references (make sure to do this before next week’s lecture).

The Course - Part II

  • Feature engineering using recipes
  • Tuning hyperparameters (grid search)
  • Grid search via racing
  • Iterative search

Hotel Data

We’ll use data on hotels to predict the cost of a room.

The data are in the modeldata package.

The Course - Case Studies

Case studies for even more practice and for trying out other packages in Tidymodels as well.

Some thoughts before we get started 💭

Prediction

  • The mechanics of prediction is easy:
    • Plug in values of predictors to the model equation
    • Calculate the predicted value of the response variable
  • Getting it right is hard!
    • There is no guarantee the model estimates you have are correct
    • Or that your model will perform as well with new data as it did with your sample data

Spending our data

  • Several steps to create a useful model: parameter estimation, model selection, performance assessment, etc.

  • Doing all of this on the entire data we have available can lead to overfitting

  • Allocate specific subsets of data for different tasks, as opposed to allocating the largest possible amount to the model parameter estimation only (which is what we’ve done so far)

Splitting data

  • Training set:
    • Sandbox for model building
    • Spend most of your time using the training set to develop the model
    • Majority of the data (usually 80%)
  • Testing set:
    • Held in reserve to determine efficacy of one or two chosen models
    • Critical to look at it once, otherwise it becomes part of the modeling process
    • Remainder of the data (usually 20%)

Your turn

How are statistics and machine learning related?

How are they similar? Different?

03:00

The “two cultures”

model first vs. data first

inference vs. prediction

What is machine learning?