This function uses a good default, but this depends on your specific goal/data
What is set.seed()?
To create that split of the data, R generates βpseudo-randomβ numbers: while they are made to behave like random numbers, their generation is deterministic given a βseedβ.
This allows us to reproduce results by setting that seed.
Which seed you pick doesnβt matter, as long as you donβt try a bunch of seeds and pick the one that gives you the best performance.
forested_train %>%ggplot(aes(x = forested, fill = tree_no_tree)) +geom_bar()
forested_train %>%ggplot(aes(x = precip_annual, fill = forested, group = forested)) +geom_histogram(position ="identity", alpha = .7)
forested_train %>%ggplot(aes(x = precip_annual, fill = forested, group = forested)) +geom_histogram(position ="fill")
forested_train %>%ggplot(aes(x = lon, y = lat, col = forested)) +geom_point()
The whole game - status update
Build a model
Open Worksheet 2 π
And now itβs time forβ¦
How do you fit a linear model in R?
How many different ways can you think of?
lm for linear model
glmnet for regularized regression
keras for regression using TensorFlow
stan for Bayesian regression
spark for large data sets
brulee for regression using torch
To specify a model
Choose a model
Specify an engine
Set the mode
To specify a model
logistic_reg()#> Logistic Regression Model Specification (classification)#> #> Computational engine: glm
To specify a model
Choose a model
Specify an engine
Set the mode
The computational engine indicates how the model is fit, such as with a specific R package implementation or even methods outside of R like Keras or Stan
Extension/Challenge: Edit this code to use a different model. For example, try using a conditional inference tree as implemented in the partykit package by changing the engine - or try an entirely different model type!
10:00
Models weβll be using today
Logistic regression
Decision trees
Logistic regression
Logistic regression
Logistic regression
Logit of outcome probability modeled as linear combination of predictors:
Find a sigmoid line that separates the two classes
Decision trees
Decision trees
Series of splits or if/then statements based on predictors
First the tree grows until some condition is met (maximum depth, no more data)
Then the tree is pruned to reduce its complexity
Decision trees
All models are wrong, but some are useful!
Logistic regression
Decision trees
A model workflow
Workflows bind preprocessors and models
What is wrong with this?
Why a workflow()?
Workflows handle new data better than base R tools in terms of new factor levels
You can use other preprocessors besides formulas (more on feature engineering in part 2 of the course)
They can help organize your work when working with multiple models
Most importantly, a workflow captures the entire modeling process: fit() and predict() apply to the preprocessing steps in addition to the actual model fit
set.seed(853)forested_val_split <-initial_validation_split(forested)validation_set(forested_val_split)#> # A tibble: 1 Γ 2#> splits id #> <list> <chr> #> 1 <split [4264/1421]> validation
A validation set is just another type of resample
Decision tree π³
Random forest π³π²π΄π΅π΄π³π³π΄π²π΅π΄π²π³π΄π³π΅π΅π΄π²π²π³π΄π³π΄π²π΄π΅π΄π²π΄π΅π²π΅π΄π²π³π΄π΅π³π΄π³
Random forest π³π²π΄π΅π³π³π΄π²π΅π΄π³π΅
Ensemble many decision tree models
All the trees vote! π³οΈ
Bootstrap aggregating + random predictor sampling
Often works well without tuning hyperparameters (more on this later!), as long as there are enough trees
Create a random forest model
rf_spec <-rand_forest(trees =1000, mode ="classification")rf_spec#> Random Forest Model Specification (classification)#> #> Main Arguments:#> trees = 1000#> #> Computational engine: ranger
Create a random forest model
rf_wflow <-workflow(forested ~ ., rf_spec)rf_wflow#> ββ Workflow ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ#> Preprocessor: Formula#> Model: rand_forest()#> #> ββ Preprocessor ββββββββββββββββββββββββββββββββββββββββββββββββββββββ#> forested ~ .#> #> ββ Model βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ#> Random Forest Model Specification (classification)#> #> Main Arguments:#> trees = 1000#> #> Computational engine: ranger
Your turn
Use fit_resamples() and rf_wflow to:
keep predictions
compute metrics
10:00
Evaluating model performance
ctrl_forested <-control_resamples(save_pred =TRUE)# Random forest uses random numbers so set the seed firstset.seed(2)rf_res <-fit_resamples(rf_wflow, forested_folds, control = ctrl_forested)collect_metrics(rf_res)#> # A tibble: 3 Γ 6#> .metric .estimator mean n std_err .config #> <chr> <chr> <dbl> <int> <dbl> <chr> #> 1 accuracy binary 0.918 10 0.00585 Preprocessor1_Model1#> 2 brier_class binary 0.0618 10 0.00337 Preprocessor1_Model1#> 3 roc_auc binary 0.972 10 0.00309 Preprocessor1_Model1
The whole game - status update
The final fit
Suppose that we are happy with our random forest model.
Letβs fit the model on the training set and verify our performance using the test set.
Weβve seen fit() and predict() (+ augment()) but there is a shortcut: