Model

Once our features have been preprocessed in a format ready for modeling algorithms (see Data), they can be used in the model selection process.

Note

The type of data preprocessing is dependent on the type of model being fit. Kuhn and Silge [2021] provide recommendations for baseline levels of preprocessing that are needed for various model functions (see this table).

The following resources provide detailed information about different regression and classification models in tidymodels:

Next, we discuss some important model selection topics like

  • Model selection

  • best fitting model

  • mean squared error

  • bias-variance trade off.



In the next sections, we’ll discuss the process of model building in detail.

Select algorithm

One of the hardest parts during the data science lifecycle can be finding the right algorithm for the job since different algorithms are better suited for different types of data and different problems. For some datasets the best algorithm could be a linear model, while for other datasets it is a random forest or neural network. There is no model that is a priori guaranteed to work better. This fact is known as the “No Free Lunch (NFL) theorem” [Wolpert, 1996].



Resources

Some of the most common algorithms are:

  • Linear and Polynomial Regression,

  • Logistic Regression,

  • k-Nearest Neighbors,

  • Support Vector Machines,

  • Decision Trees,

  • Random Forests,

  • Neural Networks and

  • Ensemble methods like Gradient Boosted Decision Trees (GBDT).

A model ensemble, where the predictions of multiple single learners are aggregated together to make one prediction, can produce a high-performance final model. Each of these methods combines the predictions from multiple versions of the same type of model (e.g., classifications trees).

Note that the only way to know for sure which model is best would be to evaluate them all [Géron, 2019]. Since this is often not possible, in practice you make some assumptions about the data and evaluate only a few reasonable models. For example, for simple tasks you may evaluate linear models with various levels of regularization as well as some ensemble methods like Gradient Boosted Decision Trees (GBDT). For very complex problems, you may evaluate various deep neural networks.

Tidymodels provides an overview about different algorithm types:

Train and evaluate

In the first phase of the model building process, a variety of initial models are generated and their performance is compared during model evaluation. As a part of this process, we also need to decide which features we want to include in our model (“feature selection”). Therefore, let’s first take a look at the topic of feature selection.

Feature selection

There are a number of different strategies for feature selection that can be applied and some of them are performed simultaneously with model building.

Note

Feature selection is the process of selecting a subset of relevant features (variables, predictors) for our model.

If you want to learn more about feature selection methods, review the following content:

Jupyter Book

Training

Now we can use the pipeline we created in Data (see last section) and combine it with tidymodels algorithms of our choice.

To combine the data preparation recipe with the model building, we use the package workflows. A workflow is an object that can bundle together your pre-processing recipe, modeling, and even post-processing requests (like calculating the RMSE).

Let`s use the popular XGBoost algorithm as an example:

library(tidymodels)
library(xgboost)

# data preprocessing recipe
df_rec <-
  recipe(your_y_label ~ ., data = train_data) %>%
  step_impute_median(all_numeric(), -all_outcomes()) %>%
  step_impute_mode(all_nominal_predictors()) %>%
  step_normalize(all_numeric(), -all_outcomes()) %>% 
  step_dummy(all_nominal_predictors()) %>%
  step_corr(all_predictors(), threshold = 0.7, method = "spearman") 

# model specification
xgb_spec <- 
  boost_tree() %>% 
  set_engine("xgboost") %>% 
  set_mode("regression")   

# workflow pipeline
xgb_wflow <-
 workflow() %>%
 add_recipe(df_rec) %>% 
 add_model(xgb_spec)

# fit model with crossvalidation
xgb_res <- 
  xgb_wflow %>% 
  fit_resamples(
    resamples = cv_folds,
    control = control_resamples(save_pred = TRUE)
  )

Evaluation

In model evaluation, we mainly assess the model’s performance metrics (using an evaluation set) and examine residual plots to understand how well the models work. Our first goal in this process is to shortlist a few (two to five) promising models.

# evaluate model
xgb_res %>% collect_metrics(summarize = TRUE)

The tidymodels package yardstick provides an extensive list of possible metrics to quantify the quality of model predictions:

Tuning

After we identified a shortlist of promising models, it usually makes sense to tune the hyper-paramters of our models.

Note

Hyper-parameters are parameters that are not directly learnt within algorithms.

In tidymodels, hyper-paramters are passed as arguments to the algorithm like “alpha” for Lasso or “K” for the number of neighbors in a K-nearest neighbors model.

Instead of trying to find good hyper-paramters manually, it is recommended to search the hyper-parameter space for the best cross validation score using one of the two generic approaches provided in the tidymodels package tune:

  • for given values, tune_grid exhaustively considers all parameter combinations.

  • tune_bayes identifies the best hyperparameters for a model using Bayesian optimization of iterative search.

The tune_grid approach is fine when you are exploring relatively few combinations, but when the hyperparameter search space is large, it is often preferable to use tune_bayes instead.

Voting and stacking

It often makes sense to combine different models since the group (“ensemble”) will usually perform better than the best individual model, especially if the individual models make very different types of errors.

Note

Voting can be useful for a set of equally well performing models in order to balance out their individual weaknesses.

Model voting combines the predictions for multiple models of any type and thereby creating an ensemble meta-estimator:

  • In classification problems, the idea behind voting is to combine conceptually different machine learning classifiers and use a majority vote or the average predicted probabilities (soft vote) to predict the class labels.

  • In regression problems, we combine different machine learning regressors and return the average predicted values.

Stacked generalization is a method for combining estimators to reduce their biases. Therefore, the predictions of each individual estimator are stacked together and used as input to a final estimator to compute the prediction. This final estimator is trained through cross-validation.

Note

Model stacking is an ensembling method that takes the outputs of many models and combines them to generate a new model that generates predictions informed by each of its members.

The tidymodelspackage stacks provides voting methods for both classification and regression.

Stacking

Evaluate best model

After we tuned hyper-parameters and/or performed voting/stacking, we evaluate the best model (system) and their errors in detail.

In particular, we take a look at the specific errors that our model (system) makes, and try to understand why it makes them and what could fix the problem - like adding extra features or getting rid of uninformative ones, cleaning up outliers, etc. [Géron, 2019]. If possible, we also display the importance scores of our predictors. With this information, we may want to try dropping some of the less useful features to make sure our model generalizes well.

Assess possible reasons for the wrongest predictions:

# make predictions
assess_res <- collect_predictions(xgb_res)

# obtain the 10 wrongest predictions
wrongest_predictions <- 
  assess_res %>% 
  mutate(residual = your_y_label - .pred) %>% 
  arrange(desc(abs(residual))) %>% 
  slice_head(n = 10)

# take a look at the observations
errors <- 
  train_data %>% 
  dplyr::slice(wrongest_prediction$.row) 

After evaluating the model (system) for a while, we eventually have a system that performs sufficiently well.

Evaluate on test set

Now is the time to evaluate the final model on the test set. If you did a lot of hyperparameter tuning, the performance will usually be slightly worse than what you measured using cross-validation - because your system ends up fine-tuned to perform well on the validation data and will likely not perform as well on unknown dataset [Géron, 2019].

The tidymodels function last_fit emulates the process where, after determining the best model, the final fit on the entire training set is needed and is then evaluated on the test set.

# final evaluation with test data
last_fit_xgb <- last_fit(xgb_wflow, split = data_split)

# show RMSE and RSQ
last_fit_xgb %>% 
  collect_metrics()

It is important to note that we don’t change the model (system) anymore to make the numbers look good on the test set; the improvements would be unlikely to generalize to new data. Instead, we use the metrics for our final evaluation to make sure the model performs sufficiently well regarding our success metrics from the planning phase.

Make predictions on new data

The tidymodels function predict() can be used for all types of models and uses the “type” argument for more specificity.

predict(object, new_data, type = NULL)

with

  • An object of class model_fit

  • new_data: A rectangular data object, such as a data frame.

  • type: A single character value or NULL. Possible values are “numeric”, “class”, “prob”, “conf_int”, “pred_int”, “quantile”, “time”, “hazard”, “survival”, or “raw”. When NULL, predict() will choose an appropriate value based on the model’s mode.

Challenges

In the following presentation, we cover some typical modeling challenges:

  • Poor quality data

  • irrelevant features and feature engineering

  • overfitting and regularization

  • underfitting