Chapter 11 Data preparation

Next, we use a recipe() to build a set of steps for data preprocessing and feature engineering.

  • First, we must tell the recipe() what our model is going to be (using a formula here) and what our training data is.
  • step_novel() will convert all nominal variables to factors.
  • We then convert the factor columns into (one or more) numeric binary (0 and 1) variables for the levels of the training data.
  • We remove any numeric variables that have zero variance.
  • We normalize (center and scale) the numeric variables.
housing_rec <-
  recipe(median_house_value ~ ., data = new_train) %>%
  step_novel(all_nominal(), -all_outcomes()) %>%
  step_dummy(all_nominal()) %>%
  step_zv(all_predictors()) %>%
  step_normalize(all_predictors())

# Show the content of our recipe
housing_rec
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor          9
## 
## Operations:
## 
## Novel factor level assignment for all_nominal(), -all_outcomes()
## Dummy variables from all_nominal()
## Zero variance filter on all_predictors()
## Centering and scaling for all_predictors()

Let’s have a closer look at the different components of the recipe.

11.1 recipe()

First of all, we created a simple recipe (we call it rec) containing only an outcome (median_house_value) and predictors (all other variables in the dataset: .). To demonstrate the use of recipes step by step, we create a new object with the name rec:

rec <- recipe(median_house_value ~ ., data = new_train)

The formula median_house_value ~ . indicates outcomes vs predictors.

11.2 Helper functions

Here some helper functions for selecting sets of variables:

  • all_predictors(): Each x variable (right side of ~)
  • all_outcomes(): Each y variable (left side of ~)
  • all_numeric(): Each numeric variable
  • all_nominal(): Each categorical variable (e.g. factor, string)
  • dplyr::select() helpers starts_with(‘Lot_’), etc.

11.3 step_novel()

step_novel() will convert all nominal variables to factors. It adds a catch-all level to a factor for any new values, which lets R intelligently predict new levels in the test set. Missing values will remain missing.

rec %>%
  step_novel(all_nominal(), -all_outcomes())
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor          9
## 
## Operations:
## 
## Novel factor level assignment for all_nominal(), -all_outcomes()

11.4 step_dummy()

Converts nominal data into dummy variables.

rec %>%
 step_dummy(all_nominal())
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor          9
## 
## Operations:
## 
## Dummy variables from all_nominal()

11.5 step_zv()

step_zv() removes zero variance variables (variables that contain only a single value).

rec %>%
  step_zv(all_predictors())
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor          9
## 
## Operations:
## 
## Zero variance filter on all_predictors()

When the recipe is applied to the data set, a column could contain only zeros. This is a “zero-variance predictor” that has no information within the column. While some R functions will not produce an error for such predictors, it usually causes warnings and other issues. step_zv() will remove columns from the data when the training set data have a single value- This step should be added to the recipe after step_dummy().

11.6 step_normalize()

Centers then scales numeric variable (mean = 0, sd = 1)

rec %>%
  step_normalize(all_numeric())
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor          9
## 
## Operations:
## 
## Centering and scaling for all_numeric()

Now it’s time to specify and then fit our models.