Chapter 11 Data preparation

Next, we use a recipe() to build a set of steps for data preprocessing and feature engineering.

First, we must tell the recipe() what our model is going to be (using a formula here) and what our training data is.
step_novel() will convert all nominal variables to factors.
We then convert the factor columns into (one or more) numeric binary (0 and 1) variables for the levels of the training data.
We remove any numeric variables that have zero variance.
We normalize (center and scale) the numeric variables.

housing_rec <-
  recipe(median_house_value ~ ., data = new_train) %>%
  step_novel(all_nominal(), -all_outcomes()) %>%
  step_dummy(all_nominal()) %>%
  step_zv(all_predictors()) %>%
  step_normalize(all_predictors())

# Show the content of our recipe
housing_rec

## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor          9
## 
## Operations:
## 
## Novel factor level assignment for all_nominal(), -all_outcomes()
## Dummy variables from all_nominal()
## Zero variance filter on all_predictors()
## Centering and scaling for all_predictors()

Let’s have a closer look at the different components of the recipe.

11.1 recipe()

First of all, we created a simple recipe (we call it rec) containing only an outcome (median_house_value) and predictors (all other variables in the dataset: .). To demonstrate the use of recipes step by step, we create a new object with the name rec:

rec <- recipe(median_house_value ~ ., data = new_train)

The formula median_house_value ~ . indicates outcomes vs predictors.

11.2 Helper functions

Here some helper functions for selecting sets of variables:

all_predictors(): Each x variable (right side of ~)
all_outcomes(): Each y variable (left side of ~)
all_numeric(): Each numeric variable
all_nominal(): Each categorical variable (e.g. factor, string)
dplyr::select() helpers starts_with(‘Lot_’), etc.

11.3 step_novel()

step_novel() will convert all nominal variables to factors. It adds a catch-all level to a factor for any new values, which lets R intelligently predict new levels in the test set. Missing values will remain missing.

rec %>%
  step_novel(all_nominal(), -all_outcomes())

## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor          9
## 
## Operations:
## 
## Novel factor level assignment for all_nominal(), -all_outcomes()

11.4 step_dummy()

Converts nominal data into dummy variables.

rec %>%
 step_dummy(all_nominal())

## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor          9
## 
## Operations:
## 
## Dummy variables from all_nominal()

11.5 step_zv()

step_zv() removes zero variance variables (variables that contain only a single value).

rec %>%
  step_zv(all_predictors())

## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor          9
## 
## Operations:
## 
## Zero variance filter on all_predictors()

When the recipe is applied to the data set, a column could contain only zeros. This is a “zero-variance predictor” that has no information within the column. While some R functions will not produce an error for such predictors, it usually causes warnings and other issues. step_zv() will remove columns from the data when the training set data have a single value- This step should be added to the recipe after step_dummy().

11.6 step_normalize()

Centers then scales numeric variable (mean = 0, sd = 1)

rec %>%
  step_normalize(all_numeric())

## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor          9
## 
## Operations:
## 
## Centering and scaling for all_numeric()

Now it’s time to specify and then fit our models.