Chapter 11 Data preparation
Next, we use a recipe()
to build a set of steps for data preprocessing and feature engineering.
- First, we must tell the
recipe()
what our model is going to be (using a formula here) and what our training data is. step_novel()
will convert all nominal variables to factors.- We then convert the factor columns into (one or more) numeric binary (0 and 1) variables for the levels of the training data.
- We remove any numeric variables that have zero variance.
- We normalize (center and scale) the numeric variables.
<-
housing_rec recipe(median_house_value ~ ., data = new_train) %>%
step_novel(all_nominal(), -all_outcomes()) %>%
step_dummy(all_nominal()) %>%
step_zv(all_predictors()) %>%
step_normalize(all_predictors())
# Show the content of our recipe
housing_rec
## Data Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 9
##
## Operations:
##
## Novel factor level assignment for all_nominal(), -all_outcomes()
## Dummy variables from all_nominal()
## Zero variance filter on all_predictors()
## Centering and scaling for all_predictors()
Let’s have a closer look at the different components of the recipe.
11.1 recipe()
First of all, we created a simple recipe (we call it rec
) containing only an outcome (median_house_value
) and predictors (all other variables in the dataset: .
). To demonstrate the use of recipes step by step, we create a new object with the name rec
:
<- recipe(median_house_value ~ ., data = new_train) rec
The formula median_house_value ~ .
indicates outcomes vs predictors.
11.2 Helper functions
Here some helper functions for selecting sets of variables:
all_predictors()
: Each x variable (right side of ~)all_outcomes()
: Each y variable (left side of ~)all_numeric()
: Each numeric variableall_nominal()
: Each categorical variable (e.g. factor, string)dplyr::select()
helpers starts_with(‘Lot_’), etc.
11.3 step_novel()
step_novel()
will convert all nominal variables to factors. It adds a catch-all level to a factor for any new values, which lets R intelligently predict new levels in the test set. Missing values will remain missing.
%>%
rec step_novel(all_nominal(), -all_outcomes())
## Data Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 9
##
## Operations:
##
## Novel factor level assignment for all_nominal(), -all_outcomes()
11.4 step_dummy()
Converts nominal data into dummy variables.
%>%
rec step_dummy(all_nominal())
## Data Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 9
##
## Operations:
##
## Dummy variables from all_nominal()
11.5 step_zv()
step_zv()
removes zero variance variables (variables that contain only a single value).
%>%
rec step_zv(all_predictors())
## Data Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 9
##
## Operations:
##
## Zero variance filter on all_predictors()
When the recipe is applied to the data set, a column could contain only zeros. This is a “zero-variance predictor” that has no information within the column. While some R functions will not produce an error for such predictors, it usually causes warnings and other issues. step_zv()
will remove columns from the data when the training set data have a single value- This step should be added to the recipe after step_dummy()
.
11.6 step_normalize()
Centers then scales numeric variable (mean = 0, sd = 1)
%>%
rec step_normalize(all_numeric())
## Data Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 9
##
## Operations:
##
## Centering and scaling for all_numeric()
Now it’s time to specify and then fit our models.