Chapter 17 Create recipe and roles
To get started, let’s create a recipe for a classification model. Before training the models, we can use a recipe to create a few new predictors and conduct some preprocessing required by the model.
The recipe()
function has two arguments:
A formula. Any variable on the left-hand side of the tilde (~) is considered the model outcome (here,
arr_delay
). On the right-hand side of the tilde are the predictors. Variables may be listed by name, or you can use the dot (.) to indicate all other variables as predictors.The data. A recipe is associated with the data set used to create the model. This will typically be the training set, so
data = train_data
here. Naming a data set doesn’t actually change the data itself; it is only used to catalog the names of the variables and their types, like factors, integers, dates, etc.
We can also add roles to this recipe. We can use the update_role()
function to let recipes know that flight
and time_hour
are variables with a custom role that we call “ID” (a role can have any character value). Whereas our formula included all variables in the training set other than arr_delay
as predictors, this tells the recipe to keep these two variables but not use them as either outcomes or predictors.
<-
flights_rec recipe(arr_delay ~ ., data = train_data) %>%
update_role(flight,
time_hour, new_role = "ID")
This step of adding roles to a recipe is optional; the purpose of using it here is that those two variables can be retained in the data but not included in the model. This can be convenient when, after the model is fit, we want to investigate some poorly predicted value. These ID columns will be available and can be used to try to understand what went wrong.
To get the current set of variables and roles, use the summary() function:
summary(flights_rec)
## # A tibble: 10 x 4
## variable type role source
## <chr> <chr> <chr> <chr>
## 1 dep_time numeric predictor original
## 2 flight numeric ID original
## 3 origin nominal predictor original
## 4 dest nominal predictor original
## 5 air_time numeric predictor original
## 6 distance numeric predictor original
## 7 carrier nominal predictor original
## 8 date date predictor original
## 9 time_hour date ID original
## 10 arr_delay nominal outcome original
17.1 Create features
Now we can start adding steps onto our recipe using the pipe operator.
17.1.1 Date
Perhaps it is reasonable for the date
of the flight to have an effect on the likelihood of a late arrival. A little bit of feature engineering might go a long way to improving our model. How should the date be encoded into the model? The date column has an R date object so including that column “as is” will mean that the model will convert it to a numeric format equal to the number of days after a reference date:
%>%
flight_data distinct(date) %>%
mutate(numeric_date = as.numeric(date))
## # A tibble: 364 x 2
## date numeric_date
## <date> <dbl>
## 1 2013-05-03 15828
## 2 2013-03-04 15768
## 3 2013-02-20 15756
## 4 2013-04-02 15797
## 5 2013-06-13 15869
## 6 2013-02-21 15757
## 7 2013-05-08 15833
## 8 2013-10-21 15999
## 9 2013-11-12 16021
## 10 2013-09-12 15960
## # … with 354 more rows
It’s possible that the numeric date variable is a good option for modeling. However, it might be better to add model terms derived from the date that have a better potential to be important to the model. For example, we could derive the following meaningful features from the single date variable:
- the day of the week,
- the month, and
- whether or not the date corresponds to a holiday.
Let’s do all three of these by adding steps to our recipe:
<-
flights_rec recipe(arr_delay ~ .,
data = train_data) %>%
update_role(flight,
time_hour, new_role = "ID") %>%
step_date(date,
features = c("dow", "month")) %>%
step_holiday(date,
holidays = timeDate::listHolidays("US")) %>%
step_rm(date)
What do each of these steps do?
With
step_date()
, we created two new factor columns with the appropriate day of the week (dow) and the month.With
step_holiday()
, we created a binary variable indicating whether the current date is a holiday or not. The argument value oftimeDate::listHolidays("US"
) uses the timeDate package to list the 17 standard US holidays.With
step_rm()
, we remove the original date variable since we no longer want it in the model.
Next, we’ll turn our attention to the variable types of our predictors. Because we plan to train a classifiaction model, we know that predictors will ultimately need to be numeric, as opposed to factor variables. In other words, there may be a difference in how we store our data (in factors inside a data frame), and how the underlying equations require them (a purely numeric matrix).
17.1.2 Dummy variables
For factors like dest
and origin
, standard practice is to convert them into dummy or indicator variables to make them numeric. These are binary values for each level of the factor. For example, our origin variable has values of “EWR,” “JFK,” and “LGA.” The standard dummy variable encoding, shown below, will create two numeric columns of the data that are 1 when the originating airport is “JFK” or “LGA” and zero otherwise, respectively.
ORIGIN | ORIGIN_JFK | ORIGIN_LGA |
---|---|---|
EWR | 0 | 0 |
JFK | 1 | 0 |
LGA | 0 | 1 |
But, unlike the standard model formula methods in R, a recipe does not automatically create these dummy variables for you; you’ll need to tell your recipe to add this step. This is for two reasons. First, many models do not require numeric predictors, so dummy variables may not always be preferred. Second, recipes can also be used for purposes outside of modeling, where non-dummy versions of the variables may work better. For example, you may want to make a table or a plot with a variable as a single factor. For those reasons, you need to explicitly tell recipes to create dummy variables using step_dummy():
<-
flights_rec recipe(arr_delay ~ .,
data = train_data) %>%
update_role(flight,
time_hour, new_role = "ID") %>%
step_date(date,
features = c("dow", "month")) %>%
step_holiday(date,
holidays = timeDate::listHolidays("US")) %>%
step_rm(date) %>%
step_dummy(all_nominal(), -all_outcomes())
Here, we did something different than before: instead of applying a step to an individual variable, we used selectors to apply this recipe step to several variables at once.
The first selector,
all_nominal()
, selects all variables that are either factors or characters.The second selector,
-all_outcomes()
removes any outcome variables from this recipe step.
With these two selectors together, our recipe step above translates to:
Create dummy variables for all of the factor or character columns unless they are outcomes.
At this stage in the recipe, this step selects the origin
, dest
, and carrier
variables. It also includes two new variables, date_dow
and date_month
, that were created by the earlier step_date()
.
More generally, the recipe selectors mean that you don’t always have to apply steps to individual variables one at a time. Since a recipe knows the variable type and role of each column, they can also be selected (or dropped) using this information.
17.1.3 Zero variance
Note that since carrier
and dest
have some infrequently occurring values, it is possible that dummy variables might be created for values that don’t exist in the training set. For example, there could be destinations that are only in the test set. The function anti_join()
returns all rows from x (test_data) where there are not matching values in y (train_data), keeping just columns from x.:
%>%
test_data distinct(dest) %>%
anti_join(train_data)
## # A tibble: 1 x 1
## dest
## <fct>
## 1 HDN
When the recipe is applied to the training set, a column could contain only zeros. This is a “zero-variance predictor” that has no information within the column. While some R functions will not produce an error for such predictors, it usually causes warnings and other issues. step_zv()
will remove columns from the data when the training set data have a single value, so it is added to the recipe after step_dummy()
:
<-
flights_rec recipe(arr_delay ~ .,
data = train_data) %>%
update_role(flight,
time_hour, new_role = "ID") %>%
step_date(date,
features = c("dow", "month")) %>%
step_holiday(date,
holidays = timeDate::listHolidays("US")) %>%
step_rm(date) %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_zv(all_predictors())
17.1.4 Correlations
As a final step, we remove predictor variables that have large absolute correlations with other variables
<-
flights_rec recipe(arr_delay ~ .,
data = train_data) %>%
update_role(flight,
time_hour, new_role = "ID") %>%
step_date(date,
features = c("dow", "month")) %>%
step_holiday(date,
holidays = timeDate::listHolidays("US")) %>%
step_rm(date) %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_zv(all_predictors()) %>%
step_corr(all_predictors())
Now we’ve created a specification of what should be done with the data.