Chapter 17 Create recipe and roles

To get started, let’s create a recipe for a classification model. Before training the models, we can use a recipe to create a few new predictors and conduct some preprocessing required by the model.

The recipe() function has two arguments:

A formula. Any variable on the left-hand side of the tilde (~) is considered the model outcome (here, arr_delay). On the right-hand side of the tilde are the predictors. Variables may be listed by name, or you can use the dot (.) to indicate all other variables as predictors.
The data. A recipe is associated with the data set used to create the model. This will typically be the training set, so data = train_data here. Naming a data set doesn’t actually change the data itself; it is only used to catalog the names of the variables and their types, like factors, integers, dates, etc.

We can also add roles to this recipe. We can use the update_role() function to let recipes know that flight and time_hour are variables with a custom role that we call “ID” (a role can have any character value). Whereas our formula included all variables in the training set other than arr_delay as predictors, this tells the recipe to keep these two variables but not use them as either outcomes or predictors.

flights_rec <- 
  recipe(arr_delay ~ ., data = train_data) %>%
  update_role(flight, 
              time_hour, 
              new_role = "ID")

This step of adding roles to a recipe is optional; the purpose of using it here is that those two variables can be retained in the data but not included in the model. This can be convenient when, after the model is fit, we want to investigate some poorly predicted value. These ID columns will be available and can be used to try to understand what went wrong.

To get the current set of variables and roles, use the summary() function:

summary(flights_rec)

## # A tibble: 10 x 4
##    variable  type    role      source  
##    <chr>     <chr>   <chr>     <chr>   
##  1 dep_time  numeric predictor original
##  2 flight    numeric ID        original
##  3 origin    nominal predictor original
##  4 dest      nominal predictor original
##  5 air_time  numeric predictor original
##  6 distance  numeric predictor original
##  7 carrier   nominal predictor original
##  8 date      date    predictor original
##  9 time_hour date    ID        original
## 10 arr_delay nominal outcome   original

17.1 Create features

Now we can start adding steps onto our recipe using the pipe operator.

17.1.1 Date

Perhaps it is reasonable for the date of the flight to have an effect on the likelihood of a late arrival. A little bit of feature engineering might go a long way to improving our model. How should the date be encoded into the model? The date column has an R date object so including that column “as is” will mean that the model will convert it to a numeric format equal to the number of days after a reference date:

flight_data %>% 
  distinct(date) %>% 
  mutate(numeric_date = as.numeric(date))

## # A tibble: 364 x 2
##    date       numeric_date
##    <date>            <dbl>
##  1 2013-05-03        15828
##  2 2013-03-04        15768
##  3 2013-02-20        15756
##  4 2013-04-02        15797
##  5 2013-06-13        15869
##  6 2013-02-21        15757
##  7 2013-05-08        15833
##  8 2013-10-21        15999
##  9 2013-11-12        16021
## 10 2013-09-12        15960
## # … with 354 more rows

It’s possible that the numeric date variable is a good option for modeling. However, it might be better to add model terms derived from the date that have a better potential to be important to the model. For example, we could derive the following meaningful features from the single date variable:

the day of the week,
the month, and
whether or not the date corresponds to a holiday.

Let’s do all three of these by adding steps to our recipe:

flights_rec <- 
  recipe(arr_delay ~ ., 
         data = train_data) %>%
  update_role(flight, 
              time_hour, 
              new_role = "ID") %>% 
  step_date(date, 
            features = c("dow", "month")) %>%               
  step_holiday(date, 
               holidays = timeDate::listHolidays("US")) %>% 
  step_rm(date)

What do each of these steps do?

With step_date(), we created two new factor columns with the appropriate day of the week (dow) and the month.
With step_holiday(), we created a binary variable indicating whether the current date is a holiday or not. The argument value of timeDate::listHolidays("US") uses the timeDate package to list the 17 standard US holidays.
With step_rm(), we remove the original date variable since we no longer want it in the model.

Next, we’ll turn our attention to the variable types of our predictors. Because we plan to train a classifiaction model, we know that predictors will ultimately need to be numeric, as opposed to factor variables. In other words, there may be a difference in how we store our data (in factors inside a data frame), and how the underlying equations require them (a purely numeric matrix).

17.1.2 Dummy variables

For factors like dest and origin, standard practice is to convert them into dummy or indicator variables to make them numeric. These are binary values for each level of the factor. For example, our origin variable has values of “EWR,” “JFK,” and “LGA.” The standard dummy variable encoding, shown below, will create two numeric columns of the data that are 1 when the originating airport is “JFK” or “LGA” and zero otherwise, respectively.

ORIGIN	ORIGIN_JFK	ORIGIN_LGA
EWR	0	0
JFK	1	0
LGA	0	1

But, unlike the standard model formula methods in R, a recipe does not automatically create these dummy variables for you; you’ll need to tell your recipe to add this step. This is for two reasons. First, many models do not require numeric predictors, so dummy variables may not always be preferred. Second, recipes can also be used for purposes outside of modeling, where non-dummy versions of the variables may work better. For example, you may want to make a table or a plot with a variable as a single factor. For those reasons, you need to explicitly tell recipes to create dummy variables using step_dummy():

flights_rec <- 
  recipe(arr_delay ~ ., 
         data = train_data) %>%
  update_role(flight, 
              time_hour, 
              new_role = "ID") %>% 
  step_date(date, 
            features = c("dow", "month")) %>% 
  step_holiday(date, 
               holidays = timeDate::listHolidays("US")) %>% 
  step_rm(date) %>% 
  step_dummy(all_nominal(), -all_outcomes())

Here, we did something different than before: instead of applying a step to an individual variable, we used selectors to apply this recipe step to several variables at once.

The first selector, all_nominal(), selects all variables that are either factors or characters.
The second selector, -all_outcomes() removes any outcome variables from this recipe step.

With these two selectors together, our recipe step above translates to:

Create dummy variables for all of the factor or character columns unless they are outcomes.

At this stage in the recipe, this step selects the origin, dest, and carrier variables. It also includes two new variables, date_dow and date_month, that were created by the earlier step_date().

More generally, the recipe selectors mean that you don’t always have to apply steps to individual variables one at a time. Since a recipe knows the variable type and role of each column, they can also be selected (or dropped) using this information.

17.1.3 Zero variance

Note that since carrier and dest have some infrequently occurring values, it is possible that dummy variables might be created for values that don’t exist in the training set. For example, there could be destinations that are only in the test set. The function anti_join() returns all rows from x (test_data) where there are not matching values in y (train_data), keeping just columns from x.:

test_data %>% 
  distinct(dest) %>% 
  anti_join(train_data)

## # A tibble: 1 x 1
##   dest 
##   <fct>
## 1 HDN

When the recipe is applied to the training set, a column could contain only zeros. This is a “zero-variance predictor” that has no information within the column. While some R functions will not produce an error for such predictors, it usually causes warnings and other issues. step_zv() will remove columns from the data when the training set data have a single value, so it is added to the recipe after step_dummy():

flights_rec <- 
  recipe(arr_delay ~ ., 
         data = train_data) %>% 
  update_role(flight, 
              time_hour, 
              new_role = "ID") %>% 
  step_date(date, 
            features = c("dow", "month")) %>% 
  step_holiday(date, 
               holidays = timeDate::listHolidays("US")) %>%
  step_rm(date) %>% 
  step_dummy(all_nominal(), -all_outcomes()) %>% 
  step_zv(all_predictors())

17.1.4 Correlations

As a final step, we remove predictor variables that have large absolute correlations with other variables

flights_rec <- 
  recipe(arr_delay ~ ., 
         data = train_data) %>% 
  update_role(flight, 
              time_hour, 
              new_role = "ID") %>% 
  step_date(date, 
            features = c("dow", "month")) %>% 
  step_holiday(date, 
               holidays = timeDate::listHolidays("US")) %>%
  step_rm(date) %>% 
  step_dummy(all_nominal(), -all_outcomes()) %>% 
  step_zv(all_predictors()) %>% 
  step_corr(all_predictors())

Now we’ve created a specification of what should be done with the data.