Chapter 15 Data preparation

Let’s use the nycflights13 data to predict whether a plane arrives more than 30 minutes late. This data set contains information on 325,819 flights departing near New York City in 2013. Furthermore, it contains weather data (hourly meterological data for LGA, JFK and EWR).

Let’s start by loading the data and making a few changes to the variables:

flight_data_all <- 
  flights %>% 
  mutate(
    # Convert the arrival delay to a factor
    arr_delay = ifelse(arr_delay >= 30, 
                       "late", 
                       "on_time"),
    arr_delay = factor(arr_delay),
    # We will use the date (not date-time) 
    # in the recipe below
    date = as.Date(time_hour)
  ) %>% 
  # Include  weather data
  inner_join(weather, by = c("origin", "time_hour")) %>% 
  # Only retain the specific columns we will use
  select(dep_time, flight, origin, 
         dest, air_time, distance, 
         carrier, date, arr_delay, time_hour) %>% 
  # Exclude missing data
  na.omit() %>% 
  # For creating models, it is 
  # better to have qualitative columns
  # encoded as factors (instead of character strings)
  mutate(across(where(is.character), as.factor))

To speed up later calculations we only use a sample of the data:

set.seed(123)

flight_data <- sample_n(flight_data_all, 
                        10000)

We can see that around 16% of the flights in this data set arrived more than 30 minutes late:

flight_data %>% 
  count(arr_delay) %>% 
  mutate(prop = n/sum(n))
## # A tibble: 2 x 3
##   arr_delay     n  prop
##   <fct>     <int> <dbl>
## 1 late       1589 0.159
## 2 on_time    8411 0.841

15.1 Data overview

Before we start building up our recipe, let’s take a quick look at a few specific variables that will be important for both preprocessing and modeling.

First, notice that the variable we created called arr_delay is a factor variable; it is important that our outcome variable for training a classification model (at least a logistic regression model) is numeric.

glimpse(flight_data)
## Rows: 10,000
## Columns: 10
## $ dep_time  <int> 825, 657, 1835, 1827, 1600, 1039, 1142, 1723, 1446, 2140, 2…
## $ flight    <int> 120, 4122, 4517, 373, 4502, 4589, 4646, 4195, 3588, 3660, 1…
## $ origin    <fct> JFK, EWR, LGA, JFK, EWR, LGA, LGA, EWR, LGA, LGA, JFK, EWR,…
## $ dest      <fct> LAX, SDF, CRW, CLT, BNA, DTW, MSP, BNA, MSP, BNA, LAX, DCA,…
## $ air_time  <dbl> 316, 104, 75, 90, 119, 93, 139, 130, 138, 104, 323, 36, 126…
## $ distance  <dbl> 2475, 642, 444, 541, 748, 502, 1020, 748, 1020, 764, 2475, …
## $ carrier   <fct> DL, EV, MQ, US, EV, MQ, MQ, EV, MQ, MQ, AA, EV, EV, EV, EV,…
## $ date      <date> 2013-05-03, 2013-03-04, 2013-02-20, 2013-04-02, 2013-06-13…
## $ arr_delay <fct> on_time, on_time, on_time, on_time, late, on_time, on_time,…
## $ time_hour <dttm> 2013-05-03 08:00:00, 2013-03-04 07:00:00, 2013-02-20 18:00…

Second, there are two variables that we don’t want to use as predictors in our model, but that we would like to retain as identification variables that can be used to troubleshoot poorly predicted data points. These are flight, a numeric value, and time_hour, a date-time value.

Third, there are 79 flight destinations contained in dest and 14 distinct carriers.

flight_data %>% 
  skimr::skim(dest, carrier) 
Table 15.1: Data summary
Name Piped data
Number of rows 10000
Number of columns 10
_______________________
Column type frequency:
factor 2
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
dest 0 1 FALSE 97 ATL: 523, LAX: 506, BOS: 496, ORD: 482
carrier 0 1 FALSE 15 UA: 1757, EV: 1598, B6: 1589, DL: 1435

Because we’ll be using a logistic regression model in this tutorial, the variables dest and carrier will be converted to dummy variables.

However, some of these values do not occur very frequently and this could complicate our analysis. We’ll discuss specific steps later in this tutorial that we can add to our recipe to address this issue before modeling.