Chapter 15 Data preparation
Let’s use the nycflights13
data to predict whether a plane arrives more than 30 minutes late. This data set contains information on 325,819 flights departing near New York City in 2013. Furthermore, it contains weather data (hourly meterological data for LGA, JFK and EWR).
Let’s start by loading the data and making a few changes to the variables:
<-
flight_data_all %>%
flights mutate(
# Convert the arrival delay to a factor
arr_delay = ifelse(arr_delay >= 30,
"late",
"on_time"),
arr_delay = factor(arr_delay),
# We will use the date (not date-time)
# in the recipe below
date = as.Date(time_hour)
%>%
) # Include weather data
inner_join(weather, by = c("origin", "time_hour")) %>%
# Only retain the specific columns we will use
select(dep_time, flight, origin,
dest, air_time, distance, %>%
carrier, date, arr_delay, time_hour) # Exclude missing data
na.omit() %>%
# For creating models, it is
# better to have qualitative columns
# encoded as factors (instead of character strings)
mutate(across(where(is.character), as.factor))
To speed up later calculations we only use a sample of the data:
set.seed(123)
<- sample_n(flight_data_all,
flight_data 10000)
We can see that around 16% of the flights in this data set arrived more than 30 minutes late:
%>%
flight_data count(arr_delay) %>%
mutate(prop = n/sum(n))
## # A tibble: 2 x 3
## arr_delay n prop
## <fct> <int> <dbl>
## 1 late 1589 0.159
## 2 on_time 8411 0.841
15.1 Data overview
Before we start building up our recipe, let’s take a quick look at a few specific variables that will be important for both preprocessing and modeling.
First, notice that the variable we created called arr_delay
is a factor variable; it is important that our outcome variable for training a classification model (at least a logistic regression model) is numeric.
glimpse(flight_data)
## Rows: 10,000
## Columns: 10
## $ dep_time <int> 825, 657, 1835, 1827, 1600, 1039, 1142, 1723, 1446, 2140, 2…
## $ flight <int> 120, 4122, 4517, 373, 4502, 4589, 4646, 4195, 3588, 3660, 1…
## $ origin <fct> JFK, EWR, LGA, JFK, EWR, LGA, LGA, EWR, LGA, LGA, JFK, EWR,…
## $ dest <fct> LAX, SDF, CRW, CLT, BNA, DTW, MSP, BNA, MSP, BNA, LAX, DCA,…
## $ air_time <dbl> 316, 104, 75, 90, 119, 93, 139, 130, 138, 104, 323, 36, 126…
## $ distance <dbl> 2475, 642, 444, 541, 748, 502, 1020, 748, 1020, 764, 2475, …
## $ carrier <fct> DL, EV, MQ, US, EV, MQ, MQ, EV, MQ, MQ, AA, EV, EV, EV, EV,…
## $ date <date> 2013-05-03, 2013-03-04, 2013-02-20, 2013-04-02, 2013-06-13…
## $ arr_delay <fct> on_time, on_time, on_time, on_time, late, on_time, on_time,…
## $ time_hour <dttm> 2013-05-03 08:00:00, 2013-03-04 07:00:00, 2013-02-20 18:00…
Second, there are two variables that we don’t want to use as predictors in our model, but that we would like to retain as identification variables that can be used to troubleshoot poorly predicted data points. These are flight, a numeric value, and time_hour, a date-time value.
Third, there are 79 flight destinations contained in dest and 14 distinct carriers.
%>%
flight_data ::skim(dest, carrier) skimr
Name | Piped data |
Number of rows | 10000 |
Number of columns | 10 |
_______________________ | |
Column type frequency: | |
factor | 2 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
dest | 0 | 1 | FALSE | 97 | ATL: 523, LAX: 506, BOS: 496, ORD: 482 |
carrier | 0 | 1 | FALSE | 15 | UA: 1757, EV: 1598, B6: 1589, DL: 1435 |
Because we’ll be using a logistic regression model in this tutorial, the variables dest
and carrier
will be converted to dummy variables.
However, some of these values do not occur very frequently and this could complicate our analysis. We’ll discuss specific steps later in this tutorial that we can add to our recipe to address this issue before modeling.