Chapter 16 Data splitting

To get started, let’s split this single dataset into two: a training set and a testing set. We’ll keep most of the rows in the original dataset (subset chosen randomly) in the training set. The training data will be used to fit the model, and the testing set will be used to measure model performance.

To do this, we can use the rsample package (included in tidymodels) to create an object that contains the information on how to split the data, and then two more rsample functions to create data frames for the training and testing sets:

# Fix the random numbers by setting the seed 
# This enables the analysis to be reproducible 
# when random numbers are used 
set.seed(555)

# Put 3/4 of the data into the training set 
data_split <- initial_split(flight_data, 
                            prop = 3/4)

# Create data frames for the two sets:
train_data <- training(data_split)
test_data  <- testing(data_split)