Chapter 5 Data understanding
5.1 Import data
library(tidyverse)
<- "https://raw.githubusercontent.com/kirenz/datasets/master/housing.csv"
LINK <- read_csv(LINK) housing_df
5.2 Data splitting
library(tidymodels)
set.seed(100)
<- initial_split(housing_df,
new_split prop = 3/4,
strata = median_income,
breaks = 5)
<- training(new_split)
new_train <- testing(new_split) new_test
5.3 Validation set
Let’s build a validation set to evaluate two simple linear regression models with different predictors.
First of all, we build a set of 5 validation folds with the function vfold_cv
(we also use stratified sampling in this example):
set.seed(100)
<-
cv_folds vfold_cv(new_train,
v = 5,
strata = median_income,
breaks = 5)
cv_folds
## # 5-fold cross-validation using stratification
## # A tibble: 5 x 2
## splits id
## <list> <chr>
## 1 <split [12.4K/3.1K]> Fold1
## 2 <split [12.4K/3.1K]> Fold2
## 3 <split [12.4K/3.1K]> Fold3
## 4 <split [12.4K/3.1K]> Fold4
## 5 <split [12.4K/3.1K]> Fold5