Hitters data preparation#

We illustrate the following regression methods on a data set called “Hitters”, which includes 20 variables and 322 observations of major league baseball players. The goal is to predict a baseball player’s salary on the basis of various features associated with performance in the previous year. We don’t cover the topic of exploratory data analysis in this notebook.

Note that scikit-learn provides a pipeline library for data preprocessing and feature engineering, which is considered best practice for data preparation. However, since we use scikit-learn as well as statsmodels in some of our examples, we won’t create a data preprocessing pipeline in this example.

Import#

import pandas as pd

df = pd.read_csv("https://raw.githubusercontent.com/kirenz/datasets/master/Hitters.csv")
df
AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI CWalks League Division PutOuts Assists Errors Salary NewLeague
0 293 66 1 30 29 14 1 293 66 1 30 29 14 A E 446 33 20 NaN A
1 315 81 7 24 38 39 14 3449 835 69 321 414 375 N W 632 43 10 475.0 N
2 479 130 18 66 72 76 3 1624 457 63 224 266 263 A W 880 82 14 480.0 A
3 496 141 20 65 78 37 11 5628 1575 225 828 838 354 N E 200 11 3 500.0 N
4 321 87 10 39 42 30 2 396 101 12 48 46 33 N E 805 40 4 91.5 N
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
317 497 127 7 65 48 37 5 2703 806 32 379 311 138 N E 325 9 3 700.0 N
318 492 136 5 76 50 94 12 5511 1511 39 897 451 875 A E 313 381 20 875.0 A
319 475 126 3 61 43 52 6 1700 433 7 217 93 146 A W 37 113 7 385.0 A
320 573 144 9 85 60 78 8 3198 857 97 470 420 332 A E 1314 131 12 960.0 A
321 631 170 9 77 44 31 11 4908 1457 30 775 357 249 A W 408 4 3 1000.0 A

322 rows × 20 columns

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 322 entries, 0 to 321
Data columns (total 20 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   AtBat      322 non-null    int64  
 1   Hits       322 non-null    int64  
 2   HmRun      322 non-null    int64  
 3   Runs       322 non-null    int64  
 4   RBI        322 non-null    int64  
 5   Walks      322 non-null    int64  
 6   Years      322 non-null    int64  
 7   CAtBat     322 non-null    int64  
 8   CHits      322 non-null    int64  
 9   CHmRun     322 non-null    int64  
 10  CRuns      322 non-null    int64  
 11  CRBI       322 non-null    int64  
 12  CWalks     322 non-null    int64  
 13  League     322 non-null    object 
 14  Division   322 non-null    object 
 15  PutOuts    322 non-null    int64  
 16  Assists    322 non-null    int64  
 17  Errors     322 non-null    int64  
 18  Salary     263 non-null    float64
 19  NewLeague  322 non-null    object 
dtypes: float64(1), int64(16), object(3)
memory usage: 50.4+ KB

Missing values#

Note that the salary is missing for some of the players:

print(df.isnull().sum())
AtBat         0
Hits          0
HmRun         0
Runs          0
RBI           0
Walks         0
Years         0
CAtBat        0
CHits         0
CHmRun        0
CRuns         0
CRBI          0
CWalks        0
League        0
Division      0
PutOuts       0
Assists       0
Errors        0
Salary       59
NewLeague     0
dtype: int64

We simply drop the missing cases:

# drop missing cases
df = df.dropna()

Create label and features#

Since we will use algorithms from scikit learn, we need to encode our categorical features as one-hot numeric features (dummy variables):

dummies = pd.get_dummies(df[['League', 'Division','NewLeague']])
dummies.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 263 entries, 1 to 321
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   League_A     263 non-null    uint8
 1   League_N     263 non-null    uint8
 2   Division_E   263 non-null    uint8
 3   Division_W   263 non-null    uint8
 4   NewLeague_A  263 non-null    uint8
 5   NewLeague_N  263 non-null    uint8
dtypes: uint8(6)
memory usage: 3.6 KB
print(dummies.head())
   League_A  League_N  Division_E  Division_W  NewLeague_A  NewLeague_N
1         0         1           0           1            0            1
2         1         0           0           1            1            0
3         0         1           1           0            0            1
4         0         1           1           0            0            1
5         1         0           0           1            1            0

Next, we create our label y:

y = df[['Salary']]

We drop the column with the outcome variable (Salary), and categorical columns for which we already created dummy variables:

X_numerical = df.drop(['Salary', 'League', 'Division', 'NewLeague'], axis=1).astype('float64')

Make a list of all numerical features (we need them later):

list_numerical = X_numerical.columns
list_numerical
Index(['AtBat', 'Hits', 'HmRun', 'Runs', 'RBI', 'Walks', 'Years', 'CAtBat',
       'CHits', 'CHmRun', 'CRuns', 'CRBI', 'CWalks', 'PutOuts', 'Assists',
       'Errors'],
      dtype='object')
# Create all features
X = pd.concat([X_numerical, dummies[['League_N', 'Division_W', 'NewLeague_N']]], axis=1)
X.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 263 entries, 1 to 321
Data columns (total 19 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   AtBat        263 non-null    float64
 1   Hits         263 non-null    float64
 2   HmRun        263 non-null    float64
 3   Runs         263 non-null    float64
 4   RBI          263 non-null    float64
 5   Walks        263 non-null    float64
 6   Years        263 non-null    float64
 7   CAtBat       263 non-null    float64
 8   CHits        263 non-null    float64
 9   CHmRun       263 non-null    float64
 10  CRuns        263 non-null    float64
 11  CRBI         263 non-null    float64
 12  CWalks       263 non-null    float64
 13  PutOuts      263 non-null    float64
 14  Assists      263 non-null    float64
 15  Errors       263 non-null    float64
 16  League_N     263 non-null    uint8  
 17  Division_W   263 non-null    uint8  
 18  NewLeague_N  263 non-null    uint8  
dtypes: float64(16), uint8(3)
memory usage: 35.7 KB

Split data#

Split the data set into train and test set with the first 70% of the data for training and the remaining 30% for testing.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=10)
X_train.head()
AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI CWalks PutOuts Assists Errors League_N Division_W NewLeague_N
260 496.0 119.0 8.0 57.0 33.0 21.0 7.0 3358.0 882.0 36.0 365.0 280.0 165.0 155.0 371.0 29.0 1 1 1
92 317.0 78.0 7.0 35.0 35.0 32.0 1.0 317.0 78.0 7.0 35.0 35.0 32.0 45.0 122.0 26.0 0 0 0
137 343.0 103.0 6.0 48.0 36.0 40.0 15.0 4338.0 1193.0 70.0 581.0 421.0 325.0 211.0 56.0 13.0 0 0 0
90 314.0 83.0 13.0 39.0 46.0 16.0 5.0 1457.0 405.0 28.0 156.0 159.0 76.0 533.0 40.0 4.0 0 1 0
100 495.0 151.0 17.0 61.0 84.0 78.0 10.0 5624.0 1679.0 275.0 884.0 1015.0 709.0 1045.0 88.0 13.0 0 0 0
y_train
Salary
260 875.0
92 70.0
137 430.0
90 431.5
100 2460.0
... ...
274 200.0
196 587.5
159 200.0
17 175.0
162 75.0

184 rows × 1 columns

Data standardization#

  • Some of our models perform best when all numerical features are centered around 0 and have variance in the same order (like Lasso, Ridge or GAMs).

  • To avoid data leakage, the standardization of numerical features should always be performed after data splitting and only from training data.

  • Furthermore, we obtain all necessary statistics for our features (mean and standard deviation) from training data and also use them on test data. Note that we don’t standardize our dummy variables (which only have values of 0 or 1).

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler().fit(X_train[list_numerical]) 

X_train[list_numerical] = scaler.transform(X_train[list_numerical])
X_test[list_numerical] = scaler.transform(X_test[list_numerical])

Create dataframes#

  • Some of our models can work with pandas dataframes (expecially if we use statsmodels)

df_train = y_train.join(X_train)
df_test = y_test.join(X_test)
df_train
Salary AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI CWalks PutOuts Assists Errors League_N Division_W NewLeague_N
260 875.0 0.644577 0.257439 -0.456963 0.101010 -0.763917 -0.975959 -0.070553 0.298535 0.239063 -0.407836 0.011298 -0.163736 -0.361084 -0.482387 1.746229 3.022233 1 1 1
92 70.0 -0.592807 -0.671359 -0.572936 -0.778318 -0.685806 -0.458312 -1.306911 -1.001403 -0.969702 -0.746705 -0.957639 -0.898919 -0.844319 -0.851547 0.022276 2.574735 0 0 0
137 430.0 -0.413075 -0.105019 -0.688910 -0.258715 -0.646751 -0.081841 1.577925 0.717456 0.706633 -0.010542 0.645511 0.259369 0.220252 -0.294452 -0.434676 0.635577 0 0 0
90 431.5 -0.613545 -0.558091 0.122907 -0.618440 -0.256196 -1.211253 -0.482672 -0.514087 -0.478077 -0.501317 -0.602362 -0.526826 -0.684451 0.786178 -0.545452 -0.706917 0 1 0
100 2460.0 0.637665 0.982354 0.586803 0.260888 1.227914 1.706394 0.547626 1.267183 1.437305 2.384908 1.535171 2.041811 1.615457 2.504446 -0.213124 0.635577 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
274 200.0 0.824309 0.733164 0.470829 0.740521 0.954525 0.859335 -0.688732 -0.824858 -0.808834 -0.571428 -0.787341 -0.685866 -0.648118 3.427344 0.326910 1.232241 1 0 1
196 587.5 0.423369 0.461321 1.862516 0.500704 1.618469 0.482865 1.165805 1.354814 1.246368 1.625375 1.112362 1.516681 0.681687 -1.002566 -0.822392 -1.303581 0 1 0
159 200.0 1.474109 1.254197 1.746542 1.140215 2.126191 -0.458312 -0.894792 -0.522636 -0.520174 -0.068968 -0.528958 -0.322776 -0.662651 -0.633407 1.310048 0.933909 0 1 0
17 175.0 -1.470728 -1.396275 -1.152806 -1.217982 -1.740306 -1.258312 -0.482672 -0.932153 -0.933620 -0.770075 -0.869554 -0.934928 -0.818885 -0.660255 0.403069 1.083075 0 1 0
162 75.0 -1.643547 -1.554850 -1.152806 -1.657646 -1.701250 -1.211253 -0.894792 -1.053127 -1.020819 -0.805130 -1.007554 -0.973938 -0.895185 0.111623 -0.690846 -1.005249 0 1 1

184 rows × 20 columns