Hitters data preparation#

We illustrate the following regression methods on a data set called “Hitters”, which includes 20 variables and 322 observations of major league baseball players. The goal is to predict a baseball player’s salary on the basis of various features associated with performance in the previous year. We don’t cover the topic of exploratory data analysis in this notebook.

Visit this documentation if you want to learn more about the data

Note that scikit-learn provides a pipeline library for data preprocessing and feature engineering, which is considered best practice for data preparation. However, since we use scikit-learn as well as statsmodels in some of our examples, we won’t create a data preprocessing pipeline in this example.

Import#

import pandas as pd

df = pd.read_csv("https://raw.githubusercontent.com/kirenz/datasets/master/Hitters.csv")

df

	AtBat	Hits	HmRun	Runs	RBI	Walks	Years	CAtBat	CHits	CHmRun	CRuns	CRBI	CWalks	League	Division	PutOuts	Assists	Errors	Salary	NewLeague
0	293	66	1	30	29	14	1	293	66	1	30	29	14	A	E	446	33	20	NaN	A
1	315	81	7	24	38	39	14	3449	835	69	321	414	375	N	W	632	43	10	475.0	N
2	479	130	18	66	72	76	3	1624	457	63	224	266	263	A	W	880	82	14	480.0	A
3	496	141	20	65	78	37	11	5628	1575	225	828	838	354	N	E	200	11	3	500.0	N
4	321	87	10	39	42	30	2	396	101	12	48	46	33	N	E	805	40	4	91.5	N
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
317	497	127	7	65	48	37	5	2703	806	32	379	311	138	N	E	325	9	3	700.0	N
318	492	136	5	76	50	94	12	5511	1511	39	897	451	875	A	E	313	381	20	875.0	A
319	475	126	3	61	43	52	6	1700	433	7	217	93	146	A	W	37	113	7	385.0	A
320	573	144	9	85	60	78	8	3198	857	97	470	420	332	A	E	1314	131	12	960.0	A
321	631	170	9	77	44	31	11	4908	1457	30	775	357	249	A	W	408	4	3	1000.0	A

322 rows × 20 columns

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 322 entries, 0 to 321
Data columns (total 20 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 AtBat      322 non-null    int64  
 Hits       322 non-null    int64  
 HmRun      322 non-null    int64  
 Runs       322 non-null    int64  
 RBI        322 non-null    int64  
 Walks      322 non-null    int64  
 Years      322 non-null    int64  
 CAtBat     322 non-null    int64  
 CHits      322 non-null    int64  
 CHmRun     322 non-null    int64  
CRuns      322 non-null    int64  
CRBI       322 non-null    int64  
CWalks     322 non-null    int64  
League     322 non-null    object 
Division   322 non-null    object 
PutOuts    322 non-null    int64  
Assists    322 non-null    int64  
Errors     322 non-null    int64  
Salary     263 non-null    float64
NewLeague  322 non-null    object 
dtypes: float64(1), int64(16), object(3)
memory usage: 50.4+ KB

Missing values#

Note that the salary is missing for some of the players:

print(df.isnull().sum())

AtBat         0
Hits          0
HmRun         0
Runs          0
RBI           0
Walks         0
Years         0
CAtBat        0
CHits         0
CHmRun        0
CRuns         0
CRBI          0
CWalks        0
League        0
Division      0
PutOuts       0
Assists       0
Errors        0
Salary       59
NewLeague     0
dtype: int64

We simply drop the missing cases:

# drop missing cases
df = df.dropna()

Create label and features#

Since we will use algorithms from scikit learn, we need to encode our categorical features as one-hot numeric features (dummy variables):

dummies = pd.get_dummies(df[['League', 'Division','NewLeague']])

dummies.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 263 entries, 1 to 321
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   League_A     263 non-null    uint8
 1   League_N     263 non-null    uint8
 2   Division_E   263 non-null    uint8
 3   Division_W   263 non-null    uint8
 4   NewLeague_A  263 non-null    uint8
 5   NewLeague_N  263 non-null    uint8
dtypes: uint8(6)
memory usage: 3.6 KB

print(dummies.head())

   League_A  League_N  Division_E  Division_W  NewLeague_A  NewLeague_N
       0         1           0           1            0            1
       1         0           0           1            1            0
       0         1           1           0            0            1
       0         1           1           0            0            1
       1         0           0           1            1            0

Next, we create our label y:

y = df[['Salary']]

We drop the column with the outcome variable (Salary), and categorical columns for which we already created dummy variables:

X_numerical = df.drop(['Salary', 'League', 'Division', 'NewLeague'], axis=1).astype('float64')

Make a list of all numerical features (we need them later):

list_numerical = X_numerical.columns
list_numerical

Index(['AtBat', 'Hits', 'HmRun', 'Runs', 'RBI', 'Walks', 'Years', 'CAtBat',
       'CHits', 'CHmRun', 'CRuns', 'CRBI', 'CWalks', 'PutOuts', 'Assists',
       'Errors'],
      dtype='object')

# Create all features
X = pd.concat([X_numerical, dummies[['League_N', 'Division_W', 'NewLeague_N']]], axis=1)
X.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 263 entries, 1 to 321
Data columns (total 19 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 AtBat        263 non-null    float64
 Hits         263 non-null    float64
 HmRun        263 non-null    float64
 Runs         263 non-null    float64
 RBI          263 non-null    float64
 Walks        263 non-null    float64
 Years        263 non-null    float64
 CAtBat       263 non-null    float64
 CHits        263 non-null    float64
 CHmRun       263 non-null    float64
CRuns        263 non-null    float64
CRBI         263 non-null    float64
CWalks       263 non-null    float64
PutOuts      263 non-null    float64
Assists      263 non-null    float64
Errors       263 non-null    float64
League_N     263 non-null    uint8  
Division_W   263 non-null    uint8  
NewLeague_N  263 non-null    uint8  
dtypes: float64(16), uint8(3)
memory usage: 35.7 KB

Split data#

Split the data set into train and test set with the first 70% of the data for training and the remaining 30% for testing.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=10)

X_train.head()

	AtBat	Hits	HmRun	Runs	RBI	Walks	Years	CAtBat	CHits	CHmRun	CRuns	CRBI	CWalks	PutOuts	Assists	Errors	League_N	Division_W	NewLeague_N
260	496.0	119.0	8.0	57.0	33.0	21.0	7.0	3358.0	882.0	36.0	365.0	280.0	165.0	155.0	371.0	29.0	1	1	1
92	317.0	78.0	7.0	35.0	35.0	32.0	1.0	317.0	78.0	7.0	35.0	35.0	32.0	45.0	122.0	26.0	0	0	0
137	343.0	103.0	6.0	48.0	36.0	40.0	15.0	4338.0	1193.0	70.0	581.0	421.0	325.0	211.0	56.0	13.0	0	0	0
90	314.0	83.0	13.0	39.0	46.0	16.0	5.0	1457.0	405.0	28.0	156.0	159.0	76.0	533.0	40.0	4.0	0	1	0
100	495.0	151.0	17.0	61.0	84.0	78.0	10.0	5624.0	1679.0	275.0	884.0	1015.0	709.0	1045.0	88.0	13.0	0	0	0

y_train

	Salary
260	875.0
92	70.0
137	430.0
90	431.5
100	2460.0
...	...
274	200.0
196	587.5
159	200.0
17	175.0
162	75.0

184 rows × 1 columns

Data standardization#

Some of our models perform best when all numerical features are centered around 0 and have variance in the same order (like Lasso, Ridge or GAMs).
To avoid data leakage, the standardization of numerical features should always be performed after data splitting and only from training data.
Furthermore, we obtain all necessary statistics for our features (mean and standard deviation) from training data and also use them on test data. Note that we don’t standardize our dummy variables (which only have values of 0 or 1).

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler().fit(X_train[list_numerical]) 

X_train[list_numerical] = scaler.transform(X_train[list_numerical])
X_test[list_numerical] = scaler.transform(X_test[list_numerical])

Create dataframes#

Some of our models can work with pandas dataframes (expecially if we use statsmodels)

df_train = y_train.join(X_train)
df_test = y_test.join(X_test)

df_train

	Salary	AtBat	Hits	HmRun	Runs	RBI	Walks	Years	CAtBat	CHits	CHmRun	CRuns	CRBI	CWalks	PutOuts	Assists	Errors	League_N	Division_W	NewLeague_N
260	875.0	0.644577	0.257439	-0.456963	0.101010	-0.763917	-0.975959	-0.070553	0.298535	0.239063	-0.407836	0.011298	-0.163736	-0.361084	-0.482387	1.746229	3.022233	1	1	1
92	70.0	-0.592807	-0.671359	-0.572936	-0.778318	-0.685806	-0.458312	-1.306911	-1.001403	-0.969702	-0.746705	-0.957639	-0.898919	-0.844319	-0.851547	0.022276	2.574735	0	0	0
137	430.0	-0.413075	-0.105019	-0.688910	-0.258715	-0.646751	-0.081841	1.577925	0.717456	0.706633	-0.010542	0.645511	0.259369	0.220252	-0.294452	-0.434676	0.635577	0	0	0
90	431.5	-0.613545	-0.558091	0.122907	-0.618440	-0.256196	-1.211253	-0.482672	-0.514087	-0.478077	-0.501317	-0.602362	-0.526826	-0.684451	0.786178	-0.545452	-0.706917	0	1	0
100	2460.0	0.637665	0.982354	0.586803	0.260888	1.227914	1.706394	0.547626	1.267183	1.437305	2.384908	1.535171	2.041811	1.615457	2.504446	-0.213124	0.635577	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
274	200.0	0.824309	0.733164	0.470829	0.740521	0.954525	0.859335	-0.688732	-0.824858	-0.808834	-0.571428	-0.787341	-0.685866	-0.648118	3.427344	0.326910	1.232241	1	0	1
196	587.5	0.423369	0.461321	1.862516	0.500704	1.618469	0.482865	1.165805	1.354814	1.246368	1.625375	1.112362	1.516681	0.681687	-1.002566	-0.822392	-1.303581	0	1	0
159	200.0	1.474109	1.254197	1.746542	1.140215	2.126191	-0.458312	-0.894792	-0.522636	-0.520174	-0.068968	-0.528958	-0.322776	-0.662651	-0.633407	1.310048	0.933909	0	1	0
17	175.0	-1.470728	-1.396275	-1.152806	-1.217982	-1.740306	-1.258312	-0.482672	-0.932153	-0.933620	-0.770075	-0.869554	-0.934928	-0.818885	-0.660255	0.403069	1.083075	0	1	0
162	75.0	-1.643547	-1.554850	-1.152806	-1.657646	-1.701250	-1.211253	-0.894792	-1.053127	-1.020819	-0.805130	-1.007554	-0.973938	-0.895185	0.111623	-0.690846	-1.005249	0	1	1

184 rows × 20 columns

Introduction to Regression Models

Hitters data preparation

Contents