{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Hitters data preparation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We illustrate the following regression methods on a data set called \"Hitters\", which includes 20 variables and 322 observations of major league baseball players. The goal is to predict a baseball player’s salary on the basis of various features associated with performance in the previous year. We don't cover the topic of exploratory data analysis in this notebook. \n", "\n", "- Visit [this documentation](https://cran.r-project.org/web/packages/ISLR/ISLR.pdf) if you want to learn more about the data\n", "\n", "Note that scikit-learn provides a [**pipeline**](https://kirenz.github.io/ds-python/docs/data.html#pipelines-in-scikit-learn\n", ") library for data preprocessing and feature engineering, which is considered best practice for data preparation. However, since we use scikit-learn as well as statsmodels in some of our examples, we won't create a data preprocessing pipeline in this example.\n", "\n", "## Import" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "df = pd.read_csv(\"https://raw.githubusercontent.com/kirenz/datasets/master/Hitters.csv\")" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AtBatHitsHmRunRunsRBIWalksYearsCAtBatCHitsCHmRunCRunsCRBICWalksLeagueDivisionPutOutsAssistsErrorsSalaryNewLeague
02936613029141293661302914AE4463320NaNA
131581724383914344983569321414375NW6324310475.0N
2479130186672763162445763224266263AW8808214480.0A
3496141206578371156281575225828838354NE200113500.0N
43218710394230239610112484633NE80540491.5N
...............................................................
31749712776548375270380632379311138NE32593700.0N
3184921365765094125511151139897451875AE31338120875.0A
319475126361435261700433721793146AW371137385.0A
32057314498560788319885797470420332AE131413112960.0A
3216311709774431114908145730775357249AW408431000.0A
\n", "

322 rows × 20 columns

\n", "
" ], "text/plain": [ " AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun \\\n", "0 293 66 1 30 29 14 1 293 66 1 \n", "1 315 81 7 24 38 39 14 3449 835 69 \n", "2 479 130 18 66 72 76 3 1624 457 63 \n", "3 496 141 20 65 78 37 11 5628 1575 225 \n", "4 321 87 10 39 42 30 2 396 101 12 \n", ".. ... ... ... ... ... ... ... ... ... ... \n", "317 497 127 7 65 48 37 5 2703 806 32 \n", "318 492 136 5 76 50 94 12 5511 1511 39 \n", "319 475 126 3 61 43 52 6 1700 433 7 \n", "320 573 144 9 85 60 78 8 3198 857 97 \n", "321 631 170 9 77 44 31 11 4908 1457 30 \n", "\n", " CRuns CRBI CWalks League Division PutOuts Assists Errors Salary \\\n", "0 30 29 14 A E 446 33 20 NaN \n", "1 321 414 375 N W 632 43 10 475.0 \n", "2 224 266 263 A W 880 82 14 480.0 \n", "3 828 838 354 N E 200 11 3 500.0 \n", "4 48 46 33 N E 805 40 4 91.5 \n", ".. ... ... ... ... ... ... ... ... ... \n", "317 379 311 138 N E 325 9 3 700.0 \n", "318 897 451 875 A E 313 381 20 875.0 \n", "319 217 93 146 A W 37 113 7 385.0 \n", "320 470 420 332 A E 1314 131 12 960.0 \n", "321 775 357 249 A W 408 4 3 1000.0 \n", "\n", " NewLeague \n", "0 A \n", "1 N \n", "2 A \n", "3 N \n", "4 N \n", ".. ... \n", "317 N \n", "318 A \n", "319 A \n", "320 A \n", "321 A \n", "\n", "[322 rows x 20 columns]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 322 entries, 0 to 321\n", "Data columns (total 20 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 AtBat 322 non-null int64 \n", " 1 Hits 322 non-null int64 \n", " 2 HmRun 322 non-null int64 \n", " 3 Runs 322 non-null int64 \n", " 4 RBI 322 non-null int64 \n", " 5 Walks 322 non-null int64 \n", " 6 Years 322 non-null int64 \n", " 7 CAtBat 322 non-null int64 \n", " 8 CHits 322 non-null int64 \n", " 9 CHmRun 322 non-null int64 \n", " 10 CRuns 322 non-null int64 \n", " 11 CRBI 322 non-null int64 \n", " 12 CWalks 322 non-null int64 \n", " 13 League 322 non-null object \n", " 14 Division 322 non-null object \n", " 15 PutOuts 322 non-null int64 \n", " 16 Assists 322 non-null int64 \n", " 17 Errors 322 non-null int64 \n", " 18 Salary 263 non-null float64\n", " 19 NewLeague 322 non-null object \n", "dtypes: float64(1), int64(16), object(3)\n", "memory usage: 50.4+ KB\n" ] } ], "source": [ "df.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Missing values\n", "\n", "Note that the salary is missing for some of the players:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "AtBat 0\n", "Hits 0\n", "HmRun 0\n", "Runs 0\n", "RBI 0\n", "Walks 0\n", "Years 0\n", "CAtBat 0\n", "CHits 0\n", "CHmRun 0\n", "CRuns 0\n", "CRBI 0\n", "CWalks 0\n", "League 0\n", "Division 0\n", "PutOuts 0\n", "Assists 0\n", "Errors 0\n", "Salary 59\n", "NewLeague 0\n", "dtype: int64\n" ] } ], "source": [ "print(df.isnull().sum())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We simply drop the missing cases: " ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# drop missing cases\n", "df = df.dropna()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create label and features\n", "\n", "Since we will use algorithms from scikit learn, we need to encode our categorical features as one-hot numeric features (dummy variables):" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "dummies = pd.get_dummies(df[['League', 'Division','NewLeague']])" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Int64Index: 263 entries, 1 to 321\n", "Data columns (total 6 columns):\n", " # Column Non-Null Count Dtype\n", "--- ------ -------------- -----\n", " 0 League_A 263 non-null uint8\n", " 1 League_N 263 non-null uint8\n", " 2 Division_E 263 non-null uint8\n", " 3 Division_W 263 non-null uint8\n", " 4 NewLeague_A 263 non-null uint8\n", " 5 NewLeague_N 263 non-null uint8\n", "dtypes: uint8(6)\n", "memory usage: 3.6 KB\n" ] } ], "source": [ "dummies.info()" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " League_A League_N Division_E Division_W NewLeague_A NewLeague_N\n", "1 0 1 0 1 0 1\n", "2 1 0 0 1 1 0\n", "3 0 1 1 0 0 1\n", "4 0 1 1 0 0 1\n", "5 1 0 0 1 1 0\n" ] } ], "source": [ "print(dummies.head())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we create our label y:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "y = df[['Salary']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We drop the column with the outcome variable (Salary), and categorical columns for which we already created dummy variables:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "X_numerical = df.drop(['Salary', 'League', 'Division', 'NewLeague'], axis=1).astype('float64')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Make a list of all numerical features (we need them later):" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['AtBat', 'Hits', 'HmRun', 'Runs', 'RBI', 'Walks', 'Years', 'CAtBat',\n", " 'CHits', 'CHmRun', 'CRuns', 'CRBI', 'CWalks', 'PutOuts', 'Assists',\n", " 'Errors'],\n", " dtype='object')" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list_numerical = X_numerical.columns\n", "list_numerical" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Int64Index: 263 entries, 1 to 321\n", "Data columns (total 19 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 AtBat 263 non-null float64\n", " 1 Hits 263 non-null float64\n", " 2 HmRun 263 non-null float64\n", " 3 Runs 263 non-null float64\n", " 4 RBI 263 non-null float64\n", " 5 Walks 263 non-null float64\n", " 6 Years 263 non-null float64\n", " 7 CAtBat 263 non-null float64\n", " 8 CHits 263 non-null float64\n", " 9 CHmRun 263 non-null float64\n", " 10 CRuns 263 non-null float64\n", " 11 CRBI 263 non-null float64\n", " 12 CWalks 263 non-null float64\n", " 13 PutOuts 263 non-null float64\n", " 14 Assists 263 non-null float64\n", " 15 Errors 263 non-null float64\n", " 16 League_N 263 non-null uint8 \n", " 17 Division_W 263 non-null uint8 \n", " 18 NewLeague_N 263 non-null uint8 \n", "dtypes: float64(16), uint8(3)\n", "memory usage: 35.7 KB\n" ] } ], "source": [ "# Create all features\n", "X = pd.concat([X_numerical, dummies[['League_N', 'Division_W', 'NewLeague_N']]], axis=1)\n", "X.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Split data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Split the data set into train and test set with the first 70% of the data for training and the remaining 30% for testing." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=10)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AtBatHitsHmRunRunsRBIWalksYearsCAtBatCHitsCHmRunCRunsCRBICWalksPutOutsAssistsErrorsLeague_NDivision_WNewLeague_N
260496.0119.08.057.033.021.07.03358.0882.036.0365.0280.0165.0155.0371.029.0111
92317.078.07.035.035.032.01.0317.078.07.035.035.032.045.0122.026.0000
137343.0103.06.048.036.040.015.04338.01193.070.0581.0421.0325.0211.056.013.0000
90314.083.013.039.046.016.05.01457.0405.028.0156.0159.076.0533.040.04.0010
100495.0151.017.061.084.078.010.05624.01679.0275.0884.01015.0709.01045.088.013.0000
\n", "
" ], "text/plain": [ " AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun \\\n", "260 496.0 119.0 8.0 57.0 33.0 21.0 7.0 3358.0 882.0 36.0 \n", "92 317.0 78.0 7.0 35.0 35.0 32.0 1.0 317.0 78.0 7.0 \n", "137 343.0 103.0 6.0 48.0 36.0 40.0 15.0 4338.0 1193.0 70.0 \n", "90 314.0 83.0 13.0 39.0 46.0 16.0 5.0 1457.0 405.0 28.0 \n", "100 495.0 151.0 17.0 61.0 84.0 78.0 10.0 5624.0 1679.0 275.0 \n", "\n", " CRuns CRBI CWalks PutOuts Assists Errors League_N Division_W \\\n", "260 365.0 280.0 165.0 155.0 371.0 29.0 1 1 \n", "92 35.0 35.0 32.0 45.0 122.0 26.0 0 0 \n", "137 581.0 421.0 325.0 211.0 56.0 13.0 0 0 \n", "90 156.0 159.0 76.0 533.0 40.0 4.0 0 1 \n", "100 884.0 1015.0 709.0 1045.0 88.0 13.0 0 0 \n", "\n", " NewLeague_N \n", "260 1 \n", "92 0 \n", "137 0 \n", "90 0 \n", "100 0 " ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_train.head()" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Salary
260875.0
9270.0
137430.0
90431.5
1002460.0
......
274200.0
196587.5
159200.0
17175.0
16275.0
\n", "

184 rows × 1 columns

\n", "
" ], "text/plain": [ " Salary\n", "260 875.0\n", "92 70.0\n", "137 430.0\n", "90 431.5\n", "100 2460.0\n", ".. ...\n", "274 200.0\n", "196 587.5\n", "159 200.0\n", "17 175.0\n", "162 75.0\n", "\n", "[184 rows x 1 columns]" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_train" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Data standardization" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Some of our models perform best when all numerical features are centered around 0 and have variance in the same order (like Lasso, Ridge or GAMs).\n", "- To avoid [data leakage](https://en.wikipedia.org/wiki/Leakage_(machine_learning)), the standardization of numerical features should always be performed after data splitting and only from training data. \n", "- Furthermore, we obtain all necessary statistics for our features (mean and standard deviation) from training data and also use them on test data. Note that we don't standardize our dummy variables (which only have values of 0 or 1)." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "from sklearn.preprocessing import StandardScaler\n", "\n", "scaler = StandardScaler().fit(X_train[list_numerical]) \n", "\n", "X_train[list_numerical] = scaler.transform(X_train[list_numerical])\n", "X_test[list_numerical] = scaler.transform(X_test[list_numerical])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create dataframes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Some of our models can work with pandas dataframes (expecially if we use statsmodels)" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "df_train = y_train.join(X_train)\n", "df_test = y_test.join(X_test)" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SalaryAtBatHitsHmRunRunsRBIWalksYearsCAtBatCHitsCHmRunCRunsCRBICWalksPutOutsAssistsErrorsLeague_NDivision_WNewLeague_N
260875.00.6445770.257439-0.4569630.101010-0.763917-0.975959-0.0705530.2985350.239063-0.4078360.011298-0.163736-0.361084-0.4823871.7462293.022233111
9270.0-0.592807-0.671359-0.572936-0.778318-0.685806-0.458312-1.306911-1.001403-0.969702-0.746705-0.957639-0.898919-0.844319-0.8515470.0222762.574735000
137430.0-0.413075-0.105019-0.688910-0.258715-0.646751-0.0818411.5779250.7174560.706633-0.0105420.6455110.2593690.220252-0.294452-0.4346760.635577000
90431.5-0.613545-0.5580910.122907-0.618440-0.256196-1.211253-0.482672-0.514087-0.478077-0.501317-0.602362-0.526826-0.6844510.786178-0.545452-0.706917010
1002460.00.6376650.9823540.5868030.2608881.2279141.7063940.5476261.2671831.4373052.3849081.5351712.0418111.6154572.504446-0.2131240.635577000
...............................................................
274200.00.8243090.7331640.4708290.7405210.9545250.859335-0.688732-0.824858-0.808834-0.571428-0.787341-0.685866-0.6481183.4273440.3269101.232241101
196587.50.4233690.4613211.8625160.5007041.6184690.4828651.1658051.3548141.2463681.6253751.1123621.5166810.681687-1.002566-0.822392-1.303581010
159200.01.4741091.2541971.7465421.1402152.126191-0.458312-0.894792-0.522636-0.520174-0.068968-0.528958-0.322776-0.662651-0.6334071.3100480.933909010
17175.0-1.470728-1.396275-1.152806-1.217982-1.740306-1.258312-0.482672-0.932153-0.933620-0.770075-0.869554-0.934928-0.818885-0.6602550.4030691.083075010
16275.0-1.643547-1.554850-1.152806-1.657646-1.701250-1.211253-0.894792-1.053127-1.020819-0.805130-1.007554-0.973938-0.8951850.111623-0.690846-1.005249011
\n", "

184 rows × 20 columns

\n", "
" ], "text/plain": [ " Salary AtBat Hits HmRun Runs RBI Walks \\\n", "260 875.0 0.644577 0.257439 -0.456963 0.101010 -0.763917 -0.975959 \n", "92 70.0 -0.592807 -0.671359 -0.572936 -0.778318 -0.685806 -0.458312 \n", "137 430.0 -0.413075 -0.105019 -0.688910 -0.258715 -0.646751 -0.081841 \n", "90 431.5 -0.613545 -0.558091 0.122907 -0.618440 -0.256196 -1.211253 \n", "100 2460.0 0.637665 0.982354 0.586803 0.260888 1.227914 1.706394 \n", ".. ... ... ... ... ... ... ... \n", "274 200.0 0.824309 0.733164 0.470829 0.740521 0.954525 0.859335 \n", "196 587.5 0.423369 0.461321 1.862516 0.500704 1.618469 0.482865 \n", "159 200.0 1.474109 1.254197 1.746542 1.140215 2.126191 -0.458312 \n", "17 175.0 -1.470728 -1.396275 -1.152806 -1.217982 -1.740306 -1.258312 \n", "162 75.0 -1.643547 -1.554850 -1.152806 -1.657646 -1.701250 -1.211253 \n", "\n", " Years CAtBat CHits CHmRun CRuns CRBI CWalks \\\n", "260 -0.070553 0.298535 0.239063 -0.407836 0.011298 -0.163736 -0.361084 \n", "92 -1.306911 -1.001403 -0.969702 -0.746705 -0.957639 -0.898919 -0.844319 \n", "137 1.577925 0.717456 0.706633 -0.010542 0.645511 0.259369 0.220252 \n", "90 -0.482672 -0.514087 -0.478077 -0.501317 -0.602362 -0.526826 -0.684451 \n", "100 0.547626 1.267183 1.437305 2.384908 1.535171 2.041811 1.615457 \n", ".. ... ... ... ... ... ... ... \n", "274 -0.688732 -0.824858 -0.808834 -0.571428 -0.787341 -0.685866 -0.648118 \n", "196 1.165805 1.354814 1.246368 1.625375 1.112362 1.516681 0.681687 \n", "159 -0.894792 -0.522636 -0.520174 -0.068968 -0.528958 -0.322776 -0.662651 \n", "17 -0.482672 -0.932153 -0.933620 -0.770075 -0.869554 -0.934928 -0.818885 \n", "162 -0.894792 -1.053127 -1.020819 -0.805130 -1.007554 -0.973938 -0.895185 \n", "\n", " PutOuts Assists Errors League_N Division_W NewLeague_N \n", "260 -0.482387 1.746229 3.022233 1 1 1 \n", "92 -0.851547 0.022276 2.574735 0 0 0 \n", "137 -0.294452 -0.434676 0.635577 0 0 0 \n", "90 0.786178 -0.545452 -0.706917 0 1 0 \n", "100 2.504446 -0.213124 0.635577 0 0 0 \n", ".. ... ... ... ... ... ... \n", "274 3.427344 0.326910 1.232241 1 0 1 \n", "196 -1.002566 -0.822392 -1.303581 0 1 0 \n", "159 -0.633407 1.310048 0.933909 0 1 0 \n", "17 -0.660255 0.403069 1.083075 0 1 0 \n", "162 0.111623 -0.690846 -1.005249 0 1 1 \n", "\n", "[184 rows x 20 columns]" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_train" ] } ], "metadata": { "interpreter": { "hash": "463226f144cc21b006ce6927bfc93dd00694e52c8bc6857abb6e555b983749e9" }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": true, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": true, "toc_position": {}, "toc_section_display": true, "toc_window_display": true }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }