{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Hitters data preparation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We illustrate the following regression methods on a data set called \"Hitters\", which includes 20 variables and 322 observations of major league baseball players. The goal is to predict a baseball player’s salary on the basis of various features associated with performance in the previous year. We don't cover the topic of exploratory data analysis in this notebook. \n", "\n", "- Visit [this documentation](https://cran.r-project.org/web/packages/ISLR/ISLR.pdf) if you want to learn more about the data\n", "\n", "Note that scikit-learn provides a [**pipeline**](https://kirenz.github.io/ds-python/docs/data.html#pipelines-in-scikit-learn\n", ") library for data preprocessing and feature engineering, which is considered best practice for data preparation. However, since we use scikit-learn as well as statsmodels in some of our examples, we won't create a data preprocessing pipeline in this example.\n", "\n", "## Import" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "df = pd.read_csv(\"https://raw.githubusercontent.com/kirenz/datasets/master/Hitters.csv\")" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | AtBat | \n", "Hits | \n", "HmRun | \n", "Runs | \n", "RBI | \n", "Walks | \n", "Years | \n", "CAtBat | \n", "CHits | \n", "CHmRun | \n", "CRuns | \n", "CRBI | \n", "CWalks | \n", "League | \n", "Division | \n", "PutOuts | \n", "Assists | \n", "Errors | \n", "Salary | \n", "NewLeague | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "293 | \n", "66 | \n", "1 | \n", "30 | \n", "29 | \n", "14 | \n", "1 | \n", "293 | \n", "66 | \n", "1 | \n", "30 | \n", "29 | \n", "14 | \n", "A | \n", "E | \n", "446 | \n", "33 | \n", "20 | \n", "NaN | \n", "A | \n", "
1 | \n", "315 | \n", "81 | \n", "7 | \n", "24 | \n", "38 | \n", "39 | \n", "14 | \n", "3449 | \n", "835 | \n", "69 | \n", "321 | \n", "414 | \n", "375 | \n", "N | \n", "W | \n", "632 | \n", "43 | \n", "10 | \n", "475.0 | \n", "N | \n", "
2 | \n", "479 | \n", "130 | \n", "18 | \n", "66 | \n", "72 | \n", "76 | \n", "3 | \n", "1624 | \n", "457 | \n", "63 | \n", "224 | \n", "266 | \n", "263 | \n", "A | \n", "W | \n", "880 | \n", "82 | \n", "14 | \n", "480.0 | \n", "A | \n", "
3 | \n", "496 | \n", "141 | \n", "20 | \n", "65 | \n", "78 | \n", "37 | \n", "11 | \n", "5628 | \n", "1575 | \n", "225 | \n", "828 | \n", "838 | \n", "354 | \n", "N | \n", "E | \n", "200 | \n", "11 | \n", "3 | \n", "500.0 | \n", "N | \n", "
4 | \n", "321 | \n", "87 | \n", "10 | \n", "39 | \n", "42 | \n", "30 | \n", "2 | \n", "396 | \n", "101 | \n", "12 | \n", "48 | \n", "46 | \n", "33 | \n", "N | \n", "E | \n", "805 | \n", "40 | \n", "4 | \n", "91.5 | \n", "N | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
317 | \n", "497 | \n", "127 | \n", "7 | \n", "65 | \n", "48 | \n", "37 | \n", "5 | \n", "2703 | \n", "806 | \n", "32 | \n", "379 | \n", "311 | \n", "138 | \n", "N | \n", "E | \n", "325 | \n", "9 | \n", "3 | \n", "700.0 | \n", "N | \n", "
318 | \n", "492 | \n", "136 | \n", "5 | \n", "76 | \n", "50 | \n", "94 | \n", "12 | \n", "5511 | \n", "1511 | \n", "39 | \n", "897 | \n", "451 | \n", "875 | \n", "A | \n", "E | \n", "313 | \n", "381 | \n", "20 | \n", "875.0 | \n", "A | \n", "
319 | \n", "475 | \n", "126 | \n", "3 | \n", "61 | \n", "43 | \n", "52 | \n", "6 | \n", "1700 | \n", "433 | \n", "7 | \n", "217 | \n", "93 | \n", "146 | \n", "A | \n", "W | \n", "37 | \n", "113 | \n", "7 | \n", "385.0 | \n", "A | \n", "
320 | \n", "573 | \n", "144 | \n", "9 | \n", "85 | \n", "60 | \n", "78 | \n", "8 | \n", "3198 | \n", "857 | \n", "97 | \n", "470 | \n", "420 | \n", "332 | \n", "A | \n", "E | \n", "1314 | \n", "131 | \n", "12 | \n", "960.0 | \n", "A | \n", "
321 | \n", "631 | \n", "170 | \n", "9 | \n", "77 | \n", "44 | \n", "31 | \n", "11 | \n", "4908 | \n", "1457 | \n", "30 | \n", "775 | \n", "357 | \n", "249 | \n", "A | \n", "W | \n", "408 | \n", "4 | \n", "3 | \n", "1000.0 | \n", "A | \n", "
322 rows × 20 columns
\n", "\n", " | AtBat | \n", "Hits | \n", "HmRun | \n", "Runs | \n", "RBI | \n", "Walks | \n", "Years | \n", "CAtBat | \n", "CHits | \n", "CHmRun | \n", "CRuns | \n", "CRBI | \n", "CWalks | \n", "PutOuts | \n", "Assists | \n", "Errors | \n", "League_N | \n", "Division_W | \n", "NewLeague_N | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
260 | \n", "496.0 | \n", "119.0 | \n", "8.0 | \n", "57.0 | \n", "33.0 | \n", "21.0 | \n", "7.0 | \n", "3358.0 | \n", "882.0 | \n", "36.0 | \n", "365.0 | \n", "280.0 | \n", "165.0 | \n", "155.0 | \n", "371.0 | \n", "29.0 | \n", "1 | \n", "1 | \n", "1 | \n", "
92 | \n", "317.0 | \n", "78.0 | \n", "7.0 | \n", "35.0 | \n", "35.0 | \n", "32.0 | \n", "1.0 | \n", "317.0 | \n", "78.0 | \n", "7.0 | \n", "35.0 | \n", "35.0 | \n", "32.0 | \n", "45.0 | \n", "122.0 | \n", "26.0 | \n", "0 | \n", "0 | \n", "0 | \n", "
137 | \n", "343.0 | \n", "103.0 | \n", "6.0 | \n", "48.0 | \n", "36.0 | \n", "40.0 | \n", "15.0 | \n", "4338.0 | \n", "1193.0 | \n", "70.0 | \n", "581.0 | \n", "421.0 | \n", "325.0 | \n", "211.0 | \n", "56.0 | \n", "13.0 | \n", "0 | \n", "0 | \n", "0 | \n", "
90 | \n", "314.0 | \n", "83.0 | \n", "13.0 | \n", "39.0 | \n", "46.0 | \n", "16.0 | \n", "5.0 | \n", "1457.0 | \n", "405.0 | \n", "28.0 | \n", "156.0 | \n", "159.0 | \n", "76.0 | \n", "533.0 | \n", "40.0 | \n", "4.0 | \n", "0 | \n", "1 | \n", "0 | \n", "
100 | \n", "495.0 | \n", "151.0 | \n", "17.0 | \n", "61.0 | \n", "84.0 | \n", "78.0 | \n", "10.0 | \n", "5624.0 | \n", "1679.0 | \n", "275.0 | \n", "884.0 | \n", "1015.0 | \n", "709.0 | \n", "1045.0 | \n", "88.0 | \n", "13.0 | \n", "0 | \n", "0 | \n", "0 | \n", "
\n", " | Salary | \n", "
---|---|
260 | \n", "875.0 | \n", "
92 | \n", "70.0 | \n", "
137 | \n", "430.0 | \n", "
90 | \n", "431.5 | \n", "
100 | \n", "2460.0 | \n", "
... | \n", "... | \n", "
274 | \n", "200.0 | \n", "
196 | \n", "587.5 | \n", "
159 | \n", "200.0 | \n", "
17 | \n", "175.0 | \n", "
162 | \n", "75.0 | \n", "
184 rows × 1 columns
\n", "\n", " | Salary | \n", "AtBat | \n", "Hits | \n", "HmRun | \n", "Runs | \n", "RBI | \n", "Walks | \n", "Years | \n", "CAtBat | \n", "CHits | \n", "CHmRun | \n", "CRuns | \n", "CRBI | \n", "CWalks | \n", "PutOuts | \n", "Assists | \n", "Errors | \n", "League_N | \n", "Division_W | \n", "NewLeague_N | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
260 | \n", "875.0 | \n", "0.644577 | \n", "0.257439 | \n", "-0.456963 | \n", "0.101010 | \n", "-0.763917 | \n", "-0.975959 | \n", "-0.070553 | \n", "0.298535 | \n", "0.239063 | \n", "-0.407836 | \n", "0.011298 | \n", "-0.163736 | \n", "-0.361084 | \n", "-0.482387 | \n", "1.746229 | \n", "3.022233 | \n", "1 | \n", "1 | \n", "1 | \n", "
92 | \n", "70.0 | \n", "-0.592807 | \n", "-0.671359 | \n", "-0.572936 | \n", "-0.778318 | \n", "-0.685806 | \n", "-0.458312 | \n", "-1.306911 | \n", "-1.001403 | \n", "-0.969702 | \n", "-0.746705 | \n", "-0.957639 | \n", "-0.898919 | \n", "-0.844319 | \n", "-0.851547 | \n", "0.022276 | \n", "2.574735 | \n", "0 | \n", "0 | \n", "0 | \n", "
137 | \n", "430.0 | \n", "-0.413075 | \n", "-0.105019 | \n", "-0.688910 | \n", "-0.258715 | \n", "-0.646751 | \n", "-0.081841 | \n", "1.577925 | \n", "0.717456 | \n", "0.706633 | \n", "-0.010542 | \n", "0.645511 | \n", "0.259369 | \n", "0.220252 | \n", "-0.294452 | \n", "-0.434676 | \n", "0.635577 | \n", "0 | \n", "0 | \n", "0 | \n", "
90 | \n", "431.5 | \n", "-0.613545 | \n", "-0.558091 | \n", "0.122907 | \n", "-0.618440 | \n", "-0.256196 | \n", "-1.211253 | \n", "-0.482672 | \n", "-0.514087 | \n", "-0.478077 | \n", "-0.501317 | \n", "-0.602362 | \n", "-0.526826 | \n", "-0.684451 | \n", "0.786178 | \n", "-0.545452 | \n", "-0.706917 | \n", "0 | \n", "1 | \n", "0 | \n", "
100 | \n", "2460.0 | \n", "0.637665 | \n", "0.982354 | \n", "0.586803 | \n", "0.260888 | \n", "1.227914 | \n", "1.706394 | \n", "0.547626 | \n", "1.267183 | \n", "1.437305 | \n", "2.384908 | \n", "1.535171 | \n", "2.041811 | \n", "1.615457 | \n", "2.504446 | \n", "-0.213124 | \n", "0.635577 | \n", "0 | \n", "0 | \n", "0 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
274 | \n", "200.0 | \n", "0.824309 | \n", "0.733164 | \n", "0.470829 | \n", "0.740521 | \n", "0.954525 | \n", "0.859335 | \n", "-0.688732 | \n", "-0.824858 | \n", "-0.808834 | \n", "-0.571428 | \n", "-0.787341 | \n", "-0.685866 | \n", "-0.648118 | \n", "3.427344 | \n", "0.326910 | \n", "1.232241 | \n", "1 | \n", "0 | \n", "1 | \n", "
196 | \n", "587.5 | \n", "0.423369 | \n", "0.461321 | \n", "1.862516 | \n", "0.500704 | \n", "1.618469 | \n", "0.482865 | \n", "1.165805 | \n", "1.354814 | \n", "1.246368 | \n", "1.625375 | \n", "1.112362 | \n", "1.516681 | \n", "0.681687 | \n", "-1.002566 | \n", "-0.822392 | \n", "-1.303581 | \n", "0 | \n", "1 | \n", "0 | \n", "
159 | \n", "200.0 | \n", "1.474109 | \n", "1.254197 | \n", "1.746542 | \n", "1.140215 | \n", "2.126191 | \n", "-0.458312 | \n", "-0.894792 | \n", "-0.522636 | \n", "-0.520174 | \n", "-0.068968 | \n", "-0.528958 | \n", "-0.322776 | \n", "-0.662651 | \n", "-0.633407 | \n", "1.310048 | \n", "0.933909 | \n", "0 | \n", "1 | \n", "0 | \n", "
17 | \n", "175.0 | \n", "-1.470728 | \n", "-1.396275 | \n", "-1.152806 | \n", "-1.217982 | \n", "-1.740306 | \n", "-1.258312 | \n", "-0.482672 | \n", "-0.932153 | \n", "-0.933620 | \n", "-0.770075 | \n", "-0.869554 | \n", "-0.934928 | \n", "-0.818885 | \n", "-0.660255 | \n", "0.403069 | \n", "1.083075 | \n", "0 | \n", "1 | \n", "0 | \n", "
162 | \n", "75.0 | \n", "-1.643547 | \n", "-1.554850 | \n", "-1.152806 | \n", "-1.657646 | \n", "-1.701250 | \n", "-1.211253 | \n", "-0.894792 | \n", "-1.053127 | \n", "-1.020819 | \n", "-0.805130 | \n", "-1.007554 | \n", "-0.973938 | \n", "-0.895185 | \n", "0.111623 | \n", "-0.690846 | \n", "-1.005249 | \n", "0 | \n", "1 | \n", "1 | \n", "
184 rows × 20 columns
\n", "