{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Simple regression model\n", "\n", "Does money make people happier? Simple version without data splitting." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Import data" ] }, { "cell_type": "code", "execution_count": 172, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "# Load the data from GitHub\n", "LINK = \"https://raw.githubusercontent.com/kirenz/datasets/master/oecd_gdp.csv\"\n", "df = pd.read_csv(LINK)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Data structure" ] }, { "cell_type": "code", "execution_count": 173, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CountryGDP per capitaLife satisfaction
0Russia9054.9146.0
1Turkey9437.3725.6
2Hungary12239.8944.9
3Poland12495.3345.8
4Slovak Republic15991.7366.1
\n", "
" ], "text/plain": [ " Country GDP per capita Life satisfaction\n", "0 Russia 9054.914 6.0\n", "1 Turkey 9437.372 5.6\n", "2 Hungary 12239.894 4.9\n", "3 Poland 12495.334 5.8\n", "4 Slovak Republic 15991.736 6.1" ] }, "execution_count": 173, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "code", "execution_count": 174, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 29 entries, 0 to 28\n", "Data columns (total 3 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 Country 29 non-null object \n", " 1 GDP per capita 29 non-null float64\n", " 2 Life satisfaction 29 non-null float64\n", "dtypes: float64(2), object(1)\n", "memory usage: 824.0+ bytes\n" ] } ], "source": [ "df.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Data corrections" ] }, { "cell_type": "code", "execution_count": 175, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countrygdp_per_capitalife_satisfaction
0Russia9054.9146.0
1Turkey9437.3725.6
2Hungary12239.8944.9
3Poland12495.3345.8
4Slovak Republic15991.7366.1
\n", "
" ], "text/plain": [ " country gdp_per_capita life_satisfaction\n", "0 Russia 9054.914 6.0\n", "1 Turkey 9437.372 5.6\n", "2 Hungary 12239.894 4.9\n", "3 Poland 12495.334 5.8\n", "4 Slovak Republic 15991.736 6.1" ] }, "execution_count": 175, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Change column names (lower case and spaces to underscore)\n", "df.columns = df.columns.str.lower().str.replace(' ', '_')\n", "\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Variable lists\n", "\n", "Prepare the data for later use" ] }, { "cell_type": "code", "execution_count": 176, "metadata": {}, "outputs": [], "source": [ "# define outcome variable as y_label\n", "y_label = 'life_satisfaction'\n", "\n", "# select features\n", "X = df[[\"gdp_per_capita\"]]\n", "\n", "# create response\n", "y = df[y_label]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data splitting" ] }, { "cell_type": "code", "execution_count": 177, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, \n", " test_size=0.2, \n", " shuffle=True,\n", " random_state=42)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Investigate the data:" ] }, { "cell_type": "code", "execution_count": 178, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((23, 1), (23,))" ] }, "execution_count": 178, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_train.shape, y_train.shape" ] }, { "cell_type": "code", "execution_count": 179, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
gdp_per_capita
2143724.031
09054.914
\n", "
" ], "text/plain": [ " gdp_per_capita\n", "21 43724.031\n", "0 9054.914" ] }, "execution_count": 179, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_train.head(2)" ] }, { "cell_type": "code", "execution_count": 180, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((6, 1), (6,))" ] }, "execution_count": 180, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_test.shape, y_test.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We make a copy of the training data since we don’t want to alter our data during data exploration. We will use this data for our exploratory data analysis." ] }, { "cell_type": "code", "execution_count": 181, "metadata": {}, "outputs": [], "source": [ "df_train = pd.DataFrame(X_train.copy())\n", "df_train = df_train.join(pd.DataFrame(y_train))" ] }, { "cell_type": "code", "execution_count": 182, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
gdp_per_capitalife_satisfaction
2143724.0316.9
09054.9146.0
\n", "
" ], "text/plain": [ " gdp_per_capita life_satisfaction\n", "21 43724.031 6.9\n", "0 9054.914 6.0" ] }, "execution_count": 182, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_train.head(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Data exploration" ] }, { "cell_type": "code", "execution_count": 183, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 183, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%matplotlib inline\n", "import altair as alt\n", "\n", "# Visualize the data\n", "alt.Chart(df_train).mark_circle(size=100).encode(\n", " x='gdp_per_capita:Q',\n", " y='life_satisfaction:Q',\n", ").interactive()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Models" ] }, { "cell_type": "code", "execution_count": 184, "metadata": {}, "outputs": [], "source": [ "from sklearn.linear_model import LinearRegression\n", "\n", "reg = LinearRegression()" ] }, { "cell_type": "code", "execution_count": 185, "metadata": {}, "outputs": [], "source": [ "from sklearn.neighbors import KNeighborsRegressor\n", "\n", "reg2 = KNeighborsRegressor(n_neighbors=3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Training & validation" ] }, { "cell_type": "code", "execution_count": 186, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import cross_val_score" ] }, { "cell_type": "code", "execution_count": 187, "metadata": {}, "outputs": [], "source": [ "# cross-validation with 5 folds\n", "scores = cross_val_score(reg, X_train, y_train, cv=5, scoring='neg_mean_squared_error') *-1\n", "\n", "scores2 = cross_val_score(reg2, X_train, y_train, cv=5, scoring='neg_mean_squared_error') *-1" ] }, { "cell_type": "code", "execution_count": 188, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
 lrknn
10.3351870.311778
20.1017820.338444
30.0971670.265333
40.2529550.082500
50.4438930.223889
\n" ], "text/plain": [ "" ] }, "execution_count": 188, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# store cross-validation scores\n", "df_scores = pd.DataFrame({\"lr\": scores, \n", " \"knn\": scores2})\n", "\n", "# reset index to match the number of folds\n", "df_scores.index += 1\n", "\n", "# print dataframe\n", "df_scores.style.background_gradient(cmap='Blues')" ] }, { "cell_type": "code", "execution_count": 189, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.RepeatChart(...)" ] }, "execution_count": 189, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(df_scores.reset_index()).mark_line(\n", " point=alt.OverlayMarkDef()\n", ").encode(\n", " x=alt.X(\"index\", bin=False, title=\"Fold\", axis=alt.Axis(tickCount=5)),\n", " y=alt.Y(\n", " alt.repeat(\"layer\"), aggregate=\"mean\", title=\"Mean squared error (MSE)\"\n", " ),\n", " color=alt.datum(alt.repeat(\"layer\")),\n", ").repeat(layer=[\"lr\", \"knn\"])" ] }, { "cell_type": "code", "execution_count": 190, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countmeanstdmin25%50%75%max
lr5.00.2461970.1500950.0971670.1017820.2529550.3351870.443893
knn5.00.2443890.1005670.0825000.2238890.2653330.3117780.338444
\n", "
" ], "text/plain": [ " count mean std min 25% 50% 75% \\\n", "lr 5.0 0.246197 0.150095 0.097167 0.101782 0.252955 0.335187 \n", "knn 5.0 0.244389 0.100567 0.082500 0.223889 0.265333 0.311778 \n", "\n", " max \n", "lr 0.443893 \n", "knn 0.338444 " ] }, "execution_count": 190, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_scores.describe().T" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The performance difference between the two models is relatively small. Let's assume we are interested in the parameters of the linear regression and therefore choose the linear regression." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Tuning" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will cover model tuning (hyperparameter tuning) in another notebook and skip this part. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Final training" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Train your best model with the complete training data (without cross-validation)." ] }, { "cell_type": "code", "execution_count": 191, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "LinearRegression()" ] }, "execution_count": 191, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Fit the model\n", "reg.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 192, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Intercept: 4.87 \n", " Slope: 0.00005\n" ] } ], "source": [ "print(f' Intercept: {reg.intercept_:.3} \\n Slope: {reg.coef_[0]:.5f}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Test error" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Evaluate the final model on the test set. " ] }, { "cell_type": "code", "execution_count": 193, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.09021411430745645" ] }, "execution_count": 193, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Prediction for our test data\n", "y_pred = reg.predict(X_test)\n", "\n", "# Mean squared error\n", "mean_squared_error(y_test, y_pred)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.9.12 ('base')", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.12" }, "nav_menu": {}, "toc": { "navigate_menu": true, "number_sections": true, "sideBar": true, "threshold": 6, "toc_cell": false, "toc_section_display": "block", "toc_window_display": true }, "toc_position": { "height": "616px", "left": "0px", "right": "20px", "top": "106px", "width": "213px" }, "vscode": { "interpreter": { "hash": "463226f144cc21b006ce6927bfc93dd00694e52c8bc6857abb6e555b983749e9" } } }, "nbformat": 4, "nbformat_minor": 1 }