{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Simple regression model\n",
"\n",
"Does money make people happier? Simple version without data splitting."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Import data"
]
},
{
"cell_type": "code",
"execution_count": 172,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"# Load the data from GitHub\n",
"LINK = \"https://raw.githubusercontent.com/kirenz/datasets/master/oecd_gdp.csv\"\n",
"df = pd.read_csv(LINK)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Data structure"
]
},
{
"cell_type": "code",
"execution_count": 173,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Country | \n",
" GDP per capita | \n",
" Life satisfaction | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Russia | \n",
" 9054.914 | \n",
" 6.0 | \n",
"
\n",
" \n",
" 1 | \n",
" Turkey | \n",
" 9437.372 | \n",
" 5.6 | \n",
"
\n",
" \n",
" 2 | \n",
" Hungary | \n",
" 12239.894 | \n",
" 4.9 | \n",
"
\n",
" \n",
" 3 | \n",
" Poland | \n",
" 12495.334 | \n",
" 5.8 | \n",
"
\n",
" \n",
" 4 | \n",
" Slovak Republic | \n",
" 15991.736 | \n",
" 6.1 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Country GDP per capita Life satisfaction\n",
"0 Russia 9054.914 6.0\n",
"1 Turkey 9437.372 5.6\n",
"2 Hungary 12239.894 4.9\n",
"3 Poland 12495.334 5.8\n",
"4 Slovak Republic 15991.736 6.1"
]
},
"execution_count": 173,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": 174,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"RangeIndex: 29 entries, 0 to 28\n",
"Data columns (total 3 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 Country 29 non-null object \n",
" 1 GDP per capita 29 non-null float64\n",
" 2 Life satisfaction 29 non-null float64\n",
"dtypes: float64(2), object(1)\n",
"memory usage: 824.0+ bytes\n"
]
}
],
"source": [
"df.info()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Data corrections"
]
},
{
"cell_type": "code",
"execution_count": 175,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" country | \n",
" gdp_per_capita | \n",
" life_satisfaction | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Russia | \n",
" 9054.914 | \n",
" 6.0 | \n",
"
\n",
" \n",
" 1 | \n",
" Turkey | \n",
" 9437.372 | \n",
" 5.6 | \n",
"
\n",
" \n",
" 2 | \n",
" Hungary | \n",
" 12239.894 | \n",
" 4.9 | \n",
"
\n",
" \n",
" 3 | \n",
" Poland | \n",
" 12495.334 | \n",
" 5.8 | \n",
"
\n",
" \n",
" 4 | \n",
" Slovak Republic | \n",
" 15991.736 | \n",
" 6.1 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" country gdp_per_capita life_satisfaction\n",
"0 Russia 9054.914 6.0\n",
"1 Turkey 9437.372 5.6\n",
"2 Hungary 12239.894 4.9\n",
"3 Poland 12495.334 5.8\n",
"4 Slovak Republic 15991.736 6.1"
]
},
"execution_count": 175,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Change column names (lower case and spaces to underscore)\n",
"df.columns = df.columns.str.lower().str.replace(' ', '_')\n",
"\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Variable lists\n",
"\n",
"Prepare the data for later use"
]
},
{
"cell_type": "code",
"execution_count": 176,
"metadata": {},
"outputs": [],
"source": [
"# define outcome variable as y_label\n",
"y_label = 'life_satisfaction'\n",
"\n",
"# select features\n",
"X = df[[\"gdp_per_capita\"]]\n",
"\n",
"# create response\n",
"y = df[y_label]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data splitting"
]
},
{
"cell_type": "code",
"execution_count": 177,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.model_selection import train_test_split\n",
"\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, \n",
" test_size=0.2, \n",
" shuffle=True,\n",
" random_state=42)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Investigate the data:"
]
},
{
"cell_type": "code",
"execution_count": 178,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"((23, 1), (23,))"
]
},
"execution_count": 178,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X_train.shape, y_train.shape"
]
},
{
"cell_type": "code",
"execution_count": 179,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" gdp_per_capita | \n",
"
\n",
" \n",
" \n",
" \n",
" 21 | \n",
" 43724.031 | \n",
"
\n",
" \n",
" 0 | \n",
" 9054.914 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" gdp_per_capita\n",
"21 43724.031\n",
"0 9054.914"
]
},
"execution_count": 179,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X_train.head(2)"
]
},
{
"cell_type": "code",
"execution_count": 180,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"((6, 1), (6,))"
]
},
"execution_count": 180,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X_test.shape, y_test.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We make a copy of the training data since we don’t want to alter our data during data exploration. We will use this data for our exploratory data analysis."
]
},
{
"cell_type": "code",
"execution_count": 181,
"metadata": {},
"outputs": [],
"source": [
"df_train = pd.DataFrame(X_train.copy())\n",
"df_train = df_train.join(pd.DataFrame(y_train))"
]
},
{
"cell_type": "code",
"execution_count": 182,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" gdp_per_capita | \n",
" life_satisfaction | \n",
"
\n",
" \n",
" \n",
" \n",
" 21 | \n",
" 43724.031 | \n",
" 6.9 | \n",
"
\n",
" \n",
" 0 | \n",
" 9054.914 | \n",
" 6.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" gdp_per_capita life_satisfaction\n",
"21 43724.031 6.9\n",
"0 9054.914 6.0"
]
},
"execution_count": 182,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_train.head(2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Data exploration"
]
},
{
"cell_type": "code",
"execution_count": 183,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
""
],
"text/plain": [
"alt.Chart(...)"
]
},
"execution_count": 183,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%matplotlib inline\n",
"import altair as alt\n",
"\n",
"# Visualize the data\n",
"alt.Chart(df_train).mark_circle(size=100).encode(\n",
" x='gdp_per_capita:Q',\n",
" y='life_satisfaction:Q',\n",
").interactive()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Models"
]
},
{
"cell_type": "code",
"execution_count": 184,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.linear_model import LinearRegression\n",
"\n",
"reg = LinearRegression()"
]
},
{
"cell_type": "code",
"execution_count": 185,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.neighbors import KNeighborsRegressor\n",
"\n",
"reg2 = KNeighborsRegressor(n_neighbors=3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Training & validation"
]
},
{
"cell_type": "code",
"execution_count": 186,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.model_selection import cross_val_score"
]
},
{
"cell_type": "code",
"execution_count": 187,
"metadata": {},
"outputs": [],
"source": [
"# cross-validation with 5 folds\n",
"scores = cross_val_score(reg, X_train, y_train, cv=5, scoring='neg_mean_squared_error') *-1\n",
"\n",
"scores2 = cross_val_score(reg2, X_train, y_train, cv=5, scoring='neg_mean_squared_error') *-1"
]
},
{
"cell_type": "code",
"execution_count": 188,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
" \n",
" \n",
" | \n",
" lr | \n",
" knn | \n",
"
\n",
" \n",
" \n",
" \n",
" 1 | \n",
" 0.335187 | \n",
" 0.311778 | \n",
"
\n",
" \n",
" 2 | \n",
" 0.101782 | \n",
" 0.338444 | \n",
"
\n",
" \n",
" 3 | \n",
" 0.097167 | \n",
" 0.265333 | \n",
"
\n",
" \n",
" 4 | \n",
" 0.252955 | \n",
" 0.082500 | \n",
"
\n",
" \n",
" 5 | \n",
" 0.443893 | \n",
" 0.223889 | \n",
"
\n",
" \n",
"
\n"
],
"text/plain": [
""
]
},
"execution_count": 188,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# store cross-validation scores\n",
"df_scores = pd.DataFrame({\"lr\": scores, \n",
" \"knn\": scores2})\n",
"\n",
"# reset index to match the number of folds\n",
"df_scores.index += 1\n",
"\n",
"# print dataframe\n",
"df_scores.style.background_gradient(cmap='Blues')"
]
},
{
"cell_type": "code",
"execution_count": 189,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
""
],
"text/plain": [
"alt.RepeatChart(...)"
]
},
"execution_count": 189,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"alt.Chart(df_scores.reset_index()).mark_line(\n",
" point=alt.OverlayMarkDef()\n",
").encode(\n",
" x=alt.X(\"index\", bin=False, title=\"Fold\", axis=alt.Axis(tickCount=5)),\n",
" y=alt.Y(\n",
" alt.repeat(\"layer\"), aggregate=\"mean\", title=\"Mean squared error (MSE)\"\n",
" ),\n",
" color=alt.datum(alt.repeat(\"layer\")),\n",
").repeat(layer=[\"lr\", \"knn\"])"
]
},
{
"cell_type": "code",
"execution_count": 190,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" count | \n",
" mean | \n",
" std | \n",
" min | \n",
" 25% | \n",
" 50% | \n",
" 75% | \n",
" max | \n",
"
\n",
" \n",
" \n",
" \n",
" lr | \n",
" 5.0 | \n",
" 0.246197 | \n",
" 0.150095 | \n",
" 0.097167 | \n",
" 0.101782 | \n",
" 0.252955 | \n",
" 0.335187 | \n",
" 0.443893 | \n",
"
\n",
" \n",
" knn | \n",
" 5.0 | \n",
" 0.244389 | \n",
" 0.100567 | \n",
" 0.082500 | \n",
" 0.223889 | \n",
" 0.265333 | \n",
" 0.311778 | \n",
" 0.338444 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" count mean std min 25% 50% 75% \\\n",
"lr 5.0 0.246197 0.150095 0.097167 0.101782 0.252955 0.335187 \n",
"knn 5.0 0.244389 0.100567 0.082500 0.223889 0.265333 0.311778 \n",
"\n",
" max \n",
"lr 0.443893 \n",
"knn 0.338444 "
]
},
"execution_count": 190,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_scores.describe().T"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The performance difference between the two models is relatively small. Let's assume we are interested in the parameters of the linear regression and therefore choose the linear regression."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Tuning"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will cover model tuning (hyperparameter tuning) in another notebook and skip this part. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Final training"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Train your best model with the complete training data (without cross-validation)."
]
},
{
"cell_type": "code",
"execution_count": 191,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org. "
],
"text/plain": [
"LinearRegression()"
]
},
"execution_count": 191,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Fit the model\n",
"reg.fit(X_train, y_train)"
]
},
{
"cell_type": "code",
"execution_count": 192,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" Intercept: 4.87 \n",
" Slope: 0.00005\n"
]
}
],
"source": [
"print(f' Intercept: {reg.intercept_:.3} \\n Slope: {reg.coef_[0]:.5f}')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Test error"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Evaluate the final model on the test set. "
]
},
{
"cell_type": "code",
"execution_count": 193,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.09021411430745645"
]
},
"execution_count": 193,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Prediction for our test data\n",
"y_pred = reg.predict(X_test)\n",
"\n",
"# Mean squared error\n",
"mean_squared_error(y_test, y_pred)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3.9.12 ('base')",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.12"
},
"nav_menu": {},
"toc": {
"navigate_menu": true,
"number_sections": true,
"sideBar": true,
"threshold": 6,
"toc_cell": false,
"toc_section_display": "block",
"toc_window_display": true
},
"toc_position": {
"height": "616px",
"left": "0px",
"right": "20px",
"top": "106px",
"width": "213px"
},
"vscode": {
"interpreter": {
"hash": "463226f144cc21b006ce6927bfc93dd00694e52c8bc6857abb6e555b983749e9"
}
}
},
"nbformat": 4,
"nbformat_minor": 1
}