Simple regression model#

Does money make people happier? Simple version without data splitting.

Data#

Import data#

import pandas as pd

# Load the data from GitHub
LINK = "https://raw.githubusercontent.com/kirenz/datasets/master/oecd_gdp.csv"
df = pd.read_csv(LINK)

Data structure#

df.head()
Country GDP per capita Life satisfaction
0 Russia 9054.914 6.0
1 Turkey 9437.372 5.6
2 Hungary 12239.894 4.9
3 Poland 12495.334 5.8
4 Slovak Republic 15991.736 6.1
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29 entries, 0 to 28
Data columns (total 3 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Country            29 non-null     object 
 1   GDP per capita     29 non-null     float64
 2   Life satisfaction  29 non-null     float64
dtypes: float64(2), object(1)
memory usage: 824.0+ bytes

Data corrections#

# Change column names (lower case and spaces to underscore)
df.columns = df.columns.str.lower().str.replace(' ', '_')

df.head()
country gdp_per_capita life_satisfaction
0 Russia 9054.914 6.0
1 Turkey 9437.372 5.6
2 Hungary 12239.894 4.9
3 Poland 12495.334 5.8
4 Slovak Republic 15991.736 6.1

Variable lists#

Prepare the data for later use

# define outcome variable as y_label
y_label = 'life_satisfaction'

# select features
X = df[["gdp_per_capita"]]

# create response
y = df[y_label]

Data splitting#

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2, 
                                                    shuffle=True,
                                                    random_state=42)

Investigate the data:

X_train.shape, y_train.shape
((23, 1), (23,))
X_train.head(2)
gdp_per_capita
21 43724.031
0 9054.914
X_test.shape, y_test.shape
((6, 1), (6,))

We make a copy of the training data since we don’t want to alter our data during data exploration. We will use this data for our exploratory data analysis.

df_train = pd.DataFrame(X_train.copy())
df_train = df_train.join(pd.DataFrame(y_train))
df_train.head(2)
gdp_per_capita life_satisfaction
21 43724.031 6.9
0 9054.914 6.0

Data exploration#

%matplotlib inline
import altair as alt

# Visualize the data
alt.Chart(df_train).mark_circle(size=100).encode(
    x='gdp_per_capita:Q',
    y='life_satisfaction:Q',
).interactive()

Models#

from sklearn.linear_model import LinearRegression

reg = LinearRegression()
from sklearn.neighbors import KNeighborsRegressor

reg2 = KNeighborsRegressor(n_neighbors=3)

Training & validation#

from sklearn.model_selection import cross_val_score
# cross-validation with 5 folds
scores = cross_val_score(reg, X_train, y_train, cv=5, scoring='neg_mean_squared_error') *-1

scores2 = cross_val_score(reg2, X_train, y_train, cv=5, scoring='neg_mean_squared_error') *-1
# store cross-validation scores
df_scores = pd.DataFrame({"lr": scores, 
                          "knn": scores2})

# reset index to match the number of folds
df_scores.index += 1

# print dataframe
df_scores.style.background_gradient(cmap='Blues')
  lr knn
1 0.335187 0.311778
2 0.101782 0.338444
3 0.097167 0.265333
4 0.252955 0.082500
5 0.443893 0.223889
alt.Chart(df_scores.reset_index()).mark_line(
     point=alt.OverlayMarkDef()
).encode(
    x=alt.X("index", bin=False, title="Fold", axis=alt.Axis(tickCount=5)),
    y=alt.Y(
        alt.repeat("layer"), aggregate="mean", title="Mean squared error (MSE)"
    ),
    color=alt.datum(alt.repeat("layer")),
).repeat(layer=["lr", "knn"])
df_scores.describe().T
count mean std min 25% 50% 75% max
lr 5.0 0.246197 0.150095 0.097167 0.101782 0.252955 0.335187 0.443893
knn 5.0 0.244389 0.100567 0.082500 0.223889 0.265333 0.311778 0.338444

The performance difference between the two models is relatively small. Let’s assume we are interested in the parameters of the linear regression and therefore choose the linear regression.

Tuning#

We will cover model tuning (hyperparameter tuning) in another notebook and skip this part.

Final training#

Train your best model with the complete training data (without cross-validation).

# Fit the model
reg.fit(X_train, y_train)
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
print(f' Intercept: {reg.intercept_:.3} \n Slope: {reg.coef_[0]:.5f}')
 Intercept: 4.87 
 Slope: 0.00005

Test error#

Evaluate the final model on the test set.

# Prediction for our test data
y_pred = reg.predict(X_test)

# Mean squared error
mean_squared_error(y_test, y_pred)
0.09021411430745645