Simple regression model#

Does money make people happier? Simple version without data splitting.

Data#

Import data#

import pandas as pd

# Load the data from GitHub
LINK = "https://raw.githubusercontent.com/kirenz/datasets/master/oecd_gdp.csv"
df = pd.read_csv(LINK)

Data structure#

df.head()

	Country	GDP per capita	Life satisfaction
0	Russia	9054.914	6.0
1	Turkey	9437.372	5.6
2	Hungary	12239.894	4.9
3	Poland	12495.334	5.8
4	Slovak Republic	15991.736	6.1

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29 entries, 0 to 28
Data columns (total 3 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Country            29 non-null     object 
 1   GDP per capita     29 non-null     float64
 2   Life satisfaction  29 non-null     float64
dtypes: float64(2), object(1)
memory usage: 824.0+ bytes

Data corrections#

# Change column names (lower case and spaces to underscore)
df.columns = df.columns.str.lower().str.replace(' ', '_')

df.head()

	country	gdp_per_capita	life_satisfaction
0	Russia	9054.914	6.0
1	Turkey	9437.372	5.6
2	Hungary	12239.894	4.9
3	Poland	12495.334	5.8
4	Slovak Republic	15991.736	6.1

Variable lists#

Prepare the data for later use

# define outcome variable as y_label
y_label = 'life_satisfaction'

# select features
X = df[["gdp_per_capita"]]

# create response
y = df[y_label]

Data splitting#

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2, 
                                                    shuffle=True,
                                                    random_state=42)

Investigate the data:

X_train.shape, y_train.shape

((23, 1), (23,))

X_train.head(2)

	gdp_per_capita
21	43724.031
0	9054.914

X_test.shape, y_test.shape

((6, 1), (6,))

We make a copy of the training data since we don’t want to alter our data during data exploration. We will use this data for our exploratory data analysis.

df_train = pd.DataFrame(X_train.copy())
df_train = df_train.join(pd.DataFrame(y_train))

df_train.head(2)

	gdp_per_capita	life_satisfaction
21	43724.031	6.9
0	9054.914	6.0

Data exploration#

%matplotlib inline
import altair as alt

# Visualize the data
alt.Chart(df_train).mark_circle(size=100).encode(
    x='gdp_per_capita:Q',
    y='life_satisfaction:Q',
).interactive()

Models#

from sklearn.linear_model import LinearRegression

reg = LinearRegression()

from sklearn.neighbors import KNeighborsRegressor

reg2 = KNeighborsRegressor(n_neighbors=3)

Training & validation#

from sklearn.model_selection import cross_val_score

# cross-validation with 5 folds
scores = cross_val_score(reg, X_train, y_train, cv=5, scoring='neg_mean_squared_error') *-1

scores2 = cross_val_score(reg2, X_train, y_train, cv=5, scoring='neg_mean_squared_error') *-1

# store cross-validation scores
df_scores = pd.DataFrame({"lr": scores, 
                          "knn": scores2})

# reset index to match the number of folds
df_scores.index += 1

# print dataframe
df_scores.style.background_gradient(cmap='Blues')

	lr	knn
1	0.335187	0.311778
2	0.101782	0.338444
3	0.097167	0.265333
4	0.252955	0.082500
5	0.443893	0.223889

alt.Chart(df_scores.reset_index()).mark_line(
     point=alt.OverlayMarkDef()
).encode(
    x=alt.X("index", bin=False, title="Fold", axis=alt.Axis(tickCount=5)),
    y=alt.Y(
        alt.repeat("layer"), aggregate="mean", title="Mean squared error (MSE)"
    ),
    color=alt.datum(alt.repeat("layer")),
).repeat(layer=["lr", "knn"])

df_scores.describe().T

	count	mean	std	min	25%	50%	75%	max
lr	5.0	0.246197	0.150095	0.097167	0.101782	0.252955	0.335187	0.443893
knn	5.0	0.244389	0.100567	0.082500	0.223889	0.265333	0.311778	0.338444

The performance difference between the two models is relatively small. Let’s assume we are interested in the parameters of the linear regression and therefore choose the linear regression.

Tuning#

We will cover model tuning (hyperparameter tuning) in another notebook and skip this part.

Final training#

Train your best model with the complete training data (without cross-validation).

# Fit the model
reg.fit(X_train, y_train)

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

print(f' Intercept: {reg.intercept_:.3} \n Slope: {reg.coef_[0]:.5f}')

 Intercept: 4.87 
 Slope: 0.00005

Test error#

Evaluate the final model on the test set.

# Prediction for our test data
y_pred = reg.predict(X_test)

# Mean squared error
mean_squared_error(y_test, y_pred)

0.09021411430745645

Introduction to Regression Models

Simple regression model

Contents