Simple regression model
Contents
Simple regression model#
Does money make people happier? Simple version without data splitting.
Data#
Import data#
import pandas as pd
# Load the data from GitHub
LINK = "https://raw.githubusercontent.com/kirenz/datasets/master/oecd_gdp.csv"
df = pd.read_csv(LINK)
Data structure#
df.head()
Country | GDP per capita | Life satisfaction | |
---|---|---|---|
0 | Russia | 9054.914 | 6.0 |
1 | Turkey | 9437.372 | 5.6 |
2 | Hungary | 12239.894 | 4.9 |
3 | Poland | 12495.334 | 5.8 |
4 | Slovak Republic | 15991.736 | 6.1 |
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29 entries, 0 to 28
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Country 29 non-null object
1 GDP per capita 29 non-null float64
2 Life satisfaction 29 non-null float64
dtypes: float64(2), object(1)
memory usage: 824.0+ bytes
Data corrections#
# Change column names (lower case and spaces to underscore)
df.columns = df.columns.str.lower().str.replace(' ', '_')
df.head()
country | gdp_per_capita | life_satisfaction | |
---|---|---|---|
0 | Russia | 9054.914 | 6.0 |
1 | Turkey | 9437.372 | 5.6 |
2 | Hungary | 12239.894 | 4.9 |
3 | Poland | 12495.334 | 5.8 |
4 | Slovak Republic | 15991.736 | 6.1 |
Variable lists#
Prepare the data for later use
# define outcome variable as y_label
y_label = 'life_satisfaction'
# select features
X = df[["gdp_per_capita"]]
# create response
y = df[y_label]
Data splitting#
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2,
shuffle=True,
random_state=42)
Investigate the data:
X_train.shape, y_train.shape
((23, 1), (23,))
X_train.head(2)
gdp_per_capita | |
---|---|
21 | 43724.031 |
0 | 9054.914 |
X_test.shape, y_test.shape
((6, 1), (6,))
We make a copy of the training data since we don’t want to alter our data during data exploration. We will use this data for our exploratory data analysis.
df_train = pd.DataFrame(X_train.copy())
df_train = df_train.join(pd.DataFrame(y_train))
df_train.head(2)
gdp_per_capita | life_satisfaction | |
---|---|---|
21 | 43724.031 | 6.9 |
0 | 9054.914 | 6.0 |
Data exploration#
%matplotlib inline
import altair as alt
# Visualize the data
alt.Chart(df_train).mark_circle(size=100).encode(
x='gdp_per_capita:Q',
y='life_satisfaction:Q',
).interactive()
Models#
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
from sklearn.neighbors import KNeighborsRegressor
reg2 = KNeighborsRegressor(n_neighbors=3)
Training & validation#
from sklearn.model_selection import cross_val_score
# cross-validation with 5 folds
scores = cross_val_score(reg, X_train, y_train, cv=5, scoring='neg_mean_squared_error') *-1
scores2 = cross_val_score(reg2, X_train, y_train, cv=5, scoring='neg_mean_squared_error') *-1
# store cross-validation scores
df_scores = pd.DataFrame({"lr": scores,
"knn": scores2})
# reset index to match the number of folds
df_scores.index += 1
# print dataframe
df_scores.style.background_gradient(cmap='Blues')
lr | knn | |
---|---|---|
1 | 0.335187 | 0.311778 |
2 | 0.101782 | 0.338444 |
3 | 0.097167 | 0.265333 |
4 | 0.252955 | 0.082500 |
5 | 0.443893 | 0.223889 |
alt.Chart(df_scores.reset_index()).mark_line(
point=alt.OverlayMarkDef()
).encode(
x=alt.X("index", bin=False, title="Fold", axis=alt.Axis(tickCount=5)),
y=alt.Y(
alt.repeat("layer"), aggregate="mean", title="Mean squared error (MSE)"
),
color=alt.datum(alt.repeat("layer")),
).repeat(layer=["lr", "knn"])
df_scores.describe().T
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
lr | 5.0 | 0.246197 | 0.150095 | 0.097167 | 0.101782 | 0.252955 | 0.335187 | 0.443893 |
knn | 5.0 | 0.244389 | 0.100567 | 0.082500 | 0.223889 | 0.265333 | 0.311778 | 0.338444 |
The performance difference between the two models is relatively small. Let’s assume we are interested in the parameters of the linear regression and therefore choose the linear regression.
Tuning#
We will cover model tuning (hyperparameter tuning) in another notebook and skip this part.
Final training#
Train your best model with the complete training data (without cross-validation).
# Fit the model
reg.fit(X_train, y_train)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
print(f' Intercept: {reg.intercept_:.3} \n Slope: {reg.coef_[0]:.5f}')
Intercept: 4.87
Slope: 0.00005
Test error#
Evaluate the final model on the test set.
# Prediction for our test data
y_pred = reg.predict(X_test)
# Mean squared error
mean_squared_error(y_test, y_pred)
0.09021411430745645