Contents

Simple statsmodels model

Contents

Simple statsmodels model#

Does money make people happier? Simple version without data splitting.

Data preparation#

import pandas as pd

# Load the data from GitHub
LINK = "https://raw.githubusercontent.com/kirenz/datasets/master/oecd_gdp.csv"
df = pd.read_csv(LINK)

df.head()

	Country	GDP per capita	Life satisfaction
0	Russia	9054.914	6.0
1	Turkey	9437.372	5.6
2	Hungary	12239.894	4.9
3	Poland	12495.334	5.8
4	Slovak Republic	15991.736	6.1

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29 entries, 0 to 28
Data columns (total 3 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Country            29 non-null     object 
 1   GDP per capita     29 non-null     float64
 2   Life satisfaction  29 non-null     float64
dtypes: float64(2), object(1)
memory usage: 824.0+ bytes

# Change column names 
df.columns = df.columns.str.lower().str.replace(' ', '_')

df.head()

	country	gdp_per_capita	life_satisfaction
0	Russia	9054.914	6.0
1	Turkey	9437.372	5.6
2	Hungary	12239.894	4.9
3	Poland	12495.334	5.8
4	Slovak Republic	15991.736	6.1

%matplotlib inline
import seaborn as sns

# Visualize the data
sns.relplot(x="gdp_per_capita", y='life_satisfaction', hue='country', data=df);

../_images/ds-happy-stats_6_0.png

Simple linear regression model#

import statsmodels.formula.api as smf

# Estimate the model 
lm = smf.ols(formula ='life_satisfaction ~ gdp_per_capita', data=df).fit()

# Model coefficients
lm.summary()

OLS Regression Results
Dep. Variable:	life_satisfaction	R-squared:	0.734
Model:	OLS	Adj. R-squared:	0.725
Method:	Least Squares	F-statistic:	74.67
Date:	Wed, 16 Mar 2022	Prob (F-statistic):	2.95e-09
Time:	16:53:19	Log-Likelihood:	-16.345
No. Observations:	29	AIC:	36.69
Df Residuals:	27	BIC:	39.42
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	4.8531	0.207	23.481	0.000	4.429	5.277
gdp_per_capita	4.912e-05	5.68e-06	8.641	0.000	3.75e-05	6.08e-05

Omnibus:	0.308	Durbin-Watson:	1.454
Prob(Omnibus):	0.857	Jarque-Bera (JB):	0.486
Skew:	-0.094	Prob(JB):	0.784
Kurtosis:	2.394	Cond. No.	9.19e+04

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 9.19e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

# Plot regression line 
sns.lmplot(data=df, x="gdp_per_capita", y="life_satisfaction", ci=False);

../_images/ds-happy-stats_10_0.png

# Add the regression predictions (as "pred") to our DataFrame
df['y_pred'] = lm.predict(df.gdp_per_capita)

from statsmodels.tools.eval_measures import mse, rmse

# Performance measures

# MSE
mse(df['life_satisfaction'], df['y_pred'])

0.18075033705835147

# RMSE
rmse(df['life_satisfaction'], df['y_pred'])

0.42514742979153886