Regression splines
Contents
Regression splines#
The following code tutorial is mainly based on the scikit learn documentation about splines provided by Mathieu Blondel, Jake Vanderplas, Christian Lorentzen and Malte Londschien and code from Jordi Warmenhoven. To learn more about the spline regression method, review “An Introduction to Statistical Learning” from [James et al., 2021].
Regression splines involve dividing the range of a feature X into K distinct regions (by using so called knots). Within each region, a polynomial function (also called a Basis Spline or B-splines) is fit to the data.
In the following example, various piecewise polynomials are fit to the data, with one knot at age=50 [James et al., 2021]:
Figures:
Top Left: The cubic polynomials are unconstrained.
Top Right: The cubic polynomials are constrained to be continuous at age=50.
Bottom Left: The cubic polynomials are constrained to be continuous, and to have continuous first and second derivatives.
Bottom Right: A linear spline is shown, which is constrained to be continuous.
The polynomials are ususally constrained so that they join smoothly at the region boundaries, or knots.
Provided that the interval is divided into enough regions, this can produce an extremely flexibel fit [James et al., 2021]:
Figure: A cubic spline and a natural cubic spline, with three knots. The dashed lines denote the knot locations.
To understand the advantages of regression splines, we first start with a linear ridge regression model, build a simple polynomial regression and then proceed to splines.
Setup#
import numpy as np
np.set_printoptions(suppress=True)
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.metrics import mean_squared_error
%matplotlib inline
import seaborn as sns
Write function to obtain RMSE for training and test data:
results = []
# create function to obtain model rmse
def model_results(model_name):
# Training data
pred_train = reg.predict(X_train)
rmse_train = round(mean_squared_error(y_train, pred_train, squared=False),4)
# Test data
pred_test = reg.predict(X_test)
rmse_test =round(mean_squared_error(y_test, pred_test, squared=False),4)
# Save model results
new_results = {"model": model_name, "rmse_train": rmse_train, "rmse_test": rmse_test}
results.append(new_results)
return results;
Data#
Import#
df = pd.read_csv('https://raw.githubusercontent.com/kirenz/datasets/master/wage.csv')
df.head(3)
Unnamed: 0 | year | age | maritl | race | education | region | jobclass | health | health_ins | logwage | wage | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 231655 | 2006 | 18 | 1. Never Married | 1. White | 1. < HS Grad | 2. Middle Atlantic | 1. Industrial | 1. <=Good | 2. No | 4.318063 | 75.043154 |
1 | 86582 | 2004 | 24 | 1. Never Married | 1. White | 4. College Grad | 2. Middle Atlantic | 2. Information | 2. >=Very Good | 2. No | 4.255273 | 70.476020 |
2 | 161300 | 2003 | 45 | 2. Married | 1. White | 3. Some College | 2. Middle Atlantic | 1. Industrial | 1. <=Good | 1. Yes | 4.875061 | 130.982177 |
Create label and feature#
We only use the feature age to predict wage (because we only use one predictor, we don’t perform data standardization)
X = df[['age']]
y = df[['wage']]
Data split#
Dividing data into train and test datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state = 1)
Data exploration#
Visualize the relationship between age and wage:
# seaborn settings
custom_params = {"axes.spines.right": False, "axes.spines.top": False}
sns.set_theme(style="ticks", rc=custom_params)
# plot
sns.scatterplot(x=X_train['age'], y=y_train['wage'], alpha=0.4);
Ridge regression#
First, we obtain the optimal alpha parameter with cross validation
We try different values for alpha:
alphas=np.logspace(-6, 6, 13)
alphas
array([ 0.000001, 0.00001 , 0.0001 , 0.001 ,
0.01 , 0.1 , 1. , 10. ,
100. , 1000. , 10000. , 100000. ,
1000000. ])
Fit model
reg = linear_model.RidgeCV(alphas=alphas)
reg.fit(X_train,y_train)
RidgeCV(alphas=array([ 0.000001, 0.00001 , 0.0001 , 0.001 ,
0.01 , 0.1 , 1. , 10. ,
100. , 1000. , 10000. , 100000. ,
1000000. ]))
Show best alpha
reg.alpha_
1000.0
Show coefficients
print(reg.coef_)
print(reg.intercept_)
[[0.71854335]]
[80.69608773]
model_results(model_name="Ridge")
[{'model': 'Ridge', 'rmse_train': 40.7053, 'rmse_test': 41.4135}]
sns.regplot(x=X_train['age'],
y=y_train['wage'],
ci=None,
line_kws={"color": "orange"});
Polynomial regression#
Next, we use a pipeline to add non-linear features to a ridge regression model.
We use
make_pipeline
which is a shorthand for the Pipeline constructorIt does not require, and does not permit, naming the estimators.
Instead, their names will be set to the lowercase of their types automatically:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
# use polynomial features with degree 3
reg = make_pipeline(
PolynomialFeatures(degree=3),
linear_model.RidgeCV(alphas=alphas)
)
reg.fit(X_train, y_train)
Pipeline(steps=[('polynomialfeatures', PolynomialFeatures(degree=3)),
('ridgecv',
RidgeCV(alphas=array([ 0.000001, 0.00001 , 0.0001 , 0.001 ,
0.01 , 0.1 , 1. , 10. ,
100. , 1000. , 10000. , 100000. ,
1000000. ])))])
model_results(model_name="Polynomial Reg")
[{'model': 'Ridge', 'rmse_train': 40.7053, 'rmse_test': 41.4135},
{'model': 'Polynomial Reg', 'rmse_train': 39.7717, 'rmse_test': 40.2513}]
# plot
sns.regplot(x=X_train['age'],
y=y_train['wage'],
ci=None,
order=3,
line_kws={"color": "orange"});
Splines (scikit-learn)#
Note that spline transformers are a new feature in scikit learn 1.0. Therefore, make sure to use the latest version of scikit learn. Use conda list scikit-learn
to see which scikit-learn version is installed. If you use Anaconda, you can update all packages using conda update --all
Spline transformer#
The following function places the knots in a uniform (this is the default) or quantile fashion.
We only have to specify the desired number of knots, and then have SplineTransformer automatically place the corresponding number of knots.
from sklearn.preprocessing import SplineTransformer
# use a spline wit 4 knots and 3 degrees
# we combine the spline with a ridge regressions
reg = make_pipeline(
SplineTransformer(n_knots=4, degree=3),
linear_model.RidgeCV(alphas=alphas)
)
reg.fit(X_train, y_train)
Pipeline(steps=[('splinetransformer', SplineTransformer(n_knots=4)),
('ridgecv',
RidgeCV(alphas=array([ 0.000001, 0.00001 , 0.0001 , 0.001 ,
0.01 , 0.1 , 1. , 10. ,
100. , 1000. , 10000. , 100000. ,
1000000. ])))])
model_results(model_name = "Cubic Spline")
[{'model': 'Ridge', 'rmse_train': 40.7053, 'rmse_test': 41.4135},
{'model': 'Polynomial Reg', 'rmse_train': 39.7717, 'rmse_test': 40.2513},
{'model': 'Cubic Spline', 'rmse_train': 39.7456, 'rmse_test': 40.2444}]
Obtain knots to show them in the following plot (there are degree number of additional knots each to the left and to the right of the fitted interval. These are there for technical reasons, so we refrain from showing them)
splt = SplineTransformer(n_knots=4, degree=3)
splt.fit(X_train)
knots = splt.bsplines_[0].t
knots[3:-3]
array([18. , 38.66666667, 59.33333333, 80. ])
import numpy as np
import matplotlib.pyplot as plt
# Create observations
x_new = np.linspace(X_test.min(),X_test.max(), 100)
# Make some predictions
pred = reg.predict(x_new)
# plot
sns.scatterplot(x=X_train['age'], y=y_train['wage'])
plt.plot(x_new, pred, label='Cubic spline with 4 knots and degree=3', color='orange')
plt.vlines(knots[3:-3], ymin=0, ymax=300, linestyles="dashed")
plt.legend();
/Users/jankirenz/opt/anaconda3/lib/python3.9/site-packages/sklearn/base.py:450: UserWarning: X does not have valid feature names, but SplineTransformer was fitted with feature names
warnings.warn(
Periodic splines#
In some settings, e.g. in time series data with seasonal effects, we expect a periodic continuation of the underlying signal.
Such effects can be modelled using periodic splines, which have equal function value and equal derivatives at the first and last knot.
Review this notebook to learn more about periodic splines in scikit learn: periodic splines
Splines (statsmodels)#
In statsmodels, we can use pandas dataframes so let’s join our label and feature:
df_train = y_train.join(X_train)
df_test = y_test.join(X_test)
df_train
wage | age | |
---|---|---|
1045 | 127.115744 | 45 |
2717 | 118.884359 | 51 |
2835 | 104.921507 | 32 |
2913 | 87.981033 | 58 |
959 | 61.052421 | 50 |
... | ... | ... |
2763 | 73.775743 | 44 |
905 | 104.921507 | 49 |
1096 | 148.413159 | 61 |
235 | 81.283253 | 34 |
1061 | 109.833986 | 29 |
2100 rows × 2 columns
Instead of providing the number of knots, in statsmodels, we have to specify the degrees of freedom (df).
df
defines how many parameters we have to estimate.They have a specific relationship with the number of knots and the degree, which depends on the type of spline (see Stackoverflow):
In the case of B-splines:
\(df=𝑘+degree\) if you specify the knots or
\(𝑘=df−degree\) if you specify the degrees of freedom and the degree.
As an example:
A cubic spline (degree=3) with 4 knots (K=4) will have \(df=4+3=7\) degrees of freedom. If we use an intercept, we need to add an additional degree of freedom.
A cubic spline (degree=3) with 5 degrees of freedom (df=5) will have \(𝑘=5−3=2\) knots (assuming the spline has no intercept).
In our case, we want to fit a cubic spline (degree=3) with an intercept and three knots (K=3). This equals \(df=3+3+1=7\) for our feature. This means that these degrees of freedom are used up by an intercept, plus six basis functions.
import statsmodels.api as sm
from statsmodels.gam.api import GLMGam, BSplines
bs = BSplines(X_train[['age']], df=7, degree=3)
/Users/jankirenz/opt/anaconda3/lib/python3.9/site-packages/statsmodels/compat/pandas.py:61: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
from pandas import Int64Index as NumericIndex
We fit a Generalized Additive Model (GAM). To learn more about GAMs, visit Generalized Additive Models (GAM).
reg = GLMGam.from_formula('wage ~ age', data=df_train, smoother=bs).fit()
Take a look at the results
print(reg.summary())
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: wage No. Observations: 2100
Model: GLMGam Df Residuals: 2093
Model Family: Gaussian Df Model: 6
Link Function: identity Scale: 1582.9
Method: PIRLS Log-Likelihood: -10712.
Date: Sat, 23 Apr 2022 Deviance: 3.3129e+06
Time: 13:19:11 Pearson chi2: 3.31e+06
No. Iterations: 3 Pseudo R-squ. (CS): 0.09038
Covariance Type: nonrobust
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 38.3686 11.185 3.431 0.001 16.447 60.290
age 1.1034 0.182 6.074 0.000 0.747 1.459
age_s0 17.0585 13.393 1.274 0.203 -9.192 43.309
age_s1 38.4351 6.901 5.569 0.000 24.909 51.961
age_s2 34.9717 6.524 5.361 0.000 22.186 47.758
age_s3 13.6359 7.383 1.847 0.065 -0.835 28.107
age_s4 12.8306 14.500 0.885 0.376 -15.588 41.250
age_s5 -53.3174 12.324 -4.326 0.000 -77.472 -29.163
==============================================================================
Obtain RMSE for training and test data:
model_name = "Spline statsmodels"
# Training data
df_train['pred_train'] = reg.predict()
rmse_train = round(mean_squared_error(df_train['wage'], df_train['pred_train'], squared=False),4)
# Test data
df_test['pred_test'] = reg.predict(df_test, exog_smooth= df_test['age'])
rmse_test = round(mean_squared_error(df_test['wage'], df_test['pred_test'], squared=False),4)
# Save model results
new_results = {"model": model_name, "rmse_train": rmse_train, "rmse_test": rmse_test}
results.append(new_results)
results
[{'model': 'Ridge', 'rmse_train': 40.7053, 'rmse_test': 41.4135},
{'model': 'Polynomial Reg', 'rmse_train': 39.7717, 'rmse_test': 40.2513},
{'model': 'Cubic Spline', 'rmse_train': 39.7456, 'rmse_test': 40.2444},
{'model': 'Spline statsmodels', 'rmse_train': 39.7188, 'rmse_test': 40.2194}]
We plot the spline with the Statsmodel’s function .plot_partial:
reg.plot_partial(0, cpr=True, plot_se=False)
Natural spline#
Finally, we fit a natural spline with patsy and statsmodels.
In patsy one can specify the number of degrees of freedom directly (actual number of columns of the resulting design matrix)
In the case of natural splines: \(df=𝑘−1\) if you specify the knots or \(𝑘=df+1\) if you specify the degrees of freedom.
from patsy import dmatrix
bs = dmatrix("cr(train, df = 3)", {"train": X_train}, return_type='dataframe')
We use statsmodel’s “Generalized linear models” (review this tutorial to learn more about GAMs)
reg = sm.GLM(y_train, bs).fit()
model_name = 'Natural Spline'
# Training data
pred_train = reg.predict(dmatrix("cr(train, df=3)", {"train": X_train}, return_type='dataframe'))
rmse_train = round(mean_squared_error(y_train, pred_train, squared=False),4)
# Test data
pred_test = reg.predict(dmatrix("cr(test, df=3)", {"test": X_test}, return_type='dataframe'))
rmse_test = round(mean_squared_error(y_test, pred_test, squared=False),4)
# Save model results
new_results = {"model": model_name, "rmse_train": rmse_train, "rmse_test": rmse_test}
results.append(new_results)
results
[{'model': 'Ridge', 'rmse_train': 40.7053, 'rmse_test': 41.4135},
{'model': 'Polynomial Reg', 'rmse_train': 39.7717, 'rmse_test': 40.2513},
{'model': 'Cubic Spline', 'rmse_train': 39.7456, 'rmse_test': 40.2444},
{'model': 'Spline statsmodels', 'rmse_train': 39.7188, 'rmse_test': 40.2194},
{'model': 'Natural Spline', 'rmse_train': 39.8826, 'rmse_test': 40.3252}]
Plot model
xp = np.linspace(X_test.min(),X_test.max(), 100)
# Make predictions
pred = reg.predict(dmatrix("cr(xp, df=3)", {"xp": xp}, return_type='dataframe'))
# plot
sns.scatterplot(x=X_train['age'], y=y_train['wage'])
plt.plot(xp, pred, color='orange', label='Natural spline with df=3')
plt.legend();
Models summary#
Create a dataframe from model results
df_results = pd.DataFrame(results)
df_results.sort_values(by=['rmse_test'])
model | rmse_train | rmse_test | |
---|---|---|---|
3 | Spline statsmodels | 39.7188 | 40.2194 |
2 | Cubic Spline | 39.7456 | 40.2444 |
1 | Polynomial Reg | 39.7717 | 40.2513 |
4 | Natural Spline | 39.8826 | 40.3252 |
0 | Ridge | 40.7053 | 41.4135 |