TensorFlow#

Case study: Houses for sale

Setup#

%matplotlib inline

import pandas as pd
import numpy as np
import seaborn as sns 
import matplotlib.pyplot as plt

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.layers.experimental import preprocessing

from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

print(tf.__version__)
sns.set_theme(style="ticks", color_codes=True)

2.7.1

Data preparation#

See notebook “Data” for details about data preprocessing

from case_duke_data_prep import *

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 97 entries, 0 to 97
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   price       97 non-null     int64   
 1   bed         97 non-null     int64   
 2   bath        97 non-null     float64 
 3   area        97 non-null     int64   
 4   year_built  97 non-null     int64   
 5   cooling     97 non-null     category
 6   lot         97 non-null     float64 
dtypes: category(1), float64(2), int64(4)
memory usage: 5.5 KB

Simple regression#

We start with a single-variable linear regression, to predict price from area.

# Select features for simple regression
features = ['area']
X = df[features]

X.info()
print("Missing values:",X.isnull().any(axis = 1).sum())

# Create response
y = df["price"]

<class 'pandas.core.frame.DataFrame'>
Int64Index: 97 entries, 0 to 97
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   area    97 non-null     int64
dtypes: int64(1)
memory usage: 1.5 KB
Missing values: 0

Data splitting#

# Train Test Split
# Use random_state to make this notebook's output identical at every run
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Linear regression#

Training a model with tf.keras typically starts by defining the model architecture.
In this case use a keras.Sequential model. This model represents a sequence of steps. In this case there is only one step:
- Apply a linear transformation to produce 1 output using layers.Dense.

The number of inputs can either be set by the input_shape argument, or automatically when the model is run for the first time.

Build the sequential model:

lm = tf.keras.Sequential([
    layers.Dense(units=1, input_shape=(1,))
])

lm.summary()

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense_2 (Dense)             (None, 1)                 2         
                                                                 
=================================================================
Total params: 2
Trainable params: 2
Non-trainable params: 0
_________________________________________________________________

Once the model is built, configure the training procedure using the Model.compile() method.
The most important arguments to compile are the loss and the optimizer since these define what will be optimized (mean_absolute_error) and how (using the optimizers.Adam).

lm.compile(
    optimizer=tf.optimizers.Adam(),
    loss='mean_absolute_error'
)

Once the training is configured, use Model.fit() to execute the training:

%%time

history = lm.fit(
    X_train, y_train,
    epochs=200,
    # suppress logging
    verbose=0,
    # Calculate validation results on 20% of the training data
    validation_split = 0.2)

CPU times: user 4.63 s, sys: 386 ms, total: 5.01 s
Wall time: 5.39 s

y_train

  525000
  540000
  105000
  610000
  535000
       ...  
  580000
  650000
  631500
  541000
  725000
Name: price, Length: 77, dtype: int64

# Calculate R squared
y_pred = lm.predict(X_train).astype(np.int64)

r2_score(y_train, y_pred)  

-7.042298800162028

# slope coefficient
lm.layers[0].kernel

<tf.Variable 'dense_2/kernel:0' shape=(1, 1) dtype=float32, numpy=array([[-1.0806017]], dtype=float32)>

Visualize the model’s training progress using the stats stored in the history object.

hist = pd.DataFrame(history.history)
hist['epoch'] = history.epoch
hist.tail()

	loss	val_loss	epoch
195	552219.375	630846.2500	195
196	552213.875	630840.0000	196
197	552208.375	630833.6875	197
198	552202.875	630827.3750	198
199	552197.375	630821.0000	199

def plot_loss(history):
  plt.plot(history.history['loss'], label='loss')
  plt.xlabel('Epoch')
  plt.ylabel('Error [price]')
  plt.legend()
  plt.grid(True)

plot_loss(history)

../_images/case-duke-tensorflow_27_0.png

Collect the results (mean squared error) on the test set, for later:

test_results = {}

test_results['lm'] = lm.evaluate(
    X_test,
    y_test, verbose=0)

test_results

{'lm': 543700.625}

Since this is a single variable regression it’s easy to look at the model’s predictions as a function of the input:

x = tf.linspace(0.0, 6200, 6201)
y = lm.predict(x)

y

array([[ 3.9999834e-01],
       [-6.8060338e-01],
       [-1.7612051e+00],
       ...,
       [-6.6971694e+03],
       [-6.6982500e+03],
       [-6.6993306e+03]], dtype=float32)

def plot_area(x, y):
  plt.scatter(X_train['area'], y_train, label='Data')
  plt.plot(x, y, color='k', label='Predictions')
  plt.xlabel('area')
  plt.ylabel('price')
  plt.legend()

plot_area(x,y)

../_images/case-duke-tensorflow_33_0.png

Multiple Regression#

# Select all relevant features
features= [
 'bed',
 'bath',
 'area',
 'year_built',
 'cooling',
 'lot'
  ]
X = df[features]

# Convert categorical to numeric
X = pd.get_dummies(X, columns=['cooling'], prefix='cooling', prefix_sep='_')

X.info()
print("Missing values:",X.isnull().any(axis = 1).sum())

# Create response
y = df["price"]

from sklearn.model_selection import train_test_split

# Train Test Split
# Use random_state to make this notebook's output identical at every run
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

lm_2 = tf.keras.Sequential([
    layers.Dense(units=1, input_shape=(7,))
])

lm_2.summary()

lm_2.compile(
    optimizer=tf.optimizers.Adam(learning_rate=0.1),
    loss='mean_absolute_error'
)

%%time

history = lm_2.fit(
    X_train, y_train,
    epochs=500,
    # suppress logging
    verbose=0,
    # Calculate validation results on 20% of the training data
    validation_split = 0.2
)

CPU times: user 6.7 s, sys: 430 ms, total: 7.13 s
Wall time: 6.52 s

# Calculate R squared
y_pred = lm_2.predict(X_train).astype(np.int64)
y_true = y_train.astype(np.int64)

r2_score(y_train, y_pred)  

0.008066631014296388

# slope coefficients
lm_2.layers[0].kernel

<tf.Variable 'dense_14/kernel:0' shape=(7, 1) dtype=float32, numpy=
array([[93.865845],
       [95.36067 ],
       [93.04233 ],
       [92.90618 ],
       [95.016266],
       [96.31462 ],
       [87.34802 ]], dtype=float32)>

plot_loss(history)

../_images/case-duke-tensorflow_42_0.png

test_results['lm_2'] = lm_2.evaluate(
    X_test, y_test, verbose=0)

DNN regression#

This model will contain a few more layers than the previous model:

Two hidden, nonlinear, Dense layers using the relu nonlinearity.
A linear single-output layer.

dnn_model = keras.Sequential([
      layers.Dense(units=1, input_shape=(7,)),
      layers.Dense(64, activation='relu'),
      layers.Dense(64, activation='relu'),
      layers.Dense(1)
  ])

dnn_model.compile(loss='mean_absolute_error',
                optimizer=tf.keras.optimizers.Adam(0.001))

%%time

history = dnn_model.fit(
    X_train, y_train,
    epochs=500,
    verbose=0,
    validation_split = 0.2)

CPU times: user 7.19 s, sys: 494 ms, total: 7.69 s
Wall time: 6.91 s

# Calculate R squared
y_pred = dnn_model.predict(X_train).astype(np.int64)
y_true = y_train.astype(np.int64)

r2_score(y_train, y_pred)  

0.3319246298888

plot_loss(history)

../_images/case-duke-tensorflow_49_0.png

test_results['dnn_model'] = dnn_model.evaluate(
    X_test, y_test, verbose=0)

Performance comparison#

pd.DataFrame(test_results, index=['Mean absolute error [price]']).T

	Mean absolute error [price]
lm	537925.625000
lm_2	145734.390625
dnn_model	103795.859375

Introduction to Regression Models

TensorFlow

Contents