Online ads#

In this problem set we analyse the relationship between online ads and purchase behavior. In particular, we want to classify which online users are likely to purchase a certain product after being exposed to an online ad.

Data preparation#

import pandas as pd

df = pd.read_csv("https://raw.githubusercontent.com/kirenz/datasets/master/purchase.csv")
df

	Unnamed: 0	User ID	Gender	Age	EstimatedSalary	Purchased
0	1	15624510	Male	19	19000	0
1	2	15810944	Male	35	20000	0
2	3	15668575	Female	26	43000	0
3	4	15603246	Female	27	57000	0
4	5	15804002	Male	19	76000	0
...	...	...	...	...	...	...
395	396	15691863	Female	46	41000	1
396	397	15706071	Male	51	23000	1
397	398	15654296	Female	50	20000	1
398	399	15755018	Male	36	33000	0
399	400	15594041	Female	49	36000	1

400 rows × 6 columns

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Unnamed: 0       400 non-null    int64 
 1   User ID          400 non-null    int64 
 2   Gender           400 non-null    object
 3   Age              400 non-null    int64 
 4   EstimatedSalary  400 non-null    int64 
 5   Purchased        400 non-null    int64 
dtypes: int64(5), object(1)
memory usage: 18.9+ KB

# make dummy variable
df['male'] = pd.get_dummies(df['Gender'], drop_first = True)
# drop irrelevant columns
df.drop(columns= ['Unnamed: 0', 'User ID', 'Gender'], inplace = True)

# inspect outcome variable
df['Purchased'].value_counts()

0    257
1    143
Name: Purchased, dtype: int64

# prepara data for scikit learn 
X = df.drop(columns=['Purchased'])
y = df.Purchased

# make data split
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state = 123)

# create new training dataset for data exploration
train_dataset = pd.DataFrame(X_train).copy()
train_dataset['Purchased'] = pd.DataFrame(y_train)

train_dataset

	Age	EstimatedSalary	male	Purchased
177	25	22000	1	0
105	21	72000	1	0
5	27	58000	1	0
288	41	79000	1	0
279	50	36000	0	1
...	...	...	...	...
230	35	147000	0	1
98	35	73000	1	0
322	41	52000	1	0
382	44	139000	0	1
365	59	29000	0	1

280 rows × 4 columns

Exploratory data analysis (EDA)#

train_dataset.groupby(by=['Purchased']).describe().T

	Purchased	0	1
Age	count	180.000000	100.000000
	mean	32.672222	46.090000
	std	8.176018	8.589511
	min	18.000000	27.000000
	25%	26.000000	39.000000
	50%	33.500000	47.000000
	75%	39.000000	53.000000
	max	59.000000	60.000000
EstimatedSalary	count	180.000000	100.000000
	mean	59788.888889	85460.000000
	std	22884.697356	41858.207020
	min	15000.000000	20000.000000
	25%	46500.000000	38750.000000
	50%	60000.000000	92000.000000
	75%	76250.000000	122000.000000
	max	134000.000000	150000.000000
male	count	180.000000	100.000000
	mean	0.527778	0.420000
	std	0.500620	0.496045
	min	0.000000	0.000000
	25%	0.000000	0.000000
	50%	1.000000	0.000000
	75%	1.000000	1.000000
	max	1.000000	1.000000

Purchasers are (on average) _______ and earn a __________ estimated salary than non-purchasers.

Visualization of differences:

import seaborn as sns

sns.pairplot(hue='Purchased', kind="reg", diag_kind="kde", data=train_dataset);

Inspect (linear) relationships between variables with correlation (pearson’s correlation coefficient)

df.corr().round(2)

	Age	EstimatedSalary	Purchased	male
Age	1.00	0.16	0.62	-0.07
EstimatedSalary	0.16	1.00	0.36	-0.06
Purchased	0.62	0.36	1.00	-0.04
male	-0.07	-0.06	-0.04	1.00

sns.kdeplot(hue="Purchased", x='Age', data=train_dataset);

Purchasers seem to be _________ than non-purchaser.

sns.boxplot(x="male", y="Age", hue="Purchased", data=train_dataset);

There are __________ differences regarding gender.

sns.kdeplot(hue="Purchased", x='EstimatedSalary', data=train_dataset); 

Purchaser earn a ______________ estimated salary.

sns.boxplot(x="male", y="EstimatedSalary", hue="Purchased", data=train_dataset);

Insight: there are ___________ differences between males and females (regarding purchase behavior, age and estimated salary)

Model#

Next, we will fit a logistic regression model. In particular, we use a model that has built-in cross-validation capabilities to automatically select the best hyper-parameter for our model.

We only use our most promising predictor variables Age and EstimatedSalary for our model.

# only use meaningful predictors
features_model = ['Age', 'EstimatedSalary']

X_train = X_train[features_model] 
X_test = X_test[features_model]

from sklearn.linear_model import LogisticRegressionCV

# model
clf = LogisticRegressionCV()

# fit model to data
clf.fit(X_train, y_train)

# prediction
y_pred = clf.predict(X_test)

Classification metrics#

# Return the mean accuracy on the given test data and labels:
clf.score(X_test, y_test)

0.825

from sklearn.metrics import ConfusionMatrixDisplay

ConfusionMatrixDisplay.from_estimator(clf, X_test, y_test);

from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred, target_names=['Not purchased', 'Purchased']))

               precision    recall  f1-score   support

Not purchased       0.82      0.94      0.87        77
    Purchased       0.84      0.63      0.72        43

     accuracy                           0.82       120
    macro avg       0.83      0.78      0.80       120
 weighted avg       0.83      0.82      0.82       120

macro: Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.

weighted: Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance.

Note that recall is also sometimes called sensitivity or true positive rate.

High scores for both precision and recall show that the classifier is returning accurate results (high precision), as well as returning a majority of all positive results (high recall).
The importance of precision vs recall depends on the use case at hand (and the costs associated with missclassification).
A system with high recall but low precision returns many results, but most of its predicted labels are incorrect when compared to the training labels.
A system with high precision but low recall is just the opposite, returning very few results, but most of its predicted labels are correct when compared to the training labels.
An ideal system with high precision and high recall will return many results, with most results labeled correctly.

The unweighted recall of our model is _____

The unweighted precision of our model is _____

ROC Curve#

from sklearn.metrics import RocCurveDisplay

RocCurveDisplay.from_estimator(clf, X_test, y_test) 

<sklearn.metrics._plot.roc_curve.RocCurveDisplay at 0x7f2e602a5e50>

AUC Score#

from sklearn.metrics import roc_auc_score

y_score = clf.predict_proba(X_test)[:, 1]
roc_auc_score(y_test, y_score)

0.9311386288130474

Analytics & Big Data

Online ads

Contents