Statistics and Plots

Pandas Introduction

Jan Kirenz

Setup

import pandas as pd

df = pd.DataFrame({
    'name': ["Tom", "Lisa", "Peter"],
    'height': [1.68, 1.93, 1.72],
    'weight': [48.4, 89.8, 84.2],
    'id': [1, 2, 3],
    'city': ['Stuttgart', 'Stuttgart', 'Berlin']
})

df['bmi'] = round(df['weight'] / (df['height'] * df['height']), 2)
df["name"] = df["name"].astype("category")
df['id'] = df['id'].astype(str)

Numeric Data

Mean

We can calculate simple statistics like the mean

df['height'].mean()

1.7766666666666666

df['height'].mean().round(2)

1.78

Formatted string literals

Print the value in nice format (using formatted string literals f”…“)

print(f"The mean of height is {df['height'].mean():.2f}")

The mean of height is 1.78

Median and Standard Deviation

df['height'].median()

1.72

df['height'].std()

0.13428824718989124

Describe

describe() shows a quick statistic summary of your numerical data.

df.describe()

	height	weight	bmi
count	3.000000	3.000000	3.000000
mean	1.776667	74.133333	23.240000
std	0.134288	22.460929	5.704972
min	1.680000	48.400000	17.150000
25%	1.700000	66.300000	20.630000
50%	1.720000	84.200000	24.110000
75%	1.825000	87.000000	26.285000
max	1.930000	89.800000	28.460000

Describe with transpose

df.describe().T.round(2)

	count	mean	std	min	25%	50%	75%	max
height	3.0	1.78	0.13	1.68	1.70	1.72	1.82	1.93
weight	3.0	74.13	22.46	48.40	66.30	84.20	87.00	89.80
bmi	3.0	23.24	5.70	17.15	20.63	24.11	26.28	28.46

Describe for specific columns with groupby

Summary statistics for numeric variables height and bmi for different levels of the categorical variable city:

df[['height', 'city']].groupby(['city']).describe().round(2).T

	city	Berlin	Stuttgart
height	count	1.00	2.00
	mean	1.72	1.80
	std	NaN	0.18
	min	1.72	1.68
	25%	1.72	1.74
	50%	1.72	1.80
	75%	1.72	1.87
	max	1.72	1.93

Categorical Data

Example

we can also use describe() for categorical data

df.describe(include="category").T

	count	unique	top	freq
name	3	3	Lisa	1

Show unique levels

Show unique levels of a categorical variable and count with value_counts()

df['city'].value_counts()

Stuttgart    2
Berlin       1
Name: city, dtype: int64

Extract specific values

We also can extract specific values

df['city'].value_counts().Stuttgart

Formatted string literals

Print the value in nice format (using formatted string literals f”…“)

count_stuttgart = df['city'].value_counts().Stuttgart

print(f"There are {count_stuttgart} people from Stuttgart in the data")

There are 2 people from Stuttgart in the data

Loop over List

Statistics for specific columns

Example of for loop to obtain statistics for specific numerical columns

# make a list of numerical columns
list_num = ['height', 'weight']

# calculate median for our list and only show 4 digits, then make a new line (\n)
for i in list_num:
    print(f'Median of {i} equals {df[i].median():.4} \n')

Median of height equals 1.72 

Median of weight equals 84.2

Summary statistics

Calculate summary statistics for our list.

for i in list_num:
    print(f'Column: {i}  \n  {df[i].describe().round(2)}   \n')

Column: height  
  count    3.00
mean     1.78
std      0.13
min      1.68
25%      1.70
50%      1.72
75%      1.82
max      1.93
Name: height, dtype: float64   

Column: weight  
  count     3.00
mean     74.13
std      22.46
min      48.40
25%      66.30
50%      84.20
75%      87.00
max      89.80
Name: weight, dtype: float64

Create Plots

Setup

# Pandas needs the module matplotlib to create plots
import matplotlib.pyplot as plt

# show plot output in Jupyter Notebook
%matplotlib inline

One boxplot

df.boxplot(column=['weight']);

Multiple boxplots with loop

# obtain plots for our list
for i in list_num:
    df.boxplot(column=[i])
    plt.title("Boxplot for " + i)
    plt.show()

What’s next?

Congratulations! You have completed this tutorial 👍

Next, you may want to go back to the lab’s website