Statistics and Plots

Pandas Introduction

Jan Kirenz

Setup

import pandas as pd

df = pd.DataFrame({
    'name': ["Tom", "Lisa", "Peter"],
    'height': [1.68, 1.93, 1.72],
    'weight': [48.4, 89.8, 84.2],
    'id': [1, 2, 3],
    'city': ['Stuttgart', 'Stuttgart', 'Berlin']
})

df['bmi'] = round(df['weight'] / (df['height'] * df['height']), 2)
df["name"] = df["name"].astype("category")
df['id'] = df['id'].astype(str)

Numeric Data

Mean

  • We can calculate simple statistics like the mean
df['height'].mean()
1.7766666666666666
df['height'].mean().round(2)
1.78

Formatted string literals

print(f"The mean of height is {df['height'].mean():.2f}")
The mean of height is 1.78

Median and Standard Deviation

df['height'].median()
1.72
df['height'].std()
0.13428824718989124

Describe

  • describe() shows a quick statistic summary of your numerical data.
df.describe()
height weight bmi
count 3.000000 3.000000 3.000000
mean 1.776667 74.133333 23.240000
std 0.134288 22.460929 5.704972
min 1.680000 48.400000 17.150000
25% 1.700000 66.300000 20.630000
50% 1.720000 84.200000 24.110000
75% 1.825000 87.000000 26.285000
max 1.930000 89.800000 28.460000

Describe with transpose

df.describe().T.round(2)
count mean std min 25% 50% 75% max
height 3.0 1.78 0.13 1.68 1.70 1.72 1.82 1.93
weight 3.0 74.13 22.46 48.40 66.30 84.20 87.00 89.80
bmi 3.0 23.24 5.70 17.15 20.63 24.11 26.28 28.46

Describe for specific columns with groupby

  • Summary statistics for numeric variables height and bmi for different levels of the categorical variable city:
df[['height', 'city']].groupby(['city']).describe().round(2).T
city Berlin Stuttgart
height count 1.00 2.00
mean 1.72 1.80
std NaN 0.18
min 1.72 1.68
25% 1.72 1.74
50% 1.72 1.80
75% 1.72 1.87
max 1.72 1.93

Categorical Data

Example

  • we can also use describe() for categorical data
df.describe(include="category").T
count unique top freq
name 3 3 Lisa 1

Show unique levels

  • Show unique levels of a categorical variable and count with value_counts()
df['city'].value_counts()
Stuttgart    2
Berlin       1
Name: city, dtype: int64

Extract specific values

  • We also can extract specific values
df['city'].value_counts().Stuttgart
2

Formatted string literals

count_stuttgart = df['city'].value_counts().Stuttgart

print(f"There are {count_stuttgart} people from Stuttgart in the data")
There are 2 people from Stuttgart in the data

Loop over List

Statistics for specific columns

  • Example of for loop to obtain statistics for specific numerical columns
# make a list of numerical columns
list_num = ['height', 'weight']
# calculate median for our list and only show 4 digits, then make a new line (\n)
for i in list_num:
    print(f'Median of {i} equals {df[i].median():.4} \n')
Median of height equals 1.72 

Median of weight equals 84.2 

Summary statistics

  • Calculate summary statistics for our list.
for i in list_num:
    print(f'Column: {i}  \n  {df[i].describe().round(2)}   \n')   
Column: height  
  count    3.00
mean     1.78
std      0.13
min      1.68
25%      1.70
50%      1.72
75%      1.82
max      1.93
Name: height, dtype: float64   

Column: weight  
  count     3.00
mean     74.13
std      22.46
min      48.40
25%      66.30
50%      84.20
75%      87.00
max      89.80
Name: weight, dtype: float64   

Create Plots

Setup

# Pandas needs the module matplotlib to create plots
import matplotlib.pyplot as plt

# show plot output in Jupyter Notebook
%matplotlib inline

One boxplot

df.boxplot(column=['weight']);

Multiple boxplots with loop

# obtain plots for our list
for i in list_num:
    df.boxplot(column=[i])
    plt.title("Boxplot for " + i)
    plt.show()

What’s next?

Congratulations! You have completed this tutorial 👍

Next, you may want to go back to the lab’s website