Filter, Group and Bin Data

Pandas Introduction

Jan Kirenz

Setup

import pandas as pd

df = pd.DataFrame({
    'name': ["Tom", "Lisa", "Peter"],
    'height': [1.68, 1.93, 1.72],
    'weight': [48.4, 89.8, 84.2],
    'id': [1, 2, 3],
    'city': ['Stuttgart', 'Stuttgart', 'Berlin']
})

df['bmi'] = round(df['weight'] / (df['height'] * df['height']), 2)
df["name"] = df["name"].astype("category")
df['id'] = df['id'].astype(str)

Filter

Filter example

  • Using a single column’s values to filter data (boolean indexing).
df[df["height"] >= 1.67]
name height weight id city bmi
0 Tom 1.68 48.4 1 Stuttgart 17.15
1 Lisa 1.93 89.8 2 Stuttgart 24.11
2 Peter 1.72 84.2 3 Berlin 28.46

Filter with AND

  • Combine filtering for two columns or more with & (AND)
df[(df["height"] >= 1.67) & (df["weight"] < 74)]
name height weight id city bmi
0 Tom 1.68 48.4 1 Stuttgart 17.15

Filter with OR

  • You can also combine filtering for two columns or more with | (OR)
df[(df["height"] >= 167) | (df["weight"] < 74)]
name height weight id city bmi
0 Tom 1.68 48.4 1 Stuttgart 17.15

isin()

  • Always use the isin() method if you have multiple conditions within one column:
df[df["name"].isin(["Tom", "Steven"])]
name height weight id city bmi
0 Tom 1.68 48.4 1 Stuttgart 17.15

Grouping

Grouping steps

  • By “group by” we are referring to a process involving one or more of the following steps:

  • Splitting the data into groups based on some criteria

  • Applying a function to each group independently

  • Combining the results into a data structure

Grouping example with mean()

  • Grouping and then applying the mean() function to the resulting groups. Furthermore, we round the results and transpose the data.
df.groupby("city").mean(numeric_only=True).round(2).T
city Berlin Stuttgart
height 1.72 1.80
weight 84.20 69.10
bmi 28.46 20.63

Segment Data: Bins

Segment data with cut()

  • Use the function cut when you need to segment and sort data values into bins.

  • This function is also useful for going from a continuous variable to a categorical variable.

BMI table

The standard weight status categories associated with BMI ranges for adults are shown in the following table:

BMI Weight Status
Below 18.5 Underweight
18.5 - 24.9 Normal or Healthy Weight
25.0 - 29.9 Overweight
30.0 and Above Obese

Source: U.S. Department of Health & Human Services

BMI example 1

  • In our function, we discretize the variable bmi into four bins according to the table above:

  • The bins [0, 18.5, 25, 30, float(‘inf’)] indicate (0,18.5], (18.5,25], (25,30], (30, float(’inf))

  • float('inf') is used for setting variable with an infinitely large value

BMI example code

df['bmi_category'] = pd.cut(df['bmi'],
                            bins=[0, 18.5, 25, 30, float('inf')],
                            labels=['underweight', 'normal', 'overweight', "obese"])
df

BMI example code

name height weight id city bmi bmi_category
0 Tom 1.68 48.4 1 Stuttgart 17.15 underweight
1 Lisa 1.93 89.8 2 Stuttgart 24.11 normal
2 Peter 1.72 84.2 3 Berlin 28.46 overweight

BMI example with categories

df['bmi_category']
0    underweight
1         normal
2     overweight
Name: bmi_category, dtype: category
Categories (4, object): ['underweight' < 'normal' < 'overweight' < 'obese']

Four equal-sized bins

  • Example of how to discretize into four equal-sized bins:
df['bmi_category_2'] = pd.cut(df['bmi'],
                              bins=4,
                              labels=['group1', 'group2', 'group3', "group4"])
df['bmi_category_2']
0    group1
1    group3
2    group4
Name: bmi_category_2, dtype: category
Categories (4, object): ['group1' < 'group2' < 'group3' < 'group4']

Four equal-sized bins without labels

  • Example of how to discretize into four equal-sized bins if you don’t need labels
df['bmi_category_3'] = pd.cut(df['bmi'],
                              bins=4,
                              labels=False)
df['bmi_category_3']
0    0
1    2
2    3
Name: bmi_category_3, dtype: int64

What’s next?

Congratulations! You have completed this tutorial 👍

Next, you may want to go back to the lab’s website