Data Types and Adding Columns

Pandas Introduction

Jan Kirenz

Setup

from datetime import datetime
import pandas as pd

df = pd.DataFrame({
    'name': ["Tom", "Lisa", "Peter"],
    'height': [1.68, 1.93, 1.72],
    'weight': [48.4, 89.8, 84.2],
    'id': [1, 2, 3],
    'city': ['Stuttgart', 'Stuttgart', 'Berlin']
})

Basics

Data Types with .dtypes

df.dtypes
name       object
height    float64
weight    float64
id          int64
city       object
dtype: object

Data Types with .info()

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   name    3 non-null      object 
 1   height  3 non-null      float64
 2   weight  3 non-null      float64
 3   id      3 non-null      int64  
 4   city    3 non-null      object 
dtypes: float64(2), int64(1), object(2)
memory usage: 252.0+ bytes

Change Data Types

Standard methods

  • There are several methods to change data types in pandas:

  • The most common method to change the data type is:

  • .astype(): Convert to a specific type (like “int32”, “float” or “catgeory”)

  • .astype(str): Convert to string

More options

  • to_datetime: Convert argument to datetime.
  • to_timedelta: Convert argument to timedelta.
  • to_numeric: Convert argument to a numeric type.

Categorical Data and Strings

What is categorical data?

  • Categoricals are a pandas data type corresponding to categorical variables in statistics.

  • A categorical variable takes on a limited, and usually fixed, number of possible values (categories).

  • Examples are gender, social class, blood type, country affiliation, observation time or rating via Likert scales.

Convert to categorical data

  • Convert variable “name” to a category dtype:
df["name"] = df["name"].astype("category")
df.info()

Convert to categorical data

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype   
---  ------  --------------  -----   
 0   name    3 non-null      category
 1   height  3 non-null      float64 
 2   weight  3 non-null      float64 
 3   id      3 non-null      int64   
 4   city    3 non-null      object  
dtypes: category(1), float64(2), int64(1), object(1)
memory usage: 363.0+ bytes

String data

  • In our example, id is not a number (we can’t perform calculations with it)

  • It is just a unique identifier so we should transform it to a simple string (object)

df['id'] = df['id'].astype(str)
df.info()

String data

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype   
---  ------  --------------  -----   
 0   name    3 non-null      category
 1   height  3 non-null      float64 
 2   weight  3 non-null      float64 
 3   id      3 non-null      object  
 4   city    3 non-null      object  
dtypes: category(1), float64(2), object(2)
memory usage: 363.0+ bytes

Add new columns

Add a constant number

  • Add a new variable called “number” to df

  • The new variable should have the number 42 in all rows

df["number"] = 42
df.head()
name height weight id city number
0 Tom 1.68 48.4 1 Stuttgart 42
1 Lisa 1.93 89.8 2 Stuttgart 42
2 Peter 1.72 84.2 3 Berlin 42

Add from existing columns

  • Create new columns from existing columns
# calculate body mass index
df['bmi'] = round(df['weight'] / (df['height'] * df['height']), 2)
df
name height weight id city number bmi
0 Tom 1.68 48.4 1 Stuttgart 42 17.15
1 Lisa 1.93 89.8 2 Stuttgart 42 24.11
2 Peter 1.72 84.2 3 Berlin 42 28.46

Add Dates

Add a date with strftime

  • To add a date, we can use datetime and strftime (see code examples on the next slides):
df["date"] = datetime.today().strftime('%Y-%m-%d')
df.head(3)
name height weight id city number bmi date
0 Tom 1.68 48.4 1 Stuttgart 42 17.15 2023-10-31
1 Lisa 1.93 89.8 2 Stuttgart 42 24.11 2023-10-31
2 Peter 1.72 84.2 3 Berlin 42 28.46 2023-10-31

Table: weekdays and day

Code Example Description
%a Sun Weekday as locale’s abbreviated name.
%A Sunday Weekday as locale’s full name.
%w 0 Weekday as a decimal number, where 0 is Sunday and 6 is Saturday.
%d 8 Day of the month as a zero-padded decimal number.
%-d 8 Day of the month as a decimal number. (Platform specific)

Table: Month

Code Example Description
%b Sep Month as locale’s abbreviated name.
%B September Month as locale’s full name.
%m 9 Month as a zero-padded decimal number.
%-m 9 Month as a decimal number. (Platform specific)

Table: Year and hour

Code Example Description
%y 13 Year without century as a zero-padded decimal number.
%Y 2013 Year with century as a decimal number.
%H 7 Hour (24-hour clock) as a zero-padded decimal number.
%-H 7 Hour (24-hour clock) as a decimal number. (Platform specific)
%I 7 Hour (12-hour clock) as a zero-padded decimal number.
%-I 7 Hour (12-hour clock) as a decimal number. (Platform specific)

Table: Minutes etc.

Code Example Description
%p AM Locale’s equivalent of either AM or PM.
%M 6 Minute as a zero-padded decimal number.
%-M 6 Minute as a decimal number. (Platform specific)
%S 5 Second as a zero-padded decimal number.
%-S 5 Second as a decimal number. (Platform specific)
%f 0 Microsecond as a decimal number, zero-padded on the left.
%z 0 UTC offset in the form ±HHMM[SS[.ffffff]] (empty string if the object is naive).
%Z UTC Time zone name (empty string if the object is naive).
%j 251 Day of the year as a zero-padded decimal number.
%-j 251 Day of the year as a decimal number. (Platform specific)
%U 36 Week number of the year (Sunday as the first day of the week) as a zero padded decimal number. All days in a new year preceding the first Sunday are considered to be in week 0.
%W 35 Week number of the year (Monday as the first day of the week) as a decimal number. All days in a new year preceding the first Monday are considered to be in week 0.
%c Sun Sep 8 07:06:05 2013 Locale’s appropriate date and time representation.
%x 09.08.13 Locale’s appropriate date representation.
%X 07:06:05 Locale’s appropriate time representation.
%% % A literal ‘%’ character.

What’s next?

Congratulations! You have completed this tutorial 👍

Next, you may want to go back to the lab’s website