Import and Store data

Pandas Introduction

Jan Kirenz

Import pandas

  • To load the pandas package and start working with it, import the package.

  • The community agreed alias for pandas is pd

import pandas as pd

Create Data

Create a DataFrame

  • To manually store data in a table, create a DataFrame:
df = pd.DataFrame({
    'name': ["Tom", "Lisa", "Peter"],
    'height': [1.68, 1.93, 1.72],
    'weight': [48.4, 89.8, 84.2],
    'id': [1, 2, 3],
    'city': ['Stuttgart', 'Stuttgart', 'Berlin']
})

Show data with head()

df.head()
name height weight id city
0 Tom 1.68 48.4 1 Stuttgart
1 Lisa 1.93 89.8 2 Stuttgart
2 Peter 1.72 84.2 3 Berlin

Import data with .read()

  • Import data with the prefix .read_*

Import data from GitHub

  • Import a CSV file in a GitHub repo
URL = "https://raw.githubusercontent.com/kirenz/datasets/master/campaign.csv"

df_github = pd.read_csv(URL, sep=",", decimal='.')

df_github.head()
age city income membership_days campaign_engagement target
0 56 Berlin 136748 837 3 1
1 46 Stuttgart 25287 615 8 0
2 32 Berlin 146593 2100 3 0
3 60 Berlin 54387 2544 0 0
4 25 Berlin 28512 138 6 0

Store data with .to()

  • Store data with the prefix .to_*
df_github.to_csv("data.csv", index=False)
  • By setting index=False the row index labels are not saved in the spreadsheet

Viewing data

Data overview

df
name height weight id city
0 Tom 1.68 48.4 1 Stuttgart
1 Lisa 1.93 89.8 2 Stuttgart
2 Peter 1.72 84.2 3 Berlin

Head and tail

# show first 2 rows
df.head(2)
name height weight id city
0 Tom 1.68 48.4 1 Stuttgart
1 Lisa 1.93 89.8 2 Stuttgart
# show last 2 rows
df.tail(2)
name height weight id city
1 Lisa 1.93 89.8 2 Stuttgart
2 Peter 1.72 84.2 3 Berlin

Info

  • The info() method prints information about a DataFrame.
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   name    3 non-null      object 
 1   height  3 non-null      float64
 2   weight  3 non-null      float64
 3   id      3 non-null      int64  
 4   city    3 non-null      object 
dtypes: float64(2), int64(1), object(2)
memory usage: 252.0+ bytes

Show column names

df.columns
Index(['name', 'height', 'weight', 'id', 'city'], dtype='object')

Show data types

df.dtypes
name       object
height    float64
weight    float64
id          int64
city       object
dtype: object
  • The data types in this DataFrame are integers (int64), floats (float64) and strings (object).

Show index

df.index
RangeIndex(start=0, stop=3, step=1)

What’s next?

Congratulations! You have completed this tutorial 👍

Next, you may want to go back to the lab’s website