Essential methods#

This introduction to pandas is based on the official pandas documentation. It contains information from the first steps in pandas tutorial but also covers more advanced topics.

Customarily, we import as follows:

import numpy as np
import pandas as pd

Object creation#

Creating a Series by passing a list of values, letting pandas create a default integer index:

s = pd.Series([1, 3, 5, np.nan, 6, 8])

s
0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

Creating a DataFrame by passing a NumPy array, with a datetime index and labeled columns:

dates = pd.date_range("20210101", periods=6)

dates
DatetimeIndex(['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04',
               '2021-01-05', '2021-01-06'],
              dtype='datetime64[ns]', freq='D')
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))

df
A B C D
2021-01-01 1.162312 -0.344268 0.550256 -1.485372
2021-01-02 -1.539542 0.529183 -0.137668 2.056552
2021-01-03 0.137383 -0.872425 0.563527 1.377882
2021-01-04 0.827011 0.525581 -2.979430 0.439790
2021-01-05 0.692313 -0.846918 -0.850518 0.014180
2021-01-06 0.069261 0.293929 1.020163 -0.871126

Create a simple DataFrame:

df_height = pd.DataFrame(
    { 
        'name': pd.Categorical([ "Tom", "Marc", "Peter"]),
        'height': np.array([168, 193, 172],dtype='int32'),
        'gender': 'male' })

Creating a DataFrame by passing a dict of objects that can be converted to series-like.

df2 = pd.DataFrame(
    {
        "A": 1.0,
        "B": pd.Timestamp("20130102"),
        "C": pd.Series(1, index=list(range(4)), dtype="float32"),
        "D": np.array([3] * 4, dtype="int32"),
        "E": pd.Categorical(["test", "train", "test", "train"]),
        "F": "foo",
    }
)

df2
A B C D E F
0 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
2 1.0 2013-01-02 1.0 3 test foo
3 1.0 2013-01-02 1.0 3 train foo

The columns of the resulting DataFrame have different dtypes.

df2.dtypes
A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

If you’re using IPython, tab completion for column names (as well as public attributes) is automatically enabled. Here’s a subset of the attributes that will be completed:

# df2.<TAB>

Viewing data#

Here is how to view the top and bottom rows of the frame:

df
A B C D
2021-01-01 1.162312 -0.344268 0.550256 -1.485372
2021-01-02 -1.539542 0.529183 -0.137668 2.056552
2021-01-03 0.137383 -0.872425 0.563527 1.377882
2021-01-04 0.827011 0.525581 -2.979430 0.439790
2021-01-05 0.692313 -0.846918 -0.850518 0.014180
2021-01-06 0.069261 0.293929 1.020163 -0.871126
df.head()
A B C D
2021-01-01 1.162312 -0.344268 0.550256 -1.485372
2021-01-02 -1.539542 0.529183 -0.137668 2.056552
2021-01-03 0.137383 -0.872425 0.563527 1.377882
2021-01-04 0.827011 0.525581 -2.979430 0.439790
2021-01-05 0.692313 -0.846918 -0.850518 0.014180
df.tail(3)
A B C D
2021-01-04 0.827011 0.525581 -2.979430 0.439790
2021-01-05 0.692313 -0.846918 -0.850518 0.014180
2021-01-06 0.069261 0.293929 1.020163 -0.871126
df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 6 entries, 2021-01-01 to 2021-01-06
Freq: D
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   A       6 non-null      float64
 1   B       6 non-null      float64
 2   C       6 non-null      float64
 3   D       6 non-null      float64
dtypes: float64(4)
memory usage: 240.0 bytes
df.index
DatetimeIndex(['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04',
               '2021-01-05', '2021-01-06'],
              dtype='datetime64[ns]', freq='D')
df.columns
Index(['A', 'B', 'C', 'D'], dtype='object')

describe() shows a quick statistic summary of your data:

df.describe()
A B C D
count 6.000000 6.000000 6.000000 6.000000
mean 0.224790 -0.119153 -0.305611 0.255318
std 0.959676 0.656414 1.464242 1.333840
min -1.539542 -0.872425 -2.979430 -1.485372
25% 0.086291 -0.721256 -0.672305 -0.649799
50% 0.414848 -0.025169 0.206294 0.226985
75% 0.793337 0.467668 0.560210 1.143359
max 1.162312 0.529183 1.020163 2.056552

Transposing your data:

df.T
2021-01-01 2021-01-02 2021-01-03 2021-01-04 2021-01-05 2021-01-06
A 1.162312 -1.539542 0.137383 0.827011 0.692313 0.069261
B -0.344268 0.529183 -0.872425 0.525581 -0.846918 0.293929
C 0.550256 -0.137668 0.563527 -2.979430 -0.850518 1.020163
D -1.485372 2.056552 1.377882 0.439790 0.014180 -0.871126

Sorting by an axis:

df.sort_index(axis=1, ascending=False)
D C B A
2021-01-01 -1.485372 0.550256 -0.344268 1.162312
2021-01-02 2.056552 -0.137668 0.529183 -1.539542
2021-01-03 1.377882 0.563527 -0.872425 0.137383
2021-01-04 0.439790 -2.979430 0.525581 0.827011
2021-01-05 0.014180 -0.850518 -0.846918 0.692313
2021-01-06 -0.871126 1.020163 0.293929 0.069261

Sorting by values:

df.sort_values(by="B")
A B C D
2021-01-03 0.137383 -0.872425 0.563527 1.377882
2021-01-05 0.692313 -0.846918 -0.850518 0.014180
2021-01-01 1.162312 -0.344268 0.550256 -1.485372
2021-01-06 0.069261 0.293929 1.020163 -0.871126
2021-01-04 0.827011 0.525581 -2.979430 0.439790
2021-01-02 -1.539542 0.529183 -0.137668 2.056552

Selection#

While standard Python / NumPy expressions for selecting and setting are intuitive and come in handy for interactive work, for production code, we recommend the optimized pandas data access methods, .at, .iat, .loc and .iloc.

Getting#

Selecting a single column, which yields a Series, equivalent to df.A:

df["A"]
2021-01-01    1.162312
2021-01-02   -1.539542
2021-01-03    0.137383
2021-01-04    0.827011
2021-01-05    0.692313
2021-01-06    0.069261
Freq: D, Name: A, dtype: float64

Selecting via [], which slices the rows.

df[0:3]
A B C D
2021-01-01 1.162312 -0.344268 0.550256 -1.485372
2021-01-02 -1.539542 0.529183 -0.137668 2.056552
2021-01-03 0.137383 -0.872425 0.563527 1.377882
df["20210101":"20210103"]
A B C D
2021-01-01 1.162312 -0.344268 0.550256 -1.485372
2021-01-02 -1.539542 0.529183 -0.137668 2.056552
2021-01-03 0.137383 -0.872425 0.563527 1.377882

Selection by label#

The .loc attribute is the primary access method. The following are valid inputs:

  • A single label, e.g. 5 or ‘a’ (Note that 5 is interpreted as a label of the index. This use is not an integer position along the index.).

  • A list or array of labels [‘a’, ‘b’, ‘c’].

  • A slice object with labels ‘a’:’f’ (Note that contrary to usual Python slices, both the start and the stop are included, when present in the index! See Slicing with labels.

  • A boolean array.

  • A callable, see Selection By Callable.

For getting a cross section using a label:

df.loc[dates[0]]
A    1.162312
B   -0.344268
C    0.550256
D   -1.485372
Name: 2021-01-01 00:00:00, dtype: float64

Selecting on a multi-axis by label:

df.loc[ : , ["A", "B"]]
A B
2021-01-01 1.162312 -0.344268
2021-01-02 -1.539542 0.529183
2021-01-03 0.137383 -0.872425
2021-01-04 0.827011 0.525581
2021-01-05 0.692313 -0.846918
2021-01-06 0.069261 0.293929

Showing label slicing, both endpoints are included:

df.loc["20210102":"20210104", ["A", "B"]]
A B
2021-01-02 -1.539542 0.529183
2021-01-03 0.137383 -0.872425
2021-01-04 0.827011 0.525581

Reduction in the dimensions of the returned object:

df.loc["20210102", ["A", "B"]]
A   -1.539542
B    0.529183
Name: 2021-01-02 00:00:00, dtype: float64

For getting a scalar value:

df.loc[dates[0], "A"]
1.1623119547723357

For getting fast access to a scalar (equivalent to the prior method) use .at. Use at if you only need to get or set a single value in a DataFrame or Series.

df.at[dates[0], "A"]
1.1623119547723357

Set value at specified row/column pair

df.at[dates[0], "A"] = 1

df.A
2021-01-01    1.000000
2021-01-02   -1.539542
2021-01-03    0.137383
2021-01-04    0.827011
2021-01-05    0.692313
2021-01-06    0.069261
Freq: D, Name: A, dtype: float64

Selection by position#

pandas provides a suite of methods in order to get purely integer based indexing.

The .iloc attribute is the primary access method. The following are valid inputs:

  • An integer e.g. 5.

  • A list or array of integers [4, 3, 0].

  • A slice object with ints 1:7.

  • A boolean array.

  • A callable, see Selection By Callable.

See more in Selection by Position

Select via the position of the passed integers:

df.iloc[0]
A    1.000000
B   -0.344268
C    0.550256
D   -1.485372
Name: 2021-01-01 00:00:00, dtype: float64

By integer slices, acting similar to NumPy/Python:

df.iloc[0:3, 0:2]
A B
2021-01-01 1.000000 -0.344268
2021-01-02 -1.539542 0.529183
2021-01-03 0.137383 -0.872425
df.iloc[1:3, 0:2]
A B
2021-01-02 -1.539542 0.529183
2021-01-03 0.137383 -0.872425

By lists of integer position locations, similar to the NumPy/Python style:

df.iloc[[1, 2, 4], [0, 2]]
A C
2021-01-02 -1.539542 -0.137668
2021-01-03 0.137383 0.563527
2021-01-05 0.692313 -0.850518

For slicing rows explicitly:

df.iloc[1:3, :]
A B C D
2021-01-02 -1.539542 0.529183 -0.137668 2.056552
2021-01-03 0.137383 -0.872425 0.563527 1.377882

For slicing columns explicitly:

df.iloc[:, 1:3]
B C
2021-01-01 -0.344268 0.550256
2021-01-02 0.529183 -0.137668
2021-01-03 -0.872425 0.563527
2021-01-04 0.525581 -2.979430
2021-01-05 -0.846918 -0.850518
2021-01-06 0.293929 1.020163

For getting a value explicitly:

df.iloc[0, 0]
1.0

For getting fast access to a scalar (equivalent to the prior method):

df.iat[0, 0]
1.0

Boolean indexing#

Boolean indexing is used to filter subsets of data based on actual values inside the DataFrame. It is called Boolean since the filter condition is associated with one of two possible values (usually denoted True and False)

Using a single column’s values to select data.

df[df["A"] > 0]
A B C D
2021-01-01 1.000000 -0.344268 0.550256 -1.485372
2021-01-03 0.137383 -0.872425 0.563527 1.377882
2021-01-04 0.827011 0.525581 -2.979430 0.439790
2021-01-05 0.692313 -0.846918 -0.850518 0.014180
2021-01-06 0.069261 0.293929 1.020163 -0.871126

Selecting values from a DataFrame where a boolean condition is met.

df[df > 0]
A B C D
2021-01-01 1.000000 NaN 0.550256 NaN
2021-01-02 NaN 0.529183 NaN 2.056552
2021-01-03 0.137383 NaN 0.563527 1.377882
2021-01-04 0.827011 0.525581 NaN 0.439790
2021-01-05 0.692313 NaN NaN 0.014180
2021-01-06 0.069261 0.293929 1.020163 NaN

Let’s make a copy of the df2 object’s indices and data. We use the default deep=True to create a new object with a copy of the calling object’s data and indices. Modifications to the data or indices of the copy will not be reflected in the original object. Note that a shallow copy (deep=False) shares data and index with original.

df_copy = df.copy()

Create a new column in the DataFrame:

df_copy["E"] = ["one", "one", "two", "three", "four", "three"]
df_copy
A B C D E
2021-01-01 1.000000 -0.344268 0.550256 -1.485372 one
2021-01-02 -1.539542 0.529183 -0.137668 2.056552 one
2021-01-03 0.137383 -0.872425 0.563527 1.377882 two
2021-01-04 0.827011 0.525581 -2.979430 0.439790 three
2021-01-05 0.692313 -0.846918 -0.850518 0.014180 four
2021-01-06 0.069261 0.293929 1.020163 -0.871126 three

Using the isin() method for filtering:

df_copy[df_copy["E"].isin(["two", "four"])]
A B C D E
2021-01-03 0.137383 -0.872425 0.563527 1.377882 two
2021-01-05 0.692313 -0.846918 -0.850518 0.014180 four

Setting#

Setting a new column automatically aligns the data by the indexes.

s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range("20210102", periods=6))
s1
2021-01-02    1
2021-01-03    2
2021-01-04    3
2021-01-05    4
2021-01-06    5
2021-01-07    6
Freq: D, dtype: int64
df["F"] = s1
df
A B C D F
2021-01-01 1.000000 -0.344268 0.550256 -1.485372 NaN
2021-01-02 -1.539542 0.529183 -0.137668 2.056552 1.0
2021-01-03 0.137383 -0.872425 0.563527 1.377882 2.0
2021-01-04 0.827011 0.525581 -2.979430 0.439790 3.0
2021-01-05 0.692313 -0.846918 -0.850518 0.014180 4.0
2021-01-06 0.069261 0.293929 1.020163 -0.871126 5.0

Setting values by label:

df.at[dates[0], "A"] = 0
df
A B C D F
2021-01-01 0.000000 -0.344268 0.550256 -1.485372 NaN
2021-01-02 -1.539542 0.529183 -0.137668 2.056552 1.0
2021-01-03 0.137383 -0.872425 0.563527 1.377882 2.0
2021-01-04 0.827011 0.525581 -2.979430 0.439790 3.0
2021-01-05 0.692313 -0.846918 -0.850518 0.014180 4.0
2021-01-06 0.069261 0.293929 1.020163 -0.871126 5.0

Setting values by position:

df.iat[0, 1] = 0
df
A B C D F
2021-01-01 0.000000 0.000000 0.550256 -1.485372 NaN
2021-01-02 -1.539542 0.529183 -0.137668 2.056552 1.0
2021-01-03 0.137383 -0.872425 0.563527 1.377882 2.0
2021-01-04 0.827011 0.525581 -2.979430 0.439790 3.0
2021-01-05 0.692313 -0.846918 -0.850518 0.014180 4.0
2021-01-06 0.069261 0.293929 1.020163 -0.871126 5.0

Setting by assigning with a NumPy array:

df.loc[:, "D"] = np.array([5] * len(df))
/tmp/ipykernel_1801/2422562958.py:1: DeprecationWarning: In a future version, `df.iloc[:, i] = newvals` will attempt to set the values inplace instead of always setting a new array. To retain the old behavior, use either `df[df.columns[i]] = newvals` or, if columns are non-unique, `df.isetitem(i, newvals)`
  df.loc[:, "D"] = np.array([5] * len(df))
df
A B C D F
2021-01-01 0.000000 0.000000 0.550256 5 NaN
2021-01-02 -1.539542 0.529183 -0.137668 5 1.0
2021-01-03 0.137383 -0.872425 0.563527 5 2.0
2021-01-04 0.827011 0.525581 -2.979430 5 3.0
2021-01-05 0.692313 -0.846918 -0.850518 5 4.0
2021-01-06 0.069261 0.293929 1.020163 5 5.0

A where operation with setting.

df3 = df.copy()
df3[df3 > 0]= -df3
df3
A B C D F
2021-01-01 0.000000 0.000000 -0.550256 -5 NaN
2021-01-02 -1.539542 -0.529183 -0.137668 -5 -1.0
2021-01-03 -0.137383 -0.872425 -0.563527 -5 -2.0
2021-01-04 -0.827011 -0.525581 -2.979430 -5 -3.0
2021-01-05 -0.692313 -0.846918 -0.850518 -5 -4.0
2021-01-06 -0.069261 -0.293929 -1.020163 -5 -5.0

Missing data#

pandas primarily uses the value np.nan to represent missing data. It is by default not included in computations. See the Missing Data section for details.

Reindexing allows you to change/add/delete the index on a specified axis. This returns a copy of the data.

df4 = df.reindex(index=dates[0:4], columns=list(df.columns) + ["E"])
df4
A B C D F E
2021-01-01 0.000000 0.000000 0.550256 5 NaN NaN
2021-01-02 -1.539542 0.529183 -0.137668 5 1.0 NaN
2021-01-03 0.137383 -0.872425 0.563527 5 2.0 NaN
2021-01-04 0.827011 0.525581 -2.979430 5 3.0 NaN
df4.loc[dates[0] : dates[1], "E"] = 1
df4
A B C D F E
2021-01-01 0.000000 0.000000 0.550256 5 NaN 1.0
2021-01-02 -1.539542 0.529183 -0.137668 5 1.0 1.0
2021-01-03 0.137383 -0.872425 0.563527 5 2.0 NaN
2021-01-04 0.827011 0.525581 -2.979430 5 3.0 NaN

To drop any rows that have missing data.

df4.dropna(how="any")
A B C D F E
2021-01-02 -1.539542 0.529183 -0.137668 5 1.0 1.0

Filling missing data.

df4.fillna(value=5)
A B C D F E
2021-01-01 0.000000 0.000000 0.550256 5 5.0 1.0
2021-01-02 -1.539542 0.529183 -0.137668 5 1.0 1.0
2021-01-03 0.137383 -0.872425 0.563527 5 2.0 5.0
2021-01-04 0.827011 0.525581 -2.979430 5 3.0 5.0

To get the boolean mask where values are nan.

pd.isna(df4)
A B C D F E
2021-01-01 False False False False True False
2021-01-02 False False False False False False
2021-01-03 False False False False False True
2021-01-04 False False False False False True

Operations#

Note that operations in general exclude missing data.

Statistics#

Performing a descriptive statistic:

df.dtypes
A    float64
B    float64
C    float64
D      int64
F    float64
dtype: object
df.mean()
A    0.031071
B   -0.061775
C   -0.305611
D    5.000000
F    3.000000
dtype: float64

Same operation on the other axis:

df.mean(1)
2021-01-01    1.387564
2021-01-02    0.970395
2021-01-03    1.365697
2021-01-04    1.274633
2021-01-05    1.598975
2021-01-06    2.276671
Freq: D, dtype: float64

Operating with objects that have different dimensionality and need alignment. In addition, pandas automatically broadcasts along the specified dimension.

Let’s use the shift() function to shift the index axis by 2 periods in the positive direction (i.e., create 2021-01-05 and 2021-01-06):

s2 = pd.Series([1, 3, 5, np.nan, 6, 8], index=dates).shift(2)
s2
2021-01-01    NaN
2021-01-02    NaN
2021-01-03    1.0
2021-01-04    3.0
2021-01-05    5.0
2021-01-06    NaN
Freq: D, dtype: float64

We can use the sub() function to subtract each element of a DataFrame with a corresponding element in a series.

df
A B C D F
2021-01-01 0.000000 0.000000 0.550256 5 NaN
2021-01-02 -1.539542 0.529183 -0.137668 5 1.0
2021-01-03 0.137383 -0.872425 0.563527 5 2.0
2021-01-04 0.827011 0.525581 -2.979430 5 3.0
2021-01-05 0.692313 -0.846918 -0.850518 5 4.0
2021-01-06 0.069261 0.293929 1.020163 5 5.0
df.sub(s2, axis="index")
A B C D F
2021-01-01 NaN NaN NaN NaN NaN
2021-01-02 NaN NaN NaN NaN NaN
2021-01-03 -0.862617 -1.872425 -0.436473 4.0 1.0
2021-01-04 -2.172989 -2.474419 -5.979430 2.0 0.0
2021-01-05 -4.307687 -5.846918 -5.850518 0.0 -1.0
2021-01-06 NaN NaN NaN NaN NaN

Apply#

Applying functions to data:

df.apply(np.cumsum)
A B C D F
2021-01-01 0.000000 0.000000 0.550256 5 NaN
2021-01-02 -1.539542 0.529183 0.412589 10 1.0
2021-01-03 -1.402159 -0.343242 0.976116 15 3.0
2021-01-04 -0.575148 0.182339 -2.003314 20 6.0
2021-01-05 0.117166 -0.664579 -2.853831 25 10.0
2021-01-06 0.186426 -0.370650 -1.833668 30 15.0

A lambda function is a small anonymous function. It can take any number of arguments, but can only have one expression:

df.apply(lambda x: x.max() - x.min())
A    2.366554
B    1.401608
C    3.999593
D    0.000000
F    4.000000
dtype: float64

Value counts (histogramming)#

s3 = pd.Series(np.random.randint(0, 4, size=5))

s3
0    2
1    0
2    1
3    3
4    3
dtype: int64
s3.value_counts()
3    2
2    1
0    1
1    1
dtype: int64

String methods#

Series is equipped with a set of string processing methods in the str attribute that make it easy to operate on each element of the array, as in the code snippet below. Note that pattern-matching in str generally uses regular expressions by default (and in some cases always uses them).

s4 = pd.Series(["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"])
s4
0       A
1       B
2       C
3    Aaba
4    Baca
5     NaN
6    CABA
7     dog
8     cat
dtype: object
s4.str.lower()
0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object

Merge#

Concat#

pandas provides various facilities for easily combining together Series and DataFrame objects with various kinds of set logic for the indexes and relational algebra functionality in the case of join / merge-type operations.

Concatenating pandas objects together with concat():

df5 = pd.DataFrame(np.random.randn(10, 4))
df5
0 1 2 3
0 -0.765385 -0.645532 1.077812 0.062465
1 0.411234 -0.846388 -1.143676 0.049230
2 0.277718 0.934318 -0.432680 -0.334319
3 -0.739204 -0.275935 1.653829 -0.171295
4 -1.692065 -0.941707 0.806281 0.334500
5 -0.933935 -0.753211 0.570971 -1.406925
6 -0.926019 0.511278 0.207824 -0.209196
7 -0.002946 -1.885014 1.971658 0.551915
8 -1.750871 0.080519 -1.743010 -0.398808
9 -0.595343 0.089527 0.720991 0.164519
# break it into pieces
pieces = [df5[:3], df5[3:7], df5[7:]]

pieces
[          0         1         2         3
 0 -0.765385 -0.645532  1.077812  0.062465
 1  0.411234 -0.846388 -1.143676  0.049230
 2  0.277718  0.934318 -0.432680 -0.334319,
           0         1         2         3
 3 -0.739204 -0.275935  1.653829 -0.171295
 4 -1.692065 -0.941707  0.806281  0.334500
 5 -0.933935 -0.753211  0.570971 -1.406925
 6 -0.926019  0.511278  0.207824 -0.209196,
           0         1         2         3
 7 -0.002946 -1.885014  1.971658  0.551915
 8 -1.750871  0.080519 -1.743010 -0.398808
 9 -0.595343  0.089527  0.720991  0.164519]
pd.concat(pieces)
0 1 2 3
0 -0.765385 -0.645532 1.077812 0.062465
1 0.411234 -0.846388 -1.143676 0.049230
2 0.277718 0.934318 -0.432680 -0.334319
3 -0.739204 -0.275935 1.653829 -0.171295
4 -1.692065 -0.941707 0.806281 0.334500
5 -0.933935 -0.753211 0.570971 -1.406925
6 -0.926019 0.511278 0.207824 -0.209196
7 -0.002946 -1.885014 1.971658 0.551915
8 -1.750871 0.080519 -1.743010 -0.398808
9 -0.595343 0.089527 0.720991 0.164519

Join#

left = pd.DataFrame({"key": ["foo", "foo"], "lval": [1, 2]})

right = pd.DataFrame({"key": ["foo", "foo"], "rval": [4, 5]})

print(left)
print("-" * 15)
print(right)
   key  lval
0  foo     1
1  foo     2
---------------
   key  rval
0  foo     4
1  foo     5
pd.merge(left, right, on="key")
key lval rval
0 foo 1 4
1 foo 1 5
2 foo 2 4
3 foo 2 5

Another example that can be given is:

left = pd.DataFrame({"key": ["foo", "bar"], "lval": [1, 2]})

right = pd.DataFrame({"key": ["foo", "bar"], "rval": [4, 5]})

print(left)
print("-" * 15)
print(right)
   key  lval
0  foo     1
1  bar     2
---------------
   key  rval
0  foo     4
1  bar     5
pd.merge(left, right, on="key")
key lval rval
0 foo 1 4
1 bar 2 5

Grouping#

By “group by” we are referring to a process involving one or more of the following steps:

  • Splitting the data into groups based on some criteria

  • Applying a function to each group independently

  • Combining the results into a data structure

df6 = pd.DataFrame(
    {
        "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
        "B": ["one", "one", "two", "three", "two", "two", "one", "three"],
        "C": np.random.randn(8),
        "D": np.random.randn(8),
    }
)

df6
A B C D
0 foo one 0.002769 0.045595
1 bar one 0.707007 0.688803
2 foo two -0.808793 0.810861
3 bar three -1.141795 0.224594
4 foo two 0.036662 -0.332820
5 bar two 0.263353 -0.209618
6 foo one -0.328079 -0.509825
7 foo three -1.071213 -1.171025

Grouping and then applying the sum() function to the resulting groups.

df6.groupby("A").sum()
/tmp/ipykernel_1801/2196244952.py:1: FutureWarning: The default value of numeric_only in DataFrameGroupBy.sum is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
  df6.groupby("A").sum()
C D
A
bar -0.171435 0.703779
foo -2.168655 -1.157213

Grouping by multiple columns forms a hierarchical index, and again we can apply the sum() function.

df6.groupby(["A", "B"]).sum()
C D
A B
bar one 0.707007 0.688803
three -1.141795 0.224594
two 0.263353 -0.209618
foo one -0.325311 -0.464230
three -1.071213 -1.171025
two -0.772132 0.478042

Reshaping#

Stack#

tuples = list(
    zip(
        *[
            ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
            ["one", "two", "one", "two", "one", "two", "one", "two"],
        ]
    )
)

tuples
[('bar', 'one'),
 ('bar', 'two'),
 ('baz', 'one'),
 ('baz', 'two'),
 ('foo', 'one'),
 ('foo', 'two'),
 ('qux', 'one'),
 ('qux', 'two')]
index = pd.MultiIndex.from_tuples(tuples, names=["first", "second"])

index
MultiIndex([('bar', 'one'),
            ('bar', 'two'),
            ('baz', 'one'),
            ('baz', 'two'),
            ('foo', 'one'),
            ('foo', 'two'),
            ('qux', 'one'),
            ('qux', 'two')],
           names=['first', 'second'])