Essential methods

Essential methods#

This introduction to pandas is based on the official pandas documentation. It contains information from the first steps in pandas tutorial but also covers more advanced topics.

Customarily, we import as follows:

import numpy as np
import pandas as pd

Object creation#

Creating a Series by passing a list of values, letting pandas create a default integer index:

s = pd.Series([1, 3, 5, np.nan, 6, 8])

s

  1.0
  3.0
  5.0
  NaN
  6.0
  8.0
dtype: float64

Creating a DataFrame by passing a NumPy array, with a datetime index and labeled columns:

dates = pd.date_range("20210101", periods=6)

dates

DatetimeIndex(['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04',
               '2021-01-05', '2021-01-06'],
              dtype='datetime64[ns]', freq='D')

df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))

df

	A	B	C	D
2021-01-01	1.162312	-0.344268	0.550256	-1.485372
2021-01-02	-1.539542	0.529183	-0.137668	2.056552
2021-01-03	0.137383	-0.872425	0.563527	1.377882
2021-01-04	0.827011	0.525581	-2.979430	0.439790
2021-01-05	0.692313	-0.846918	-0.850518	0.014180
2021-01-06	0.069261	0.293929	1.020163	-0.871126

Create a simple DataFrame:

df_height = pd.DataFrame(
    { 
        'name': pd.Categorical([ "Tom", "Marc", "Peter"]),
        'height': np.array([168, 193, 172],dtype='int32'),
        'gender': 'male' })

Creating a DataFrame by passing a dict of objects that can be converted to series-like.

df2 = pd.DataFrame(
    {
        "A": 1.0,
        "B": pd.Timestamp("20130102"),
        "C": pd.Series(1, index=list(range(4)), dtype="float32"),
        "D": np.array([3] * 4, dtype="int32"),
        "E": pd.Categorical(["test", "train", "test", "train"]),
        "F": "foo",
    }
)

df2

	A	B	C	D	E	F
0	1.0	2013-01-02	1.0	3	test	foo
1	1.0	2013-01-02	1.0	3	train	foo
2	1.0	2013-01-02	1.0	3	test	foo
3	1.0	2013-01-02	1.0	3	train	foo

The columns of the resulting DataFrame have different dtypes.

df2.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

If you’re using IPython, tab completion for column names (as well as public attributes) is automatically enabled. Here’s a subset of the attributes that will be completed:

# df2.<TAB>

Viewing data#

Here is how to view the top and bottom rows of the frame:

df

	A	B	C	D
2021-01-01	1.162312	-0.344268	0.550256	-1.485372
2021-01-02	-1.539542	0.529183	-0.137668	2.056552
2021-01-03	0.137383	-0.872425	0.563527	1.377882
2021-01-04	0.827011	0.525581	-2.979430	0.439790
2021-01-05	0.692313	-0.846918	-0.850518	0.014180
2021-01-06	0.069261	0.293929	1.020163	-0.871126

df.head()

	A	B	C	D
2021-01-01	1.162312	-0.344268	0.550256	-1.485372
2021-01-02	-1.539542	0.529183	-0.137668	2.056552
2021-01-03	0.137383	-0.872425	0.563527	1.377882
2021-01-04	0.827011	0.525581	-2.979430	0.439790
2021-01-05	0.692313	-0.846918	-0.850518	0.014180

df.tail(3)

	A	B	C	D
2021-01-04	0.827011	0.525581	-2.979430	0.439790
2021-01-05	0.692313	-0.846918	-0.850518	0.014180
2021-01-06	0.069261	0.293929	1.020163	-0.871126

df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 6 entries, 2021-01-01 to 2021-01-06
Freq: D
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   A       6 non-null      float64
 1   B       6 non-null      float64
 2   C       6 non-null      float64
 3   D       6 non-null      float64
dtypes: float64(4)
memory usage: 240.0 bytes

df.index

DatetimeIndex(['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04',
               '2021-01-05', '2021-01-06'],
              dtype='datetime64[ns]', freq='D')

df.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

describe() shows a quick statistic summary of your data:

df.describe()

	A	B	C	D
count	6.000000	6.000000	6.000000	6.000000
mean	0.224790	-0.119153	-0.305611	0.255318
std	0.959676	0.656414	1.464242	1.333840
min	-1.539542	-0.872425	-2.979430	-1.485372
25%	0.086291	-0.721256	-0.672305	-0.649799
50%	0.414848	-0.025169	0.206294	0.226985
75%	0.793337	0.467668	0.560210	1.143359
max	1.162312	0.529183	1.020163	2.056552

Transposing your data:

df.T

	2021-01-01	2021-01-02	2021-01-03	2021-01-04	2021-01-05	2021-01-06
A	1.162312	-1.539542	0.137383	0.827011	0.692313	0.069261
B	-0.344268	0.529183	-0.872425	0.525581	-0.846918	0.293929
C	0.550256	-0.137668	0.563527	-2.979430	-0.850518	1.020163
D	-1.485372	2.056552	1.377882	0.439790	0.014180	-0.871126

Sorting by an axis:

df.sort_index(axis=1, ascending=False)

	D	C	B	A
2021-01-01	-1.485372	0.550256	-0.344268	1.162312
2021-01-02	2.056552	-0.137668	0.529183	-1.539542
2021-01-03	1.377882	0.563527	-0.872425	0.137383
2021-01-04	0.439790	-2.979430	0.525581	0.827011
2021-01-05	0.014180	-0.850518	-0.846918	0.692313
2021-01-06	-0.871126	1.020163	0.293929	0.069261

Sorting by values:

df.sort_values(by="B")

	A	B	C	D
2021-01-03	0.137383	-0.872425	0.563527	1.377882
2021-01-05	0.692313	-0.846918	-0.850518	0.014180
2021-01-01	1.162312	-0.344268	0.550256	-1.485372
2021-01-06	0.069261	0.293929	1.020163	-0.871126
2021-01-04	0.827011	0.525581	-2.979430	0.439790
2021-01-02	-1.539542	0.529183	-0.137668	2.056552

Selection#

While standard Python / NumPy expressions for selecting and setting are intuitive and come in handy for interactive work, for production code, we recommend the optimized pandas data access methods, .at, .iat, .loc and .iloc.

Getting#

Selecting a single column, which yields a Series, equivalent to df.A:

df["A"]

2021-01-01    1.162312
2021-01-02   -1.539542
2021-01-03    0.137383
2021-01-04    0.827011
2021-01-05    0.692313
2021-01-06    0.069261
Freq: D, Name: A, dtype: float64

Selecting via [], which slices the rows.

df[0:3]

	A	B	C	D
2021-01-01	1.162312	-0.344268	0.550256	-1.485372
2021-01-02	-1.539542	0.529183	-0.137668	2.056552
2021-01-03	0.137383	-0.872425	0.563527	1.377882

df["20210101":"20210103"]

	A	B	C	D
2021-01-01	1.162312	-0.344268	0.550256	-1.485372
2021-01-02	-1.539542	0.529183	-0.137668	2.056552
2021-01-03	0.137383	-0.872425	0.563527	1.377882

Selection by label#

The .loc attribute is the primary access method. The following are valid inputs:

A single label, e.g. 5 or ‘a’ (Note that 5 is interpreted as a label of the index. This use is not an integer position along the index.).
A list or array of labels [‘a’, ‘b’, ‘c’].
A slice object with labels ‘a’:’f’ (Note that contrary to usual Python slices, both the start and the stop are included, when present in the index! See Slicing with labels.
A boolean array.
A callable, see Selection By Callable.

For getting a cross section using a label:

df.loc[dates[0]]

A    1.162312
B   -0.344268
C    0.550256
D   -1.485372
Name: 2021-01-01 00:00:00, dtype: float64

Selecting on a multi-axis by label:

df.loc[ : , ["A", "B"]]

	A	B
2021-01-01	1.162312	-0.344268
2021-01-02	-1.539542	0.529183
2021-01-03	0.137383	-0.872425
2021-01-04	0.827011	0.525581
2021-01-05	0.692313	-0.846918
2021-01-06	0.069261	0.293929

Showing label slicing, both endpoints are included:

df.loc["20210102":"20210104", ["A", "B"]]

	A	B
2021-01-02	-1.539542	0.529183
2021-01-03	0.137383	-0.872425
2021-01-04	0.827011	0.525581

Reduction in the dimensions of the returned object:

df.loc["20210102", ["A", "B"]]

A   -1.539542
B    0.529183
Name: 2021-01-02 00:00:00, dtype: float64

For getting a scalar value:

df.loc[dates[0], "A"]

1.1623119547723357

For getting fast access to a scalar (equivalent to the prior method) use .at. Use at if you only need to get or set a single value in a DataFrame or Series.

df.at[dates[0], "A"]

1.1623119547723357

Set value at specified row/column pair

df.at[dates[0], "A"] = 1

df.A

2021-01-01    1.000000
2021-01-02   -1.539542
2021-01-03    0.137383
2021-01-04    0.827011
2021-01-05    0.692313
2021-01-06    0.069261
Freq: D, Name: A, dtype: float64

Selection by position#

pandas provides a suite of methods in order to get purely integer based indexing.

The .iloc attribute is the primary access method. The following are valid inputs:

An integer e.g. 5.
A list or array of integers [4, 3, 0].
A slice object with ints 1:7.
A boolean array.
A callable, see Selection By Callable.

See more in Selection by Position

Select via the position of the passed integers:

df.iloc[0]

A    1.000000
B   -0.344268
C    0.550256
D   -1.485372
Name: 2021-01-01 00:00:00, dtype: float64

By integer slices, acting similar to NumPy/Python:

df.iloc[0:3, 0:2]

	A	B
2021-01-01	1.000000	-0.344268
2021-01-02	-1.539542	0.529183
2021-01-03	0.137383	-0.872425

df.iloc[1:3, 0:2]

	A	B
2021-01-02	-1.539542	0.529183
2021-01-03	0.137383	-0.872425

By lists of integer position locations, similar to the NumPy/Python style:

df.iloc[[1, 2, 4], [0, 2]]

	A	C
2021-01-02	-1.539542	-0.137668
2021-01-03	0.137383	0.563527
2021-01-05	0.692313	-0.850518

For slicing rows explicitly:

df.iloc[1:3, :]

	A	B	C	D
2021-01-02	-1.539542	0.529183	-0.137668	2.056552
2021-01-03	0.137383	-0.872425	0.563527	1.377882

For slicing columns explicitly:

df.iloc[:, 1:3]

	B	C
2021-01-01	-0.344268	0.550256
2021-01-02	0.529183	-0.137668
2021-01-03	-0.872425	0.563527
2021-01-04	0.525581	-2.979430
2021-01-05	-0.846918	-0.850518
2021-01-06	0.293929	1.020163

For getting a value explicitly:

df.iloc[0, 0]

1.0

For getting fast access to a scalar (equivalent to the prior method):

df.iat[0, 0]

1.0

Boolean indexing#

Boolean indexing is used to filter subsets of data based on actual values inside the DataFrame. It is called Boolean since the filter condition is associated with one of two possible values (usually denoted True and False)

Using a single column’s values to select data.

df[df["A"] > 0]

	A	B	C	D
2021-01-01	1.000000	-0.344268	0.550256	-1.485372
2021-01-03	0.137383	-0.872425	0.563527	1.377882
2021-01-04	0.827011	0.525581	-2.979430	0.439790
2021-01-05	0.692313	-0.846918	-0.850518	0.014180
2021-01-06	0.069261	0.293929	1.020163	-0.871126

Selecting values from a DataFrame where a boolean condition is met.

df[df > 0]

	A	B	C	D
2021-01-01	1.000000	NaN	0.550256	NaN
2021-01-02	NaN	0.529183	NaN	2.056552
2021-01-03	0.137383	NaN	0.563527	1.377882
2021-01-04	0.827011	0.525581	NaN	0.439790
2021-01-05	0.692313	NaN	NaN	0.014180
2021-01-06	0.069261	0.293929	1.020163	NaN

Let’s make a copy of the df2 object’s indices and data. We use the default deep=True to create a new object with a copy of the calling object’s data and indices. Modifications to the data or indices of the copy will not be reflected in the original object. Note that a shallow copy (deep=False) shares data and index with original.

df_copy = df.copy()

Create a new column in the DataFrame:

df_copy["E"] = ["one", "one", "two", "three", "four", "three"]

df_copy

	A	B	C	D	E
2021-01-01	1.000000	-0.344268	0.550256	-1.485372	one
2021-01-02	-1.539542	0.529183	-0.137668	2.056552	one
2021-01-03	0.137383	-0.872425	0.563527	1.377882	two
2021-01-04	0.827011	0.525581	-2.979430	0.439790	three
2021-01-05	0.692313	-0.846918	-0.850518	0.014180	four
2021-01-06	0.069261	0.293929	1.020163	-0.871126	three

Using the isin() method for filtering:

df_copy[df_copy["E"].isin(["two", "four"])]

	A	B	C	D	E
2021-01-03	0.137383	-0.872425	0.563527	1.377882	two
2021-01-05	0.692313	-0.846918	-0.850518	0.014180	four

Setting#

Setting a new column automatically aligns the data by the indexes.

s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range("20210102", periods=6))

s1

2021-01-02    1
2021-01-03    2
2021-01-04    3
2021-01-05    4
2021-01-06    5
2021-01-07    6
Freq: D, dtype: int64

df["F"] = s1

df

	A	B	C	D	F
2021-01-01	1.000000	-0.344268	0.550256	-1.485372	NaN
2021-01-02	-1.539542	0.529183	-0.137668	2.056552	1.0
2021-01-03	0.137383	-0.872425	0.563527	1.377882	2.0
2021-01-04	0.827011	0.525581	-2.979430	0.439790	3.0
2021-01-05	0.692313	-0.846918	-0.850518	0.014180	4.0
2021-01-06	0.069261	0.293929	1.020163	-0.871126	5.0

Setting values by label:

df.at[dates[0], "A"] = 0

df

	A	B	C	D	F
2021-01-01	0.000000	-0.344268	0.550256	-1.485372	NaN
2021-01-02	-1.539542	0.529183	-0.137668	2.056552	1.0
2021-01-03	0.137383	-0.872425	0.563527	1.377882	2.0
2021-01-04	0.827011	0.525581	-2.979430	0.439790	3.0
2021-01-05	0.692313	-0.846918	-0.850518	0.014180	4.0
2021-01-06	0.069261	0.293929	1.020163	-0.871126	5.0

Setting values by position:

df.iat[0, 1] = 0

df

	A	B	C	D	F
2021-01-01	0.000000	0.000000	0.550256	-1.485372	NaN
2021-01-02	-1.539542	0.529183	-0.137668	2.056552	1.0
2021-01-03	0.137383	-0.872425	0.563527	1.377882	2.0
2021-01-04	0.827011	0.525581	-2.979430	0.439790	3.0
2021-01-05	0.692313	-0.846918	-0.850518	0.014180	4.0
2021-01-06	0.069261	0.293929	1.020163	-0.871126	5.0

Setting by assigning with a NumPy array:

df.loc[:, "D"] = np.array([5] * len(df))

/tmp/ipykernel_1801/2422562958.py:1: DeprecationWarning: In a future version, `df.iloc[:, i] = newvals` will attempt to set the values inplace instead of always setting a new array. To retain the old behavior, use either `df[df.columns[i]] = newvals` or, if columns are non-unique, `df.isetitem(i, newvals)`
  df.loc[:, "D"] = np.array([5] * len(df))

df

	A	B	C	D	F
2021-01-01	0.000000	0.000000	0.550256	5	NaN
2021-01-02	-1.539542	0.529183	-0.137668	5	1.0
2021-01-03	0.137383	-0.872425	0.563527	5	2.0
2021-01-04	0.827011	0.525581	-2.979430	5	3.0
2021-01-05	0.692313	-0.846918	-0.850518	5	4.0
2021-01-06	0.069261	0.293929	1.020163	5	5.0

A where operation with setting.

df3 = df.copy()

df3[df3 > 0]= -df3

df3

	A	B	C	D	F
2021-01-01	0.000000	0.000000	-0.550256	-5	NaN
2021-01-02	-1.539542	-0.529183	-0.137668	-5	-1.0
2021-01-03	-0.137383	-0.872425	-0.563527	-5	-2.0
2021-01-04	-0.827011	-0.525581	-2.979430	-5	-3.0
2021-01-05	-0.692313	-0.846918	-0.850518	-5	-4.0
2021-01-06	-0.069261	-0.293929	-1.020163	-5	-5.0

Missing data#

pandas primarily uses the value np.nan to represent missing data. It is by default not included in computations. See the Missing Data section for details.

Reindexing allows you to change/add/delete the index on a specified axis. This returns a copy of the data.

df4 = df.reindex(index=dates[0:4], columns=list(df.columns) + ["E"])

df4

	A	B	C	D	F	E
2021-01-01	0.000000	0.000000	0.550256	5	NaN	NaN
2021-01-02	-1.539542	0.529183	-0.137668	5	1.0	NaN
2021-01-03	0.137383	-0.872425	0.563527	5	2.0	NaN
2021-01-04	0.827011	0.525581	-2.979430	5	3.0	NaN

df4.loc[dates[0] : dates[1], "E"] = 1

df4

	A	B	C	D	F	E
2021-01-01	0.000000	0.000000	0.550256	5	NaN	1.0
2021-01-02	-1.539542	0.529183	-0.137668	5	1.0	1.0
2021-01-03	0.137383	-0.872425	0.563527	5	2.0	NaN
2021-01-04	0.827011	0.525581	-2.979430	5	3.0	NaN

To drop any rows that have missing data.

df4.dropna(how="any")

	A	B	C	D	F	E
2021-01-02	-1.539542	0.529183	-0.137668	5	1.0	1.0

Filling missing data.

df4.fillna(value=5)

	A	B	C	D	F	E
2021-01-01	0.000000	0.000000	0.550256	5	5.0	1.0
2021-01-02	-1.539542	0.529183	-0.137668	5	1.0	1.0
2021-01-03	0.137383	-0.872425	0.563527	5	2.0	5.0
2021-01-04	0.827011	0.525581	-2.979430	5	3.0	5.0

To get the boolean mask where values are nan.

pd.isna(df4)

	A	B	C	D	F	E
2021-01-01	False	False	False	False	True	False
2021-01-02	False	False	False	False	False	False
2021-01-03	False	False	False	False	False	True
2021-01-04	False	False	False	False	False	True

Operations#

Note that operations in general exclude missing data.

Statistics#

Performing a descriptive statistic:

df.dtypes

A    float64
B    float64
C    float64
D      int64
F    float64
dtype: object

df.mean()

A    0.031071
B   -0.061775
C   -0.305611
D    5.000000
F    3.000000
dtype: float64

Same operation on the other axis:

df.mean(1)

2021-01-01    1.387564
2021-01-02    0.970395
2021-01-03    1.365697
2021-01-04    1.274633
2021-01-05    1.598975
2021-01-06    2.276671
Freq: D, dtype: float64

Operating with objects that have different dimensionality and need alignment. In addition, pandas automatically broadcasts along the specified dimension.

Let’s use the shift() function to shift the index axis by 2 periods in the positive direction (i.e., create 2021-01-05 and 2021-01-06):

s2 = pd.Series([1, 3, 5, np.nan, 6, 8], index=dates).shift(2)

s2

2021-01-01    NaN
2021-01-02    NaN
2021-01-03    1.0
2021-01-04    3.0
2021-01-05    5.0
2021-01-06    NaN
Freq: D, dtype: float64

We can use the sub() function to subtract each element of a DataFrame with a corresponding element in a series.

df

	A	B	C	D	F
2021-01-01	0.000000	0.000000	0.550256	5	NaN
2021-01-02	-1.539542	0.529183	-0.137668	5	1.0
2021-01-03	0.137383	-0.872425	0.563527	5	2.0
2021-01-04	0.827011	0.525581	-2.979430	5	3.0
2021-01-05	0.692313	-0.846918	-0.850518	5	4.0
2021-01-06	0.069261	0.293929	1.020163	5	5.0

df.sub(s2, axis="index")

	A	B	C	D	F
2021-01-01	NaN	NaN	NaN	NaN	NaN
2021-01-02	NaN	NaN	NaN	NaN	NaN
2021-01-03	-0.862617	-1.872425	-0.436473	4.0	1.0
2021-01-04	-2.172989	-2.474419	-5.979430	2.0	0.0
2021-01-05	-4.307687	-5.846918	-5.850518	0.0	-1.0
2021-01-06	NaN	NaN	NaN	NaN	NaN

Apply#

Applying functions to data:

df.apply(np.cumsum)

	A	B	C	D	F
2021-01-01	0.000000	0.000000	0.550256	5	NaN
2021-01-02	-1.539542	0.529183	0.412589	10	1.0
2021-01-03	-1.402159	-0.343242	0.976116	15	3.0
2021-01-04	-0.575148	0.182339	-2.003314	20	6.0
2021-01-05	0.117166	-0.664579	-2.853831	25	10.0
2021-01-06	0.186426	-0.370650	-1.833668	30	15.0

A lambda function is a small anonymous function. It can take any number of arguments, but can only have one expression:

df.apply(lambda x: x.max() - x.min())

A    2.366554
B    1.401608
C    3.999593
D    0.000000
F    4.000000
dtype: float64

Value counts (histogramming)#

s3 = pd.Series(np.random.randint(0, 4, size=5))

s3

  2
  0
  1
  3
  3
dtype: int64

s3.value_counts()

  2
  1
  1
  1
dtype: int64

String methods#

Series is equipped with a set of string processing methods in the str attribute that make it easy to operate on each element of the array, as in the code snippet below. Note that pattern-matching in str generally uses regular expressions by default (and in some cases always uses them).

s4 = pd.Series(["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"])

s4

     A
     B
     C
  Aaba
  Baca
   NaN
  CABA
   dog
   cat
dtype: object

s4.str.lower()

     a
     b
     c
  aaba
  baca
   NaN
  caba
   dog
   cat
dtype: object

Merge#

Concat#

pandas provides various facilities for easily combining together Series and DataFrame objects with various kinds of set logic for the indexes and relational algebra functionality in the case of join / merge-type operations.

Concatenating pandas objects together with concat():

df5 = pd.DataFrame(np.random.randn(10, 4))

df5

	0	1	2	3
0	-0.765385	-0.645532	1.077812	0.062465
1	0.411234	-0.846388	-1.143676	0.049230
2	0.277718	0.934318	-0.432680	-0.334319
3	-0.739204	-0.275935	1.653829	-0.171295
4	-1.692065	-0.941707	0.806281	0.334500
5	-0.933935	-0.753211	0.570971	-1.406925
6	-0.926019	0.511278	0.207824	-0.209196
7	-0.002946	-1.885014	1.971658	0.551915
8	-1.750871	0.080519	-1.743010	-0.398808
9	-0.595343	0.089527	0.720991	0.164519

# break it into pieces
pieces = [df5[:3], df5[3:7], df5[7:]]

pieces

[          0         1         2         3
-0.765385 -0.645532  1.077812  0.062465
0.411234 -0.846388 -1.143676  0.049230
0.277718  0.934318 -0.432680 -0.334319,
           0         1         2         3
-0.739204 -0.275935  1.653829 -0.171295
-1.692065 -0.941707  0.806281  0.334500
-0.933935 -0.753211  0.570971 -1.406925
-0.926019  0.511278  0.207824 -0.209196,
           0         1         2         3
-0.002946 -1.885014  1.971658  0.551915
-1.750871  0.080519 -1.743010 -0.398808
-0.595343  0.089527  0.720991  0.164519]

pd.concat(pieces)

	0	1	2	3
0	-0.765385	-0.645532	1.077812	0.062465
1	0.411234	-0.846388	-1.143676	0.049230
2	0.277718	0.934318	-0.432680	-0.334319
3	-0.739204	-0.275935	1.653829	-0.171295
4	-1.692065	-0.941707	0.806281	0.334500
5	-0.933935	-0.753211	0.570971	-1.406925
6	-0.926019	0.511278	0.207824	-0.209196
7	-0.002946	-1.885014	1.971658	0.551915
8	-1.750871	0.080519	-1.743010	-0.398808
9	-0.595343	0.089527	0.720991	0.164519

Join#

left = pd.DataFrame({"key": ["foo", "foo"], "lval": [1, 2]})

right = pd.DataFrame({"key": ["foo", "foo"], "rval": [4, 5]})

print(left)
print("-" * 15)
print(right)

   key  lval
0  foo     1
1  foo     2
---------------
   key  rval
0  foo     4
1  foo     5

pd.merge(left, right, on="key")

	key	lval	rval
0	foo	1	4
1	foo	1	5
2	foo	2	4
3	foo	2	5

Another example that can be given is:

left = pd.DataFrame({"key": ["foo", "bar"], "lval": [1, 2]})

right = pd.DataFrame({"key": ["foo", "bar"], "rval": [4, 5]})

print(left)
print("-" * 15)
print(right)

   key  lval
0  foo     1
1  bar     2
---------------
   key  rval
0  foo     4
1  bar     5

pd.merge(left, right, on="key")

	key	lval	rval
0	foo	1	4
1	bar	2	5

Grouping#

By “group by” we are referring to a process involving one or more of the following steps:

Splitting the data into groups based on some criteria
Applying a function to each group independently
Combining the results into a data structure

df6 = pd.DataFrame(
    {
        "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
        "B": ["one", "one", "two", "three", "two", "two", "one", "three"],
        "C": np.random.randn(8),
        "D": np.random.randn(8),
    }
)

df6

	A	B	C	D
0	foo	one	0.002769	0.045595
1	bar	one	0.707007	0.688803
2	foo	two	-0.808793	0.810861
3	bar	three	-1.141795	0.224594
4	foo	two	0.036662	-0.332820
5	bar	two	0.263353	-0.209618
6	foo	one	-0.328079	-0.509825
7	foo	three	-1.071213	-1.171025

Grouping and then applying the sum() function to the resulting groups.

df6.groupby("A").sum()

/tmp/ipykernel_1801/2196244952.py:1: FutureWarning: The default value of numeric_only in DataFrameGroupBy.sum is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
  df6.groupby("A").sum()

	C	D
A
bar	-0.171435	0.703779
foo	-2.168655	-1.157213

Grouping by multiple columns forms a hierarchical index, and again we can apply the sum() function.

df6.groupby(["A", "B"]).sum()

		C	D
A	B
bar	one	0.707007	0.688803
	three	-1.141795	0.224594
	two	0.263353	-0.209618
foo	one	-0.325311	-0.464230
	three	-1.071213	-1.171025
	two	-0.772132	0.478042

Reshaping#

Stack#

tuples = list(
    zip(
        *[
            ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
            ["one", "two", "one", "two", "one", "two", "one", "two"],
        ]
    )
)

tuples

[('bar', 'one'),
 ('bar', 'two'),
 ('baz', 'one'),
 ('baz', 'two'),
 ('foo', 'one'),
 ('foo', 'two'),
 ('qux', 'one'),
 ('qux', 'two')]

index = pd.MultiIndex.from_tuples(tuples, names=["first", "second"])

index

MultiIndex([('bar', 'one'),
            ('bar', 'two'),
            ('baz', 'one'),
            ('baz', 'two'),
            ('foo', 'one'),
            ('foo', 'two'),
            ('qux', 'one'),
            ('qux', 'two')],
           names=['first', 'second'])