Essential methods#
This introduction to pandas is based on the official pandas documentation. It contains information from the first steps in pandas tutorial but also covers more advanced topics.
Customarily, we import as follows:
import numpy as np
import pandas as pd
Object creation#
Creating a Series by passing a list of values, letting pandas create a default integer index:
s = pd.Series([1, 3, 5, np.nan, 6, 8])
s
0 1.0
1 3.0
2 5.0
3 NaN
4 6.0
5 8.0
dtype: float64
Creating a DataFrame by passing a NumPy array, with a datetime index and labeled columns:
dates = pd.date_range("20210101", periods=6)
dates
DatetimeIndex(['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04',
'2021-01-05', '2021-01-06'],
dtype='datetime64[ns]', freq='D')
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
df
A | B | C | D | |
---|---|---|---|---|
2021-01-01 | 1.162312 | -0.344268 | 0.550256 | -1.485372 |
2021-01-02 | -1.539542 | 0.529183 | -0.137668 | 2.056552 |
2021-01-03 | 0.137383 | -0.872425 | 0.563527 | 1.377882 |
2021-01-04 | 0.827011 | 0.525581 | -2.979430 | 0.439790 |
2021-01-05 | 0.692313 | -0.846918 | -0.850518 | 0.014180 |
2021-01-06 | 0.069261 | 0.293929 | 1.020163 | -0.871126 |
Create a simple DataFrame:
df_height = pd.DataFrame(
{
'name': pd.Categorical([ "Tom", "Marc", "Peter"]),
'height': np.array([168, 193, 172],dtype='int32'),
'gender': 'male' })
Creating a DataFrame by passing a dict of objects that can be converted to series-like.
df2 = pd.DataFrame(
{
"A": 1.0,
"B": pd.Timestamp("20130102"),
"C": pd.Series(1, index=list(range(4)), dtype="float32"),
"D": np.array([3] * 4, dtype="int32"),
"E": pd.Categorical(["test", "train", "test", "train"]),
"F": "foo",
}
)
df2
A | B | C | D | E | F | |
---|---|---|---|---|---|---|
0 | 1.0 | 2013-01-02 | 1.0 | 3 | test | foo |
1 | 1.0 | 2013-01-02 | 1.0 | 3 | train | foo |
2 | 1.0 | 2013-01-02 | 1.0 | 3 | test | foo |
3 | 1.0 | 2013-01-02 | 1.0 | 3 | train | foo |
The columns of the resulting DataFrame have different dtypes.
df2.dtypes
A float64
B datetime64[ns]
C float32
D int32
E category
F object
dtype: object
If you’re using IPython, tab completion for column names (as well as public attributes) is automatically enabled. Here’s a subset of the attributes that will be completed:
# df2.<TAB>
Viewing data#
Here is how to view the top and bottom rows of the frame:
df
A | B | C | D | |
---|---|---|---|---|
2021-01-01 | 1.162312 | -0.344268 | 0.550256 | -1.485372 |
2021-01-02 | -1.539542 | 0.529183 | -0.137668 | 2.056552 |
2021-01-03 | 0.137383 | -0.872425 | 0.563527 | 1.377882 |
2021-01-04 | 0.827011 | 0.525581 | -2.979430 | 0.439790 |
2021-01-05 | 0.692313 | -0.846918 | -0.850518 | 0.014180 |
2021-01-06 | 0.069261 | 0.293929 | 1.020163 | -0.871126 |
df.head()
A | B | C | D | |
---|---|---|---|---|
2021-01-01 | 1.162312 | -0.344268 | 0.550256 | -1.485372 |
2021-01-02 | -1.539542 | 0.529183 | -0.137668 | 2.056552 |
2021-01-03 | 0.137383 | -0.872425 | 0.563527 | 1.377882 |
2021-01-04 | 0.827011 | 0.525581 | -2.979430 | 0.439790 |
2021-01-05 | 0.692313 | -0.846918 | -0.850518 | 0.014180 |
df.tail(3)
A | B | C | D | |
---|---|---|---|---|
2021-01-04 | 0.827011 | 0.525581 | -2.979430 | 0.439790 |
2021-01-05 | 0.692313 | -0.846918 | -0.850518 | 0.014180 |
2021-01-06 | 0.069261 | 0.293929 | 1.020163 | -0.871126 |
df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 6 entries, 2021-01-01 to 2021-01-06
Freq: D
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 A 6 non-null float64
1 B 6 non-null float64
2 C 6 non-null float64
3 D 6 non-null float64
dtypes: float64(4)
memory usage: 240.0 bytes
df.index
DatetimeIndex(['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04',
'2021-01-05', '2021-01-06'],
dtype='datetime64[ns]', freq='D')
df.columns
Index(['A', 'B', 'C', 'D'], dtype='object')
describe() shows a quick statistic summary of your data:
df.describe()
A | B | C | D | |
---|---|---|---|---|
count | 6.000000 | 6.000000 | 6.000000 | 6.000000 |
mean | 0.224790 | -0.119153 | -0.305611 | 0.255318 |
std | 0.959676 | 0.656414 | 1.464242 | 1.333840 |
min | -1.539542 | -0.872425 | -2.979430 | -1.485372 |
25% | 0.086291 | -0.721256 | -0.672305 | -0.649799 |
50% | 0.414848 | -0.025169 | 0.206294 | 0.226985 |
75% | 0.793337 | 0.467668 | 0.560210 | 1.143359 |
max | 1.162312 | 0.529183 | 1.020163 | 2.056552 |
Transposing your data:
df.T
2021-01-01 | 2021-01-02 | 2021-01-03 | 2021-01-04 | 2021-01-05 | 2021-01-06 | |
---|---|---|---|---|---|---|
A | 1.162312 | -1.539542 | 0.137383 | 0.827011 | 0.692313 | 0.069261 |
B | -0.344268 | 0.529183 | -0.872425 | 0.525581 | -0.846918 | 0.293929 |
C | 0.550256 | -0.137668 | 0.563527 | -2.979430 | -0.850518 | 1.020163 |
D | -1.485372 | 2.056552 | 1.377882 | 0.439790 | 0.014180 | -0.871126 |
Sorting by an axis:
df.sort_index(axis=1, ascending=False)
D | C | B | A | |
---|---|---|---|---|
2021-01-01 | -1.485372 | 0.550256 | -0.344268 | 1.162312 |
2021-01-02 | 2.056552 | -0.137668 | 0.529183 | -1.539542 |
2021-01-03 | 1.377882 | 0.563527 | -0.872425 | 0.137383 |
2021-01-04 | 0.439790 | -2.979430 | 0.525581 | 0.827011 |
2021-01-05 | 0.014180 | -0.850518 | -0.846918 | 0.692313 |
2021-01-06 | -0.871126 | 1.020163 | 0.293929 | 0.069261 |
Sorting by values:
df.sort_values(by="B")
A | B | C | D | |
---|---|---|---|---|
2021-01-03 | 0.137383 | -0.872425 | 0.563527 | 1.377882 |
2021-01-05 | 0.692313 | -0.846918 | -0.850518 | 0.014180 |
2021-01-01 | 1.162312 | -0.344268 | 0.550256 | -1.485372 |
2021-01-06 | 0.069261 | 0.293929 | 1.020163 | -0.871126 |
2021-01-04 | 0.827011 | 0.525581 | -2.979430 | 0.439790 |
2021-01-02 | -1.539542 | 0.529183 | -0.137668 | 2.056552 |
Selection#
While standard Python / NumPy expressions for selecting and setting are intuitive and come in handy for interactive work, for production code, we recommend the optimized pandas data access methods, .at, .iat, .loc and .iloc.
Getting#
Selecting a single column, which yields a Series, equivalent to df.A
:
df["A"]
2021-01-01 1.162312
2021-01-02 -1.539542
2021-01-03 0.137383
2021-01-04 0.827011
2021-01-05 0.692313
2021-01-06 0.069261
Freq: D, Name: A, dtype: float64
Selecting via [], which slices the rows.
df[0:3]
A | B | C | D | |
---|---|---|---|---|
2021-01-01 | 1.162312 | -0.344268 | 0.550256 | -1.485372 |
2021-01-02 | -1.539542 | 0.529183 | -0.137668 | 2.056552 |
2021-01-03 | 0.137383 | -0.872425 | 0.563527 | 1.377882 |
df["20210101":"20210103"]
A | B | C | D | |
---|---|---|---|---|
2021-01-01 | 1.162312 | -0.344268 | 0.550256 | -1.485372 |
2021-01-02 | -1.539542 | 0.529183 | -0.137668 | 2.056552 |
2021-01-03 | 0.137383 | -0.872425 | 0.563527 | 1.377882 |
Selection by label#
The .loc
attribute is the primary access method. The following are valid inputs:
A single label, e.g. 5 or ‘a’ (Note that 5 is interpreted as a label of the index. This use is not an integer position along the index.).
A list or array of labels [‘a’, ‘b’, ‘c’].
A slice object with labels ‘a’:’f’ (Note that contrary to usual Python slices, both the start and the stop are included, when present in the index! See Slicing with labels.
A boolean array.
A callable, see Selection By Callable.
For getting a cross section using a label:
df.loc[dates[0]]
A 1.162312
B -0.344268
C 0.550256
D -1.485372
Name: 2021-01-01 00:00:00, dtype: float64
Selecting on a multi-axis by label:
df.loc[ : , ["A", "B"]]
A | B | |
---|---|---|
2021-01-01 | 1.162312 | -0.344268 |
2021-01-02 | -1.539542 | 0.529183 |
2021-01-03 | 0.137383 | -0.872425 |
2021-01-04 | 0.827011 | 0.525581 |
2021-01-05 | 0.692313 | -0.846918 |
2021-01-06 | 0.069261 | 0.293929 |
Showing label slicing, both endpoints are included:
df.loc["20210102":"20210104", ["A", "B"]]
A | B | |
---|---|---|
2021-01-02 | -1.539542 | 0.529183 |
2021-01-03 | 0.137383 | -0.872425 |
2021-01-04 | 0.827011 | 0.525581 |
Reduction in the dimensions of the returned object:
df.loc["20210102", ["A", "B"]]
A -1.539542
B 0.529183
Name: 2021-01-02 00:00:00, dtype: float64
For getting a scalar value:
df.loc[dates[0], "A"]
1.1623119547723357
For getting fast access to a scalar (equivalent to the prior method) use .at
.
Use at if you only need to get or set a single value in a DataFrame or Series.
df.at[dates[0], "A"]
1.1623119547723357
Set value at specified row/column pair
df.at[dates[0], "A"] = 1
df.A
2021-01-01 1.000000
2021-01-02 -1.539542
2021-01-03 0.137383
2021-01-04 0.827011
2021-01-05 0.692313
2021-01-06 0.069261
Freq: D, Name: A, dtype: float64
Selection by position#
pandas provides a suite of methods in order to get purely integer based indexing.
The .iloc attribute is the primary access method. The following are valid inputs:
An integer e.g. 5.
A list or array of integers [4, 3, 0].
A slice object with ints 1:7.
A boolean array.
A callable, see Selection By Callable.
See more in Selection by Position
Select via the position of the passed integers:
df.iloc[0]
A 1.000000
B -0.344268
C 0.550256
D -1.485372
Name: 2021-01-01 00:00:00, dtype: float64
By integer slices, acting similar to NumPy/Python:
df.iloc[0:3, 0:2]
A | B | |
---|---|---|
2021-01-01 | 1.000000 | -0.344268 |
2021-01-02 | -1.539542 | 0.529183 |
2021-01-03 | 0.137383 | -0.872425 |
df.iloc[1:3, 0:2]
A | B | |
---|---|---|
2021-01-02 | -1.539542 | 0.529183 |
2021-01-03 | 0.137383 | -0.872425 |
By lists of integer position locations, similar to the NumPy/Python style:
df.iloc[[1, 2, 4], [0, 2]]
A | C | |
---|---|---|
2021-01-02 | -1.539542 | -0.137668 |
2021-01-03 | 0.137383 | 0.563527 |
2021-01-05 | 0.692313 | -0.850518 |
For slicing rows explicitly:
df.iloc[1:3, :]
A | B | C | D | |
---|---|---|---|---|
2021-01-02 | -1.539542 | 0.529183 | -0.137668 | 2.056552 |
2021-01-03 | 0.137383 | -0.872425 | 0.563527 | 1.377882 |
For slicing columns explicitly:
df.iloc[:, 1:3]
B | C | |
---|---|---|
2021-01-01 | -0.344268 | 0.550256 |
2021-01-02 | 0.529183 | -0.137668 |
2021-01-03 | -0.872425 | 0.563527 |
2021-01-04 | 0.525581 | -2.979430 |
2021-01-05 | -0.846918 | -0.850518 |
2021-01-06 | 0.293929 | 1.020163 |
For getting a value explicitly:
df.iloc[0, 0]
1.0
For getting fast access to a scalar (equivalent to the prior method):
df.iat[0, 0]
1.0
Boolean indexing#
Boolean indexing is used to filter subsets of data based on actual values inside the DataFrame. It is called Boolean since the filter condition is associated with one of two possible values (usually denoted True
and False
)
Using a single column’s values to select data.
df[df["A"] > 0]
A | B | C | D | |
---|---|---|---|---|
2021-01-01 | 1.000000 | -0.344268 | 0.550256 | -1.485372 |
2021-01-03 | 0.137383 | -0.872425 | 0.563527 | 1.377882 |
2021-01-04 | 0.827011 | 0.525581 | -2.979430 | 0.439790 |
2021-01-05 | 0.692313 | -0.846918 | -0.850518 | 0.014180 |
2021-01-06 | 0.069261 | 0.293929 | 1.020163 | -0.871126 |
Selecting values from a DataFrame where a boolean condition is met.
df[df > 0]
A | B | C | D | |
---|---|---|---|---|
2021-01-01 | 1.000000 | NaN | 0.550256 | NaN |
2021-01-02 | NaN | 0.529183 | NaN | 2.056552 |
2021-01-03 | 0.137383 | NaN | 0.563527 | 1.377882 |
2021-01-04 | 0.827011 | 0.525581 | NaN | 0.439790 |
2021-01-05 | 0.692313 | NaN | NaN | 0.014180 |
2021-01-06 | 0.069261 | 0.293929 | 1.020163 | NaN |
Let’s make a copy of the df2
object’s indices and data. We use the default deep=True
to create a new object with a copy of the calling object’s data and indices. Modifications to the data or indices of the copy will not be reflected in the original object. Note that a shallow copy (deep=False
) shares data and index with original.
df_copy = df.copy()
Create a new column in the DataFrame:
df_copy["E"] = ["one", "one", "two", "three", "four", "three"]
df_copy
A | B | C | D | E | |
---|---|---|---|---|---|
2021-01-01 | 1.000000 | -0.344268 | 0.550256 | -1.485372 | one |
2021-01-02 | -1.539542 | 0.529183 | -0.137668 | 2.056552 | one |
2021-01-03 | 0.137383 | -0.872425 | 0.563527 | 1.377882 | two |
2021-01-04 | 0.827011 | 0.525581 | -2.979430 | 0.439790 | three |
2021-01-05 | 0.692313 | -0.846918 | -0.850518 | 0.014180 | four |
2021-01-06 | 0.069261 | 0.293929 | 1.020163 | -0.871126 | three |
Using the isin() method for filtering:
df_copy[df_copy["E"].isin(["two", "four"])]
A | B | C | D | E | |
---|---|---|---|---|---|
2021-01-03 | 0.137383 | -0.872425 | 0.563527 | 1.377882 | two |
2021-01-05 | 0.692313 | -0.846918 | -0.850518 | 0.014180 | four |
Setting#
Setting a new column automatically aligns the data by the indexes.
s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range("20210102", periods=6))
s1
2021-01-02 1
2021-01-03 2
2021-01-04 3
2021-01-05 4
2021-01-06 5
2021-01-07 6
Freq: D, dtype: int64
df["F"] = s1
df
A | B | C | D | F | |
---|---|---|---|---|---|
2021-01-01 | 1.000000 | -0.344268 | 0.550256 | -1.485372 | NaN |
2021-01-02 | -1.539542 | 0.529183 | -0.137668 | 2.056552 | 1.0 |
2021-01-03 | 0.137383 | -0.872425 | 0.563527 | 1.377882 | 2.0 |
2021-01-04 | 0.827011 | 0.525581 | -2.979430 | 0.439790 | 3.0 |
2021-01-05 | 0.692313 | -0.846918 | -0.850518 | 0.014180 | 4.0 |
2021-01-06 | 0.069261 | 0.293929 | 1.020163 | -0.871126 | 5.0 |
Setting values by label:
df.at[dates[0], "A"] = 0
df
A | B | C | D | F | |
---|---|---|---|---|---|
2021-01-01 | 0.000000 | -0.344268 | 0.550256 | -1.485372 | NaN |
2021-01-02 | -1.539542 | 0.529183 | -0.137668 | 2.056552 | 1.0 |
2021-01-03 | 0.137383 | -0.872425 | 0.563527 | 1.377882 | 2.0 |
2021-01-04 | 0.827011 | 0.525581 | -2.979430 | 0.439790 | 3.0 |
2021-01-05 | 0.692313 | -0.846918 | -0.850518 | 0.014180 | 4.0 |
2021-01-06 | 0.069261 | 0.293929 | 1.020163 | -0.871126 | 5.0 |
Setting values by position:
df.iat[0, 1] = 0
df
A | B | C | D | F | |
---|---|---|---|---|---|
2021-01-01 | 0.000000 | 0.000000 | 0.550256 | -1.485372 | NaN |
2021-01-02 | -1.539542 | 0.529183 | -0.137668 | 2.056552 | 1.0 |
2021-01-03 | 0.137383 | -0.872425 | 0.563527 | 1.377882 | 2.0 |
2021-01-04 | 0.827011 | 0.525581 | -2.979430 | 0.439790 | 3.0 |
2021-01-05 | 0.692313 | -0.846918 | -0.850518 | 0.014180 | 4.0 |
2021-01-06 | 0.069261 | 0.293929 | 1.020163 | -0.871126 | 5.0 |
Setting by assigning with a NumPy array:
df.loc[:, "D"] = np.array([5] * len(df))
/tmp/ipykernel_1801/2422562958.py:1: DeprecationWarning: In a future version, `df.iloc[:, i] = newvals` will attempt to set the values inplace instead of always setting a new array. To retain the old behavior, use either `df[df.columns[i]] = newvals` or, if columns are non-unique, `df.isetitem(i, newvals)`
df.loc[:, "D"] = np.array([5] * len(df))
df
A | B | C | D | F | |
---|---|---|---|---|---|
2021-01-01 | 0.000000 | 0.000000 | 0.550256 | 5 | NaN |
2021-01-02 | -1.539542 | 0.529183 | -0.137668 | 5 | 1.0 |
2021-01-03 | 0.137383 | -0.872425 | 0.563527 | 5 | 2.0 |
2021-01-04 | 0.827011 | 0.525581 | -2.979430 | 5 | 3.0 |
2021-01-05 | 0.692313 | -0.846918 | -0.850518 | 5 | 4.0 |
2021-01-06 | 0.069261 | 0.293929 | 1.020163 | 5 | 5.0 |
A where operation with setting.
df3 = df.copy()
df3[df3 > 0]= -df3
df3
A | B | C | D | F | |
---|---|---|---|---|---|
2021-01-01 | 0.000000 | 0.000000 | -0.550256 | -5 | NaN |
2021-01-02 | -1.539542 | -0.529183 | -0.137668 | -5 | -1.0 |
2021-01-03 | -0.137383 | -0.872425 | -0.563527 | -5 | -2.0 |
2021-01-04 | -0.827011 | -0.525581 | -2.979430 | -5 | -3.0 |
2021-01-05 | -0.692313 | -0.846918 | -0.850518 | -5 | -4.0 |
2021-01-06 | -0.069261 | -0.293929 | -1.020163 | -5 | -5.0 |
Missing data#
pandas primarily uses the value np.nan to represent missing data. It is by default not included in computations. See the Missing Data section for details.
Reindexing allows you to change/add/delete the index on a specified axis. This returns a copy of the data.
df4 = df.reindex(index=dates[0:4], columns=list(df.columns) + ["E"])
df4
A | B | C | D | F | E | |
---|---|---|---|---|---|---|
2021-01-01 | 0.000000 | 0.000000 | 0.550256 | 5 | NaN | NaN |
2021-01-02 | -1.539542 | 0.529183 | -0.137668 | 5 | 1.0 | NaN |
2021-01-03 | 0.137383 | -0.872425 | 0.563527 | 5 | 2.0 | NaN |
2021-01-04 | 0.827011 | 0.525581 | -2.979430 | 5 | 3.0 | NaN |
df4.loc[dates[0] : dates[1], "E"] = 1
df4
A | B | C | D | F | E | |
---|---|---|---|---|---|---|
2021-01-01 | 0.000000 | 0.000000 | 0.550256 | 5 | NaN | 1.0 |
2021-01-02 | -1.539542 | 0.529183 | -0.137668 | 5 | 1.0 | 1.0 |
2021-01-03 | 0.137383 | -0.872425 | 0.563527 | 5 | 2.0 | NaN |
2021-01-04 | 0.827011 | 0.525581 | -2.979430 | 5 | 3.0 | NaN |
To drop any rows that have missing data.
df4.dropna(how="any")
A | B | C | D | F | E | |
---|---|---|---|---|---|---|
2021-01-02 | -1.539542 | 0.529183 | -0.137668 | 5 | 1.0 | 1.0 |
Filling missing data.
df4.fillna(value=5)
A | B | C | D | F | E | |
---|---|---|---|---|---|---|
2021-01-01 | 0.000000 | 0.000000 | 0.550256 | 5 | 5.0 | 1.0 |
2021-01-02 | -1.539542 | 0.529183 | -0.137668 | 5 | 1.0 | 1.0 |
2021-01-03 | 0.137383 | -0.872425 | 0.563527 | 5 | 2.0 | 5.0 |
2021-01-04 | 0.827011 | 0.525581 | -2.979430 | 5 | 3.0 | 5.0 |
To get the boolean mask where values are nan.
pd.isna(df4)
A | B | C | D | F | E | |
---|---|---|---|---|---|---|
2021-01-01 | False | False | False | False | True | False |
2021-01-02 | False | False | False | False | False | False |
2021-01-03 | False | False | False | False | False | True |
2021-01-04 | False | False | False | False | False | True |
Operations#
Note that operations in general exclude missing data.
Statistics#
Performing a descriptive statistic:
df.dtypes
A float64
B float64
C float64
D int64
F float64
dtype: object
df.mean()
A 0.031071
B -0.061775
C -0.305611
D 5.000000
F 3.000000
dtype: float64
Same operation on the other axis:
df.mean(1)
2021-01-01 1.387564
2021-01-02 0.970395
2021-01-03 1.365697
2021-01-04 1.274633
2021-01-05 1.598975
2021-01-06 2.276671
Freq: D, dtype: float64
Operating with objects that have different dimensionality and need alignment. In addition, pandas automatically broadcasts along the specified dimension.
Let’s use the shift() function to shift the index axis by 2 periods in the positive direction (i.e., create 2021-01-05 and 2021-01-06):
s2 = pd.Series([1, 3, 5, np.nan, 6, 8], index=dates).shift(2)
s2
2021-01-01 NaN
2021-01-02 NaN
2021-01-03 1.0
2021-01-04 3.0
2021-01-05 5.0
2021-01-06 NaN
Freq: D, dtype: float64
We can use the sub()
function to subtract each element of a DataFrame with a corresponding element in a series.
df
A | B | C | D | F | |
---|---|---|---|---|---|
2021-01-01 | 0.000000 | 0.000000 | 0.550256 | 5 | NaN |
2021-01-02 | -1.539542 | 0.529183 | -0.137668 | 5 | 1.0 |
2021-01-03 | 0.137383 | -0.872425 | 0.563527 | 5 | 2.0 |
2021-01-04 | 0.827011 | 0.525581 | -2.979430 | 5 | 3.0 |
2021-01-05 | 0.692313 | -0.846918 | -0.850518 | 5 | 4.0 |
2021-01-06 | 0.069261 | 0.293929 | 1.020163 | 5 | 5.0 |
df.sub(s2, axis="index")
A | B | C | D | F | |
---|---|---|---|---|---|
2021-01-01 | NaN | NaN | NaN | NaN | NaN |
2021-01-02 | NaN | NaN | NaN | NaN | NaN |
2021-01-03 | -0.862617 | -1.872425 | -0.436473 | 4.0 | 1.0 |
2021-01-04 | -2.172989 | -2.474419 | -5.979430 | 2.0 | 0.0 |
2021-01-05 | -4.307687 | -5.846918 | -5.850518 | 0.0 | -1.0 |
2021-01-06 | NaN | NaN | NaN | NaN | NaN |
Apply#
Applying functions to data:
df.apply(np.cumsum)
A | B | C | D | F | |
---|---|---|---|---|---|
2021-01-01 | 0.000000 | 0.000000 | 0.550256 | 5 | NaN |
2021-01-02 | -1.539542 | 0.529183 | 0.412589 | 10 | 1.0 |
2021-01-03 | -1.402159 | -0.343242 | 0.976116 | 15 | 3.0 |
2021-01-04 | -0.575148 | 0.182339 | -2.003314 | 20 | 6.0 |
2021-01-05 | 0.117166 | -0.664579 | -2.853831 | 25 | 10.0 |
2021-01-06 | 0.186426 | -0.370650 | -1.833668 | 30 | 15.0 |
A lambda function is a small anonymous function. It can take any number of arguments, but can only have one expression:
df.apply(lambda x: x.max() - x.min())
A 2.366554
B 1.401608
C 3.999593
D 0.000000
F 4.000000
dtype: float64
Value counts (histogramming)#
s3 = pd.Series(np.random.randint(0, 4, size=5))
s3
0 2
1 0
2 1
3 3
4 3
dtype: int64
s3.value_counts()
3 2
2 1
0 1
1 1
dtype: int64
String methods#
Series is equipped with a set of string processing methods in the str attribute that make it easy to operate on each element of the array, as in the code snippet below. Note that pattern-matching in str generally uses regular expressions by default (and in some cases always uses them).
s4 = pd.Series(["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"])
s4
0 A
1 B
2 C
3 Aaba
4 Baca
5 NaN
6 CABA
7 dog
8 cat
dtype: object
s4.str.lower()
0 a
1 b
2 c
3 aaba
4 baca
5 NaN
6 caba
7 dog
8 cat
dtype: object
Merge#
Concat#
pandas provides various facilities for easily combining together Series and DataFrame objects with various kinds of set logic for the indexes and relational algebra functionality in the case of join / merge-type operations.
Concatenating pandas objects together with concat():
df5 = pd.DataFrame(np.random.randn(10, 4))
df5
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | -0.765385 | -0.645532 | 1.077812 | 0.062465 |
1 | 0.411234 | -0.846388 | -1.143676 | 0.049230 |
2 | 0.277718 | 0.934318 | -0.432680 | -0.334319 |
3 | -0.739204 | -0.275935 | 1.653829 | -0.171295 |
4 | -1.692065 | -0.941707 | 0.806281 | 0.334500 |
5 | -0.933935 | -0.753211 | 0.570971 | -1.406925 |
6 | -0.926019 | 0.511278 | 0.207824 | -0.209196 |
7 | -0.002946 | -1.885014 | 1.971658 | 0.551915 |
8 | -1.750871 | 0.080519 | -1.743010 | -0.398808 |
9 | -0.595343 | 0.089527 | 0.720991 | 0.164519 |
# break it into pieces
pieces = [df5[:3], df5[3:7], df5[7:]]
pieces
[ 0 1 2 3
0 -0.765385 -0.645532 1.077812 0.062465
1 0.411234 -0.846388 -1.143676 0.049230
2 0.277718 0.934318 -0.432680 -0.334319,
0 1 2 3
3 -0.739204 -0.275935 1.653829 -0.171295
4 -1.692065 -0.941707 0.806281 0.334500
5 -0.933935 -0.753211 0.570971 -1.406925
6 -0.926019 0.511278 0.207824 -0.209196,
0 1 2 3
7 -0.002946 -1.885014 1.971658 0.551915
8 -1.750871 0.080519 -1.743010 -0.398808
9 -0.595343 0.089527 0.720991 0.164519]
pd.concat(pieces)
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | -0.765385 | -0.645532 | 1.077812 | 0.062465 |
1 | 0.411234 | -0.846388 | -1.143676 | 0.049230 |
2 | 0.277718 | 0.934318 | -0.432680 | -0.334319 |
3 | -0.739204 | -0.275935 | 1.653829 | -0.171295 |
4 | -1.692065 | -0.941707 | 0.806281 | 0.334500 |
5 | -0.933935 | -0.753211 | 0.570971 | -1.406925 |
6 | -0.926019 | 0.511278 | 0.207824 | -0.209196 |
7 | -0.002946 | -1.885014 | 1.971658 | 0.551915 |
8 | -1.750871 | 0.080519 | -1.743010 | -0.398808 |
9 | -0.595343 | 0.089527 | 0.720991 | 0.164519 |
Join#
left = pd.DataFrame({"key": ["foo", "foo"], "lval": [1, 2]})
right = pd.DataFrame({"key": ["foo", "foo"], "rval": [4, 5]})
print(left)
print("-" * 15)
print(right)
key lval
0 foo 1
1 foo 2
---------------
key rval
0 foo 4
1 foo 5
pd.merge(left, right, on="key")
key | lval | rval | |
---|---|---|---|
0 | foo | 1 | 4 |
1 | foo | 1 | 5 |
2 | foo | 2 | 4 |
3 | foo | 2 | 5 |
Another example that can be given is:
left = pd.DataFrame({"key": ["foo", "bar"], "lval": [1, 2]})
right = pd.DataFrame({"key": ["foo", "bar"], "rval": [4, 5]})
print(left)
print("-" * 15)
print(right)
key lval
0 foo 1
1 bar 2
---------------
key rval
0 foo 4
1 bar 5
pd.merge(left, right, on="key")
key | lval | rval | |
---|---|---|---|
0 | foo | 1 | 4 |
1 | bar | 2 | 5 |
Grouping#
By “group by” we are referring to a process involving one or more of the following steps:
Splitting the data into groups based on some criteria
Applying a function to each group independently
Combining the results into a data structure
df6 = pd.DataFrame(
{
"A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
"B": ["one", "one", "two", "three", "two", "two", "one", "three"],
"C": np.random.randn(8),
"D": np.random.randn(8),
}
)
df6
A | B | C | D | |
---|---|---|---|---|
0 | foo | one | 0.002769 | 0.045595 |
1 | bar | one | 0.707007 | 0.688803 |
2 | foo | two | -0.808793 | 0.810861 |
3 | bar | three | -1.141795 | 0.224594 |
4 | foo | two | 0.036662 | -0.332820 |
5 | bar | two | 0.263353 | -0.209618 |
6 | foo | one | -0.328079 | -0.509825 |
7 | foo | three | -1.071213 | -1.171025 |
Grouping and then applying the sum() function to the resulting groups.
df6.groupby("A").sum()
/tmp/ipykernel_1801/2196244952.py:1: FutureWarning: The default value of numeric_only in DataFrameGroupBy.sum is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
df6.groupby("A").sum()
C | D | |
---|---|---|
A | ||
bar | -0.171435 | 0.703779 |
foo | -2.168655 | -1.157213 |
Grouping by multiple columns forms a hierarchical index, and again we can apply the sum() function.
df6.groupby(["A", "B"]).sum()
C | D | ||
---|---|---|---|
A | B | ||
bar | one | 0.707007 | 0.688803 |
three | -1.141795 | 0.224594 | |
two | 0.263353 | -0.209618 | |
foo | one | -0.325311 | -0.464230 |
three | -1.071213 | -1.171025 | |
two | -0.772132 | 0.478042 |
Reshaping#
Stack#
tuples = list(
zip(
*[
["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
["one", "two", "one", "two", "one", "two", "one", "two"],
]
)
)
tuples
[('bar', 'one'),
('bar', 'two'),
('baz', 'one'),
('baz', 'two'),
('foo', 'one'),
('foo', 'two'),
('qux', 'one'),
('qux', 'two')]
index = pd.MultiIndex.from_tuples(tuples, names=["first", "second"])
index
MultiIndex([('bar', 'one'),
('bar', 'two'),
('baz', 'one'),
('baz', 'two'),
('foo', 'one'),
('foo', 'two'),
('qux', 'one'),
('qux', 'two')],
names=['first', 'second'])