A few intermediate pandas tricks • Max Halford

I want to use this post to share some pandas snippets that I find useful. I use them from time to time, in particular when I’m doing time series competitions on platforms such as Kaggle. Like any data scientist, I perform similar data processing steps on different datasets. Usually, I put repetitive patterns in xam, which is my personal data science toolbox. However, I think that the following snippets are too small and too specific for being added into a library.

Some of the following code assumes that you’re using version 1.0 of pandas, which in particular added support for typed missing values. Note that the following code is not guaranteed to be the optimal way of proceeding. If you have a quicker solution, then feel welcome to post a comment at the bottom of this post.

Time since/until event

Take the following series of weather indications:

import pandas as pd

weather = ['rain', 'sun', 'snow', 'snow', 'snow', 'sun', 'rain', 'rain']
weather = pd.Series(weather, dtype='category')
weather

weather
rain
sun
snow
snow
snow
sun
rain
rain

Depending on the problem, it might be worthwhile to extract the time since an event occurred. For instance, we might want to produce a feature that indicates how long ago the 'rain' value occurred. For the sake of simplicity, I’m going to assume that the period between each row is constant. I will thus simply count the number of steps that occured since each value appeared. Ideally, we want to have one column for each distinct value. Here is the code I use for doing so:

def steps_since(series):

    features = pd.DataFrame(index=series.index)

    for cat in series.cat.categories:
        counts = series.eq(cat).cumsum()
        since = counts.groupby(counts).cumcount().astype('Int64')
        since[counts.eq(0)] = pd.NA
        features[f'steps_since_{cat}'] = since

    return features

steps_since(weather)

steps_since_rain	steps_since_snow	steps_since_sun
0	NA	NA
1	NA	0
2	0	1
3	0	2
4	0	3
5	1	0
0	2	1
0	3	2

I’ve assumed that the provided series contains categorical data. This allows us to access the list of categories via series.cat.categories. You could use series.unique() instead. I’ve also taken care of inserting NAs for values that haven’t yet occurred. Before version 1.0, missing values were commonly indicated with np.nan, but this required the series to be of type float64. The modern approach is to use the Int64 type in combination of pd.NA to store nullable integers.

How about the following dataset?

weather_per_country = pd.DataFrame({
    'country': pd.Categorical(['France'] * 4 + ['Germany'] * 4),
    'weather': weather
})
weather_per_country

country	weather
France	rain
France	sun
France	snow
France	snow
Germany	snow
Germany	sun
Germany	rain
Germany	sun

In this case, we might want to extract the time since each event, within each country. We can so via a groupby:

pd.concat(
    steps_since(g['weather']).assign(country=country)
    for country, g in weather_per_country.groupby('country')
)

steps_since_rain	steps_since_snow	steps_since_sun	country
0	NA	NA	France
1	NA	0	France
2	0	1	France
3	0	2	France
NA	0	NA	Germany
NA	1	0	Germany
0	2	1	Germany
0	3	2	Germany

Finally, what about obtaining the number of steps until an event, rather than since an event? This can be done by rewriting the code steps_since function and inverting the series before counting:

def steps_until(series):

    features = pd.DataFrame(index=series.index)

    for cat in series.cat.categories:
        counts = series.eq(cat).iloc[::-1].cumsum()
        until = counts.groupby(counts).cumcount().astype('Int64')
        until[counts.eq(0)] = pd.NA
        features[f'steps_until_{cat}'] = until

    return features

steps_until(weather)

steps_until_rain	steps_until_snow	steps_until_sun
0	2	1
5	1	0
4	0	3
3	0	2
2	0	1
1	NA	0
0	NA	NA
0	NA	NA

Longest event streak

Next, take the following series of temperature measurements:

temperature = [18, pd.NA, pd.NA, pd.NA, 21, pd.NA, pd.NA, 19]
temperature = pd.Series(temperature, name='temperature')
temperature

temperature
18
NA
NA
NA
21
NA
NA
19

In this case, you might want to interpolate the missing values before feeding the data to a machine learning model. However, in order for your interpolation to be robust, you might want to verify that the maximum number of consecutive values isn’t too large. You can calculate the latter as so:

missing = temperature.isnull()
missing.groupby((~missing).cumsum()).sum().max()

Naturally, you can adapt this code to count consecutive values of arbitrary events. Indeed, the missing series definition could be replaced by a different condition altogether. For instance, we can determine the longest streak of rainy days like this:

rainy = weather.eq('rain')
rainy.groupby((~rainy).cumsum()).sum().max()

Target encoding for time series

The one thing I do in every time series competition is target encoding. In short, the goal of target encoding is to replace a category by the average of the target values of the rows that belong to said category. Naturally, you’re not limited to using an average. You can also use Bayesian target encoding if some of your categories are rare.

Target encoding is very important for time series data. It has been used in almost every top model on Kaggle time series competitions, such as ASHRAE, Recruit Restaurants, and Wikipedia Web Traffic. When you do target encoding on temporal data, you need to be wary of not leaking current and future information into the aggregate of each row. Basically, you need to shift the target values within each group backwards. Indeed, for any given moment, we want our aggregate to pertain to past data only. As an example, consider the following toy dataset:

readings = pd.DataFrame(
    [
        ('A', 'Saturday', 101),
        ('A', 'Sunday', 88),
        ('A', 'Saturday', 103),
        ('A', 'Sunday', 82),
        ('A', 'Saturday', 100),
        ('B', 'Saturday', 27),
        ('B', 'Sunday', 13),
        ('B', 'Saturday', 21),
        ('B', 'Sunday', 17),
        ('B', 'Saturday', 25)
    ],
    columns=['building', 'day', 'reading']
)
readings

building	day	reading
A	Saturday	101
A	Sunday	88
A	Saturday	103
A	Sunday	82
A	Saturday	100
B	Saturday	27
B	Sunday	13
B	Saturday	21
B	Sunday	17
B	Saturday	25

Let’s say we want to calculate some target encodings for different column combinations. I like to define each target encoding as a tuple (on, lag, by, how) where:

on is the column on which to perform the aggregate.
lag is the size of the window.
by determines on which column to do the grouping.
how indicates the statistics to compute.

I then store all the tuples I want to compute in a list:

AGGS = [
    ('reading', 2, ['building'], ['mean']),
    ('reading', 3, ['building'], ['mean']),
    ('reading', 3, ['day'], ['mean', 'max']),
    ('reading', 3, ['building', 'day'], ['mean'])
]

I then proceed as follows:

features = readings.copy()

for on, lag, by, hows in AGGS:

    agg = readings.groupby(by)[on].apply(lambda x: (
        x
        .shift(1)
        .rolling(window=lag, min_periods=1)
        .agg(hows)
    ))
    agg = agg.rename(columns={
        how: f'{how}_of_last_{lag}_{on}_by_' + '_and_'.join(by)
        for how in hows
    })

    features = features.join(agg)

features

building	day	reading	mean_of_last_3_reading_by_building	mean_of_last_3_reading_by_day	max_of_last_3_reading_by_day	mean_of_last_3_reading_by_building_and_day
A	Saturday	101	nan	nan	nan	nan
A	Sunday	88	101	nan	nan	nan
A	Saturday	103	94.5	101	101	101
A	Sunday	82	97.3333	88	88	88
A	Saturday	100	91	102	103	102
B	Saturday	27	nan	101.333	103	nan
B	Sunday	13	27	85	88	nan
B	Saturday	21	20	76.6667	103	27
B	Sunday	17	20.3333	61	88	13
B	Saturday	25	17	49.3333	100	24

Note that I have set min_periods to 1 so that statistics are still computed when there are less than lag values in a given window. As you can see, the nans are present when there are no past values to aggregate on.

When I do a Kaggle competition, I run the above code on a small subset of the data and manually check that it’s doing what I want it to do. This is akin to pointing and calling, which I highly recommend doing in any data science endeavour whatsoever.

You might have noticed that I used apply to shift the values backwards before aggregating. Since version 1.0, it’s also possible to define a custom window that does the shifting for us. This may be done by subclassing pd.api.indexers.BaseIndexer:

import numpy as np

class ShiftedWindow(pd.api.indexers.BaseIndexer):

    def __init__(self, window_size):
        self.window_size = window_size

    def get_window_bounds(self, num_values, min_periods, center, closed):

        starts = np.arange(-self.window_size, num_values - self.window_size)
        ends = starts + self.window_size
        starts[:self.window_size] = 0

        return starts, ends

This allows us to rewrite the above snippet as follows:

features = readings.copy()

for on, lag, by, hows in AGGS:

    agg = (
        readings
        .groupby(by)[on]
        .rolling(ShiftedWindow(lag), min_periods=1)
        .agg(hows)
    )
    agg = agg.rename(columns={
        how: f'{how}_of_last_{lag}_{on}_by_' + '_and_'.join(by)
        for how in hows
    })

    features = features.join(agg)

However, it turns out that custom windows within a groupby will only work once version 1.1.1 is released. As of writing this blog post, the above code will not produce the expected output, even though it won’t raise any exception. But if you’re reading from the near future, then it will work!

That’s all for now! I purposefully didn’t go into the detail of each line of code. I rather expect people to copy/paste the above snippets to (hopefully) play around with them by themselves. Feel free to drop a comment if you have any question whatsoever.

Table of contents

Time since/until event

Longest event streak

Target encoding for time series

Related posts