Table of contents

I want to use this post to share some pandas snippets that I find useful. I use them from time to time, in particular when I’m doing time series competitions on platforms such as Kaggle. Like any data scientist, I perform similar data processing steps on different datasets. Usually, I put repetitive patterns in xam, which is my personal data science toolbox. However, I think that the following snippets are too small and too specific for being added into a library.

Some of the following code assumes that you’re using version 1.0 of pandas, which in particular added support for typed missing values. Note that the following code is not guaranteed to be the optimal way of proceeding. If you have a quicker solution, then feel welcome to post a comment at the bottom of this post.

Time since/until event

Take the following series of weather indications:

import pandas as pd

weather = ['rain', 'sun', 'snow', 'snow', 'snow', 'sun', 'rain', 'rain']
weather = pd.Series(weather, dtype='category')
weather
weather
rain
sun
snow
snow
snow
sun
rain
rain

Depending on the problem, it might be worthwhile to extract the time since an event occurred. For instance, we might want to produce a feature that indicates how long ago the 'rain' value occurred. For the sake of simplicity, I’m going to assume that the period between each row is constant. I will thus simply count the number of steps that occured since each value appeared. Ideally, we want to have one column for each distinct value. Here is the code I use for doing so:

def steps_since(series):

    features = pd.DataFrame(index=series.index)

    for cat in series.cat.categories:
        counts = series.eq(cat).cumsum()
        since = counts.groupby(counts).cumcount().astype('Int64')
        since[counts.eq(0)] = pd.NA
        features[f'steps_since_{cat}'] = since

    return features

steps_since(weather)
steps_since_rainsteps_since_snowsteps_since_sun
0NANA
1NA0
201
302
403
510
021
032

I’ve assumed that the provided series contains categorical data. This allows us to access the list of categories via series.cat.categories. You could use series.unique() instead. I’ve also taken care of inserting NAs for values that haven’t yet occurred. Before version 1.0, missing values were commonly indicated with np.nan, but this required the series to be of type float64. The modern approach is to use the Int64 type in combination of pd.NA to store nullable integers.

How about the following dataset?

weather_per_country = pd.DataFrame({
    'country': pd.Categorical(['France'] * 4 + ['Germany'] * 4),
    'weather': weather
})
weather_per_country
countryweather
Francerain
Francesun
Francesnow
Francesnow
Germanysnow
Germanysun
Germanyrain
Germanysun

In this case, we might want to extract the time since each event, within each country. We can so via a groupby:

pd.concat(
    steps_since(g['weather']).assign(country=country)
    for country, g in weather_per_country.groupby('country')
)
steps_since_rainsteps_since_snowsteps_since_suncountry
0NANAFrance
1NA0France
201France
302France
NA0NAGermany
NA10Germany
021Germany
032Germany

Finally, what about obtaining the number of steps until an event, rather than since an event? This can be done by rewriting the code steps_since function and inverting the series before counting:

def steps_until(series):

    features = pd.DataFrame(index=series.index)

    for cat in series.cat.categories:
        counts = series.eq(cat).iloc[::-1].cumsum()
        until = counts.groupby(counts).cumcount().astype('Int64')
        until[counts.eq(0)] = pd.NA
        features[f'steps_until_{cat}'] = until

    return features

steps_until(weather)
steps_until_rainsteps_until_snowsteps_until_sun
021
510
403
302
201
1NA0
0NANA
0NANA

Longest event streak

Next, take the following series of temperature measurements:

temperature = [18, pd.NA, pd.NA, pd.NA, 21, pd.NA, pd.NA, 19]
temperature = pd.Series(temperature, name='temperature')
temperature
temperature
18
NA
NA
NA
21
NA
NA
19

In this case, you might want to interpolate the missing values before feeding the data to a machine learning model. However, in order for your interpolation to be robust, you might want to verify that the maximum number of consecutive values isn’t too large. You can calculate the latter as so:

missing = temperature.isnull()
missing.groupby((~missing).cumsum()).sum().max()
3

Naturally, you can adapt this code to count consecutive values of arbitrary events. Indeed, the missing series definition could be replaced by a different condition altogether. For instance, we can determine the longest streak of rainy days like this:

rainy = weather.eq('rain')
rainy.groupby((~rainy).cumsum()).sum().max()
2

Target encoding for time series

The one thing I do in every time series competition is target encoding. In short, the goal of target encoding is to replace a category by the average of the target values of the rows that belong to said category. Naturally, you’re not limited to using an average. You can also use Bayesian target encoding if some of your categories are rare.

Target encoding is very important for time series data. It has been used in almost every top model on Kaggle time series competitions, such as ASHRAE, Recruit Restaurants, and Wikipedia Web Traffic. When you do target encoding on temporal data, you need to be wary of not leaking current and future information into the aggregate of each row. Basically, you need to shift the target values within each group backwards. Indeed, for any given moment, we want our aggregate to pertain to past data only. As an example, consider the following toy dataset:

readings = pd.DataFrame(
    [
        ('A', 'Saturday', 101),
        ('A', 'Sunday', 88),
        ('A', 'Saturday', 103),
        ('A', 'Sunday', 82),
        ('A', 'Saturday', 100),
        ('B', 'Saturday', 27),
        ('B', 'Sunday', 13),
        ('B', 'Saturday', 21),
        ('B', 'Sunday', 17),
        ('B', 'Saturday', 25)
    ],
    columns=['building', 'day', 'reading']
)
readings
buildingdayreading
ASaturday101
ASunday88
ASaturday103
ASunday82
ASaturday100
BSaturday27
BSunday13
BSaturday21
BSunday17
BSaturday25

Let’s say we want to calculate some target encodings for different column combinations. I like to define each target encoding as a tuple (on, lag, by, how) where:

  • on is the column on which to perform the aggregate.
  • lag is the size of the window.
  • by determines on which column to do the grouping.
  • how indicates the statistics to compute.

I then store all the tuples I want to compute in a list:

AGGS = [
    ('reading', 2, ['building'], ['mean']),
    ('reading', 3, ['building'], ['mean']),
    ('reading', 3, ['day'], ['mean', 'max']),
    ('reading', 3, ['building', 'day'], ['mean'])
]

I then proceed as follows:

features = readings.copy()

for on, lag, by, hows in AGGS:

    agg = readings.groupby(by)[on].apply(lambda x: (
        x
        .shift(1)
        .rolling(window=lag, min_periods=1)
        .agg(hows)
    ))
    agg = agg.rename(columns={
        how: f'{how}_of_last_{lag}_{on}_by_' + '_and_'.join(by)
        for how in hows
    })

    features = features.join(agg)

features
buildingdayreadingmean_of_last_3_reading_by_buildingmean_of_last_3_reading_by_daymax_of_last_3_reading_by_daymean_of_last_3_reading_by_building_and_day
ASaturday101nannannannan
ASunday88101nannannan
ASaturday10394.5101101101
ASunday8297.3333888888
ASaturday10091102103102
BSaturday27nan101.333103nan
BSunday13278588nan
BSaturday212076.666710327
BSunday1720.3333618813
BSaturday251749.333310024

Note that I have set min_periods to 1 so that statistics are still computed when there are less than lag values in a given window. As you can see, the nans are present when there are no past values to aggregate on.

When I do a Kaggle competition, I run the above code on a small subset of the data and manually check that it’s doing what I want it to do. This is akin to pointing and calling, which I highly recommend doing in any data science endeavour whatsoever.

You might have noticed that I used apply to shift the values backwards before aggregating. Since version 1.0, it’s also possible to define a custom window that does the shifting for us. This may be done by subclassing pd.api.indexers.BaseIndexer:

import numpy as np

class ShiftedWindow(pd.api.indexers.BaseIndexer):

    def __init__(self, window_size):
        self.window_size = window_size

    def get_window_bounds(self, num_values, min_periods, center, closed):

        starts = np.arange(-self.window_size, num_values - self.window_size)
        ends = starts + self.window_size
        starts[:self.window_size] = 0

        return starts, ends

This allows us to rewrite the above snippet as follows:

features = readings.copy()

for on, lag, by, hows in AGGS:

    agg = (
        readings
        .groupby(by)[on]
        .rolling(ShiftedWindow(lag), min_periods=1)
        .agg(hows)
    )
    agg = agg.rename(columns={
        how: f'{how}_of_last_{lag}_{on}_by_' + '_and_'.join(by)
        for how in hows
    })

    features = features.join(agg)

However, it turns out that custom windows within a groupby will only work once version 1.1.1 is released. As of writing this blog post, the above code will not produce the expected output, even though it won’t raise any exception. But if you’re reading from the near future, then it will work!

That’s all for now! I purposefully didn’t go into the detail of each line of code. I rather expect people to copy/paste the above snippets to (hopefully) play around with them by themselves. Feel free to drop a comment if you have any question whatsoever.