Table of contents

I want to use this post to share some pandas snippets that I find useful. I use them from time to time, in particular when I’m doing time series competitions on platforms such as Kaggle. Like any data scientist, I perform similar data processing steps on different datasets. Usually, I put repetitive patterns in xam, which is my personal data science toolbox. However, I think that the following snippets are too small and too specific for being added into a library.

Some of the following code assumes that you’re using version 1.0 of pandas, which in particular added support for typed missing values. Note that the following code is not guaranteed to be the optimal way of proceeding. If you have a quicker solution, then feel welcome to post a comment at the bottom of this post.

Time since/until event

Take the following series of weather indications:

import pandas as pd

weather = ['rain', 'sun', 'snow', 'snow', 'snow', 'sun', 'rain', 'rain']
weather = pd.Series(weather, dtype='category')
weather
weather
rain
sun
snow
snow
snow
sun
rain
rain

Depending on the problem, it might be worthwhile to extract the time since an event occurred. For instance, we might want to produce a feature that indicates how long ago the 'rain' value occurred. For the sake of simplicity, I’m going to assume that the period between each row is constant. I will thus simply count the number of steps that occured since each value appeared. Ideally, we want to have one column for each distinct value. Here is the code I use for doing so:

def steps_since(series):

    features = pd.DataFrame(index=series.index)

    for cat in series.cat.categories:
        counts = series.eq(cat).cumsum()
        since = counts.groupby(counts).cumcount().astype('Int64')
        since[counts.eq(0)] = pd.NA
        features[f'steps_since_{cat}'] = since

    return features

steps_since(weather)
steps_since_rain steps_since_snow steps_since_sun
0 NA NA
1 NA 0
2 0 1
3 0 2
4 0 3
5 1 0
0 2 1
0 3 2

I’ve assumed that the provided series contains categorical data. This allows us to access the list of categories via series.cat.categories. You could use series.unique() instead. I’ve also taken care of inserting NAs for values that haven’t yet occurred. Before version 1.0, missing values were commonly indicated with np.nan, but this required the series to be of type float64. The modern approach is to use the Int64 type in combination of pd.NA to store nullable integers.

How about the following dataset?

weather_per_country = pd.DataFrame({
    'country': pd.Categorical(['France'] * 4 + ['Germany'] * 4),
    'weather': weather
})
weather_per_country
country weather
France rain
France sun
France snow
France snow
Germany snow
Germany sun
Germany rain
Germany sun

In this case, we might want to extract the time since each event, within each country. We can so via a groupby:

pd.concat(
    steps_since(g['weather']).assign(country=country)
    for country, g in weather_per_country.groupby('country')
)
steps_since_rain steps_since_snow steps_since_sun country
0 NA NA France
1 NA 0 France
2 0 1 France
3 0 2 France
NA 0 NA Germany
NA 1 0 Germany
0 2 1 Germany
0 3 2 Germany

Finally, what about obtaining the number of steps until an event, rather than since an event? This can be done by rewriting the code steps_since function and inverting the series before counting:

def steps_until(series):

    features = pd.DataFrame(index=series.index)

    for cat in series.cat.categories:
        counts = series.eq(cat).iloc[::-1].cumsum()
        until = counts.groupby(counts).cumcount().astype('Int64')
        until[counts.eq(0)] = pd.NA
        features[f'steps_until_{cat}'] = until

    return features

steps_until(weather)
steps_until_rain steps_until_snow steps_until_sun
0 2 1
5 1 0
4 0 3
3 0 2
2 0 1
1 NA 0
0 NA NA
0 NA NA

Longest event streak

Next, take the following series of temperature measurements:

temperature = [18, pd.NA, pd.NA, pd.NA, 21, pd.NA, pd.NA, 19]
temperature = pd.Series(temperature, name='temperature')
temperature
temperature
18
NA
NA
NA
21
NA
NA
19

In this case, you might want to interpolate the missing values before feeding the data to a machine learning model. However, in order for your interpolation to be robust, you might want to verify that the maximum number of consecutive values isn’t too large. You can calculate the latter as so:

missing = temperature.isnull()
missing.groupby((~missing).cumsum()).sum().max()
3

Naturally, you can adapt this code to count consecutive values of arbitrary events. Indeed, the missing series definition could be replaced by a different condition altogether. For instance, we can determine the longest streak of rainy days like this:

rainy = weather.eq('rain')
rainy.groupby((~rainy).cumsum()).sum().max()
2

Target encoding for time series

The one thing I do in every time series competition is target encoding. In short, the goal of target encoding is to replace a category by the average of the target values of the rows that belong to said category. Naturally, you’re not limited to using an average. You can also use Bayesian target encoding if some of your categories are rare.

Target encoding is very important for time series data. It has been used in almost every top model on Kaggle time series competitions, such as ASHRAE, Recruit Restaurants, and Wikipedia Web Traffic. When you do target encoding on temporal data, you need to be wary of not leaking current and future information into the aggregate of each row. Basically, you need to shift the target values within each group backwards. Indeed, for any given moment, we want our aggregate to pertain to past data only. As an example, consider the following toy dataset:

readings = pd.DataFrame(
    [
        ('A', 'Saturday', 101),
        ('A', 'Sunday', 88),
        ('A', 'Saturday', 103),
        ('A', 'Sunday', 82),
        ('A', 'Saturday', 100),
        ('B', 'Saturday', 27),
        ('B', 'Sunday', 13),
        ('B', 'Saturday', 21),
        ('B', 'Sunday', 17),
        ('B', 'Saturday', 25)
    ],
    columns=['building', 'day', 'reading']
)
readings
building day reading
A Saturday 101
A Sunday 88
A Saturday 103
A Sunday 82
A Saturday 100
B Saturday 27
B Sunday 13
B Saturday 21
B Sunday 17
B Saturday 25

Let’s say we want to calculate some target encodings for different column combinations. I like to define each target encoding as a tuple (on, lag, by, how) where:

  • on is the column on which to perform the aggregate.
  • lag is the size of the window.
  • by determines on which column to do the grouping.
  • how indicates the statistics to compute.

I then store all the tuples I want to compute in a list:

AGGS = [
    ('reading', 2, ['building'], ['mean']),
    ('reading', 3, ['building'], ['mean']),
    ('reading', 3, ['day'], ['mean', 'max']),
    ('reading', 3, ['building', 'day'], ['mean'])
]

I then proceed as follows:

features = readings.copy()

for on, lag, by, hows in AGGS:

    agg = readings.groupby(by)[on].apply(lambda x: (
        x
        .shift(1)
        .rolling(window=lag, min_periods=1)
        .agg(hows)
    ))
    agg = agg.rename(columns={
        how: f'{how}_of_last_{lag}_{on}_by_' + '_and_'.join(by)
        for how in hows
    })

    features = features.join(agg)

features
building day reading mean_of_last_3_reading_by_building mean_of_last_3_reading_by_day max_of_last_3_reading_by_day mean_of_last_3_reading_by_building_and_day
A Saturday 101 nan nan nan nan
A Sunday 88 101 nan nan nan
A Saturday 103 94.5 101 101 101
A Sunday 82 97.3333 88 88 88
A Saturday 100 91 102 103 102
B Saturday 27 nan 101.333 103 nan
B Sunday 13 27 85 88 nan
B Saturday 21 20 76.6667 103 27
B Sunday 17 20.3333 61 88 13
B Saturday 25 17 49.3333 100 24

Note that I have set min_periods to 1 so that statistics are still computed when there are less than lag values in a given window. As you can see, the nans are present when there are no past values to aggregate on.

When I do a Kaggle competition, I run the above code on a small subset of the data and manually check that it’s doing what I want it to do. This is akin to pointing and calling, which I highly recommend doing in any data science endeavour whatsoever.

You might have noticed that I used apply to shift the values backwards before aggregating. Since version 1.0, it’s also possible to define a custom window that does the shifting for us. This may be done by subclassing pd.api.indexers.BaseIndexer:

import numpy as np

class ShiftedWindow(pd.api.indexers.BaseIndexer):

    def __init__(self, window_size):
        self.window_size = window_size

    def get_window_bounds(self, num_values, min_periods, center, closed):

        starts = np.arange(-self.window_size, num_values - self.window_size)
        ends = starts + self.window_size
        starts[:self.window_size] = 0

        return starts, ends

This allows us to rewrite the above snippet as follows:

features = readings.copy()

for on, lag, by, hows in AGGS:

    agg = (
        readings
        .groupby(by)[on]
        .rolling(ShiftedWindow(lag), min_periods=1)
        .agg(hows)
    )
    agg = agg.rename(columns={
        how: f'{how}_of_last_{lag}_{on}_by_' + '_and_'.join(by)
        for how in hows
    })

    features = features.join(agg)

However, it turns out that custom windows within a groupby will only work once version 1.1.1 is released. As of writing this blog post, the above code will not produce the expected output, even though it won’t raise any exception. But if you’re reading from the near future, then it will work!

That’s all for now! I purposefully didn’t go into the detail of each line of code. I rather expect people to copy/paste the above snippets to (hopefully) play around with them by themselves. Feel free to drop a comment if you have any question whatsoever.