Computing cross-correlations in SQL • Max Halford

Introduction

I’m currently working on a problem at work where I have to measure the impact of a growth initiative on a performance metric. Hypothetically, this might answer the following kind of question:

I’ve spent X amount of money, what is the impact on the number of visitors on my website?

Of course, there are many measures that can be taken to answer such a question. I decided to measure the correlation between the initiative and the metric, with the latter being shifted forward in time. This measure is called the cross-correlation. It’s different from serial correlation, which is the correlation of a series with a shifted version of itself.

Computing the cross-correlation between two series of numbers $X$ and $Y$ is simple. Assuming that we want to measure the impact of $X$ on $Y$, we just have to shift $Y$ forward and compute the correlation between $X$ and the shifted version of $Y$. Of course, one has to decide by how much to shift $Y$. Ideally, we would like to measure the impact of $X$ on multiple shifted versions of $Y$. Indeed, we might want to know how increasing $X$ affects $Y$ in one week, two weeks, three weeks, etc.

Cross-correlation for two variables

I’m going to be using PostgreSQL 12.3. I’m be using a fair amount of common table expressions in order to clarify the steps to take. First, let’s pick some dummy data:

WITH dummy AS (
    SELECT * FROM (
    VALUES
        ('2020-01-01'::DATE, 1, 10),
        ('2020-01-02'::DATE, 2, 20),
        ('2020-01-03'::DATE, 3, 50),
        ('2020-01-04'::DATE, 0, -1000),
        ('2020-01-05'::DATE, -5, 0)
    ) AS dummy (date, x, y)
)

As mentioned, we would like our code to work for an arbitrary number of steps forward. Therefore, let’s define a range of delta values by which will be used to shift Y forward:

, delta AS (
    SELECT *
    FROM generate_series(1, 3) delta
)

Now let’s add the deltas to the date field in order to define the future dates:

, dates AS (
    SELECT
        date AS present,
        delta,
        date + delta * '1 day'::interval AS future
    FROM dummy
    CROSS JOIN delta
)

>>> SELECT * FROM dates LIMIT 8;
present     delta   future
2020-01-01  1       2020-01-02
2020-01-02  1       2020-01-03
2020-01-03  1       2020-01-04
2020-01-04  1       2020-01-05
2020-01-05  1       2020-01-06
2020-01-01  2       2020-01-03
2020-01-02  2       2020-01-04
2020-01-03  2       2020-01-05

We can now join the dates with the values from the initial table:

, pairwise AS (
    SELECT
        dates.present,
        dates.future,
        dates.delta,
        present.x AS present_x,
        future.y AS future_y
    FROM
        dates,
        dummy AS present,
        dummy AS future
    WHERE dates.present = present.date
        AND dates.future = future.date
)

>>> SELECT * FROM pairwise ORDER BY present, delta;
present     future      delta   present_x   future_y
2020-01-01  2020-01-02  1       1           20
2020-01-01  2020-01-03  2       1           50
2020-01-01  2020-01-04  3       1           -1000
2020-01-02  2020-01-03  1       2           50
2020-01-02  2020-01-04  2       2           -1000
2020-01-02  2020-01-05  3       2           0
2020-01-03  2020-01-04  1       3           -1000
2020-01-03  2020-01-05  2       3           0
2020-01-04  2020-01-05  1       0           0

Finally, we just have to compute the correlation within each group. The groups are defined by the delta field. In other words, we will compute the correlation for all values that match $(X_t, Y_{t+1})$, $(X_t, Y_{t+2})$, and $(X_t, Y_{t+3})$.

, cross_corrs AS (
    SELECT
        delta,
        CORR(present_x, future_y) AS pearson
    FROM pairwise
    GROUP BY delta
    ORDER BY delta
)

>>> SELECT * FROM cross_corrs;
delta   pearson
1       -0,7487619679833206
2       -0,042207495591251455
3       1

That’s it, we’re done. You can click on the toggle below if you want to copy/paste the full example.

Click to see the full example

WITH dummy AS (
    SELECT * FROM (
    VALUES
        ('2020-01-01'::DATE, 1, 10),
        ('2020-01-02'::DATE, 2, 20),
        ('2020-01-03'::DATE, 3, 50),
        ('2020-01-04'::DATE, 0, -1000),
        ('2020-01-05'::DATE, -5, 0)
    ) AS dummy (date, x, y)
)

, delta AS (
    SELECT *
    FROM generate_series(1, 3) delta
)

, dates AS (
    SELECT
        date AS present,
        delta,
        date + delta * '1 day'::interval AS future
    FROM dummy
    CROSS JOIN delta
)

, pairwise AS (
    SELECT
        dates.present,
        dates.future,
        dates.delta,
        present.x AS present_x,
        future.y AS future_y
    FROM
        dates,
        dummy AS present,
        dummy AS future
    WHERE dates.present = present.date
        AND dates.future = future.date
)

, cross_corrs AS (
    SELECT
        delta,
        CORR(present_x, future_y) AS pearson
    FROM pairwise
    GROUP BY delta
    ORDER BY delta
)

SELECT * FROM cross_corrs;

Different kinds of correlations

Up until now, I’ve been making the common assumption that “correlation” refers to the Pearson correlation. There are other kinds of correlations, such as Spearman’s rank correlation. The latter is not available in PostgreSQL 12.3, but it is easy to calculate, as it is just the Pearson correlation of the ranks of the values of $X$ and $Y$. In SQL, this translates to:

SELECT
    delta
    RANK () OVER (
        PARTITION BY delta
        ORDER BY x
    ) AS x_rank,
    RANK () OVER (
        PARTITION BY delta
        ORDER BY y
    ) AS y_rank
FROM dummy

0.9

Ideally, we would like to calculate both kinds of correlations (Pearson and Spearman) for each delta value. This is quite straightforward to do, albeit slightly verbose.

WITH dummy AS (
    SELECT * FROM (
    VALUES
        ('2020-01-01'::DATE, 1, 10),
        ('2020-01-02'::DATE, 2, 20),
        ('2020-01-03'::DATE, 3, 50),
        ('2020-01-04'::DATE, 0, -1000),
        ('2020-01-05'::DATE, -5, 0)
    ) AS dummy (date, x, y)
)

, delta AS (
    SELECT *
    FROM generate_series(1, 3) delta
)

, dates AS (
    SELECT
        date AS present,
        delta,
        date + delta * '1 day'::interval AS future
    FROM dummy
    CROSS JOIN delta
)

, pairwise AS (
    SELECT
        dates.present,
        dates.future,
        dates.delta,
        present.x AS present_x,
        future.y AS future_y
    FROM
        dates,
        dummy AS present,
        dummy AS future
    WHERE dates.present = present.date
        AND dates.future = future.date
)

, cross_corrs AS (
    SELECT
        delta,
        CORR(present_x, future_y) AS pearson
    FROM pairwise
    GROUP BY delta
    ORDER BY delta
)

, pairwise_ranks AS (
    SELECT
    	delta,
        RANK () OVER (
            PARTITION BY delta
            ORDER BY present_x
        ) AS present_x_rank,
        RANK () OVER (
            PARTITION BY delta
            ORDER BY future_y
        ) AS future_y_rank
    FROM pairwise
)

, cross_rank_corrs AS (
    SELECT
        delta,
        CORR(present_x_rank, future_y_rank) AS spearman
    FROM pairwise_ranks
    GROUP BY delta
    ORDER BY delta
)

, corrs AS (
    SELECT
        cc.delta,
        cc.pearson,
        crc.spearman
    FROM cross_corrs cc
    LEFT JOIN cross_rank_corrs crc
        ON cc.delta = crc.delta
)

>>> SELECT * FROM corrs;
delta   pearson                 spearman
1       -0,7487619679833206     -0,2
2       -0,042207495591251455   -0,5
3       1                       1

Multiple $X$s and $Y$s

We now have a framework to compute different kinds of cross-correlations between two variables over multiple time steps. Let’s push the envelope and make it so that we can compute the cross-correlations for multiple variables. To be precise, we would like to compute cross-correlations between a group of initiatives ($X$s) and a group of metrics ($Y$s).

First of all, let’s create some more dummy data. We’ll also split the $X$s from the $Y$s. We’re also going to melt (or “unpivot” as some like to say) both groups. The reason why we do this is that it will simplify the subsequent code.

WITH dummy AS (
    SELECT * FROM (
    VALUES
        ('2020-01-01'::DATE, 1, 4, 10, 15),
        ('2020-01-02'::DATE, 2, 8, 20, 12),
        ('2020-01-03'::DATE, 3, -2, 50, -44),
        ('2020-01-04'::DATE, 0, 3, -1000, 12),
        ('2020-01-05'::DATE, -5, 10, 0, 14)
    ) AS dummy (date, x1, x2, y1, y2)
)

, xs AS (
    SELECT
        date,
        UNNEST(ARRAY['x1', 'x2']) AS var,
        UNNEST(ARRAY[x1, x2]) AS val
    FROM dummy
)

, ys AS (
    SELECT
        date,
        UNNEST(ARRAY['y1', 'y2']) AS var,
        UNNEST(ARRAY[y1, y2]) AS val
    FROM dummy
)

>>> SELECT * FROM xs ORDER BY date, var;
date        var   val
2020-01-01  x1    1
2020-01-01  x2    4
2020-01-02  x1    2
2020-01-02  x2    8
2020-01-03  x1    3
2020-01-03  x2    -2
2020-01-04  x1    0
2020-01-04  x2    3
2020-01-05  x1    -5
2020-01-05  x2    10

Note that this way of representing data agrees with the Tidy Data principles advocated by Hadley Wickam. Essentially, flattening/melting a table eases the manipulation of said table.

Next, we’re going to be generating future dates in the exact same way as we did previously.

, delta AS (
    SELECT *
    FROM generate_series(1, 3) delta
)

, dates AS (
    SELECT
        date AS present,
        delta,
        date + delta * '1 day'::interval AS future
    FROM dummy
    CROSS JOIN delta
)

We do however need to change the definition of the pairwise CTE. It’s quite straightforward as long as you’re concentrating a minimum. Here goes:

, pairwise AS (
    SELECT
        dates.present,
        dates.future,
        dates.delta,
        xs.var AS x_var,
        xs.val AS x_val,
        ys.var AS y_var,
        ys.val AS y_val
    FROM
        dates,
        xs,
        ys
    WHERE dates.present = xs.date
        AND dates.future = ys.date
    ORDER BY present, delta, x_var, y_var
)

>>> SELECT * FROM pairwise LIMIT 5;
present     future       delta  x_var x_val y_var   y_val
2020-01-01  2020-01-02   1      x1    1     y1      20
2020-01-01  2020-01-02   1      x1    1     y2      12
2020-01-01  2020-01-02   1      x2    4     y1      20
2020-01-01  2020-01-02   1      x2    4     y2      12
2020-01-01  2020-01-03   2      x1    1     y1      50

We can now compute all the correlations we want by grouping over (delta, x_var, y_var). Again, this is still relatively simple as long as you proceed step by step:

, cross_corrs AS (
    SELECT
        delta,
        x_var AS x,
        y_var AS y,
        'pearson' AS kind,
        CORR(x_val, y_val) AS correlation
    FROM pairwise
    GROUP BY delta, x_var, y_var
)

, pairwise_ranks AS (
    SELECT
    	delta,
    	x_var AS x,
    	y_var AS y,
        RANK () OVER (
            PARTITION BY delta, x_var, y_var
            ORDER BY x_val
        ) AS x_rank,
        RANK () OVER (
            PARTITION BY delta, x_var, y_var
            ORDER BY y_val
        ) AS y_rank
    FROM pairwise
)

, cross_rank_corrs AS (
    SELECT
        delta,
        x,
        y,
        'spearman' AS kind,
        CORR(x_rank, y_rank) AS correlation
    FROM pairwise_ranks
    GROUP BY delta, x, y
)

, corrs AS (
    SELECT * FROM cross_corrs
    UNION
    SELECT * FROM cross_rank_corrs
    ORDER BY delta, x, y, kind
)

>>> SELECT * FROM corrs;
delta   x   y        kind       correlation
1       x1  y1       pearson     -0,7487619679833206
1       x1  y1       spearman    -0,2
1       x1  y2       pearson     -0,2823436901242271
1       x1  y2       spearman    -0,7181848464596079
1       x2  y1       pearson     0,8708519999050144
1       x2  y1       spearman    1
1       x2  y2       pearson     -0,7618694949404378
1       x2  y2       spearman    -0,5129891760425771
2       x1  y1       pearson     -0,042207495591251455
2       x1  y1       spearman    -0,5
2       x1  y2       pearson     0,8808122718846411
2       x1  y2       spearman    1
2       x2  y1       pearson     -0,777082191335969
2       x2  y1       spearman    -0,5
2       x2  y2       pearson     -0,144827299193112
2       x2  y2       spearman    -0,5
3       x1  y1       pearson     1
3       x1  y1       spearman    1
3       x1  y2       pearson     1
3       x1  y2       spearman    1
3       x2  y1       pearson     1
3       x2  y1       spearman    1
3       x2  y2       pearson     1
3       x2  y2       spearman    1

That’s all folks!

Introduction

Cross-correlation for two variables

Different kinds of correlations

Multiple $X$s and $Y$s

Related posts