Table of contents

Motivation

At Alan, we do almost all our data analysis in SQL. Our data warehouse used to be PostgreSQL, and have since switched to Snowflake for performance reasons. We load data into our warehouse with Airflow. This includes dumps of our production database, third-party data, and health data from other actors in the health ecosystem. This is raw data. We transform this into prepared data via an in-house tool that resembles dbt. You can read more about it here.

Our data analysis is done on top of our prepared data. We use Metabase to create dashboards. Recently, we’ve been having a lot of discussion around our setup. Metabase allows querying the data warehouse with SQL. This gives us the liberty to do whatever we want between the data warehouse and the visualisation. This sounds great… but it’s not. Indeed, it’s too permissive.

One of the issues we’re facing is that we have a lot of business logic that is stored in Metabase, rather than being versioned in our analytics codebase. There is also business logic that is duplicated in many places across Metabase. Moreover, when we make a change to our prepared data, it’s burdensome to propagate the changes into our dashboards.

Generally speaking, we wish to change our relationship with Metabase. We’ve agreed that the less SQL code is in Metabase, the better. We’re addressing this via various initiatives. One of them is to prepare data in such a way that it can be consumed in a dashboard with minimal effort. We recently discovered that Snowflake has a GROUPING SETS operator. This has unlocked quite a powerful pattern for us. Before delving into said pattern, let us start by dwelling on what this operator does.

What GROUPING SETS does

Let’s use a toy example to illustrate.

As a health insurance company, it is key for us is to keep track of our margin. We do this by comparing premiums, which is the money we receive from the people we cover, with claims, which are the healthcare expenses we reimburse. Assuming we’ve collected these figures at a company level, this might result in such a tidy dataset:

weekcountryindustrypremiumsclaims
2021-01-01πŸ‡«πŸ‡·πŸ₯10,0008,000
2021-01-01πŸ‡«πŸ‡·πŸ­5,0007,000
2021-01-08πŸ‡«πŸ‡·πŸ₯11,0008,500
2021-01-08πŸ‡«πŸ‡·πŸ­4,0006,000
2021-01-01πŸ‡ͺπŸ‡ΈπŸ₯2,0001,800
2021-01-01πŸ‡ͺπŸ‡ΈπŸ­3,0003,500
Table definition in SQL
WITH accounts AS (
    SELECT week, country, industry, premiums, claims
    FROM VALUES
        ('2021-01-01', 'πŸ‡«πŸ‡·', 'πŸ₯', 10000, 8000),
        ('2021-01-01', 'πŸ‡«πŸ‡·', '🏭', 5000, 7000),
        ('2021-01-08', 'πŸ‡«πŸ‡·', 'πŸ₯', 11000, 8500),
        ('2021-01-08', 'πŸ‡«πŸ‡·', '🏭', 4000, 6000),
        ('2021-01-01', 'πŸ‡ͺπŸ‡Έ', 'πŸ₯', 2000, 1800),
        ('2021-01-01', 'πŸ‡ͺπŸ‡Έ', '🏭', 3000, 3500)
    AS accounts (week, country, industry, premiums, claims)
)

SELECT *
FROM accounts

Typically, we compute a loss ratio by comparing the premiums with the claims:

SELECT SUM(claims) / SUM(premiums) AS loss_ratio
FROM accounts
0.994286

We would like to break this metric down across a few dimensions to get a better understanding. We might want to look at the evolution through time, the country the company is based in, as well as the type of industry it belongs to. But we’re greedy, so we also want to look at combinations of these dimensions.

The naΓ―ve approach in SQL would be to write down many queries with different GROUP BY statements. This can quickly get verbose and difficult to maintain. This is exactly what the GROUPING SETS operator is meant for. Let’s start with a small example:

SELECT
    week,
    industry,
    SUM(claims) / SUM(premiums) AS loss_ratio
FROM accounts
WHERE country = 'πŸ‡«πŸ‡·'
GROUP BY GROUPING SETS (
    (week),
    (industry),
    (week, industry)
)
weekindustryloss_ratio
2021-01-01πŸ₯0.80
2021-01-01🏭1.40
2021-01-08πŸ₯0.77
2021-01-08🏭1.50
2021-01-01NULL1.00
2021-01-08NULL0.97
NULLπŸ₯0.79
NULL🏭1.44

This is in fact the concatenation of three smaller tables:

(week, industry)

weekindustryloss_ratio
2021-01-01πŸ₯0.80
2021-01-01🏭1.40
2021-01-08πŸ₯0.77
2021-01-08🏭1.50

(week)

weekindustryloss_ratio
2021-01-01NULL1.00
2021-01-08NULL0.97

(industry)

weekindustryloss_ratio
NULLπŸ₯0.79
NULL🏭1.44

In fact, you could just as well implement a GROUPING SET yourself by concatenating many GROUP BY query results together via some UNION operators. This is nicely illustrated here. That’s it really, the GROUPING SET operator simply performs and collates many GROUP BY in one fell swoop.

A pattern for creating dashboards

Data analysis very often boils down to looking at a metric, and drilling down to understand its distribution. This is why the tidy data concept data is so powerful: the data is ready to be aggregated. A dashboard is very often just an interface to display metrics grouped by various dimensions. Ideally, dashboards allow choosing which dimensions to drill down on. Sadly, this isn’t available in Metabase. Moreover, the desired aggregation has to be performed live. This costs precious seconds as well as compute credits.

The nice thing with GROUPING SETS is that all the computation has already been performed. You can just filter the resulting table to look at the set of dimensions you’re interested in. One way to do this is by checking on the nullity of the columns:

WITH groups (
    SELECT
        week,
        industry,
        SUM(claims) / SUM(premiums) AS loss_ratio
    FROM accounts
    WHERE country = 'πŸ‡«πŸ‡·'
    GROUP BY GROUPING SETS (
        (week),
        (industry),
        (week, industry)
    )
)

SELECT *
FROM groups
WHERE week IS NOT NULL
AND industry IS NOT NULL
weekindustryloss_ratio
2021-01-01πŸ₯0.80
2021-01-01🏭1.40
2021-01-08πŸ₯0.77
2021-01-08🏭1.50

This works just fine. It drastically simplifies the query that would have to be written in dashboard. Indeed, there’s no need to write a GROUP BY statement.

If there’s a lot of dimensions, writing many WHERE ... IS NOT NULL can get a bit boring. A colleague at Alan found a nice trick to make this easier. The idea is to build a string that indicates which dimensions participate in a group. It’s possible to do this in Snowflake by using GROUPING_ID.

groups AS (
    SELECT
        week,
        industry,
        ARRAY_TO_STRING(
            ARRAY_CONSTRUCT_COMPACT(
                IFF(GROUPING_ID(week) = 0, 'week', NULL),
                IFF(GROUPING_ID(industry) = 0, 'industry', NULL)
            ),
            ' x '
        ) AS group_by,
        SUM(claims) / SUM(premiums) AS loss_ratio
    FROM accounts
    WHERE country = 'πŸ‡«πŸ‡·'
    GROUP BY GROUPING SETS (
        (week),
        (industry),
        (week, industry)
    )
)

SELECT *
FROM groups
weekindustrygroup_byloss_ratio
2021-01-01πŸ₯week x industry0.80
2021-01-01🏭week x industry1.40
2021-01-08πŸ₯week x industry0.77
2021-01-08🏭week x industry1.50
2021-01-01NULLweek1.00
2021-01-08NULLweek0.97
NULLπŸ₯industry0.79
NULL🏭industry1.44

This is nice, because now we can access a particular group as so:

-- Before
SELECT *
FROM groups
WHERE week IS NOT NULL
AND industry IS NOT NULL

-- After
SELECT *
FROM groups
WHERE group_by = 'week x industry'
weekindustryloss_ratio
2021-01-01πŸ₯0.80
2021-01-01🏭1.40
2021-01-08πŸ₯0.77
2021-01-08🏭1.50

That’s as simple as a query can get. It’s just a very readable SELECT FROM WHERE. This is great for us, as it minimizes the amount of SQL we put in Metabase. It also speeds up our dashboards because the heavy-lifting has already been done.

Shortcuts: CUBE and ROLLUP

Writing down a GROUPING SETS operator can be a bit tedious. It’s also slightly error-prone if you’re juggling with a lot of dimensions. Thankfully, in Snowflake there are a couple of operators to ease this process.

You can use CUBE when you want to group on all the combinations of dimensions. It’s a good default mode when you’re not sure what dimensions are going to be used in the dashboard. By the way, I really think that this notion of having prepared data that does not know how it’s going to be used is a powerful idea. It’s yet another instance of data independence.

GROUP BY CUBE (week, country, industry)

-- is short for

GROUP BY GROUPING SETS (
    (week),
    (country),
    (industry),
    (week, country),
    (week, industry),
    (country, industry),
    (week, country, industry)
)

You can also the ROLLUP operator if you want to do a GROUPING SETS which drills down over dimensions. It’s useful when your dimensions have a hierarchy.

GROUP BY ROLLUP (week, country, industry)

-- is short for

GROUP BY GROUPING SETS (
    (week),
    (week, country),
    (week, country, industry)
)

Conclusion

I hope you found this post useful! Taking a step back, it does feel that this is reinventing the wheel somehow. Writing SQL to connect a data warehouse to a dashboard may seem awkward. But sometimes you’re limited by your tools, and you don’t have access to an expensive no-code interface to do all this for you.

Note that I’ve been using Snowflake as an example because that’s we use at work. But this is also available in PostgreSQL, as well as in MySQL, but sadly not in SQLite 😒