Profile Photo

Jamie Skipworth


Technology Generalist | Software & Data


Cracking Window Functions

Window functions - what are they and how would you use them? A question is often fired at data engineering interviewees, by interviewers more interested in hearing crap regurgitated from the back-pages of a manual, rather than how you’d solve a particular problem.

Cracking Windows

One “official” description of a window function is this:

A window function performs a calculation across a set of table rows that are somehow related to the current row.

This description used to annoy me when I was trying to get to grips with them because it’s completely unhelpful in describing what they do; it doesn’t bloody mean anything. If you jump straight to the manual to look at the syntax like I did, you’d probably still be in the dark:

window_func(arg1, arg2..) OVER (PARTITION BY expression ORDER BY expression)

Documentation has improved in recent years, so they’re not as obscure as they used to be. Nevertheless, here’s my attempt at explaining them.


Sometimes the records in your data might be related to the ones around them, like transaction or event records. A “window” is actually just some contiguous range of records defined by a key and/or a sort order, like a customer ID and and an order date.

If I defined a window as PARTITION BY customer_id (using the above syntax), then I’d end up with windows looking a bit like this:

Two windows

I now have a window that slides over each block of identical customer_id records. You could aggregate and rank these records, like ranking each customer’s orders by value:

DENSE_RANK( ) OVER (PARTITION BY customer_id ORDER BY total_due DESC)

I’m still not sure if I’ve explained them well, so I’m going to use some examples instead. I’ve loaded 5 tables from the Microsoft AdventureWorks database into PostgreSQL (because you’d be mental to use anything else). They look like this:

adv=# \dt
                  List of relations
 Schema |        Name        | Type  |     Owner
--------+--------------------+-------+----------------
 public | customer           | table | spandexman
 public | person             | table | spandexman
 public | product            | table | spandexman
 public | sales_order_detail | table | spandexman
 public | sales_order_header | table | spandexman
(5 rows)

To try and show why window functions are useful, here are 2 equivalent queries to get total revenue for each customer. One uses plain old aggregates, and the other uses a window function.

With a regular aggregate:

with cust_total as (
-- First, aggregate the total revenue for each customer.
  select 
    customer_id
  , sum(total_due) as cust_revenue
  from 
    sales_order_header 
  group by  
    customer_id
)
-- Then join it back to the customer & sales data.
-- This gives us an order-level view, and also the total customer revenue.
select 
  c.customer_id
, p.first_name
, p.last_name
-- Cast order_date to date to truncate time.
, cast( s.order_date as date ) as order_date
, sum(s.total_due) as total_due
, t.cust_revenue as revenue
from 
  customer c 
inner join 
  person p 
  on ( c.customer_id = p.bus_ent_id )
inner join 
  sales_order_header s 
  on( s.customer_id = c.customer_id ) 
inner join
  cust_total t
  on( c.customer_id = t.customer_id ) 
group by 
  c.customer_id
, p.first_name
, p.last_name
, s.order_date
, t.cust_revenue
order by 
  customer_id asc
, order_date desc
, revenue desc;

Output:

 customer_id | first_name  | last_name | order_date | total_due | revenue
-------------+-------------+-----------+------------+-----------+----------
       11000 | Mary        | Young     | 2013-10-03 |   2770.27 |  9115.13
       11000 | Mary        | Young     | 2013-06-20 |   2587.88 |  9115.13
       11000 | Mary        | Young     | 2011-06-21 |   3756.99 |  9115.13
       11001 | Amber       | Young     | 2014-05-12 |   650.801 |  7054.19
       11001 | Amber       | Young     | 2013-06-18 |   2674.02 |  7054.19
       11001 | Amber       | Young     | 2011-06-17 |   3729.36 |  7054.19

With a window function:

-- Using the SUM() window function, we can acheive the same thing without
-- screwing around with sub-queries and joins. It becomes a one-liner.
select 
  c.customer_id
, p.first_name
, p.last_name
-- Cast order_date to date to truncate time.
, cast( s.order_date as date ) as order_date
, s.total_due
, sum( s.total_due ) over( partition by c.customer_id ) as revenue
from 
  customer c 
inner join 
  person p 
  on ( c.customer_id = p.bus_ent_id )
inner join 
  sales_order_header s 
  on( s.customer_id = c.customer_id ) 
order by 
  customer_id asc
, order_date desc
, revenue desc;

Output:

 customer_id | first_name  | last_name | order_date | total_due | revenue
-------------+-------------+-----------+------------+-----------+---------
       11000 | Mary        | Young     | 2013-10-03 |   2770.27 |  9115.13
       11000 | Mary        | Young     | 2013-06-20 |   2587.88 |  9115.13
       11000 | Mary        | Young     | 2011-06-21 |   3756.99 |  9115.13
       11001 | Amber       | Young     | 2014-05-12 |   650.801 |  7054.19
       11001 | Amber       | Young     | 2013-06-18 |   2674.02 |  7054.19
       11001 | Amber       | Young     | 2011-06-17 |   3729.36 |  7054.19

Same same.

So instead of having to write a separate sub-query to aggregate the revenue and join it back, we can just call the SUM() window function and tell it that we want to PARTITION BY customer_id, which tells it to SUM( total_due ) for each customer_id.

Another benefit of using the window function here is increased performance. If you run an EXPLAIN ANALYZE on these queries you can see a performance improvement of around 36%.

With a regular aggregate:

 Planning time: 1.661 ms
 Execution time: 164.368 ms

With a window function:

 Planning time: 1.025 ms
 Execution time: 106.204 ms

So not only can you reduce the code you need to write, you can also improve performance! Why wouldn’t you use them?!

More Windows

Here are a few more examples that use window functions in various ways.

DENSE_RANK customers by revenue

Let say we want to rank the customers by their lifetime revenue. First we’ll perform the SUM() function the same as above, then we’ll rank customers with DENSE_RANK().

with revenue as(
  select 
    c.customer_id
  , p.first_name
  , p.last_name
  , sum( s.total_due )  over( partition by c.customer_id ) as revenue
  from 
    customer c 
  inner join 
    person p 
    on ( c.customer_id = p.bus_ent_id )
  inner join 
    sales_order_header s 
    on( s.customer_id = c.customer_id ) 
)
select
  r.*
, dense_rank() over( order by r.revenue desc ) as rnk
from
  revenue r
order by 
  rnk asc;

Results:

 customer_id | first_name  |    last_name     | revenue | rnk
-------------+-------------+------------------+---------+------
       12301 | Hannah      | Clark            | 14691.4 |    1
       12301 | Hannah      | Clark            | 14691.4 |    1
       12301 | Hannah      | Clark            | 14691.4 |    1
       12301 | Hannah      | Clark            | 14691.4 |    1
       12301 | Hannah      | Clark            | 14691.4 |    1
       12132 | Taylor      | Jones            | 14690.2 |    2
       12132 | Taylor      | Jones            | 14690.2 |    2
       12132 | Taylor      | Jones            | 14690.2 |    2
       12132 | Taylor      | Jones            | 14690.2 |    2

Time between orders using LAG

Let’s work out the time difference between orders. Here we’re asking for the difference between order_dates for each customer. Because we want it for each customer we PARTITION BY customer_id, and because we want the time between orders we have to sort on order_date. Sort order is obviously very important here.

select 
  c.customer_id
, p.first_name
, p.last_name
, cast( order_date as date ) as order_date -- Cast to date to truncate time
, lag( cast( order_date as date ) ) over( partition by c.customer_id order by order_date ) as last_order 
from 
  customer c 
inner join
  person p 
  on ( c.customer_id = p.bus_ent_id ) 
inner join 
  sales_order_header s 
  on( s.customer_id = c.customer_id ) 
order by 
  customer_id
, order_date asc;

So for each customer order record, we get the value of the previous order_date.

 customer_id | first_name  | last_name | order_date | last_order
-------------+-------------+-----------+------------+------------
       11000 | Mary        | Young     | 2011-06-21 |
       11000 | Mary        | Young     | 2013-06-20 | 2011-06-21
       11000 | Mary        | Young     | 2013-10-03 | 2013-06-20
       11001 | Amber       | Young     | 2011-06-17 |
       11001 | Amber       | Young     | 2013-06-18 | 2011-06-17
       11001 | Amber       | Young     | 2014-05-12 | 2013-06-18

If you wanted to extract the number of days between the current and last order, you could modify the query like this:

with orders as (  
  select 
    c.customer_id
  , p.first_name
  , p.last_name
  , cast( order_date as date  ) as order_date -- Cast to date to truncate time
  , lag( cast( order_date as date ) ) over( partition by c.customer_id order by order_date ) as last_order 
  from 
    customer c 
  inner join
    person p 
    on ( c.customer_id = p.bus_ent_id ) 
  inner join 
    sales_order_header s 
    on( s.customer_id = c.customer_id )
)
select 
    customer_id
  , first_name
  , last_name
  , order_date
  , last_order 
  -- extract expects a timestamp, so cast the dates back to timestamp
  , extract( day from ( cast( order_date as timestamp ) - cast( last_order as timestamp ) ) ) as lag_days
from 
  orders o;

To get this:

 customer_id | first_name  |  last_name | order_date | last_order | lag_days
-------------+-------------+------------+------------+------------+----------
       11000 | Mary        | Young      | 2011-06-21 |            |
       11000 | Mary        | Young      | 2013-06-20 | 2011-06-21 |   730
       11000 | Mary        | Young      | 2013-10-03 | 2013-06-20 |   105
       11001 | Amber       | Young      | 2011-06-17 |            |
       11001 | Amber       | Young      | 2013-06-18 | 2011-06-17 |   732
       11001 | Amber       | Young      | 2014-05-12 | 2013-06-18 |   328

From here you could do things like work out customer purchase frequency if you wanted to.

How about a simple RFM model?

Something that’s quite commonly done (or used to be) is RFM modelling, which allows you to rank customers by Recency, Frequency and Monetary measures.

with rfm as (
  select
      c.customer_id
    , p.first_name
    , p.last_name
    -- Set the base date to 2014-12-31. The data is old, so using today's date yields no rows
    , extract( days from ( date('2014-12-31') - max(s.order_date) ) ) as recency
    , max( order_date ) as last_order_dt
    , count( 1 )        as frequency
    , sum( total_due )  as monetary
  from 
    customer c 
  inner join 
    person p
    on ( c.customer_id = p.bus_ent_id ) 
  inner join 
    sales_order_header s 
    on( s.customer_id = c.customer_id ) 
  group by
      c.customer_id
    , p.first_name
    , p.last_name   
)
select      
    customer_id
  , first_name
  , last_name
  , ntile( 10 ) over( order by recency asc )    as rfm_recency
  , ntile( 10 ) over( order by frequency asc )  as rfm_frequency
  , ntile( 10 ) over( order by monetary asc )   as rfm_monetary
from 
  rfm
where recency < 365 -- Only customers active within a year
order by 
    customer_id
  , rfm_recency desc
  , rfm_frequency desc
  , rfm_monetary desc;

This gives us a list of customers and their various RFM rank values. You’ll be able to identify your most or least valuable customers using something like this.

 customer_id | first_name  |    last_name     | rfm_recency | rfm_frequency | rfm_monetary
-------------+-------------+------------------+-------------+---------------+--------------
       11001 | Amber       | Young            |           3 |             9 |           10
       11012 | Chloe       | Thompson         |           7 |             8 |            4
       11013 | Joe         | Raman            |           1 |             8 |            4
       11017 | Chloe       | Clark            |           7 |            10 |           10
       11018 | Joe         | Mehta            |           6 |             9 |           10
       11019 | Joe         | Belson           |           1 |            10 |            6
       11023 | Chloe       | Lee              |           1 |             8 |            5
       11025 | Marshall    | Sun              |           6 |             9 |           10
       11026 | Joe         | Martinez         |           7 |             9 |           10

So, window functions are dead useful, and if you’re a bit of a lazy developer like me it’ll save you some typing.

Get the database

If you want to play around with PostgreSQL, download it here and install it.

For MacOS, which I what I used for this post, simply get the app from postgresapp.com and run it.

The database dump is available on my GitHub here. To restore the database, do the following from the command-line:

$ psql postgres -c "create database adv encoding utf8"
CREATE DATABASE
$ gzcat adv_db.sql.gz | psql adv

You can then connect to the database an run queries by running psql adv.