Window functions

A window function performs a calculation across a set of rows — all rows in a query are divided into groups (windows) and each group has its own aggregates. It does not return a single output row as an aggregate function. A window function adds an aggregated value to each row of the selection result in a separate column.

Create a table to run test queries

 
To follow the examples in this article and see how window functions work, first create the wnd_func_table table:

CREATE TABLE wnd_func_table (group_id Int32, sort_id Int32, value Int32) ENGINE = MergeTree ORDER BY group_id;
INSERT INTO wnd_func_table FORMAT Values
(1,1,10), (1,2,20), (1,3,30), (1,4,40), (1,5,50),
(2,1,1), (2,2,2), (2,3,3), (2,4,4), (2,4,5), (2,4,6), (2,5,7), (2,6,8);
┌─group_id─┬─sort_id─┬─value─┐
│        1 │       1 │    10 │
│        1 │       2 │    20 │
│        1 │       3 │    30 │
│        1 │       4 │    40 │
│        1 │       5 │    50 │
│        2 │       1 │     1 │
│        2 │       2 │     2 │
│        2 │       3 │     3 │
│        2 │       4 │     4 │
│        2 │       4 │     5 │
│        2 │       4 │     6 │
│        2 │       5 │     7 │
│        2 │       6 │     8 │
└──────────┴─────────┴───────┘

Syntax overview

The basic syntax for using a window function can be written as follows:

SELECT <window_function_name>([<argument_list>]) OVER {() | <window_name> | (<window_spec>)}
FROM <table_name>
WINDOW <window_name> AS (<window_spec>);

Use the OVER clause to define a window for a function — a set of rows over which the function is calculated. After the OVER keyword, you can specify:

  • () — empty parentheses, if you want to pass all rows from a dataset to a window function;

    Example

     
    The following query reads all rows from the wnd_func_table table and adds row numbers and the sum of all values in the value column to the additional number and sum columns:

    SELECT
        group_id,
        sort_id,
        value,
        row_number() OVER() AS number,
        sum(value) OVER() AS sum
    FROM wnd_func_table;
    ┌─group_id─┬─sort_id─┬─value─┬─number─┬─sum─┐
    │        1 │       1 │    10 │      1 │ 186 │
    │        1 │       2 │    20 │      2 │ 186 │
    │        1 │       3 │    30 │      3 │ 186 │
    │        1 │       4 │    40 │      4 │ 186 │
    │        1 │       5 │    50 │      5 │ 186 │
    │        2 │       1 │     1 │      6 │ 186 │
    │        2 │       2 │     2 │      7 │ 186 │
    │        2 │       3 │     3 │      8 │ 186 │
    │        2 │       4 │     4 │      9 │ 186 │
    │        2 │       4 │     5 │     10 │ 186 │
    │        2 │       4 │     6 │     11 │ 186 │
    │        2 │       5 │     7 │     12 │ 186 │
    │        2 │       6 │     8 │     13 │ 186 │
    └──────────┴─────────┴───────┴────────┴─────┘
  • <window_name> — name assigned to a window defined via the WINDOW clause within a query (in other words, the WINDOW clause allows you to describe a window once and then reuse it for multiple window functions in the SELECT query);

  • <window_spec> — window specification in parentheses as a set of optional clauses (see the syntax below).

Window specification

Below is the basic syntax of the window specification (<window_spec>):

[<window_name>]
[PARTITION BY <partition_column> [, ...]]
[ORDER BY <order_column> [ASC|DESC] [, ...]]
[{RANGE | ROWS} {<frame_start> | BETWEEN <frame_start> AND <frame_end>}]

where:

  • PARTITION BY — column that contains values used to divide rows into windows. A window function calculates a value for each row of the source dataset over rows of the window to which this row belongs. If you omit PARTITION BY in a query, a window function is evaluated over all rows.

    Example

     
    The following query uses the PARTITION BY clause to divide data rows into windows by values of the group_id column:

    SELECT
        group_id,
        sort_id,
        value,
        row_number() OVER (w) AS number,
        sum(value) OVER (w) AS sum
    FROM wnd_func_table
    WINDOW w AS (PARTITION BY group_id);

    The row_number and sum functions are calculated from data of each window separately (in this example, a window is a set of rows with the same values in the group_id column):

    ┌─group_id─┬─sort_id─┬─value─┬─number─┬─sum─┐
    │        1 │       1 │    10 │      1 │ 150 │
    │        1 │       2 │    20 │      2 │ 150 │
    │        1 │       3 │    30 │      3 │ 150 │
    │        1 │       4 │    40 │      4 │ 150 │
    │        1 │       5 │    50 │      5 │ 150 │
    │        2 │       1 │     1 │      1 │  36 │
    │        2 │       2 │     2 │      2 │  36 │
    │        2 │       3 │     3 │      3 │  36 │
    │        2 │       4 │     4 │      4 │  36 │
    │        2 │       4 │     5 │      5 │  36 │
    │        2 │       4 │     6 │      6 │  36 │
    │        2 │       5 │     7 │      7 │  36 │
    │        2 │       6 │     8 │      8 │  36 │
    └──────────┴─────────┴───────┴────────┴─────┘
  • ORDER BY — column that contains values used to order rows inside a window during a window function calculation.

    If ORDER BY is specified, the function is calculated for rows from the beginning of the window to the current row (including all rows with the same value in the field specified in the ORDER BY clause) according to the specified sort order. That is, for each row in the source dataset, it is selected a frame — a set of rows over which the function should be calculated (for example, aggregate functions used as window functions operate with frame rows, rather than with the entire window). You can use the RANGE or ROWS clause to specify another rule for defining a frame.

    Example

     
    The following query sets sorting by values of the sort_id column:

    SELECT
        group_id,
        sort_id,
        value,
        row_number() OVER (w) AS number,
        groupArray(value) OVER (w) AS frame_values,
        sum(value) OVER (w) AS sum
    FROM wnd_func_table
    WINDOW w AS (PARTITION BY group_id ORDER BY sort_id ASC);

    The output table includes the following additional columns with results of window functions:

    • number — row number within a window;

    • frame_values — the value column’s values in rows that belong to the frame for the current row;

    • sum — sum of the value column’s values in rows included into the frame to which the current row belongs (in this example — each value is summed up with all previous ones).

    ┌─group_id─┬─sort_id─┬─value─┬─number─┬─frame_values──────┬─sum─┐
    │        1 │       1 │    10 │      1 │ [10]              │  10 │
    │        1 │       2 │    20 │      2 │ [10,20]           │  30 │
    │        1 │       3 │    30 │      3 │ [10,20,30]        │  60 │
    │        1 │       4 │    40 │      4 │ [10,20,30,40]     │ 100 │
    │        1 │       5 │    50 │      5 │ [10,20,30,40,50]  │ 150 │
    │        2 │       1 │     1 │      1 │ [1]               │   1 │
    │        2 │       2 │     2 │      2 │ [1,2]             │   3 │
    │        2 │       3 │     3 │      3 │ [1,2,3]           │   6 │
    │        2 │       4 │     4 │      4 │ [1,2,3,4,5,6]     │  21 │
    │        2 │       4 │     5 │      5 │ [1,2,3,4,5,6]     │  21 │
    │        2 │       4 │     6 │      6 │ [1,2,3,4,5,6]     │  21 │
    │        2 │       5 │     7 │      7 │ [1,2,3,4,5,6,7]   │  28 │
    │        2 │       6 │     8 │      8 │ [1,2,3,4,5,6,7,8] │  36 │
    └──────────┴─────────┴───────┴────────┴───────────────────┴─────┘

    The sort order of rows in a window is not related to the query-level sort order.

ROWS

The ROWS clause allows you to define a frame by specifying a fixed number of rows preceding or following the current row in one of the following ways:

ROWS <frame_start>
ROWS BETWEEN <frame_start> AND <frame_end>

Bounds of a frame (<frame_start> and <frame_end>) can be defined with the following expressions.

CURRENT ROW

Frame starts or ends with the current row

UNBOUNDED PRECEDING

Frame starts with the first row of the window

UNBOUNDED FOLLOWING

Frame ends with the last row of the window

<N> PRECEDING

Frame includes <N> rows before the current row

<M> FOLLOWING

Frame includes <M> rows after the current row

If <frame_end> is omitted, a frame ends with CURRENT ROW by default.

Example — ROWS <frame_start>

 

A frame includes the current row and two previous ones:

SELECT
    group_id,
    sort_id,
    value,
    groupArray(value) OVER (PARTITION BY group_id ORDER BY sort_id ASC
            ROWS 2 PRECEDING) AS frame_values
FROM wnd_func_table;

The frame_values column contains values of the value column in all rows that fall into a frame defined for each row separately:

┌─group_id─┬─sort_id─┬─value─┬─frame_values─┐
│        1 │       1 │    10 │ [10]         │
│        1 │       2 │    20 │ [10,20]      │
│        1 │       3 │    30 │ [10,20,30]   │
│        1 │       4 │    40 │ [20,30,40]   │
│        1 │       5 │    50 │ [30,40,50]   │
│        2 │       1 │     1 │ [1]          │
│        2 │       2 │     2 │ [1,2]        │
│        2 │       3 │     3 │ [1,2,3]      │
│        2 │       4 │     4 │ [2,3,4]      │
│        2 │       4 │     5 │ [3,4,5]      │
│        2 │       4 │     6 │ [4,5,6]      │
│        2 │       5 │     7 │ [5,6,7]      │
│        2 │       6 │     8 │ [6,7,8]      │
└──────────┴─────────┴───────┴──────────────┘
Example — ROWS BETWEEN <frame_start> AND <frame_end>

 
A frame includes the current row and all subsequent rows until the end of the window:

SELECT
    group_id,
    sort_id,
    value,
    groupArray(value) OVER (PARTITION BY group_id ORDER BY sort_id ASC
        ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS frame_values
FROM wnd_func_table;
┌─group_id─┬─sort_id─┬─value─┬─frame_values──────┐
│        1 │       1 │    10 │ [10,20,30,40,50]  │
│        1 │       2 │    20 │ [20,30,40,50]     │
│        1 │       3 │    30 │ [30,40,50]        │
│        1 │       4 │    40 │ [40,50]           │
│        1 │       5 │    50 │ [50]              │
│        2 │       1 │     1 │ [1,2,3,4,5,6,7,8] │
│        2 │       2 │     2 │ [2,3,4,5,6,7,8]   │
│        2 │       3 │     3 │ [3,4,5,6,7,8]     │
│        2 │       4 │     4 │ [4,5,6,7,8]       │
│        2 │       4 │     5 │ [5,6,7,8]         │
│        2 │       4 │     6 │ [6,7,8]           │
│        2 │       5 │     7 │ [7,8]             │
│        2 │       6 │     8 │ [8]               │
└──────────┴─────────┴───────┴───────────────────┘

RANGE

The RANGE clause is also intended to specify a window frame. Unlike ROWS that operates with physical rows, RANGE operates with row ranges defined by the ORDER BY clause. This means that rows with the same values in ORDER BY columns are counted as one row. For the ROWS clause, each physical record in the dataset is considered a separate row.

There are two ways to use the RANGE clause:

RANGE <frame_start>
RANGE BETWEEN <frame_start> AND <frame_end>

You can use expressions listed in the table below to set frame bounds (<frame_start> and <frame_end>).

CURRENT ROW

Frame starts or ends with a range of rows with the same values in ORDER BY columns as the current row (all rows of the range are included in the frame)

UNBOUNDED PRECEDING

Frame starts with the first row of the window

UNBOUNDED FOLLOWING

Frame ends with the last row of the window

<offset> PRECEDING

Frame includes rows with the ORDER BY column’s values that are at most <offset> less than the value of the ORDER BY column in the current row. The <offset> PRECEDING expression can only be used if a single column is set in the ORDER BY clause

<offset> FOLLOWING

Frame includes rows with the ORDER BY column’s values that are at most <offset> greater than the value of the ORDER BY column in the current row. The <offset> FOLLOWING expression can only be used if a single column is set in the ORDER BY clause

If <frame_end> is omitted, a frame ends with CURRENT ROW by default.

If ORDER BY is specified in a query, but the frame is not explicitly described, the default frame is applied: RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.

Example — CURRENT ROW

 
A frame includes only the current row:

SELECT
    group_id,
    sort_id,
    value,
    groupArray(value) OVER (PARTITION BY group_id ORDER BY sort_id ASC
        RANGE BETWEEN CURRENT ROW AND CURRENT ROW) AS frame_values
FROM wnd_func_table;

In the second window (group_id=2), a range of rows with the same value in the sort_id field is treated as a single row:

┌─group_id─┬─sort_id─┬─value─┬─frame_values─┐
│        1 │       1 │    10 │ [10]         │
│        1 │       2 │    20 │ [20]         │
│        1 │       3 │    30 │ [30]         │
│        1 │       4 │    40 │ [40]         │
│        1 │       5 │    50 │ [50]         │
│        2 │       1 │     1 │ [1]          │
│        2 │       2 │     2 │ [2]          │
│        2 │       3 │     3 │ [3]          │
│        2 │       4 │     4 │ [4,5,6]      │
│        2 │       4 │     5 │ [4,5,6]      │
│        2 │       4 │     6 │ [4,5,6]      │
│        2 │       5 │     7 │ [7]          │
│        2 │       6 │     8 │ [8]          │
└──────────┴─────────┴───────┴──────────────┘
Example — <offset> PRECEDING/FOLLOWING

 

A frame for each row includes rows where the number value (the ORDER BY sort field) is at most 10 less than and at most 5 greater than the number value in the current row:

SELECT
    number,
    groupArray(number) OVER (ORDER BY number ASC
        RANGE BETWEEN 10 PRECEDING AND 5 FOLLOWING) AS frame_values
FROM values('number Int8', 10, 20, 25, 27, 30, 40, 15, 50, 60, 7, 5, 2);

In the resulting table, the frame_values column contains arrays of number values for the frame of each row:

┌─number─┬─frame_values─────┐
│      2 │ [2,5,7]          │
│      5 │ [2,5,7,10]       │
│      7 │ [2,5,7,10]       │
│     10 │ [2,5,7,10,15]    │
│     15 │ [5,7,10,15,20]   │
│     20 │ [10,15,20,25]    │
│     25 │ [15,20,25,27,30] │
│     27 │ [20,25,27,30]    │
│     30 │ [20,25,27,30]    │
│     40 │ [30,40]          │
│     50 │ [40,50]          │
│     60 │ [50,60]          │
└────────┴──────────────────┘

Function types

Window functions can be combined into the following groups:

  • Aggregate — aggregate functions (for example, sum, avg, count, min, max) can be used as window functions.

  • Ranking — functions that rank rows in a window. For example, such functions can be used to assign a sequence number to a row or to rank rows.

    Function Return value

    row_number()

    Row number within a window. Numbering starts from 1 and does not depend on whether rows have the same values in the ORDER BY column

    rank()

    Row rank. The same rank is returned for rows with the same values in the ORDER BY column, and N-1 rank values are skipped for the following rows, where N is the number of rows in the previous group of rows with the same values in the ORDER BY column (that is, ranking is with gaps)

    dense_rank()

    Row rank. The same rank is returned for rows with the same values in the ORDER BY column, and the numeric rank for the following rows is not skipped (that is, ranking is sequential)

  • Value functions — functions that allow you to access different rows of a frame relative to the current row, as well as values ​​of rows at the beginning or end of the frame.

    Function Return value

    lagInFrame(value, [N])

    leadInFrame(value, [N])

    Value of the value column from a row that precedes or follows the current row by N rows within the frame. The default value of N is 1

    first_value(value)

    last_value(value)

    Value of the value column from the first or last row of the frame

    nth_value(value, N)

    Value of the value column from the N-th row of the frame

Found a mistake? Seleсt text and press Ctrl+Enter to report it