Конференция Arenadata
Новое время — новый Greenplum
Мы приглашаем вас принять участие в конференции, посвященной будущему Open-Source Greenplum 19 сентября в 18:00:00 UTC +3. Встреча будет проходить в гибридном формате — и офлайн, и онлайн. Онлайн-трансляция будет доступна для всех желающих.
Внезапное закрытие Greenplum его владельцем — компанией Broadcom - стало неприятным сюрпризом для всех, кто использует или планирует начать использовать решения на базе этой технологии. Многие ожидают выхода стабильной версии Greenplum 7 и надеются на её дальнейшее активное развитие.
Arenadata не могла допустить, чтобы разрабатываемый годами Open-Source проект Greenplum прекратил своё существование, поэтому 19 сентября мы представим наш ответ на данное решение Broadcom, а участники сообщества получат исчерпывающие разъяснения на все вопросы о дальнейшей судьбе этой технологии.

На конференции вас ждёт обсуждение следующих тем:

  • План возрождения Greenplum;
  • Дорожная карта;
  • Экспертное обсуждение и консультации.
Осталось до события

Window and analytical functions in Hive

Hive window and analytical functions are used to perform calculations over a set of related rows (windows or partitions) in a result set. Unlike aggregate functions that collapse all column values to a single value, the outcome of a window function is one or more columns with calculation results added to the original result set. Window and analytical functions are extremely useful for analytical tasks and are essential solution to queries like "get top X products sold by category", "get maximum salary per department", "get first N rows from the result set", and alike.

All functions can be divided into the following groups:

This section provides examples and recommendations on using major functions, more details on supported Hive window and analytical functions are available in Apache Hive documentation.

Test database

Throughout this article, the following Hive table transactions is used to demonstrate the examples.

SELECT * FROM transactions;
+----------------------+----------------------+--------------------------+------------------------+
| transactions.txn_id  | transactions.acc_id  | transactions.txn_amount  | transactions.txn_date  |
+----------------------+----------------------+--------------------------+------------------------+
| 1                    | 1002                 | 10.00                    | 2023-01-01             |
| 2                    | 1002                 | 20.00                    | 2023-01-03             |
| 3                    | 1002                 | 30.00                    | 2023-01-02             |
| 4                    | 1001                 | 100.50                   | 2023-01-02             |
| 5                    | 1001                 | 150.50                   | 2023-01-04             |
| 6                    | 1001                 | 200.50                   | 2023-01-03             |
| 7                    | 1003                 | 50.00                    | 2023-01-03             |
| 8                    | 1003                 | 50.00                    | 2023-01-01             |
| 9                    | 1003                 | 75.00                    | 2023-01-04             |
+----------------------+----------------------+--------------------------+------------------------+
SQL for creating test table

 
The following SQL can be used to create and populate the test Hive table using /bin/beeline.

DROP TABLE IF EXISTS `transactions`;
CREATE TABLE `transactions`(`txn_id` int, `acc_id` int, `txn_amount` decimal(10,2), `txn_date` date);
INSERT INTO transactions VALUES
(1, 1002, 10.00, '2023-01-01'),
(2, 1002, 20.00, '2023-01-03'),
(3, 1002, 30.00, '2023-01-02'),
(4, 1001, 100.50, '2023-01-02'),
(5, 1001, 150.50, '2023-01-04'),
(6, 1001, 200.50, '2023-01-03'),
(7, 1003, 50.00, '2023-01-03'),
(8, 1003, 50.00, '2023-01-01'),
(9, 1003, 75.00, '2023-01-04');

Syntax

The basic syntax for using window and analytical functions is as follows:

SELECT <func>([[, ...]]) OVER ([<window_spec>]) [<window_name>]
FROM <table_name>;

Where:

  • <func> — a standard SQL aggregation function (e.g. AVG) or a special window/analytical function.

  • OVER — indicates that the function should be calculated for specific windows within the result set.

  • <window_spec> — an expression that defines the criteria to split the result set into windows. More details on windows specification are available below.

  • <window_name> — an alias for the column with the function calculation results.

Window specification

A window specification tells Hive how to divide the result set fetched by SELECT into windows. An empty specification indicates that the entire result set should be treated as a single window and the function should run for all the rows. The window specification syntax is as follows:

[PARTITION BY <col> [, ...]]
[ORDER BY <col> [ASC|DESC] [, ...]]
(ROWS | RANGE) BETWEEN (UNBOUNDED | [num]) PRECEDING AND ([num] PRECEDING | CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)

Where:

  • PARTITION BY <col> — splits the result set into partitions based on the <col> values.

  • ORDER BY <col> — orders the rows within each partition before running the function.

  • ROWS | RANGE …​ — specifies the upper/lower window frame bounds.

There is also an alternative syntax with placing windows specification in a separate WINDOW clause:

SELECT <func>([[, ...]]) OVER <window_name>
FROM <table_name>
WINDOW <window_name> AS ([<window_spec>])

OVER keyword

At its simplest, a window function is defined by adding the OVER clause to a SELECT query. A typical selection may include window/analytical functions or standard SQL aggregate functions (SUM, AVG, etc.).

In the following example, the empty OVER() clause indicates that no explicit window is specified, so the entire result set returned by SELECT should be treated as a window to calculate the average transaction amount.

SELECT *, AVG(txn_amount)
OVER() AS average_txn
FROM transactions;

The result contains an extra average_txn column that holds the average value of all transactions:

+----------------------+----------------------+--------------------------+------------------------+--------------+
| transactions.txn_id  | transactions.acc_id  | transactions.txn_amount  | transactions.txn_date  | average_txn  |
+----------------------+----------------------+--------------------------+------------------------+--------------+
| 1                    | 1002                 | 10.00                    | 2023-01-01             | 76.277778    |
| 2                    | 1002                 | 20.00                    | 2023-01-03             | 76.277778    |
| 3                    | 1002                 | 30.00                    | 2023-01-02             | 76.277778    |
| 4                    | 1001                 | 100.50                   | 2023-01-02             | 76.277778    |
| 5                    | 1001                 | 150.50                   | 2023-01-04             | 76.277778    |
| 6                    | 1001                 | 200.50                   | 2023-01-03             | 76.277778    |
| 7                    | 1003                 | 50.00                    | 2023-01-03             | 76.277778    |
| 8                    | 1003                 | 50.00                    | 2023-01-01             | 76.277778    |
| 9                    | 1003                 | 75.00                    | 2023-01-04             | 76.277778    |
+----------------------+----------------------+--------------------------+------------------------+--------------+

OVER + PARTITION BY

By using the OVER (PARTITION BY col) construct, you can split the result set into partitions so the function logic is applied for each individual partition.

For example, the following query calculates average transaction amount with respect to each account.

SELECT *, AVG(txn_amount)
OVER (PARTITION BY acc_id) AS average_txn_per_acc
FROM transactions;

The result set now includes average values calculated for individual accounts:

+----------------------+----------------------+--------------------------+------------------------+----------------------+
| transactions.txn_id  | transactions.acc_id  | transactions.txn_amount  | transactions.txn_date  | average_txn_per_acc  |
+----------------------+----------------------+--------------------------+------------------------+----------------------+
| 4                    | 1001                 | 100.50                   | 2023-01-02             | 150.500000           |
| 5                    | 1001                 | 150.50                   | 2023-01-04             | 150.500000           |
| 6                    | 1001                 | 200.50                   | 2023-01-03             | 150.500000           |
| 1                    | 1002                 | 10.00                    | 2023-01-01             | 20.000000            |
| 2                    | 1002                 | 20.00                    | 2023-01-03             | 20.000000            |
| 3                    | 1002                 | 30.00                    | 2023-01-02             | 20.000000            |
| 7                    | 1003                 | 50.00                    | 2023-01-03             | 58.333333            |
| 8                    | 1003                 | 50.00                    | 2023-01-01             | 58.333333            |
| 9                    | 1003                 | 75.00                    | 2023-01-04             | 58.333333            |
+----------------------+----------------------+--------------------------+------------------------+----------------------+

OVER + ORDER BY

With the ORDER keyword, you can add sorting to your function.

For example, the following query calculates the total amount paid by each account at a specific day, sorting the rows by date.

SELECT *, SUM(txn_amount)
OVER (PARTITION BY acc_id ORDER BY txn_date) AS running_total_paid
FROM transactions;

For each account ID, the rows in the final result set are sorted by transaction date. Notice that the running_total_paid column stores running total values. This occurs because adding ORDER BY to the window specification implicitly uses the default RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW boundary. Thus, when calculating an average for each row, the function uses the current row value + all previous values. Without ORDER BY the total sum in running_total_paid would be the same for all the records.

+----------------------+----------------------+--------------------------+------------------------+---------------------+
| transactions.txn_id  | transactions.acc_id  | transactions.txn_amount  | transactions.txn_date  | running_total_paid  |
+----------------------+----------------------+--------------------------+------------------------+---------------------+
| 4                    | 1001                 | 100.50                   | 2023-01-02             | 100.50              |
| 6                    | 1001                 | 200.50                   | 2023-01-03             | 301.00              |
| 5                    | 1001                 | 150.50                   | 2023-01-04             | 451.50              |
| 1                    | 1002                 | 10.00                    | 2023-01-01             | 10.00               |
| 3                    | 1002                 | 30.00                    | 2023-01-02             | 40.00               |
| 2                    | 1002                 | 20.00                    | 2023-01-03             | 60.00               |
| 8                    | 1003                 | 50.00                    | 2023-01-01             | 50.00               |
| 7                    | 1003                 | 50.00                    | 2023-01-03             | 100.00              |
| 9                    | 1003                 | 75.00                    | 2023-01-04             | 175.00              |
+----------------------+----------------------+--------------------------+------------------------+---------------------+

Window functions

LEAD and LAG

The LEAD/LAG are similar functions that allow you to pull data from next/previous rows respectively. These can be useful for comparing adjacent column values or for calculating the difference between rows.

The following query groups the selection results by account, and for each row, returns an ID of the next transaction available in the partition group.

SELECT txn_id, txn_amount, txn_date, LEAD(txn_id)
OVER (PARTITION BY acc_id ORDER BY txn_date) AS next_txn_id
FROM transactions;

The result is shown below. NULL values in the next_txn_id indicate the window boundary i.e. no more next rows are available for the current partition group.

+---------+-------------+-------------+--------------+
| txn_id  | txn_amount  |  txn_date   | next_txn_id  |
+---------+-------------+-------------+--------------+
| 4       | 100.50      | 2023-01-02  | 6            |
| 6       | 200.50      | 2023-01-03  | 5            |
| 5       | 150.50      | 2023-01-04  | NULL         |
| 1       | 10.00       | 2023-01-01  | 3            |
| 3       | 30.00       | 2023-01-02  | 2            |
| 2       | 20.00       | 2023-01-03  | NULL         |
| 8       | 50.00       | 2023-01-01  | 7            |
| 7       | 50.00       | 2023-01-03  | 9            |
| 9       | 75.00       | 2023-01-04  | NULL         |
+---------+-------------+-------------+--------------+

By default, LEAD and LAG functions return column values that are 1 step away from the current row and return NULL if the partition boundary is reached. You can change the offset and the out-of-boundary indicating value using the following signature:

LEAD(col, i, j)
LAG(col, i, j)

Where:

  • col — the name of a column to get the next/previous value.

  • i — the offset to get the next/previous value.

  • j — the value to be returned instead of NULL when reaching a partition boundary.

ROW_NUMBER

The ROW_NUMBER() function assigns a unique sequence number to each row within a partition according to the partition and order rules.

The following sample query shows how to get the first 2 transactions committed by each account.

SELECT * FROM
     (SELECT *, ROW_NUMBER()
      OVER (PARTITION BY acc_id ORDER BY txn_date) AS row_num
      FROM transactions) tx
WHERE tx.row_num < 3;

The result tx.row_num column indicates the first two transactions committed by each account chronologically:

+------------+------------+----------------+--------------+-------------+
| tx.txn_id  | tx.acc_id  | tx.txn_amount  | tx.txn_date  | tx.row_num  |
+------------+------------+----------------+--------------+-------------+
| 4          | 1001       | 100.50         | 2023-01-02   | 1           |
| 6          | 1001       | 200.50         | 2023-01-03   | 2           |
| 1          | 1002       | 10.00          | 2023-01-01   | 1           |
| 3          | 1002       | 30.00          | 2023-01-02   | 2           |
| 8          | 1003       | 50.00          | 2023-01-01   | 1           |
| 7          | 1003       | 50.00          | 2023-01-03   | 2           |
+------------+------------+----------------+--------------+-------------+

FIRST_VALUE and LAST_VALUE

The FIRST_VALUE/LAST_VALUE functions return a column value from the first/last row of a partition.

In the example below, the query gets the lowest transaction ID values for each account partition.

SELECT txn_id, acc_id, FIRST_VALUE(txn_id)
OVER (PARTITION BY acc_id) lowest_id_per_acc
FROM transactions;

The result:

+---------+---------+--------------------+
| txn_id  | acc_id  | lowest_id_per_acc  |
+---------+---------+--------------------+
| 4       | 1001    | 4                  |
| 5       | 1001    | 4                  |
| 6       | 1001    | 4                  |
| 1       | 1002    | 1                  |
| 2       | 1002    | 1                  |
| 3       | 1002    | 1                  |
| 7       | 1003    | 7                  |
| 8       | 1003    | 7                  |
| 9       | 1003    | 7                  |
+---------+---------+--------------------+

Both FIRST_VALUE/LAST_VALUE functions have an optional Boolean parameter that defaults to false. If set to true, Hive will ignore rows with NULL column values.

Analytical functions

RANK and DENSE_RANK

The RANK function assigns rank to each row in a partition. In case of duplicates, the function returns equal ranks for such rows, skipping the subsequent rank number.

In the following example, RANK calculates the rank of each transaction among all other transactions (no PARTITION BY specified). To calculate the transaction ranks for each account, the PARTITION BY acc_id expression should be added.

SELECT txn_amount, acc_id, RANK()
OVER (ORDER BY txn_amount DESC) rank_global
-- per-account rank:
-- OVER (PARTITION BY acc_id ORDER BY txn_amount DESC) rank_per_acc
FROM transactions;

The result:

+-------------+---------+--------------+
| txn_amount  | acc_id  | rank_global  |
+-------------+---------+--------------+
| 200.50      | 1001    | 1            |
| 150.50      | 1001    | 2            |
| 100.50      | 1001    | 3            |
| 75.00       | 1003    | 4            |
| 50.00       | 1003    | 5            |
| 50.00       | 1003    | 5            |
| 30.00       | 1002    | 7            |
| 20.00       | 1002    | 8            |
| 10.00       | 1002    | 9            |
+-------------+---------+--------------+

Notice that the result set has two transactions with the same rank (5) as their txn_amount values are equal. Since these two rows have been assigned the same rank 5, the subsequent rank number 6 is absent in the result set, and the next available rank number is 7. To avoid such "holes" in the ranks sequence, use the DENSE_RANK function. The previous example returns the following result if DENSE_RANK is used instead of RANK:

+-------------+---------+--------------+
| txn_amount  | acc_id  | rank_global  |
+-------------+---------+--------------+
| 200.50      | 1001    | 1            |
| 150.50      | 1001    | 2            |
| 100.50      | 1001    | 3            |
| 75.00       | 1003    | 4            |
| 50.00       | 1003    | 5            |
| 50.00       | 1003    | 5            |
| 30.00       | 1002    | 6            |
| 20.00       | 1002    | 7            |
| 10.00       | 1002    | 8            |
+-------------+---------+--------------+

PERCENT_RANK

The PERCENT_RANK function calculates a percentage rank (a number from 0 to 1 inclusive) for each row within a partition group. The ORDER BY clause is mandatory in the window specification. The function produces a column with an ascending numeric sequence that represents the position of each row within a partition. The first row in each partition group always has the value of 0, and the last row is always 1.

For example, the following query ranks the transaction committed by the same account.

SELECT txn_id, acc_id, txn_amount, PERCENT_RANK()
OVER (PARTITION BY acc_id ORDER BY txn_amount) position_within_acc
FROM transactions;

The result set contains an extra column position_within_acc with a ranking number:

+---------+---------+-------------+-------------------+
| txn_id  | acc_id  | txn_amount  |position_within_acc|
+---------+---------+-------------+-------------------+
| 4       | 1001    | 100.50      | 0.0               |
| 5       | 1001    | 150.50      | 0.5               |
| 6       | 1001    | 200.50      | 1.0               |
| 1       | 1002    | 10.00       | 0.0               |
| 2       | 1002    | 20.00       | 0.5               |
| 3       | 1002    | 30.00       | 1.0               |
| 7       | 1003    | 50.00       | 0.0               |
| 8       | 1003    | 50.00       | 0.0               |
| 9       | 1003    | 75.00       | 1.0               |
+---------+---------+-------------+-------------------+

NTILE

The NTILE(n) function divides an ordered result set into n roughly equal parts (buckets) and for each row, assigns an integer, indicating which bucket the given row falls into. The ORDER BY clause is mandatory in the window specification. With this universal function, you can calculate statistical notions like median, percentile, quartile, etc.

The following example splits all the transactions into 4 groups (quartiles) based on the transaction amount and identifies which quartile each row falls into. Notice that the partition gets sorted by txn_amount from lowest to highest, thus, smallest transactions belong to the first quartile.

SELECT txn_id, txn_amount, NTILE(4)
OVER (ORDER BY txn_amount) AS quartile
FROM transactions;

The result:

+---------+-------------+-----------+
| txn_id  | txn_amount  | quartile  |
+---------+-------------+-----------+
| 1       | 10.00       | 1         |
| 2       | 20.00       | 1         |
| 3       | 30.00       | 1         |
| 7       | 50.00       | 2         |
| 8       | 50.00       | 2         |
| 9       | 75.00       | 3         |
| 4       | 100.50      | 3         |
| 5       | 150.50      | 4         |
| 6       | 200.50      | 4         |
+---------+-------------+-----------+
Found a mistake? Seleсt text and press Ctrl+Enter to report it