Window and analytical functions in Hive
Hive window and analytical functions are used to perform calculations over a set of related rows (windows or partitions) in a result set. Unlike aggregate functions that collapse all column values to a single value, the outcome of a window function is one or more columns with calculation results added to the original result set. Window and analytical functions are extremely useful for analytical tasks and are essential solution to queries like "get top X products sold by category", "get maximum salary per department", "get first N rows from the result set", and alike.
All functions can be divided into the following groups:
-
Window functions, such as
LEAD
,LAG
,FIRST_VALUE
, etc. -
Analytical functions like
RANK
,ROW_NUMBER
,NTILE
, etc.
This section provides examples and recommendations on using major functions, more details on supported Hive window and analytical functions are available in Apache Hive documentation.
Test database
Throughout this article, the following Hive table transactions
is used to demonstrate the examples.
SELECT * FROM transactions; +----------------------+----------------------+--------------------------+------------------------+ | transactions.txn_id | transactions.acc_id | transactions.txn_amount | transactions.txn_date | +----------------------+----------------------+--------------------------+------------------------+ | 1 | 1002 | 10.00 | 2023-01-01 | | 2 | 1002 | 20.00 | 2023-01-03 | | 3 | 1002 | 30.00 | 2023-01-02 | | 4 | 1001 | 100.50 | 2023-01-02 | | 5 | 1001 | 150.50 | 2023-01-04 | | 6 | 1001 | 200.50 | 2023-01-03 | | 7 | 1003 | 50.00 | 2023-01-03 | | 8 | 1003 | 50.00 | 2023-01-01 | | 9 | 1003 | 75.00 | 2023-01-04 | +----------------------+----------------------+--------------------------+------------------------+
The following SQL can be used to create and populate the test Hive table using /bin/beeline.
DROP TABLE IF EXISTS `transactions`;
CREATE TABLE `transactions`(`txn_id` int, `acc_id` int, `txn_amount` decimal(10,2), `txn_date` date);
INSERT INTO transactions VALUES
(1, 1002, 10.00, '2023-01-01'),
(2, 1002, 20.00, '2023-01-03'),
(3, 1002, 30.00, '2023-01-02'),
(4, 1001, 100.50, '2023-01-02'),
(5, 1001, 150.50, '2023-01-04'),
(6, 1001, 200.50, '2023-01-03'),
(7, 1003, 50.00, '2023-01-03'),
(8, 1003, 50.00, '2023-01-01'),
(9, 1003, 75.00, '2023-01-04');
Syntax
The basic syntax for using window and analytical functions is as follows:
SELECT <func>([[, ...]]) OVER ([<window_spec>]) [<window_name>]
FROM <table_name>;
Where:
-
<func>
— a standard SQL aggregation function (e.g.AVG
) or a special window/analytical function. -
OVER
— indicates that the function should be calculated for specific windows within the result set. -
<window_spec>
— an expression that defines the criteria to split the result set into windows. More details on windows specification are available below. -
<window_name>
— an alias for the column with the function calculation results.
Window specification
A window specification tells Hive how to divide the result set fetched by SELECT
into windows.
An empty specification indicates that the entire result set should be treated as a single window and the function should run for all the rows.
The window specification syntax is as follows:
[PARTITION BY <col> [, ...]]
[ORDER BY <col> [ASC|DESC] [, ...]]
(ROWS | RANGE) BETWEEN (UNBOUNDED | [num]) PRECEDING AND ([num] PRECEDING | CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
Where:
-
PARTITION BY <col>
— splits the result set into partitions based on the<col>
values. -
ORDER BY <col>
— orders the rows within each partition before running the function. -
ROWS | RANGE …
— specifies the upper/lower window frame bounds.
There is also an alternative syntax with placing windows specification in a separate WINDOW
clause:
SELECT <func>([[, ...]]) OVER <window_name>
FROM <table_name>
WINDOW <window_name> AS ([<window_spec>])
OVER keyword
At its simplest, a window function is defined by adding the OVER
clause to a SELECT
query.
A typical selection may include window/analytical functions or standard SQL aggregate functions (SUM
, AVG
, etc.).
In the following example, the empty OVER()
clause indicates that no explicit window is specified, so the entire result set returned by SELECT
should be treated as a window to calculate the average transaction amount.
SELECT *, AVG(txn_amount)
OVER() AS average_txn
FROM transactions;
The result contains an extra average_txn
column that holds the average value of all transactions:
+----------------------+----------------------+--------------------------+------------------------+--------------+ | transactions.txn_id | transactions.acc_id | transactions.txn_amount | transactions.txn_date | average_txn | +----------------------+----------------------+--------------------------+------------------------+--------------+ | 1 | 1002 | 10.00 | 2023-01-01 | 76.277778 | | 2 | 1002 | 20.00 | 2023-01-03 | 76.277778 | | 3 | 1002 | 30.00 | 2023-01-02 | 76.277778 | | 4 | 1001 | 100.50 | 2023-01-02 | 76.277778 | | 5 | 1001 | 150.50 | 2023-01-04 | 76.277778 | | 6 | 1001 | 200.50 | 2023-01-03 | 76.277778 | | 7 | 1003 | 50.00 | 2023-01-03 | 76.277778 | | 8 | 1003 | 50.00 | 2023-01-01 | 76.277778 | | 9 | 1003 | 75.00 | 2023-01-04 | 76.277778 | +----------------------+----------------------+--------------------------+------------------------+--------------+
OVER + PARTITION BY
By using the OVER (PARTITION BY col)
construct, you can split the result set into partitions so the function logic is applied for each individual partition.
For example, the following query calculates average transaction amount with respect to each account.
SELECT *, AVG(txn_amount)
OVER (PARTITION BY acc_id) AS average_txn_per_acc
FROM transactions;
The result set now includes average values calculated for individual accounts:
+----------------------+----------------------+--------------------------+------------------------+----------------------+ | transactions.txn_id | transactions.acc_id | transactions.txn_amount | transactions.txn_date | average_txn_per_acc | +----------------------+----------------------+--------------------------+------------------------+----------------------+ | 4 | 1001 | 100.50 | 2023-01-02 | 150.500000 | | 5 | 1001 | 150.50 | 2023-01-04 | 150.500000 | | 6 | 1001 | 200.50 | 2023-01-03 | 150.500000 | | 1 | 1002 | 10.00 | 2023-01-01 | 20.000000 | | 2 | 1002 | 20.00 | 2023-01-03 | 20.000000 | | 3 | 1002 | 30.00 | 2023-01-02 | 20.000000 | | 7 | 1003 | 50.00 | 2023-01-03 | 58.333333 | | 8 | 1003 | 50.00 | 2023-01-01 | 58.333333 | | 9 | 1003 | 75.00 | 2023-01-04 | 58.333333 | +----------------------+----------------------+--------------------------+------------------------+----------------------+
OVER + ORDER BY
With the ORDER
keyword, you can add sorting to your function.
For example, the following query calculates the total amount paid by each account at a specific day, sorting the rows by date.
SELECT *, SUM(txn_amount)
OVER (PARTITION BY acc_id ORDER BY txn_date) AS running_total_paid
FROM transactions;
For each account ID, the rows in the final result set are sorted by transaction date.
Notice that the running_total_paid
column stores running total values.
This occurs because adding ORDER BY
to the window specification implicitly uses the default RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
boundary.
Thus, when calculating an average for each row, the function uses the current row value + all previous values.
Without ORDER BY
the total sum in running_total_paid
would be the same for all the records.
+----------------------+----------------------+--------------------------+------------------------+---------------------+ | transactions.txn_id | transactions.acc_id | transactions.txn_amount | transactions.txn_date | running_total_paid | +----------------------+----------------------+--------------------------+------------------------+---------------------+ | 4 | 1001 | 100.50 | 2023-01-02 | 100.50 | | 6 | 1001 | 200.50 | 2023-01-03 | 301.00 | | 5 | 1001 | 150.50 | 2023-01-04 | 451.50 | | 1 | 1002 | 10.00 | 2023-01-01 | 10.00 | | 3 | 1002 | 30.00 | 2023-01-02 | 40.00 | | 2 | 1002 | 20.00 | 2023-01-03 | 60.00 | | 8 | 1003 | 50.00 | 2023-01-01 | 50.00 | | 7 | 1003 | 50.00 | 2023-01-03 | 100.00 | | 9 | 1003 | 75.00 | 2023-01-04 | 175.00 | +----------------------+----------------------+--------------------------+------------------------+---------------------+
Window functions
LEAD and LAG
The LEAD
/LAG
are similar functions that allow you to pull data from next/previous rows respectively.
These can be useful for comparing adjacent column values or for calculating the difference between rows.
The following query groups the selection results by account, and for each row, returns an ID of the next transaction available in the partition group.
SELECT txn_id, txn_amount, txn_date, LEAD(txn_id)
OVER (PARTITION BY acc_id ORDER BY txn_date) AS next_txn_id
FROM transactions;
The result is shown below.
NULL
values in the next_txn_id
indicate the window boundary i.e. no more next rows are available for the current partition group.
+---------+-------------+-------------+--------------+ | txn_id | txn_amount | txn_date | next_txn_id | +---------+-------------+-------------+--------------+ | 4 | 100.50 | 2023-01-02 | 6 | | 6 | 200.50 | 2023-01-03 | 5 | | 5 | 150.50 | 2023-01-04 | NULL | | 1 | 10.00 | 2023-01-01 | 3 | | 3 | 30.00 | 2023-01-02 | 2 | | 2 | 20.00 | 2023-01-03 | NULL | | 8 | 50.00 | 2023-01-01 | 7 | | 7 | 50.00 | 2023-01-03 | 9 | | 9 | 75.00 | 2023-01-04 | NULL | +---------+-------------+-------------+--------------+
By default, LEAD
and LAG
functions return column values that are 1 step away from the current row and return NULL
if the partition boundary is reached.
You can change the offset and the out-of-boundary indicating value using the following signature:
LEAD(col, i, j)
LAG(col, i, j)
Where:
-
col
— the name of a column to get the next/previous value. -
i
— the offset to get the next/previous value. -
j
— the value to be returned instead ofNULL
when reaching a partition boundary.
ROW_NUMBER
The ROW_NUMBER()
function assigns a unique sequence number to each row within a partition according to the partition and order rules.
The following sample query shows how to get the first 2 transactions committed by each account.
SELECT * FROM
(SELECT *, ROW_NUMBER()
OVER (PARTITION BY acc_id ORDER BY txn_date) AS row_num
FROM transactions) tx
WHERE tx.row_num < 3;
The result tx.row_num
column indicates the first two transactions committed by each account chronologically:
+------------+------------+----------------+--------------+-------------+ | tx.txn_id | tx.acc_id | tx.txn_amount | tx.txn_date | tx.row_num | +------------+------------+----------------+--------------+-------------+ | 4 | 1001 | 100.50 | 2023-01-02 | 1 | | 6 | 1001 | 200.50 | 2023-01-03 | 2 | | 1 | 1002 | 10.00 | 2023-01-01 | 1 | | 3 | 1002 | 30.00 | 2023-01-02 | 2 | | 8 | 1003 | 50.00 | 2023-01-01 | 1 | | 7 | 1003 | 50.00 | 2023-01-03 | 2 | +------------+------------+----------------+--------------+-------------+
FIRST_VALUE and LAST_VALUE
The FIRST_VALUE
/LAST_VALUE
functions return a column value from the first/last row of a partition.
In the example below, the query gets the lowest transaction ID values for each account partition.
SELECT txn_id, acc_id, FIRST_VALUE(txn_id)
OVER (PARTITION BY acc_id) lowest_id_per_acc
FROM transactions;
The result:
+---------+---------+--------------------+ | txn_id | acc_id | lowest_id_per_acc | +---------+---------+--------------------+ | 4 | 1001 | 4 | | 5 | 1001 | 4 | | 6 | 1001 | 4 | | 1 | 1002 | 1 | | 2 | 1002 | 1 | | 3 | 1002 | 1 | | 7 | 1003 | 7 | | 8 | 1003 | 7 | | 9 | 1003 | 7 | +---------+---------+--------------------+
Both FIRST_VALUE
/LAST_VALUE
functions have an optional Boolean parameter that defaults to false
.
If set to true
, Hive will ignore rows with NULL
column values.
Analytical functions
RANK and DENSE_RANK
The RANK
function assigns rank to each row in a partition.
In case of duplicates, the function returns equal ranks for such rows, skipping the subsequent rank number.
In the following example, RANK
calculates the rank of each transaction among all other transactions (no PARTITION BY
specified).
To calculate the transaction ranks for each account, the PARTITION BY acc_id
expression should be added.
SELECT txn_amount, acc_id, RANK()
OVER (ORDER BY txn_amount DESC) rank_global
-- per-account rank:
-- OVER (PARTITION BY acc_id ORDER BY txn_amount DESC) rank_per_acc
FROM transactions;
The result:
+-------------+---------+--------------+ | txn_amount | acc_id | rank_global | +-------------+---------+--------------+ | 200.50 | 1001 | 1 | | 150.50 | 1001 | 2 | | 100.50 | 1001 | 3 | | 75.00 | 1003 | 4 | | 50.00 | 1003 | 5 | | 50.00 | 1003 | 5 | | 30.00 | 1002 | 7 | | 20.00 | 1002 | 8 | | 10.00 | 1002 | 9 | +-------------+---------+--------------+
Notice that the result set has two transactions with the same rank (5
) as their txn_amount
values are equal.
Since these two rows have been assigned the same rank 5
, the subsequent rank number 6
is absent in the result set, and the next available rank number is 7
.
To avoid such "holes" in the ranks sequence, use the DENSE_RANK
function.
The previous example returns the following result if DENSE_RANK
is used instead of RANK
:
+-------------+---------+--------------+ | txn_amount | acc_id | rank_global | +-------------+---------+--------------+ | 200.50 | 1001 | 1 | | 150.50 | 1001 | 2 | | 100.50 | 1001 | 3 | | 75.00 | 1003 | 4 | | 50.00 | 1003 | 5 | | 50.00 | 1003 | 5 | | 30.00 | 1002 | 6 | | 20.00 | 1002 | 7 | | 10.00 | 1002 | 8 | +-------------+---------+--------------+
PERCENT_RANK
The PERCENT_RANK
function calculates a percentage rank (a number from 0 to 1 inclusive) for each row within a partition group.
The ORDER BY
clause is mandatory in the window specification.
The function produces a column with an ascending numeric sequence that represents the position of each row within a partition.
The first row in each partition group always has the value of 0
, and the last row is always 1
.
For example, the following query ranks the transaction committed by the same account.
SELECT txn_id, acc_id, txn_amount, PERCENT_RANK()
OVER (PARTITION BY acc_id ORDER BY txn_amount) position_within_acc
FROM transactions;
The result set contains an extra column position_within_acc
with a ranking number:
+---------+---------+-------------+-------------------+ | txn_id | acc_id | txn_amount |position_within_acc| +---------+---------+-------------+-------------------+ | 4 | 1001 | 100.50 | 0.0 | | 5 | 1001 | 150.50 | 0.5 | | 6 | 1001 | 200.50 | 1.0 | | 1 | 1002 | 10.00 | 0.0 | | 2 | 1002 | 20.00 | 0.5 | | 3 | 1002 | 30.00 | 1.0 | | 7 | 1003 | 50.00 | 0.0 | | 8 | 1003 | 50.00 | 0.0 | | 9 | 1003 | 75.00 | 1.0 | +---------+---------+-------------+-------------------+
NTILE
The NTILE(n)
function divides an ordered result set into n
roughly equal parts (buckets) and for each row, assigns an integer, indicating which bucket the given row falls into.
The ORDER BY
clause is mandatory in the window specification.
With this universal function, you can calculate statistical notions like median, percentile, quartile, etc.
The following example splits all the transactions into 4 groups (quartiles) based on the transaction amount and identifies which quartile each row falls into.
Notice that the partition gets sorted by txn_amount
from lowest to highest, thus, smallest transactions belong to the first quartile.
SELECT txn_id, txn_amount, NTILE(4)
OVER (ORDER BY txn_amount) AS quartile
FROM transactions;
The result:
+---------+-------------+-----------+ | txn_id | txn_amount | quartile | +---------+-------------+-----------+ | 1 | 10.00 | 1 | | 2 | 20.00 | 1 | | 3 | 30.00 | 1 | | 7 | 50.00 | 2 | | 8 | 50.00 | 2 | | 9 | 75.00 | 3 | | 4 | 100.50 | 3 | | 5 | 150.50 | 4 | | 6 | 200.50 | 4 | +---------+-------------+-----------+