Aggregate functions in Hive
Hive aggregate functions are built-in functions that accept a set of values and return a single value. This section provides examples and tips on using basic and advanced Hive aggregation features. The full list of aggregate functions supported in Hive by default is available in Apache Hive documentation.
Test database
Throughout this article, the following Hive tables employees
and transactions
are used to demonstrate aggregation examples.
SELECT * FROM employees; +---------------+-----------------+----------------------+----------------------------+----------------+-------------------+--------------------+ | employees.id | employees.name | employees.last_name | employees.email | employees.age | employees.salary | employees.dept_id | +---------------+-----------------+----------------------+----------------------------+----------------+-------------------+--------------------+ | 1 | Ivan | Ivanov | ivan_ivanov123@mail.ru | 30 | 1000 | 1 | | 2 | Sarah | Connor | sarah_connor123@yahoo.com | 35 | 1500 | 2 | | 3 | Rick | Sanchez | rick_123@mail.ru | 29 | 1300 | 2 | | 4 | John | Smith | john_smith123@gmail.com | 29 | 2000 | 3 | +---------------+-----------------+----------------------+----------------------------+----------------+-------------------+--------------------+ SELECT * FROM transactions; +----------------------+----------------------+--------------------------+------------------------+ | transactions.txn_id | transactions.acc_id | transactions.txn_amount | transactions.txn_date | +----------------------+----------------------+--------------------------+------------------------+ | 1 | 101 | 10.20 | 2023-10-26 | | 2 | 101 | 102.30 | 2023-10-26 | | 3 | 102 | 95.00 | 2023-10-26 | | 4 | 103 | 900.50 | 2023-10-27 | | 5 | 102 | 910.50 | 2023-10-27 | | 6 | 101 | 850.00 | 2023-10-27 | +----------------------+----------------------+--------------------------+------------------------+
The following SQL can be used to create and populate test Hive tables using /bin/beeline.
CREATE TABLE IF NOT EXISTS employees (id int, name string, last_name string, email string, age int, salary int, dept_id int)
COMMENT 'Employee Table'
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
INSERT INTO employees VALUES
(1, 'Ivan', 'Ivanov', 'ivan_ivanov123@mail.ru', 30, 1000, 1),
(2, 'Sarah', 'Connor', 'sarah_connor123@yahoo.com', 35, 1500, 2),
(3, 'Rick', 'Sanchez', 'rick_123@mail.ru', 29, 1300, 2),
(4, 'John', 'Smith', 'john_smith123@gmail.com', 29, 2000, 3);
CREATE TABLE IF NOT EXISTS transactions(txn_id int, acc_id int, txn_amount decimal(10,2), txn_date date)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe';
INSERT INTO transactions VALUES
(1, 101, 10.20, '2023-10-26'),
(2, 101, 102.30, '2023-10-26'),
(3, 102, 95.00, '2023-10-26'),
(4, 103, 900.50, '2023-10-27'),
(5, 102, 910.50, '2023-10-27'),
(6, 101, 850.00, '2023-10-27');
Basic aggregation
The COUNT(col)
function returns the total number of rows in the result set.
SELECT COUNT(id) AS emp_total_cnt FROM employees;
The result:
+----------------+ | emp_total_cnt | +----------------+ | 4 | +----------------+
TIP
Using the DISTINCT keyword, you can perform aggregation only on unique column values.
|
The MIN()
and MAX()
functions return the minimum/maximum value of a numeric column.
SELECT MIN(salary) from employees;
The output:
+-------+ | _c0 | +-------+ | 1000 | +-------+
The AVG()
function returns an average value of a numeric column.
Currently, aggregate functions are not supported inside the WHERE
clause.
For example, the following query will produce an error.
SELECT e.salary, e.age
FROM employees e
WHERE e.age > AVG(e.age);
The error message:
Error: Error while compiling statement: FAILED: SemanticException [Error 10128]: Line 3:14 Not yet supported place for UDAF 'avg' (state=42000,code=10128)
As a workaround for this case, you can nest AVG()
inside a subquery as shown below.
SELECT e.age, e.last_name
FROM employees e
WHERE e.age > (SELECT AVG(age) FROM employees);
The successful result:
+--------+--------------+ | e.age | e.last_name | +--------+--------------+ | 35 | Connor | +--------+--------------+
More Hive aggregation functions
Apart from standard SQL aggregation functions like SUM
, AVG
, etc., Hive provides other functions that may come in handy for analyzing your datasets.
For example, using collect_*(col)
, you can collapse retrieved column values to a list or a set.
Consider the example below.
SELECT collect_set(age) FROM employees; (1)
SELECT d.dep_name AS department_name, collect_list(e.age) AS ages (2)
FROM employees e
JOIN departments d
ON e.dept_id=d.id
GROUP BY d.dep_name;
1 | Returns a single row with a set of unique values. |
2 | For each department, returns a list of values. |
The result:
+-------------+ | _c0 | +-------------+ | [30,35,29] | +-------------+ +------------------+----------+ | department_name | ages | +------------------+----------+ | it | [35,29] | | sales | [30] | | support | [29] | +------------------+----------+
Hive also provides special functions for calculating mathematical entities like deviation, variance, and many others. A few examples below.
SELECT variance(salary) FROM employees; (1)
SELECT var_pop(salary) FROM employees; (1)
SELECT var_samp(salary) FROM employees; (2)
SELECT stddev_pop(salary) FROM employees; (3)
SELECT stddev_samp(salary) FROM employees; (4)
1 | The queries return the statistical variance of the numeric salary column. |
2 | Returns the unbiased sample variance for the column. |
3 | Returns the statistical standard deviation of all values for the column. |
4 | Returns the unbiased sample standard deviation of the numeric column. |
The result:
SELECT variance(salary) FROM employees; +-----------+ | _c0 | +-----------+ | 132500.0 | +-----------+ SELECT var_pop(salary) FROM employees; +-----------+ | _c0 | +-----------+ | 132500.0 | +-----------+ SELECT var_samp(salary) FROM employees; +---------------------+ | _c0 | +---------------------+ | 176666.66666666666 | +---------------------+ SELECT stddev_pop(salary) FROM employees; +--------------------+ | _c0 | +--------------------+ | 364.0054944640259 | +--------------------+ SELECT stddev_samp(salary) FROM employees; +---------------------+ | _c0 | +---------------------+ | 420.31734043061635 | +---------------------+
The full list of Hive aggregate functions is available in Apache Hive documentation.
Advanced aggregation
GROUPING SETS
The GROUPING SETS
clause used with GROUP BY
allows you to apply several GROUP BY
criteria for the same record set.
Using several grouping sets (e.g. GROUPING SETS (a,b)
) is equivalent to running multiple GROUP BY a
, GROUP BY b
queries united by UNION
.
However, fetching results via grouping sets is done in one table scan as opposed to running several queries with different GROUP BY
clauses.
The following example specifies a single grouping set (GROUPING SETS (acc_id)
), thus, aggregating the results by account ID and not by transactions date.
SELECT acc_id, txn_date, SUM(txn_amount) as total_paid
FROM transactions
GROUP BY acc_id, txn_date
GROUPING SETS (acc_id);
The result of such query is the total amount paid by each account.
NULL
values in the txn_date
column indicate that no aggregation by date was applied.
+---------+-----------+-------------+ | acc_id | txn_date | total_paid | +---------+-----------+-------------+ | 101 | NULL | 962.50 | | 102 | NULL | 1005.50 | | 103 | NULL | 900.50 | +---------+-----------+-------------+
The following example specifies multiple grouping sets (GROUPING SETS (acc_id, txn_date)
).
SELECT acc_id, txn_date, SUM(txn_amount) as total_paid, grouping__id
FROM transactions
GROUP BY acc_id, txn_date
GROUPING SETS (acc_id, txn_date);
In this case, the result set consists of two subsets grouped by txn_date
and acc_id
:
+---------+-------------+-------------+ | acc_id | txn_date | total_paid | +---------+-------------+-------------+ | NULL | 2023-10-26 | 207.50 | | NULL | 2023-10-27 | 2661.00 | | 101 | NULL | 962.50 | | 102 | NULL | 1005.50 | | 103 | NULL | 900.50 | +---------+-------------+-------------+
NULL
values indicate that no aggregation was performed on the given column.
Notice that the above result is a UNION
of two result sets, returned by the queries like:
SELECT NULL, txn_date, SUM(txn_amount)
FROM transactions
GROUP BY txn_date
UNION
SELECT acc_id, NULL, SUM(txn_amount)
FROM transactions
GROUP BY acc_id;
IMPORTANT
Running individual queries with |
WITH CUBE
The CUBE
function is used with GROUP BY
to calculate all possible combinations of column groups specified in the GROUP BY
clause.
The following HiveQL constructs are identical:
... GROUP BY a, b, c WITH CUBE
-- is equivalent to:
... GROUP BY a, b, c GROUPING SETS (
(a, b, c), (a, b), (b, c), (a, c), (a), (b), (c), ( )
)
Consider the example below.
SELECT acc_id, txn_date, SUM(txn_amount) as total_sum
FROM transactions
GROUP BY acc_id, txn_date WITH CUBE;
The result includes records for all possible aggregation groups:
+---------+-------------+-------------+ | acc_id | txn_date | total_sum | +---------+-------------+-------------+ | NULL | NULL | 2868.50 | | NULL | 2023-10-26 | 207.50 | | NULL | 2023-10-27 | 2661.00 | | 101 | NULL | 962.50 | | 101 | 2023-10-26 | 112.50 | | 101 | 2023-10-27 | 850.00 | | 102 | NULL | 1005.50 | | 102 | 2023-10-26 | 95.00 | | 102 | 2023-10-27 | 910.50 | | 103 | NULL | 900.50 | | 103 | 2023-10-27 | 900.50 | +---------+-------------+-------------+
In this example, using WITH CUBE
is equivalent to the following query:
SELECT acc_id, txn_date, SUM(txn_amount) as total_sum
FROM transactions
GROUP BY acc_id, txn_date
GROUPING SETS ((acc_id, txn_date), acc_id, txn_date, ( ));
WITH ROLLUP
The ROLLUP
function is used with GROUP BY
to calculate multiple levels of subtotals across a specified group of dimensions.
The function creates subtotals which "roll up" from the most detailed level to the most generic one, following a grouping list specified in the GROUP BY
clause.
The following HiveQL constructs are identical:
... GROUP BY a, b, c, WITH ROLLUP
-- is equivalent to:
... GROUP BY a, b, c GROUPING SETS ( (a, b, c), (a, b), (a), ( ))
Consider the example below.
SELECT acc_id, txn_date, SUM(txn_amount) as total_sum
FROM transactions
GROUP BY acc_id, txn_date WITH ROLLUP;
The result set includes records with different aggregation levels — starting from the most detailed groups (money paid by a specific account on a specific day) up to the most generic one (total money paid):
+---------+-------------+------------+ | acc_id | txn_date | total_sum | +---------+-------------+------------+ | NULL | NULL | 2868.50 | | 101 | NULL | 962.50 | | 101 | 2023-10-26 | 112.50 | | 101 | 2023-10-27 | 850.00 | | 102 | NULL | 1005.50 | | 102 | 2023-10-26 | 95.00 | | 102 | 2023-10-27 | 910.50 | | 103 | NULL | 900.50 | | 103 | 2023-10-27 | 900.50 | +---------+-------------+------------+
In this example, using ROLLUP
is equivalent to the following query:
SELECT acc_id, txn_date, SUM(txn_amount) as total_sum
FROM transactions
GROUP BY acc_id, txn_date
GROUPING SETS ((acc_id, txn_date), acc_id, ( ));