Home
Arenadata Hadoop
How to
Hive
Data querying
Query types
Aggregate functions

Aggregate functions in Hive

Konstantin Alpashkin

Contents

Test database
Basic aggregation
- More Hive aggregation functions
Advanced aggregation

Hive aggregate functions are built-in functions that accept a set of values and return a single value. This section provides examples and tips on using basic and advanced Hive aggregation features. The full list of aggregate functions supported in Hive by default is available in Apache Hive documentation.

Test database

Throughout this article, the following Hive tables employees and transactions are used to demonstrate aggregation examples.

SELECT * FROM employees;
+---------------+-----------------+----------------------+----------------------------+----------------+-------------------+--------------------+
| employees.id  | employees.name  | employees.last_name  |      employees.email       | employees.age  | employees.salary  | employees.dept_id  |
+---------------+-----------------+----------------------+----------------------------+----------------+-------------------+--------------------+
| 1             | Ivan            | Ivanov               | ivan_ivanov123@mail.ru     | 30             | 1000              | 1                  |
| 2             | Sarah           | Connor               | sarah_connor123@yahoo.com  | 35             | 1500              | 2                  |
| 3             | Rick            | Sanchez              | rick_123@mail.ru           | 29             | 1300              | 2                  |
| 4             | John            | Smith                | john_smith123@gmail.com    | 29             | 2000              | 3                  |
+---------------+-----------------+----------------------+----------------------------+----------------+-------------------+--------------------+


SELECT * FROM transactions;
+----------------------+----------------------+--------------------------+------------------------+
| transactions.txn_id  | transactions.acc_id  | transactions.txn_amount  | transactions.txn_date  |
+----------------------+----------------------+--------------------------+------------------------+
| 1                    | 101                  | 10.20                    | 2023-10-26             |
| 2                    | 101                  | 102.30                   | 2023-10-26             |
| 3                    | 102                  | 95.00                    | 2023-10-26             |
| 4                    | 103                  | 900.50                   | 2023-10-27             |
| 5                    | 102                  | 910.50                   | 2023-10-27             |
| 6                    | 101                  | 850.00                   | 2023-10-27             |
+----------------------+----------------------+--------------------------+------------------------+

SQL for creating test tables

The following SQL can be used to create and populate test Hive tables using /bin/beeline.

CREATE TABLE IF NOT EXISTS employees (id int, name string, last_name string, email string, age int, salary int, dept_id int)
COMMENT 'Employee Table'
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

INSERT INTO employees VALUES
(1, 'Ivan', 'Ivanov', 'ivan_ivanov123@mail.ru', 30, 1000, 1),
(2, 'Sarah', 'Connor', 'sarah_connor123@yahoo.com', 35, 1500, 2),
(3, 'Rick', 'Sanchez', 'rick_123@mail.ru', 29, 1300, 2),
(4, 'John', 'Smith', 'john_smith123@gmail.com', 29, 2000, 3);


CREATE TABLE IF NOT EXISTS transactions(txn_id int, acc_id int, txn_amount decimal(10,2), txn_date date)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe';

INSERT INTO transactions VALUES
(1, 101, 10.20, '2023-10-26'),
(2, 101, 102.30, '2023-10-26'),
(3, 102, 95.00, '2023-10-26'),
(4, 103, 900.50, '2023-10-27'),
(5, 102, 910.50, '2023-10-27'),
(6, 101, 850.00, '2023-10-27');

Basic aggregation

The COUNT(col) function returns the total number of rows in the result set.

SELECT COUNT(id) AS emp_total_cnt FROM employees;

The result:

+----------------+
| emp_total_cnt  |
+----------------+
| 4              |
+----------------+

TIP

Using the DISTINCT keyword, you can perform aggregation only on unique column values.

The MIN() and MAX() functions return the minimum/maximum value of a numeric column.

SELECT MIN(salary) from employees;

The output:

+-------+
|  _c0  |
+-------+
| 1000  |
+-------+

The AVG() function returns an average value of a numeric column. Currently, aggregate functions are not supported inside the WHERE clause. For example, the following query will produce an error.

SELECT e.salary, e.age
FROM employees e
WHERE e.age > AVG(e.age);

The error message:

Error: Error while compiling statement: FAILED: SemanticException [Error 10128]: Line 3:14 Not yet supported place for UDAF 'avg' (state=42000,code=10128)

As a workaround for this case, you can nest AVG() inside a subquery as shown below.

SELECT e.age, e.last_name
FROM employees e
WHERE e.age > (SELECT AVG(age) FROM employees);

The successful result:

+--------+--------------+
| e.age  | e.last_name  |
+--------+--------------+
| 35     | Connor       |
+--------+--------------+

More Hive aggregation functions

Apart from standard SQL aggregation functions like SUM, AVG, etc., Hive provides other functions that may come in handy for analyzing your datasets. For example, using collect_*(col), you can collapse retrieved column values to a list or a set. Consider the example below.

SELECT collect_set(age) FROM employees; (1)
SELECT d.dep_name AS department_name, collect_list(e.age) AS ages (2)
FROM employees e
    JOIN departments d
    ON e.dept_id=d.id
GROUP BY d.dep_name;

1	Returns a single row with a set of unique values.
2	For each department, returns a list of values.

The result:

+-------------+
|     _c0     |
+-------------+
| [30,35,29]  |
+-------------+

+------------------+----------+
| department_name  |   ages   |
+------------------+----------+
| it               | [35,29]  |
| sales            | [30]     |
| support          | [29]     |
+------------------+----------+

Hive also provides special functions for calculating mathematical entities like deviation, variance, and many others. A few examples below.

SELECT variance(salary) FROM employees; (1)
SELECT var_pop(salary) FROM employees; (1)
SELECT var_samp(salary) FROM employees; (2)
SELECT stddev_pop(salary) FROM employees; (3)
SELECT stddev_samp(salary) FROM employees; (4)

1	The queries return the statistical variance of the numeric `salary` column.
2	Returns the unbiased sample variance for the column.
3	Returns the statistical standard deviation of all values for the column.
4	Returns the unbiased sample standard deviation of the numeric column.

The result:

SELECT variance(salary) FROM employees;
+-----------+
|    _c0    |
+-----------+
| 132500.0  |
+-----------+

SELECT var_pop(salary) FROM employees;
+-----------+
|    _c0    |
+-----------+
| 132500.0  |
+-----------+

SELECT var_samp(salary) FROM employees;
+---------------------+
|         _c0         |
+---------------------+
| 176666.66666666666  |
+---------------------+

SELECT stddev_pop(salary) FROM employees;
+--------------------+
|        _c0         |
+--------------------+
| 364.0054944640259  |
+--------------------+

SELECT stddev_samp(salary) FROM employees;
+---------------------+
|         _c0         |
+---------------------+
| 420.31734043061635  |
+---------------------+

The full list of Hive aggregate functions is available in Apache Hive documentation.

Advanced aggregation

GROUPING SETS

The GROUPING SETS clause used with GROUP BY allows you to apply several GROUP BY criteria for the same record set. Using several grouping sets (e.g. GROUPING SETS (a,b)) is equivalent to running multiple GROUP BY a, GROUP BY b queries united by UNION. However, fetching results via grouping sets is done in one table scan as opposed to running several queries with different GROUP BY clauses.

The following example specifies a single grouping set (GROUPING SETS (acc_id)), thus, aggregating the results by account ID and not by transactions date.

SELECT acc_id, txn_date, SUM(txn_amount) as total_paid
FROM transactions
GROUP BY acc_id, txn_date
GROUPING SETS (acc_id);

The result of such query is the total amount paid by each account. NULL values in the txn_date column indicate that no aggregation by date was applied.

+---------+-----------+-------------+
| acc_id  | txn_date  | total_paid  |
+---------+-----------+-------------+
| 101     | NULL      | 962.50      |
| 102     | NULL      | 1005.50     |
| 103     | NULL      | 900.50      |
+---------+-----------+-------------+

The following example specifies multiple grouping sets (GROUPING SETS (acc_id, txn_date)).

SELECT acc_id, txn_date, SUM(txn_amount) as total_paid, grouping__id
FROM transactions
GROUP BY acc_id, txn_date
GROUPING SETS (acc_id, txn_date);

In this case, the result set consists of two subsets grouped by txn_date and acc_id:

+---------+-------------+-------------+
| acc_id  |  txn_date   | total_paid  |
+---------+-------------+-------------+
| NULL    | 2023-10-26  | 207.50      |
| NULL    | 2023-10-27  | 2661.00     |
| 101     | NULL        | 962.50      |
| 102     | NULL        | 1005.50     |
| 103     | NULL        | 900.50      |
+---------+-------------+-------------+

NULL values indicate that no aggregation was performed on the given column. Notice that the above result is a UNION of two result sets, returned by the queries like:

SELECT NULL, txn_date, SUM(txn_amount)
FROM transactions
GROUP BY txn_date
UNION
SELECT acc_id, NULL, SUM(txn_amount)
FROM transactions
GROUP BY acc_id;

IMPORTANT

Running individual queries with UNION will launch 2 jobs instead of 1 and can potentially take longer to execute.

WITH CUBE

The CUBE function is used with GROUP BY to calculate all possible combinations of column groups specified in the GROUP BY clause. The following HiveQL constructs are identical:

... GROUP BY a, b, c WITH CUBE
-- is equivalent to:
... GROUP BY a, b, c GROUPING SETS (
    (a, b, c), (a, b), (b, c), (a, c), (a), (b), (c), ( )
)

Consider the example below.

SELECT acc_id, txn_date, SUM(txn_amount) as total_sum
FROM transactions
GROUP BY acc_id, txn_date WITH CUBE;

The result includes records for all possible aggregation groups:

+---------+-------------+-------------+
| acc_id  |  txn_date   | total_sum   |
+---------+-------------+-------------+
| NULL    | NULL        | 2868.50     |
| NULL    | 2023-10-26  | 207.50      |
| NULL    | 2023-10-27  | 2661.00     |
| 101     | NULL        | 962.50      |
| 101     | 2023-10-26  | 112.50      |
| 101     | 2023-10-27  | 850.00      |
| 102     | NULL        | 1005.50     |
| 102     | 2023-10-26  | 95.00       |
| 102     | 2023-10-27  | 910.50      |
| 103     | NULL        | 900.50      |
| 103     | 2023-10-27  | 900.50      |
+---------+-------------+-------------+

In this example, using WITH CUBE is equivalent to the following query:

SELECT acc_id, txn_date, SUM(txn_amount) as total_sum
FROM transactions
GROUP BY acc_id, txn_date
GROUPING SETS ((acc_id, txn_date), acc_id, txn_date, ( ));

WITH ROLLUP

The ROLLUP function is used with GROUP BY to calculate multiple levels of subtotals across a specified group of dimensions. The function creates subtotals which "roll up" from the most detailed level to the most generic one, following a grouping list specified in the GROUP BY clause. The following HiveQL constructs are identical:

... GROUP BY a, b, c, WITH ROLLUP
-- is equivalent to:
... GROUP BY a, b, c GROUPING SETS ( (a, b, c), (a, b), (a), ( ))

Consider the example below.

SELECT acc_id, txn_date, SUM(txn_amount) as total_sum
FROM transactions
GROUP BY acc_id, txn_date WITH ROLLUP;

The result set includes records with different aggregation levels — starting from the most detailed groups (money paid by a specific account on a specific day) up to the most generic one (total money paid):

+---------+-------------+------------+
| acc_id  |  txn_date   | total_sum  |
+---------+-------------+------------+
| NULL    | NULL        | 2868.50    |
| 101     | NULL        | 962.50     |
| 101     | 2023-10-26  | 112.50     |
| 101     | 2023-10-27  | 850.00     |
| 102     | NULL        | 1005.50    |
| 102     | 2023-10-26  | 95.00      |
| 102     | 2023-10-27  | 910.50     |
| 103     | NULL        | 900.50     |
| 103     | 2023-10-27  | 900.50     |
+---------+-------------+------------+

In this example, using ROLLUP is equivalent to the following query:

SELECT acc_id, txn_date, SUM(txn_amount) as total_sum
FROM transactions
GROUP BY acc_id, txn_date
GROUPING SETS ((acc_id, txn_date), acc_id, ( ));

Found a mistake? Seleсt text and press Ctrl+Enter to report it