Aggregate functions
Overview
An aggregate function computes a single result from a set of input values. For example, you can calculate the sum, average, maximum, or minimum over a set of rows.
To obtain the full list of predefined aggregate functions with their descriptions, refer to the List of Aggregate Functions section of the ClickHouse documentation.
To reproduce examples provided below in this article and explore behavior of various aggregate functions, first create the users
table:
CREATE TABLE users (user_id Int32, name String, age Int32, role String) ENGINE = MergeTree ORDER BY user_id;
INSERT INTO users VALUES
(1, 'john', 37, 'developer'),
(2, 'alex', 25, 'analyst'),
(3, 'mary', 30, 'manager'),
(4, 'david', 41, 'developer'),
(5, 'andrew', 41, 'analyst'),
(6, 'harry', 29, 'developer'),
(7, 'ann', 35, 'QA engineer'),
(8, 'jack', 27, 'designer');
┌─user_id─┬─name───┬─age─┬─role────────┐ │ 1 │ john │ 37 │ developer │ │ 2 │ alex │ 25 │ analyst │ │ 3 │ mary │ 30 │ manager │ │ 4 │ david │ 41 │ developer │ │ 5 │ andrew │ 41 │ analyst │ │ 6 │ harry │ 29 │ developer │ │ 7 │ ann │ 35 │ QA engineer │ │ 8 │ jack │ 27 │ designer │ └─────────┴────────┴─────┴─────────────┘
For example, the following aggregate function returns the age of the youngest user from the users
table:
SELECT min(age) FROM users;
The output:
┌─min(age)─┐ │ 25 │ └──────────┘
To utilize an aggregate function in the WHERE
clause, use a subquery. The following example returns the name of the youngest user:
SELECT name FROM users WHERE age = (SELECT min(age) FROM users);
The output:
┌─name─┐ │ alex │ └──────┘
You can also combine an aggregate function with the GROUP BY
clause. For example, the following query returns the number of users for each role:
SELECT role, count() AS number_of_users FROM users GROUP BY role;
The output:
┌─role────────┬─number_of_users─┐ │ manager │ 1 │ │ designer │ 1 │ │ analyst │ 2 │ │ developer │ 3 │ │ QA engineer │ 1 │ └─────────────┴─────────────────┘
To filter the result set, use the HAVING
clause. For example, the following query returns roles that are assigned to more than one user:
SELECT role, count() AS number_of_users FROM users GROUP BY role HAVING count()>1;
The output:
┌─role──────┬─number_of_users─┐ │ analyst │ 2 │ │ developer │ 3 │ └───────────┴─────────────────┘
NOTE
There is a fundamental difference between the |
Combinators
To extend the capabilities of aggregate functions and address a wide range of tasks without the need to change data structures, ADQM supports various combinators. A combinator is a special suffix that is added to an aggregate function name and modifies that function’s behavior. You can use multiple combinators for one function simultaneously.
Add conditions to aggregations
An aggregate function with the If
combinator accepts an additional argument — a condition of the UInt8 type that determines which rows the aggregate function should process.
Aggregate expression syntax is <aggr_func_name>If(<expr>, <if_condition>)
.
Example
The following query returns the number of users older than 30:
SELECT countIf(age>30) AS over_30 FROM users;
The output:
┌─over_30─┐ │ 4 │ └─────────┘
Aggregate on unique values
An aggregate function with the Distinct
combinator is invoked once for each unique value of an argument (or unique set of values, for multiple arguments).
Aggregate expression syntax is <aggr_func_name>Distinct(<expr>)
or <aggr_func_name>(DISTINCT <expr>)
.
Example
The following query returns the number of unique roles in the users
table:
SELECT count(DISTINCT role) AS unique_roles FROM users;
The output:
┌─unique_roles─┐ │ 5 │ └──────────────┘
Manage aggregate values for empty inputs
You can use one of the following combinators to define a value that an aggregate function should return in the case when it does not have input data:
-
OrDefault
— returns a default value of the aggregate function’s return type; -
OrNull
— converts the aggregate function’s return type to Nullable, and returns NULL if there is nothing to aggregate.
Example:
SELECT avg(number), avgOrDefault(number), avgOrNull(number) FROM numbers(0);
The output:
┌─avg(number)─┬─avgOrDefault(number)─┬─avgOrNull(number)─┐ │ nan │ 0 │ ᴺᵁᴸᴸ │ └─────────────┴──────────────────────┴───────────────────┘
Aggregate arrays
Array
Aggregate functions with the Array
combinator accept arguments of the Array(T) type (arrays) instead of T-type arguments. When processing arrays, an aggregate function works like the original aggregate function over all array elements.
Aggregate expression syntax is <aggr_func_name>Array(<arr>)
, where <arr>
is a set of arrays (for example, a column that contains arrays). If a function accepts multiple arguments, array lengths should be equal.
-
Create a table with one of the columns containing arrays:
CREATE TABLE test_arrays (id Int32, array Array(UInt32)) ENGINE = MergeTree ORDER BY id;
INSERT INTO test_arrays VALUES (1, [1, 2, 3]), (2, [2,3,7]), (3, [1,1,1]);
┌─id─┬─array───┐ │ 1 │ [1,2,3] │ │ 2 │ [2,3,7] │ │ 3 │ [1,1,1] │ └────┴─────────┘
-
Use the
sumArray
function to calculate the sum of elements of all arrays in a column of the Array(UInt32) type:SELECT sumArray(array) AS sum FROM test_arrays;
┌─sum─┐ │ 21 │ └─────┘
ForEach
The ForEach
combination converts an aggregate function for tables into an aggregate function for arrays that aggregates corresponding values from multiple arrays and returns an array of results.
Aggregate expression syntax is <aggr_func_name>ForEach(<arr>)
, where <arr>
is a set of arrays (for example, a table column with arrays).
Use the maxForEach
function to get maximum values among the corresponding elements of arrays:
SELECT maxForEach(array) FROM test_arrays;
The aggregate function finds the maximum value among the first elements of arrays and writes it to the first position of the resulting array. Then it repeats the same for the second and third elements of arrays:
┌─maxForEach(array)─┐ │ [2,3,7] │ └───────────────────┘
Aggregate values of the Map type
To aggregate values of the Map type, use the Map
combinator for aggregate functions. An aggregate function with this combinator accepts a Map type as an argument, aggregates values of each key separately using the appropriate aggregate function, and returns the result of the Map type.
-
Create a table with one of the columns containing values of the Map type:
CREATE TABLE test_map_table (a Int32, b Map(String, UInt64)) ENGINE = MergeTree ORDER BY a;
INSERT INTO test_map_table VALUES (1, {'key1':1, 'key2':10}), (2, {'key1':2,'key2':20}), (3, {'key1':3,'key2':30});
┌─a─┬─b────────────────────┐ │ 1 │ {'key1':1,'key2':10} │ │ 2 │ {'key1':2,'key2':20} │ │ 3 │ {'key1':3,'key2':30} │ └───┴──────────────────────┘
-
Use the
Map
combinator for thesum
aggregate function to calculate the sum of values by thekey2
key:select sumMap(b)['key2'] from test_map_table;
The output:
┌─arrayElement(sumMap(b), 'key2')─┐ │ 60 │ └─────────────────────────────────┘
Work with aggregate states
ADQM allows working with intermediate aggregate states instead of resulting values which take up more space. Use the State
combinator for aggregate functions to store intermediate states of aggregations as values of the AggregateFunction type. To get the final result of aggregation in the future, pass intermediate states as arguments to the same aggregate functions with combinators:
-
Merge
— combines states to finish aggregation, and returns the resulting value; -
MergeState
— merges intermediate aggregation states, similar to theMerge
combinator, but returns an intermediate aggregation state (not the resulting value), similar to theState
combinator.
For some aggregate functions (for example, sum
, max
, min
, any
), you can use the SimpleState
combinator. An aggregate function with this combinator returns the same value as the original function, but of the SimpleAggregateFunction type. SimpleAggregateFunction
has better performance than AggregateFunction
for the same aggregate function.
To store aggregation states and automatically merge them by primary key, you can use the AggregatingMergeTree table engine.
Split data into groups before aggregation
The Resample
combinator allows you to divide data into groups (for example, split column values into intervals) and perform aggregation for each group separately.
Aggregate expression syntax is <aggr_func_name>Resample(<start>, <end>, <step>)(<aggr_func_params>, <resample_key>)
, where:
-
<resample_key>
— column which values are used to split data into intervals; -
<start>
— start value of the whole interval for<resample_key>
values; -
<stop>
— end value of the whole interval for<resample_key>
values (the interval does not include the end value —[<start>, <stop>)
); -
<step>
— step for dividing the whole interval into subintervals (theaggr_func_name
function is executed over each of subintervals independently); -
<aggr_func_params>
— aggregate function parameters.
An aggregate function with the Resample
combinator returns an array of the aggr_func_name
function results for each data group.
The source users
table:
┌─name───┬─age─┐ │ john │ 37 │ │ alex │ 25 │ │ mary │ 30 │ │ david │ 41 │ │ andrew │ 41 │ │ harry │ 29 │ │ ann │ 35 │ │ jack │ 27 │ └────────┴─────┘
The following query returns names of users whose age is in the [20,35)
and [35,50)
intervals:
SELECT groupArrayResample(20, 50, 15)(name, age) from users;
where:
-
name
— column with user names (an argument of the groupArray aggregate function); -
age
— column to aggregate user names by age.
The output:
┌─groupArrayResample(20, 50, 15)(name, age)────────────────────────┐ │ [['alex','mary','harry','jack'],['john','david','andrew','ann']] │ └──────────────────────────────────────────────────────────────────┘
Process NULL
Most aggregate functions ignore NULL input values. The exception is the count()
function that returns the total number of rows in the selection.
Example
Create a table with a column of the Nullable type:
CREATE TABLE table_nulls (x Int8, y Nullable(Int8)) ENGINE = MergeTree ORDER BY x;
INSERT INTO table_nulls VALUES (1, NULL), (2, 1), (3, 2), (4, 2), (5, 3);
┌─x─┬────y─┐ │ 1 │ ᴺᵁᴸᴸ │ │ 2 │ 1 │ │ 3 │ 2 │ │ 4 │ 2 │ │ 5 │ 3 │ └───┴──────┘
The groupArray(y)
function returns an array of the y
column’s values, ignoring NULL:
SELECT groupArray(y) FROM table_nulls;
┌─groupArray(y)─┐ │ [1,2,2,3] │ └───────────────┘
The count()
function returns the total number of rows in the table:
SELECT count() FROM table_nulls;
┌─count()─┐ │ 5 │ └─────────┘
The count(y)
function returns the number of rows where y
values are not NULL:
SELECT count(y) FROM table_nulls;
┌─count(y)─┐ │ 4 │ └──────────┘
The count(DISTINCT y)
function returns the number of unique rows with non-NULL y
values:
SELECT count(DISTINCT y) FROM table_nulls;
┌─uniqExact(y)─┐ │ 3 │ └──────────────┘
You can use the coalesce
function to replace NULL values with some suitable value that will be taken into account when evaluating an aggregate function. For example, avg(coalesce(y, 0))
uses 0
instead of NULL values in the y
column when calculating an average value for the column:
SELECT avg(y), avg(coalesce(y, 0)) FROM table_nulls;
┌─avg(y)─┬─avg(coalesce(y, 0))─┐ │ 2 │ 1.6 │ └────────┴─────────────────────┘
To work around skipping NULL values, you can also use the tuple
function, for example:
SELECT groupArray(y), groupArray(tuple(y)).1 FROM table_nulls;
┌─groupArray(y)─┬─tupleElement(groupArray(tuple(y)), 1)─┐ │ [1,2,2,3] │ [NULL,1,2,2,3] │ └───────────────┴───────────────────────────────────────┘