Aggregate functions

Overview

An aggregate function computes a single result from a set of input values. For example, you can calculate the sum, average, maximum, or minimum over a set of rows.

To obtain the full list of predefined aggregate functions with their descriptions, refer to the List of Aggregate Functions section of the ClickHouse documentation.

Create a table to run test queries

 
To reproduce examples provided below in this article and explore behavior of various aggregate functions, first create the users table:

CREATE TABLE users (user_id Int32, name String, age Int32, role String) ENGINE = MergeTree ORDER BY user_id;
INSERT INTO users VALUES
(1, 'john', 37, 'developer'),
(2, 'alex', 25, 'analyst'),
(3, 'mary', 30, 'manager'),
(4, 'david', 41, 'developer'),
(5, 'andrew', 41, 'analyst'),
(6, 'harry', 29, 'developer'),
(7, 'ann', 35, 'QA engineer'),
(8, 'jack', 27, 'designer');
┌─user_id─┬─name───┬─age─┬─role────────┐
│       1 │ john   │  37 │ developer   │
│       2 │ alex   │  25 │ analyst     │
│       3 │ mary   │  30 │ manager     │
│       4 │ david  │  41 │ developer   │
│       5 │ andrew │  41 │ analyst     │
│       6 │ harry  │  29 │ developer   │
│       7 │ ann    │  35 │ QA engineer │
│       8 │ jack   │  27 │ designer    │
└─────────┴────────┴─────┴─────────────┘

For example, the following aggregate function returns the age of the youngest user from the users table:

SELECT min(age) FROM users;

The output:

┌─min(age)─┐
│       25 │
└──────────┘

To utilize an aggregate function in the WHERE clause, use a subquery. The following example returns the name of the youngest user:

SELECT name FROM users WHERE age = (SELECT min(age) FROM users);

The output:

┌─name─┐
│ alex │
└──────┘

You can also combine an aggregate function with the GROUP BY clause. For example, the following query returns the number of users for each role:

SELECT role, count() AS number_of_users FROM users GROUP BY role;

The output:

┌─role────────┬─number_of_users─┐
│ manager     │               1 │
│ designer    │               1 │
│ analyst     │               2 │
│ developer   │               3 │
│ QA engineer │               1 │
└─────────────┴─────────────────┘

To filter the result set, use the HAVING clause. For example, the following query returns roles that are assigned to more than one user:

SELECT role, count() AS number_of_users FROM users GROUP BY role HAVING count()>1;

The output:

┌─role──────┬─number_of_users─┐
│ analyst   │               2 │
│ developer │               3 │
└───────────┴─────────────────┘
NOTE

There is a fundamental difference between the WHERE and HAVING clauses. WHERE selects input rows before evaluating groups and aggregate functions — it determines which rows are included in aggregate calculations. HAVING selects rows after groups and aggregate functions are calculated. The WHERE clause cannot contain aggregate functions, and the HAVING clause always contains aggregate functions. If you need to use additional conditions, add them to the WHERE clause, not to HAVING. This approach avoids aggregation calculations for rows that do not meet the additional conditions.

Combinators

To extend the capabilities of aggregate functions and address a wide range of tasks without the need to change data structures, ADQM supports various combinators. A combinator is a special suffix that is added to an aggregate function name and modifies that function’s behavior. You can use multiple combinators for one function simultaneously.

Add conditions to aggregations

An aggregate function with the If combinator accepts an additional argument — a condition of the UInt8 type that determines which rows the aggregate function should process.

Aggregate expression syntax is <aggr_func_name>If(<expr>, <if_condition>).

Example

The following query returns the number of users older than 30:

SELECT countIf(age>30) AS over_30 FROM users;

The output:

┌─over_30─┐
│       4 │
└─────────┘

Aggregate on unique values

An aggregate function with the Distinct combinator is invoked once for each unique value of an argument (or unique set of values, for multiple arguments).

Aggregate expression syntax is <aggr_func_name>Distinct(<expr>) or <aggr_func_name>(DISTINCT <expr>).

Example

The following query returns the number of unique roles in the users table:

SELECT count(DISTINCT role) AS unique_roles FROM users;

The output:

┌─unique_roles─┐
│            5 │
└──────────────┘

Manage aggregate values for empty inputs

You can use one of the following combinators to define a value that an aggregate function should return in the case when it does not have input data:

  • OrDefault — returns a default value of the aggregate function’s return type;

  • OrNull — converts the aggregate function’s return type to Nullable, and returns NULL if there is nothing to aggregate.

Example:

SELECT avg(number), avgOrDefault(number), avgOrNull(number) FROM numbers(0);

The output:

┌─avg(number)─┬─avgOrDefault(number)─┬─avgOrNull(number)─┐
│         nan │                    0 │              ᴺᵁᴸᴸ │
└─────────────┴──────────────────────┴───────────────────┘

Aggregate arrays

Array

Aggregate functions with the Array combinator accept arguments of the Array(T) type (arrays) instead of T-type arguments. When processing arrays, an aggregate function works like the original aggregate function over all array elements.

Aggregate expression syntax is <aggr_func_name>Array(<arr>), where <arr> is a set of arrays (for example, a column that contains arrays). If a function accepts multiple arguments, array lengths should be equal.

Example
  1. Create a table with one of the columns containing arrays:

    CREATE TABLE test_arrays (id Int32, array Array(UInt32)) ENGINE = MergeTree ORDER BY id;
    INSERT INTO test_arrays VALUES (1, [1, 2, 3]), (2, [2,3,7]), (3, [1,1,1]);
    ┌─id─┬─array───┐
    │  1 │ [1,2,3] │
    │  2 │ [2,3,7] │
    │  3 │ [1,1,1] │
    └────┴─────────┘
  2. Use the sumArray function to calculate the sum of elements of all arrays in a column of the Array(UInt32) type:

    SELECT sumArray(array) AS sum FROM test_arrays;
    ┌─sum─┐
    │  21 │
    └─────┘

ForEach

The ForEach combination converts an aggregate function for tables into an aggregate function for arrays that aggregates corresponding values from multiple arrays and returns an array of results.

Aggregate expression syntax is <aggr_func_name>ForEach(<arr>), where <arr> is a set of arrays (for example, a table column with arrays).

Example

 
Use the maxForEach function to get maximum values among the corresponding elements of arrays:

SELECT maxForEach(array) FROM test_arrays;

The aggregate function finds the maximum value among the first elements of arrays and writes it to the first position of the resulting array. Then it repeats the same for the second and third elements of arrays:

┌─maxForEach(array)─┐
│ [2,3,7]           │
└───────────────────┘

Aggregate values of the Map type

To aggregate values of the Map type, use the Map combinator for aggregate functions. An aggregate function with this combinator accepts a Map type as an argument, aggregates values of each key separately using the appropriate aggregate function, and returns the result of the Map type.

Example
  1. Create a table with one of the columns containing values of the Map type:

    CREATE TABLE test_map_table (a Int32, b Map(String, UInt64)) ENGINE = MergeTree ORDER BY a;
    INSERT INTO test_map_table VALUES (1, {'key1':1, 'key2':10}), (2, {'key1':2,'key2':20}), (3, {'key1':3,'key2':30});
    ┌─a─┬─b────────────────────┐
    │ 1 │ {'key1':1,'key2':10} │
    │ 2 │ {'key1':2,'key2':20} │
    │ 3 │ {'key1':3,'key2':30} │
    └───┴──────────────────────┘
  2. Use the Map combinator for the sum aggregate function to calculate the sum of values by the key2 key:

    select sumMap(b)['key2'] from test_map_table;

    The output:

    ┌─arrayElement(sumMap(b), 'key2')─┐
    │                              60 │
    └─────────────────────────────────┘

Work with aggregate states

ADQM allows working with intermediate aggregate states instead of resulting values which take up more space. Use the State combinator for aggregate functions to store intermediate states of aggregations as values of the AggregateFunction type. To get the final result of aggregation in the future, pass intermediate states as arguments to the same aggregate functions with combinators:

  • Merge — combines states to finish aggregation, and returns the resulting value;

  • MergeState — merges intermediate aggregation states, similar to the Merge combinator, but returns an intermediate aggregation state (not the resulting value), similar to the State combinator.

For some aggregate functions (for example, sum, max, min, any), you can use the SimpleState combinator. An aggregate function with this combinator returns the same value as the original function, but of the SimpleAggregateFunction type. SimpleAggregateFunction has better performance than AggregateFunction for the same aggregate function.

To store aggregation states and automatically merge them by primary key, you can use the AggregatingMergeTree table engine.

Split data into groups before aggregation

The Resample combinator allows you to divide data into groups (for example, split column values into intervals) and perform aggregation for each group separately.

Aggregate expression syntax is <aggr_func_name>Resample(<start>, <end>, <step>)(<aggr_func_params>, <resample_key>), where:

  • <resample_key> — column which values are used to split data into intervals;

  • <start> — start value of the whole interval for <resample_key> values;

  • <stop> — end value of the whole interval for <resample_key> values (the interval does not include the end value — [<start>, <stop>));

  • <step> — step for dividing the whole interval into subintervals (the aggr_func_name function is executed over each of subintervals independently);

  • <aggr_func_params> — aggregate function parameters.

An aggregate function with the Resample combinator returns an array of the aggr_func_name function results for each data group.

Example

 
The source users table:

┌─name───┬─age─┐
│ john   │  37 │
│ alex   │  25 │
│ mary   │  30 │
│ david  │  41 │
│ andrew │  41 │
│ harry  │  29 │
│ ann    │  35 │
│ jack   │  27 │
└────────┴─────┘

The following query returns names of users whose age is in the [20,35) and [35,50) intervals:

SELECT groupArrayResample(20, 50, 15)(name, age) from users;

where:

  • name — column with user names (an argument of the groupArray aggregate function);

  • age — column to aggregate user names by age.

The output:

┌─groupArrayResample(20, 50, 15)(name, age)────────────────────────┐
│ [['alex','mary','harry','jack'],['john','david','andrew','ann']] │
└──────────────────────────────────────────────────────────────────┘

Process NULL

Most aggregate functions ignore NULL input values. The exception is the count() function that returns the total number of rows in the selection.

Example

Create a table with a column of the Nullable type:

CREATE TABLE table_nulls (x Int8, y Nullable(Int8)) ENGINE = MergeTree ORDER BY x;
INSERT INTO table_nulls VALUES (1, NULL), (2, 1), (3, 2), (4, 2), (5, 3);
┌─x─┬────y─┐
│ 1 │ ᴺᵁᴸᴸ │
│ 2 │    1 │
│ 3 │    2 │
│ 4 │    2 │
│ 5 │    3 │
└───┴──────┘

The groupArray(y) function returns an array of the y column’s values, ignoring NULL:

SELECT groupArray(y) FROM table_nulls;
┌─groupArray(y)─┐
│ [1,2,2,3]     │
└───────────────┘

The count() function returns the total number of rows in the table:

SELECT count() FROM table_nulls;
┌─count()─┐
│       5 │
└─────────┘

The count(y) function returns the number of rows where y values are not NULL:

SELECT count(y) FROM table_nulls;
┌─count(y)─┐
│        4 │
└──────────┘

The count(DISTINCT y) function returns the number of unique rows with non-NULL y values:

SELECT count(DISTINCT y) FROM table_nulls;
┌─uniqExact(y)─┐
│            3 │
└──────────────┘

You can use the coalesce function to replace NULL values with some suitable value that will be taken into account when evaluating an aggregate function. For example, avg(coalesce(y, 0)) uses 0 instead of NULL values in the y column when calculating an average value for the column:

SELECT avg(y), avg(coalesce(y, 0)) FROM table_nulls;
┌─avg(y)─┬─avg(coalesce(y, 0))─┐
│      2 │                 1.6 │
└────────┴─────────────────────┘

To work around skipping NULL values, you can also use the tuple function, for example:

SELECT groupArray(y), groupArray(tuple(y)).1 FROM table_nulls;
┌─groupArray(y)─┬─tupleElement(groupArray(tuple(y)), 1)─┐
│ [1,2,2,3]     │ [NULL,1,2,2,3]                        │
└───────────────┴───────────────────────────────────────┘
Found a mistake? Seleсt text and press Ctrl+Enter to report it