Denormalize data
Overview
Denormalization is a technique for optimizing analytical queries that involves transforming a data structure by combining fact and dimension tables into a single wide ("flat") table to reduce the need for complex joins (JOIN operations) in queries.
Denormalization shifts JOIN-related work from query execution time to data insertion or pre-processing time, which can significantly speed up read operations. At the same time, it leads to redundancy and duplication of data, which increases storage costs and complicates write operations. Thus, data denormalization provides a number of advantages, but comes with some limitations. To make a decision on the appropriateness of using this approach, it is necessary to analyze the specifics of your database usage and take into account the requirements for its maintenance.
-
When denormalization can be an appropriate strategy:
-
It is necessary to frequently repeat analytical queries to logically related data in real time, the speed of data reading is critically important.
-
Queries should use as few joins as possible (for example, when you work with large datasets).
-
Data is relatively stable: it does not change, changes rarely, or is only appended to the end of a set.
-
Storage costs are less important than high query performance.
-
Some delay in data availability is acceptable for analytical queries (data can be periodically completely reloaded in batches).
-
-
When to avoid denormalization:
-
Data needs to be updated or deleted frequently. Updating a denormalized table may require rewriting many rows, which is an expensive and slow process, while in normalized tables, you can partially update specific fields without affecting unrelated data.
-
There are complex relationships between data, such as many-to-many — in this case, changing a single source row can also result in the need to update a large number of rows. Such data is easier to maintain with a normalized model.
-
There are high-cardinality relationships. If a row in a table has thousands of related records in another table, denormalization will require representing them as an array of simple type values or tuples. Using arrays with a large number of elements (on the order of 1000 or more) can negatively impact performance.
-
-
Potential problems that denormalization can lead to:
-
Storage space can increase rapidly due to storing duplicate data (although storage efficiency may not be so problematic thanks to ClickHouse’s powerful compression algorithms).
-
Inserting and updating data becomes complex. As mentioned above, when data is denormalized, any change to a single entity (such as a product name or user information) requires rewriting the entire denormalized dataset.
-
Maintaining data integrity can be a challenge — duplicate data in rows of a denormalized table increases the risk of inconsistency (especially when changes are required).
-
If denormalization is generally appropriate for a specific case, there is usually no need to denormalize all the data — it is enough to include only a few columns that are accessed most frequently. Denormalization often involves adding one or two columns (for example, some statistics) to the main table for analytics. In case of complex objects or one-to-many relationships, you can use complex types for denormalization: Tuple, Array(Tuple), or Nested.
If denormalization is not a suitable way to optimize a database, instead of flattening all your data into one structure, you can use other built-in ADQM/ClickHouse tools that allow you to build an understandable data model while still ensuring high query performance: dictionaries, materialized views, projections.
Example
The example below shows how to minimize the use of JOIN operations in queries by denormalizing data.
Normalized structure
Sample data structure
The initial normalized data model includes four separate tables:
-
The
orderstable (a fact table):CREATE TABLE orders ( order_id UInt64, customer_id UInt64, order_date Date ) ENGINE = MergeTree() ORDER BY (order_date, order_id);Table dataINSERT INTO orders VALUES (1001, 3, '2026-04-07'), (1002, 2, '2026-04-07'), (1003, 1, '2026-04-07'), (1004, 4, '2026-04-08'), (1005, 1, '2026-04-08');┌─order_id─┬─customer_id─┬─order_date─┐ 1. │ 1001 │ 3 │ 2026-04-07 │ 2. │ 1002 │ 2 │ 2026-04-07 │ 3. │ 1003 │ 1 │ 2026-04-07 │ 4. │ 1004 │ 4 │ 2026-04-08 │ 5. │ 1005 │ 1 │ 2026-04-08 │ └──────────┴─────────────┴────────────┘
-
The
productstable (a dimension table):CREATE TABLE products ( product_id UInt64, product_name String, category String, price_per_unit Decimal(10,2) ) ENGINE = MergeTree() ORDER BY product_id;Table dataINSERT INTO products VALUES (1, 'Laptop', 'Electronics', 999.99), (2, 'Smartphone', 'Electronics', 699.99), (3, 'Headphones', 'Electronics', 49.99), (4, 'Desk Chair', 'Furniture', 149.99), (5, 'Coffee Table', 'Furniture', 249.99), (6, 'Notebook', 'Stationery', 1.99), (7, 'Pen', 'Stationery', 0.49), (8, 'Backpack', 'Accessories', 59.99), (9, 'Monitor', 'Electronics', 199.99), (10, 'Office Chair', 'Furniture', 169.99), (11, 'Keyboard', 'Electronics', 49.99), (12, 'Printer', 'Electronics', 299.99), (13, 'Cap', 'Accessories', 5.99), (14, 'Binder', 'Stationery', 2.99), (15, 'Water Bottle', 'Accessories', 9.99);┌─product_id─┬─product_name─┬─category────┬─price_per_unit─┐ 1. │ 1 │ Laptop │ Electronics │ 999.99 │ 2. │ 2 │ Smartphone │ Electronics │ 699.99 │ 3. │ 3 │ Headphones │ Electronics │ 49.99 │ 4. │ 4 │ Desk Chair │ Furniture │ 149.99 │ 5. │ 5 │ Coffee Table │ Furniture │ 249.99 │ 6. │ 6 │ Notebook │ Stationery │ 1.99 │ 7. │ 7 │ Pen │ Stationery │ 0.49 │ 8. │ 8 │ Backpack │ Accessories │ 59.99 │ 9. │ 9 │ Monitor │ Electronics │ 199.99 │ 10. │ 10 │ Office Chair │ Furniture │ 169.99 │ 11. │ 11 │ Keyboard │ Electronics │ 49.99 │ 12. │ 12 │ Printer │ Electronics │ 299.99 │ 13. │ 13 │ Cap │ Accessories │ 5.99 │ 14. │ 14 │ Binder │ Stationery │ 2.99 │ 15. │ 15 │ Water Bottle │ Accessories │ 9.99 │ └────────────┴──────────────┴─────────────┴────────────────┘ -
The
customerstable (a dimension table):CREATE TABLE customers ( customer_id UInt64, full_name String, city String, email String ) ENGINE = MergeTree() ORDER BY customer_id;Table dataINSERT INTO customers VALUES (1, 'James Wilson', 'New York', 'jwilson@example.com'), (2, 'Mary Burton', 'Chicago', 'mburton@example.com'), (3, 'Patricia Robinson', 'Los Angeles', 'probinson@example.com'), (4, 'Robert Gray', 'Seattle', 'rgray@example.com'), (5, 'Elizabeth Tucker', 'Chicago', 'etucker@example.com'), (6, 'Joseph Lewis', 'New York', 'jlewis@example.com'), (7, 'Paul Brown', 'Chicago', 'pbrown@example.com'), (8, 'Andrew Clarke', 'Los Angeles', 'aclarke@example.com'), (9, 'William Ferguson', 'Chicago', 'wferguson@example.com'), (10, 'Linda Black', 'New York', 'lblack@example.com'), (11, 'David Green', 'Chicago', 'dgreen@example.com'), (12, 'Daniel Gray', 'New York', 'dgray@example.com'), (13, 'Mark Armstrong', 'New York', 'marmstrong@example.com'), (14, 'Donald Carter', 'Seattle', 'dcarter@example.com'), (15, 'Elizabeth Collins', 'Chicago', 'ecollins@example.com');┌─customer_id─┬─full_name─────────┬─city────────┬─email──────────────────┐ 1. │ 1 │ James Wilson │ New York │ jwilson@example.com │ 2. │ 2 │ Mary Burton │ Chicago │ mburton@example.com │ 3. │ 3 │ Patricia Robinson │ Los Angeles │ probinson@example.com │ 4. │ 4 │ Robert Gray │ Seattle │ rgray@example.com │ 5. │ 5 │ Elizabeth Tucker │ Chicago │ etucker@example.com │ 6. │ 6 │ Joseph Lewis │ New York │ jlewis@example.com │ 7. │ 7 │ Paul Brown │ Chicago │ pbrown@example.com │ 8. │ 8 │ Andrew Clarke │ Los Angeles │ aclarke@example.com │ 9. │ 9 │ William Ferguson │ Chicago │ wferguson@example.com │ 10. │ 10 │ Linda Black │ New York │ lblack@example.com │ 11. │ 11 │ David Green │ Chicago │ dgreen@example.com │ 12. │ 12 │ Daniel Gray │ New York │ dgray@example.com │ 13. │ 13 │ Mark Armstrong │ New York │ marmstrong@example.com │ 14. │ 14 │ Donald Carter │ Seattle │ dcarter@example.com │ 15. │ 15 │ Elizabeth Collins │ Chicago │ ecollins@example.com │ └─────────────┴───────────────────┴─────────────┴────────────────────────┘ -
The
order_itemstable with order details (many-to-many relationships between orders and products):CREATE TABLE order_items ( order_id UInt64, product_id UInt64, quantity UInt32 ) ENGINE = MergeTree() ORDER BY (order_id, product_id);Table dataINSERT INTO order_items VALUES (1001, 1, 1), (1001, 3, 2), (1001, 7, 6), (1002, 9, 2), (1002, 10, 2), (1003, 15, 3), (1003, 8, 2), (1003, 12, 1), (1003, 9, 1), (1004, 4, 1), (1004, 13, 2), (1005, 6, 10), (1005, 9, 2), (1005, 7, 5);┌─order_id─┬─product_id─┬─quantity─┐ 1. │ 1001 │ 1 │ 1 │ 2. │ 1001 │ 3 │ 2 │ 3. │ 1001 │ 7 │ 6 │ 4. │ 1002 │ 9 │ 2 │ 5. │ 1002 │ 10 │ 2 │ 6. │ 1003 │ 8 │ 2 │ 7. │ 1003 │ 9 │ 1 │ 8. │ 1003 │ 12 │ 1 │ 9. │ 1003 │ 15 │ 3 │ 10. │ 1004 │ 4 │ 1 │ 11. │ 1004 │ 13 │ 2 │ 12. │ 1005 │ 6 │ 10 │ 13. │ 1005 │ 7 │ 5 │ 14. │ 1005 │ 9 │ 2 │ └──────────┴────────────┴──────────┘
Sample queries
With such a data structure, it may often be necessary to use multiple joins in queries, for example:
-
Find all sales of products not from the
Electronicscategory later than a specified date:SELECT o.order_id AS order_id, c.full_name AS customer, c.city, o.order_date, p.product_name AS product, p.category, oi.quantity AS units_sold, p.price_per_unit * oi.quantity AS sale_amount FROM orders o JOIN customers c ON o.customer_id = c.customer_id JOIN order_items oi ON o.order_id = oi.order_id JOIN products p ON oi.product_id = p.product_id WHERE p.category != 'Electronics' AND o.order_date >= '2026-04-07';┌─order_id─┬─customer──────────┬─city────────┬─order_date─┬─product──────┬─category────┬─units_sold─┬─sale_amount─┐ 1. │ 1001 │ Patricia Robinson │ Los Angeles │ 2026-04-07 │ Pen │ Stationery │ 6 │ 2.94 │ 2. │ 1002 │ Mary Burton │ Chicago │ 2026-04-07 │ Office Chair │ Furniture │ 2 │ 339.98 │ 3. │ 1003 │ James Wilson │ New York │ 2026-04-07 │ Backpack │ Accessories │ 2 │ 119.98 │ 4. │ 1003 │ James Wilson │ New York │ 2026-04-07 │ Water Bottle │ Accessories │ 3 │ 29.97 │ 5. │ 1004 │ Robert Gray │ Seattle │ 2026-04-08 │ Desk Chair │ Furniture │ 1 │ 149.99 │ 6. │ 1004 │ Robert Gray │ Seattle │ 2026-04-08 │ Cap │ Accessories │ 2 │ 11.98 │ 7. │ 1005 │ James Wilson │ New York │ 2026-04-08 │ Notebook │ Stationery │ 10 │ 19.9 │ 8. │ 1005 │ James Wilson │ New York │ 2026-04-08 │ Pen │ Stationery │ 5 │ 2.45 │ └──────────┴───────────────────┴─────────────┴────────────┴──────────────┴─────────────┴────────────┴─────────────┘
-
Calculate the amount of each order:
SELECT o.order_id AS order_id, o.order_date, c.full_name AS customer, round(SUM(oi.quantity * p.price_per_unit), 2) AS total_amount FROM orders o JOIN customers c ON o.customer_id = c.customer_id JOIN order_items oi ON o.order_id = oi.order_id JOIN products p ON oi.product_id = p.product_id GROUP BY o.order_id, c.full_name, o.order_date ORDER BY o.order_id;┌─order_id─┬─order_date─┬─customer──────────┬─total_amount─┐ 1. │ 1001 │ 2026-04-07 │ Patricia Robinson │ 1102.91 │ 2. │ 1002 │ 2026-04-07 │ Mary Burton │ 739.96 │ 3. │ 1003 │ 2026-04-07 │ James Wilson │ 649.93 │ 4. │ 1004 │ 2026-04-08 │ Robert Gray │ 161.97 │ 5. │ 1005 │ 2026-04-08 │ James Wilson │ 422.33 │ └──────────┴────────────┴───────────────────┴──────────────┘
Denormalized structure
Denormalization of individual columns
Suppose you most often need to query orders together with customer names and total amounts. To avoid joins in such queries, you can create the orders_with_total table based on orders extending it with the customer_name and total_amount columns:
CREATE TABLE orders_with_total (
order_id UInt64,
customer_id UInt64,
order_date Date,
customer_name String,
total_amount Decimal(10,2)
) ENGINE = MergeTree()
ORDER BY (order_date, order_id);
The work on combining data from different tables is performed once at the stage of filling the denormalized table with data:
INSERT INTO orders_with_total
SELECT
o.*,
c.full_name,
round(SUM(oi.quantity * p.price_per_unit), 2)
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
GROUP BY o.order_id, o.customer_id, o.order_date, c.full_name;
Now, a query to obtain data on orders with their amounts looks like:
SELECT order_id, order_date, customer_name, total_amount FROM orders_with_total;
┌─order_id─┬─order_date─┬─customer_name─────┬─total_amount─┐ 1. │ 1001 │ 2026-04-07 │ Patricia Robinson │ 1102.91 │ 2. │ 1002 │ 2026-04-07 │ Mary Burton │ 739.96 │ 3. │ 1003 │ 2026-04-07 │ James Wilson │ 649.93 │ 4. │ 1004 │ 2026-04-08 │ Robert Gray │ 161.97 │ 5. │ 1005 │ 2026-04-08 │ James Wilson │ 422.33 │ └──────────┴────────────┴───────────────────┴──────────────┘
Use a complex type for denormalization
If a single order includes multiple items (one-to-many relationship), you can create the orders_with_items table using the Nested type to list items in the order (without flattening nested data — flatten_nested=0) and with a materialized column to calculate the order total:
SET flatten_nested=0;
CREATE TABLE orders_with_items (
order_id UInt64,
customer_id UInt64,
order_date Date,
items Nested (
product_name String,
category String,
quantity UInt32,
price_per_unit Decimal(10,2)
),
total_amount Decimal(10,2) MATERIALIZED
arraySum(x -> x.3 * x.4, items)
) ENGINE = MergeTree()
ORDER BY order_id;
Fill the table with data:
INSERT INTO orders_with_items
SELECT
o.*,
arrayMap(
(product_name, category, quantity, price_per_unit) ->
tuple(product_name, category, quantity, price_per_unit),
groupArray(product_name),
groupArray(category),
groupArray(quantity),
groupArray(price_per_unit)
) AS items
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
GROUP BY o.order_id, o.customer_id, o.order_date;
Examples of queries to such a table:
-
Read the entire table:
SELECT order_id, order_date, items, total_amount FROM orders_with_items;┌─order_id─┬─order_date─┬─items────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬─total_amount─┐ 1. │ 1001 │ 2026-04-07 │ [('Laptop','Electronics',1,999.99),('Headphones','Electronics',2,49.99),('Pen','Stationery',6,0.49)] │ 1102.91 │ 2. │ 1002 │ 2026-04-07 │ [('Monitor','Electronics',2,199.99),('Office Chair','Furniture',2,169.99)] │ 739.96 │ 3. │ 1003 │ 2026-04-07 │ [('Backpack','Accessories',2,59.99),('Monitor','Electronics',1,199.99),('Printer','Electronics',1,299.99),('Water Bottle','Accessories',3,9.99)] │ 649.93 │ 4. │ 1004 │ 2026-04-08 │ [('Desk Chair','Furniture',1,149.99),('Cap','Accessories',2,5.99)] │ 161.97 │ 5. │ 1005 │ 2026-04-08 │ [('Notebook','Stationery',10,1.99),('Pen','Stationery',5,0.49),('Monitor','Electronics',2,199.99)] │ 422.33 │ └──────────┴────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴──────────────┘ -
Find all orders that include the
Monitorproduct and display the number of units sold in each order:SELECT order_id, order_date, tupleElement(arrayFilter(x -> x.1 = 'Monitor', items), 'quantity') AS monitors_sold FROM orders_with_items WHERE notEmpty(monitors_sold);┌─order_id─┬─order_date─┬─monitors_sold─┐ 1. │ 1002 │ 2026-04-07 │ [2] │ 2. │ 1003 │ 2026-04-07 │ [1] │ 3. │ 1005 │ 2026-04-08 │ [2] │ └──────────┴────────────┴───────────────┘