Конференция Arenadata
Новое время — новый Greenplum
Мы приглашаем вас принять участие в конференции, посвященной будущему Open-Source Greenplum 19 сентября в 18:00:00 UTC +3. Встреча будет проходить в гибридном формате — и офлайн, и онлайн. Онлайн-трансляция будет доступна для всех желающих.
Внезапное закрытие Greenplum его владельцем — компанией Broadcom - стало неприятным сюрпризом для всех, кто использует или планирует начать использовать решения на базе этой технологии. Многие ожидают выхода стабильной версии Greenplum 7 и надеются на её дальнейшее активное развитие.
Arenadata не могла допустить, чтобы разрабатываемый годами Open-Source проект Greenplum прекратил своё существование, поэтому 19 сентября мы представим наш ответ на данное решение Broadcom, а участники сообщества получат исчерпывающие разъяснения на все вопросы о дальнейшей судьбе этой технологии.

На конференции вас ждёт обсуждение следующих тем:

  • План возрождения Greenplum;
  • Дорожная карта;
  • Экспертное обсуждение и консультации.
Осталось до события

Use materialized views in Hive

A materialized view (MV) is a database object that stores results of a query and serves these precomputed results later instead of fetching the data from the database. Typically, MVs are used to boost the performance by "caching" the results of long-running queries with complex aggregations, joins, filters, etc.

This section describes basic MV operations with usage examples. More information about MVs in Hive can be found in Apache Hive documentation.

Test database

Throughout this article, the following Hive tables employees and departments are used to demonstrate MV usage examples.

SELECT * FROM employees;
+---------------+-----------------+----------------------+----------------------------+----------------+-------------------+--------------------+
| employees.id  | employees.name  | employees.last_name  |      employees.email       | employees.age  | employees.salary  | employees.dept_id  |
+---------------+-----------------+----------------------+----------------------------+----------------+-------------------+--------------------+
| 1             | Ivan            | Ivanov               | ivan_ivanov123@mail.ru     | 30             | 1000              | 1                  |
| 2             | Sarah           | Connor               | sarah_connor123@yahoo.com  | 35             | 1500              | 2                  |
| 3             | Rick            | Sanchez              | rick_123@mail.ru           | 29             | 1300              | 2                  |
| 4             | John            | Smith                | john_smith123@gmail.com    | 29             | 2000              | 3                  |
+---------------+-----------------+----------------------+----------------------------+----------------+-------------------+--------------------+

SELECT * FROM departments;
+-----------------+-----------------------+-------------------------+------------------------+
| departments.id  | departments.dep_name  | departments.dep_office  | departments.dep_phone  |
+-----------------+-----------------------+-------------------------+------------------------+
| 1               | sales                 | 110                     | 9379992                |
| 2               | it                    | 115                     | 12345678               |
| 3               | support               | 125                     | 880080080              |
+-----------------+-----------------------+-------------------------+------------------------+
SQL for creating test table

 
The following SQL can be used to create and populate the test Hive table using /bin/beeline.

CREATE DATABASE IF NOT EXISTS mv_demo;
USE mv_demo;

DROP TABLE IF EXISTS employees;
CREATE TABLE employees (id int, name string, last_name string, email string, age int, salary int, dept_id int)
 STORED AS ORC
 TBLPROPERTIES ('transactional'='true');

INSERT INTO employees VALUES
(1, 'Ivan', 'Ivanov', 'ivan_ivanov123@mail.ru', 30, 1000, 1),
(2, 'Sarah', 'Connor', 'sarah_connor123@yahoo.com', 35, 1500, 2),
(3, 'Rick', 'Sanchez', 'rick_123@mail.ru', 29, 1300, 2),
(4, 'John', 'Smith', 'john_smith123@gmail.com', 29, 2000, 3);

DROP TABLE IF EXISTS departments;
CREATE TABLE departments (id int, dep_name string, dep_office int, dep_phone string)
STORED AS ORC
TBLPROPERTIES ('transactional'='true');

INSERT INTO departments VALUES
(1, 'sales', 110, '9379992'),
(2, 'it', 115, '12345678'),
(3, 'support', 125, '880080080');

Syntax

Since technically an MV is an ordinary Hive table (yet with several limitations), the basic syntax for creating MVs is similar to creating Hive tables. The MV creation syntax is as follows:

CREATE MATERIALIZED VIEW [IF NOT EXISTS] [<db_name>.]<mv_name>
  [DISABLE REWRITE]
  [COMMENT <comment>]
  [PARTITIONED ON (<col_name>, ...)]
  [CLUSTERED ON (<col_name>, ...) | DISTRIBUTED ON (<col_name>, ...) SORTED ON (<col_name>, ...)]
  [
    [ROW FORMAT <row_format>]
    [STORED AS <file_format>] | STORED BY '<storage.handler.class.name>' [WITH SERDEPROPERTIES (...)]
  ]
  [LOCATION <hdfs_path>]
  [TBLPROPERTIES (<property_name>=<property_value>, ...)]
AS
<query>;

Where:

  • <query> — the query to run for results and populate the MV with data.

  • DISABLE REWRITE — disables automatic query rewriting for the given MV.

  • PARTITIONED ON — allows partitioning the MV by columns to improve the performance.

  • CLUSTERED ON — used to divide the MV into roughly equal buckets for better maintainability and speed. Each bucket is stored as a separate file in HDFS.

  • ROW FORMAT — defines row formatting using either delimiters or a custom SerDe class.

  • STORED AS — sets the storage format. The default format is ORC.

  • STORED BY — specifies the name of a custom storage handler class to store MV data using a non-native format.

  • LOCATION — defines an HDFS path to store the MV data.

  • TBLPROPERTIES — a list of key/value properties for tuning the MV behavior.

Query rewriting

Once an MV is created, the Hive optimizer automatically rewrites incoming queries to use the MV if it believes that this will improve the query execution plan. The query rewriting mechanism is enabled by default and can be disabled at MV creation time (DISABLE REWRITE clause) or by using the following command:

ALTER MATERIALIZED VIEW <mv_name> DISABLE REWRITE;

To disable the query rewriting feature globally, set the following property:

SET hive.materializedview.rewriting=false;

If an MV becomes stale (MV content is not updated for a long time), Hive will stop using this MV for query rewriting. The time interval to consider an MV as stale can be set with the hive.materializedview.rewriting.time.window"="Nmin" property in the Custom hive-site.xml ADCM settings section.

IMPORTANT
Automatic rewriting cannot be enabled if the MV was created from a non-transactional table. The source tables must be ACID, managed tables.

Rebuild materialized view

After an MV is created, any modifications to the source tables are not automatically reflected in the MV. The MV rebuild process should be triggered manually using the following command:

ALTER MATERIALIZED VIEW <mv_name> REBUILD;

If new data was added to source tables using INSERT, Hive runs an incremental rebuild. For UPDATE and DELETE operations, the full rebuild is performed.

Usage example

The steps below show basic MV operations:

  1. Create a new MV from the test tables.

    CREATE MATERIALIZED VIEW mv_emp_dp
      AS SELECT e.name, e.last_name, e.email, d.dep_name
      FROM employees e JOIN departments d
      ON (e.dept_id = d.id);
  2. Get the list of available MVs:

    SHOW MATERIALIZED VIEWS;

    The output:

    +------------+
    |  tab_name  |
    +------------+
    | mv_emp_dp  |
    +------------+

    Also, check the HDFS warehouse location for new files.

    $ sudo -u hdfs hdfs dfs -ls /apps/hive/warehouse/mv_demo.db/mv_emp_dp

    A new ORC file with MV data is available in the Hive warehouse directory:

    -rw-r--r--   3 hive hadoop        795 2023-11-20 13:23 /apps/hive/warehouse/mv_demo.db/mv_emp_dp/000000_0
  3. Run a query similar to the one used for MV creation. Such a query gets the results from the precomputed MV instead of joining and scanning the two source tables.

    SELECT e.last_name, e.email, d.dep_name
    FROM employees e JOIN departments d
    ON (e.dept_id = d.id)
    WHERE d.dep_name='sales';

    The result:

    +--------------+-------------------------+-------------+
    | e.last_name  |         e.email         | d.dep_name  |
    +--------------+-------------------------+-------------+
    | Ivanov       | ivan_ivanov123@mail.ru  | sales       |
    +--------------+-------------------------+-------------++
  4. Run the same query with EXPLAIN EXTENDED to ensure that query rewriting took place.

    EXPLAIN EXTENDED SELECT e.last_name, e.email, d.dep_name
    FROM employees e JOIN departments d
    ON (e.dept_id = d.id)
    WHERE d.dep_name='sales';

    The output is shown below. Notice the alias alias: mv_demo.mv_emp_dp — this indicates that the results have been fetched from the mv_demo.mv_emp_dp MV with one table scan.

    +----------------------------------------------------+
    |                      Explain                       |
    +----------------------------------------------------+
    | STAGE DEPENDENCIES:                                |
    |   Stage-0 is a root stage                          |
    |                                                    |
    | STAGE PLANS:                                       |
    |   Stage: Stage-0                                   |
    |     Fetch Operator                                 |
    |       limit: -1                                    |
    |       Processor Tree:                              |
    |         TableScan                                  |
    |           alias: mv_demo.mv_emp_dp                 |
    |           GatherStats: false                       |
    |           Filter Operator                          |
    |             isSamplingPred: false                  |
    |             predicate: (dep_name = 'sales') (ty... |
    |             Select Operator                        |
    |               expressions: last_name (type: str... |
    |               outputColumnNames: _col0, _col1, ... |
    |               ListSink                             |
    |                                                    |
    +----------------------------------------------------+
  5. Update the source table.

    INSERT INTO employees VALUES
    (5,'Anna','Petrova','ann_pett@yandxed.ru', 32, 1700, 1);
  6. Run the EXPLAIN EXTENDED query again:

    EXPLAIN EXTENDED SELECT e.last_name, e.email, d.dep_name
    FROM employees e JOIN departments d
    ON (e.dept_id = d.id)
    WHERE d.dep_name='sales';

    This time the execution plan includes joining and full scanning of the source tables since their contents is no more in sync with the MV.

    +----------------------------------------------------+
    |                      Explain                       |
    +----------------------------------------------------+
    | STAGE DEPENDENCIES:                                |
    |   Stage-1 is a root stage                          |
    |   Stage-0 depends on stages: Stage-1               |
    |                                                    |
    | STAGE PLANS:                                       |
    |   Stage: Stage-1                                   |
    |     Tez                                            |
    |       DagId: hive_20231120144209_2c92e022-a865-... |
    |       Edges:                                       |
    |         Map 1 <- Map 2 (BROADCAST_EDGE)            |
    |       DagName: hive_20231120144209_2c92e022-a86... |
    |       Vertices:                                    |
    |         Map 1                                      |
    |             Map Operator Tree:                     |
    |                 TableScan                          |
    |                   alias: e                         |
    |                   Statistics: Num rows: 50 Dat ... |
    |                   ...                              |
    +----------------------------------------------------+
  7. To refresh the mv_emp_dp MV and synchronize the updates made to the source tables, rebuild the MV:

    ALTER MATERIALIZED VIEW mv_emp_dp REBUILD;
  8. Run the EXPLAIN EXTENDED query again:

    EXPLAIN EXTENDED SELECT e.last_name, e.email, d.dep_name
    FROM employees e JOIN departments d
    ON (e.dept_id = d.id)
    WHERE d.dep_name='sales';

    Now the execution plan indicates the use of MV for getting the data.

    +----------------------------------------------------+
    |                      Explain                       |
    +----------------------------------------------------+
    | STAGE DEPENDENCIES:                                |
    |   Stage-0 is a root stage                          |
    |                                                    |
    | STAGE PLANS:                                       |
    |   Stage: Stage-0                                   |
    |     Fetch Operator                                 |
    |       limit: -1                                    |
    |       Processor Tree:                              |
    |         TableScan                                  |
    |           alias: mv_demo.mv_emp_dp                 |
    |           GatherStats: false                       |
    |           Filter Operator                          |
    |             isSamplingPred: false                  |
    |             predicate: (dep_name = 'sales') (ty... |
    |             Select Operator                        |
    |               expressions: last_name (type: str... |
    |               outputColumnNames: _col0, _col1, ... |
    |               ListSink                             |
    |                                                    |
    +----------------------------------------------------+
  9. To delete the materialized view, use the command:

    DROP MATERIALIZED VIEW mv_emp_dp;
Found a mistake? Seleсt text and press Ctrl+Enter to report it