Конференция Arenadata
Новое время — новый Greenplum
Мы приглашаем вас принять участие в конференции, посвященной будущему Open-Source Greenplum 19 сентября в 18:00:00 UTC +3. Встреча будет проходить в гибридном формате — и офлайн, и онлайн. Онлайн-трансляция будет доступна для всех желающих.
Внезапное закрытие Greenplum его владельцем — компанией Broadcom - стало неприятным сюрпризом для всех, кто использует или планирует начать использовать решения на базе этой технологии. Многие ожидают выхода стабильной версии Greenplum 7 и надеются на её дальнейшее активное развитие.
Arenadata не могла допустить, чтобы разрабатываемый годами Open-Source проект Greenplum прекратил своё существование, поэтому 19 сентября мы представим наш ответ на данное решение Broadcom, а участники сообщества получат исчерпывающие разъяснения на все вопросы о дальнейшей судьбе этой технологии.

На конференции вас ждёт обсуждение следующих тем:

  • План возрождения Greenplum;
  • Дорожная карта;
  • Экспертное обсуждение и консультации.
Осталось до события

Impala with Hive

Impala can use existing Apache Hive infrastructure to perform long-running, batch-oriented SQL queries. Impala stores table definitions in MySQL or PostgreSQL database known as Metastore, the same database where Hive keeps this type of data. Impala can access tables defined or loaded by Hive, if all columns use Impala-supported data types, file formats, and compression codecs.

Use the same table in Hive and Impala

As described above, you can work with the same table in Hive and Impala. For example, create a table in Hive (see Work with Hive tables). The Hive shell is available on the host where the Hive HiveServer2 component is installed.

$ sudo -u hive hive
CREATE TABLE employees (
employee_id INT,
full_name STRING,
hours INT,
rate  INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';

INSERT INTO TABLE
    employees
VALUES
     (1, 'James Wilson', 8, 500),
    (2, 'Mary Burton', 6, 450),
    (3, 'Patricia Robinson', 7, 430);

After that you can utilize impala-shell to select data:

$ impala-shell
SELECT * FROM employees;

The result:

Query: select * from employees
Query submitted at: 2023-09-05 14:32:38 (Coordinator: http://ees-1adh.ru-central1.internal:25000)
Query progress can be monitored at: http://ees-1adh.ru-central1.internal:25000/query_plan?query_id=dd477956d9bde7ca:61358fec00000000
+-------------+-------------------+-------+------+
| employee_id | full_name         | hours | rate |
+-------------+-------------------+-------+------+
| 1           | James Wilson      | 8     | 500  |
| 2           | Mary Burton       | 6     | 450  |
| 3           | Patricia Robinson | 7     | 430  |
+-------------+-------------------+-------+------+
Fetched 3 row(s) in 0.21s

Add rows to the table in impala-shell:

insert into employees values ((4,"Robert Gray", 8, 480), (5,"Elizabeth Tucker", 7, 470));

Use the Hive shell to display data:

$ sudo -u hive hive
SELECT * FROM employees;

The result:

OK
1   James Wilson    8   500
2   Mary Burton 6   450
3   Patricia Robinson   7   430
4   Robert Gray 8   480
5   Elizabeth Tucker    7   470
Time taken: 0.23 seconds, Fetched: 5 row(s)

Rows added through Impala are displayed.

The initial focus on query features and performance means that Impala can read more types of data with the SELECT statement than it can write with the INSERT statement. To query data using the Avro, RCFile, or SequenceFile file formats, you should load the data using Hive.

Collect table and column statistics for query optimization

Impala can do better optimization for complex or multi-table queries when it has statistics about the amount of data and data distribution. For example, the optimization of JOIN queries requires statistics to determine which table is larger. It can be done by comparing the number of rows and the average row size of each table.

The Impala query optimizer can use table statistics and column statistics. Originally, you gathered this information with the ANALYZE TABLE statement in Hive. In Impala, use the COMPUTE STATS statement instead. COMPUTE STATS requires less setup, is more reliable, and does not require switching back and forth between impala-shell and the Hive shell.

Table statistics

The Impala query planner uses statistics of entire tables and partitions. This information includes physical characteristics: a number of rows, a number of data files, a total size of data files, and a file format. For partitioned tables, statistics are calculated for each partition and for the entire table.

The statistics metadata is stored in the Metastore database, and can be updated by Impala or Hive. If a value is not available, -1 is used as a placeholder. Some values, for example, the number and total sizes of data files, are always kept up to date because they can be calculated cheaply as part of collecting HDFS metadata.

The example below uses the SHOW TABLE STATS statement to display table statistics. For partitioned tables you can also use the SHOW PARTITIONS statement that displays exactly the same information as SHOW TABLE STATS.

show table stats test_table1;

The result:

+-------+--------+---------+--------------+-------------------+---------+-------------------+--------------------------------------------+
| #Rows | #Files |  Size   | Bytes Cached | Cache Replication |  Format | Incremental stats | Location                                   |
+-------+--------+---------+--------------+-------------------+---------+-------------------+--------------------------------------------+
| -1    |   52   | 12.62GB | NOT CACHED   | NOT CACHED        | PARQUET | false             | hdfs://adh/apps/hive/warehouse/test_table1 |

The values for the number and sizes of files are shown. The number of rows is not known and is displayed as -1, because it requires a potentially expensive scan through the entire table.

Execute the COMPUTE STATS statement and check the table statistics:

compute stats test_table1;

show table stats test_table1;

The result:

+-------------+--------+---------+--------------+-------------------+---------+-------------------+--------------------------------------------+
|    #Rows    | #Files |  Size   | Bytes Cached | Cache Replication |  Format | Incremental stats | Location                                   |
+-------------+--------+---------+--------------+-------------------+---------+-------------------+--------------------------------------------+
| 541666666   |   52   | 12.62GB | NOT CACHED   | NOT CACHED        | PARQUET | false             | hdfs://adh/apps/hive/warehouse/test_table1 |

The COMPUTE STATS statement adds the missing statistics values.

You can see the Incremental stats column in the output. The column shows whether the Incremental stats feature is used to collect statistics for the table. Incremental stats is a functionality for partitioned tables. When Impala computes incremental statistics for a partitioned table, by default, Impala only processes partitions that do not yet have incremental statistics. This way, you can keep statistics up to date without incurring the overhead of reprocessing the entire table each time. For a particular table, use either COMPUTE STATS or COMPUTE INCREMENTAL STATS, but never combine the two or alternate between them. If you need to switch from COMPUTE STATS to COMPUTE INCREMENTAL STATS or vice versa for a table, drop all statistics by running DROP STATS before changing the statistics type. For more information, see the Generating Table and Column Statistics section of the Table and Column Statistics article.

Column statistics

The Impala query planner can utilize statistics of individual columns when the metadata is available in the Metastore database. This technique is most valuable for columns compared across tables in JOIN queries to estimate how many rows the query will retrieve from each table. These statistics are also important for subqueries that use the EXISTS and IN operators. These operators are internally processed as JOIN queries.

The following example displays the column statistics for an unpartitioned Parquet table:

show column stats test_table2;

The result:

+--------+---------+------------------+--------+----------+----------+--------+---------+
| Column | Type    | #Distinct Values | #Nulls | Max Size | Avg Size | #Trues | #Falses |
+--------+---------+------------------+--------+----------+----------+--------+---------+
| field1 | INT     | -1               | -1     |  4       | 4.0      |  -1    |  -1     |
| field2 | BOOLEAN | -1               | -1     |  1       | 1.0      |  -1    |  -1     |
| field3 | STRING  | -1               | -1     | -1       | -1       |  -1    |  -1     |
+--------+---------+------------------+--------+----------+----------+--------+---------+

In the output, the maximum and average size values of numeric and fixed-size types are available because they are constants. The number of Distinct Values is not determined, since it requires a potentially expensive scan through the entire table, and the value is displayed as -1. The same applies to maximum and average sizes of variable-sized types, such as STRING. The COMPUTE STATS statement adds the missing values and updates statistics.

compute stats test_table2;

show column stats test_table2;

The result:

+--------+---------+------------------+--------+----------+----------+--------+---------+
| Column | Type    | #Distinct Values | #Nulls | Max Size | Avg Size | #Trues | #Falses |
+--------+---------+------------------+--------+----------+----------+--------+---------+
| field1 | INT     | 6                | 0      | 4        | 4.0      | -1     | -1      |
| field2 | BOOLEAN | 2                | 0      | 1        | 1.0      | 14     | 7       |
| field3 | STRING  | 6                | 0      | 4        | 4.0      | -1     | -1      |
+--------+---------+------------------+--------+----------+----------+--------+---------+

The Trues and Falses values are calculated only for fields of the BOOLEAN type.

For effective using of column statistics, Impala also needs to have table statistics. When you use the Impala COMPUTE STATS statement, both table and column statistics are gathered at the same time. If you run the Hive statement ANALYZE TABLE COMPUTE STATISTICS FOR COLUMNS, column statistics are computed. Impala can only use the column statistics if the table is unpartitioned. Impala is not able to utilize Hive-generated column statistics for a partitioned table.

Found a mistake? Seleсt text and press Ctrl+Enter to report it