Work with Iceberg tables in Hive

Konstantin Alpashkin

Contents

DDL commands
DML commands
- SELECT
- INSERT
Custom Hive catalogs
Limitations

Apache Iceberg is an open, high-performance format for large analytic tables. The ADH Hive service adopts this format to enhance the performance of Hive tables.

NOTE

Hive supports Iceberg tables starting with ADH 3.2.4.3.

In Hive, the Iceberg table format is available out-of-the-box. The support for Iceberg tables is based on Hive StorageHandlers. This means that to create an Iceberg table in Hive, you should specify the org.apache.iceberg.mr.hive.HiveIcebergStorageHandler storage handler in the STORED BY clause. The Iceberg tables support is enabled by default and can be toggled using the iceberg.engine.hive.enabled property (Clusters → <clusterName> → Services → Hive → Primary Configuration → hive-site.xml). Also, the engine.hive.enabled property can be set programmatically as a table property during the creation of an Iceberg table. An example is below:

Catalog catalog = ... ;
    Map<String, String> tableProperties=Maps.newHashMap();
    tableProperties.put(TableProperties.ENGINE_HIVE_ENABLED,"true");
    catalog.createTable(tableId,schema,spec,tableProperties);

NOTE

The table property configuration overrides the global ADH configuration.

As of ADH 3.2.4.3, the support for Iceberg tables in Hive comprises the following capabilities:

Create a table.
Drop a table.
Select data.
Insert data.

DDL commands

Create an Iceberg table

To create an Iceberg table, append the STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler' clause to the standard Hive CREATE TABLE statement.

Example:

CREATE TABLE transactions (txn_id int, acc_id int, txn_amount decimal(10,2), txn_date date)
STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler';

Run the DESCRIBE FORMATTED command to retrieve information about the newly created table:

DESCRIBE FORMATTED transactions;

Sample output

+-------------------------------+----------------------------------------------------+----------------------------------------------------+
|           col_name            |                     data_type                      |                      comment                       |
+-------------------------------+----------------------------------------------------+----------------------------------------------------+
| # col_name                    | data_type                                          | comment                                            |
| id                            | int                                                | from deserializer                                  |
| value                         | string                                             | from deserializer                                  |
|                               | NULL                                               | NULL                                               |
| # Detailed Table Information  | NULL                                               | NULL                                               |
| Database:                     | default                                            | NULL                                               |
| OwnerType:                    | USER                                               | NULL                                               |
| Owner:                        | hive                                               | NULL                                               |
| CreateTime:                   | Wed Jun 05 14:09:36 UTC 2024                       | NULL                                               |
| LastAccessTime:               | UNKNOWN                                            | NULL                                               |
| Retention:                    | 0                                                  | NULL                                               |
| Location:                     | hdfs://adh/apps/hive/warehouse/test_iceberg_table  | NULL                                               |
| Table Type:                   | MANAGED_TABLE                                      | NULL                                               |
| Table Parameters:             | NULL                                               | NULL                                               |
|                               | bucketing_version                                  | 2                                                  |
|                               | current-schema                                     | {\"type\":\"struct\",\"schema-id\":0,\"fields\":[{\"id\":1,\"name\":\"id\",\"required\":false,\"type\":\"int\"},{\"id\":2,\"name\":\"value\",\"required\":false,\"type\":\"string\"}]} |
|                               | engine.hive.enabled                                | true                                               |
|                               | external.table.purge                               | TRUE                                               |
|                               | metadata_location                                  | hdfs://adh/apps/hive/warehouse/test_iceberg_table/metadata/00000-fa65f392-3b34-4e51-915b-8e2aae261098.metadata.json |
|                               | numFiles                                           | 0                                                  |
|                               | numRows                                            | 0                                                  |
|                               | rawDataSize                                        | 0                                                  |
|                               | snapshot-count                                     | 0                                                  |
|                               | storage_handler                                    | org.apache.iceberg.mr.hive.HiveIcebergStorageHandler |
|                               | table_type                                         | ICEBERG                                            |
|                               | totalSize                                          | 0                                                  |
|                               | transient_lastDdlTime                              | 1717596576                                         |
|                               | uuid                                               | f9da6135-095e-491f-919e-f2194253b8f8               |
|                               | write.parquet.compression-codec                    | zstd                                               |
|                               | NULL                                               | NULL                                               |
| # Storage Information         | NULL                                               | NULL                                               |
| SerDe Library:                | org.apache.iceberg.mr.hive.HiveIcebergSerDe        | NULL                                               |
| InputFormat:                  | org.apache.iceberg.mr.hive.HiveIcebergInputFormat  | NULL                                               |
| OutputFormat:                 | org.apache.iceberg.mr.hive.HiveIcebergOutputFormat | NULL                                               |
| Compressed:                   | No                                                 | NULL                                               |
| Num Buckets:                  | 0                                                  | NULL                                               |
| Bucket Columns:               | []                                                 | NULL                                               |
| Sort Columns:                 | []                                                 | NULL                                               |
+-------------------------------+----------------------------------------------------+----------------------------------------------------+

The table_type parameter in the output indicates that the table has been created as an Iceberg table.

CREATE TABLE AS SELECT

You can also use the Hive’s native CREATE TABLE AS <tbl_name> (CTAS) syntax to create an Iceberg table based on an existing Hive table. In this case, the major distinction is that an Iceberg table is created at the very beginning of the query execution, whereas the data is inserted after the query is completed. Thus, for some period of time, the table may exist but contain no data.

CREATE TABLE transactions_ctas
STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler' AS
SELECT * FROM transactions;

Sample DESCRIBE FORMATTED output

DESCRIBE FORMATTED transactions_ctas;
+-------------------------------+----------------------------------------------------+----------------------------------------------------+
|           col_name            |                     data_type                      |                      comment                       |
+-------------------------------+----------------------------------------------------+----------------------------------------------------+
| # col_name                    | data_type                                          | comment                                            |
| id                            | int                                                | from deserializer                                  |
|                               | NULL                                               | NULL                                               |
| # Detailed Table Information  | NULL                                               | NULL                                               |
| Database:                     | default                                            | NULL                                               |
| OwnerType:                    | USER                                               | NULL                                               |
| Owner:                        | hive                                               | NULL                                               |
| CreateTime:                   | Mon Jun 03 20:59:25 UTC 2024                       | NULL                                               |
| LastAccessTime:               | UNKNOWN                                            | NULL                                               |
| Retention:                    | 0                                                  | NULL                                               |
| Location:                     | hdfs://adh/apps/hive/warehouse/test_iceberg_table1 | NULL                                               |
| Table Type:                   | MANAGED_TABLE                                      | NULL                                               |
| Table Parameters:             | NULL                                               | NULL                                               |
|                               | current-schema                                     | {\"type\":\"struct\",\"schema-id\":0,\"fields\":[{\"id\":1,\"name\":\"id\",\"required\":false,\"type\":\"int\"}]} |
|                               | engine.hive.enabled                                | true                                               |
|                               | external.table.purge                               | TRUE                                               |
|                               | metadata_location                                  | hdfs://adh/apps/hive/warehouse/test_iceberg_table1/metadata/00000-df221cec-ab76-4de9-95cd-ef7605c8e79f.metadata.json |
|                               | numFiles                                           | 0                                                  |
|                               | numRows                                            | 0                                                  |
|                               | rawDataSize                                        | 0                                                  |
|                               | snapshot-count                                     | 0                                                  |
|                               | storage_handler                                    | org.apache.iceberg.mr.hive.HiveIcebergStorageHandler |
|                               | table_type                                         | ICEBERG                                            |
|                               | totalSize                                          | 0                                                  |
|                               | transient_lastDdlTime                              | 1717448365                                         |
|                               | uuid                                               | eb1933a0-43d9-4dd2-9426-cf7ace5063c8               |
|                               | write.parquet.compression-codec                    | zstd                                               |
|                               | NULL                                               | NULL                                               |
| # Storage Information         | NULL                                               | NULL                                               |
| SerDe Library:                | org.apache.iceberg.mr.hive.HiveIcebergSerDe        | NULL                                               |
| InputFormat:                  | org.apache.iceberg.mr.hive.HiveIcebergInputFormat  | NULL                                               |
| OutputFormat:                 | org.apache.iceberg.mr.hive.HiveIcebergOutputFormat | NULL                                               |
| Compressed:                   | No                                                 | NULL                                               |
| Num Buckets:                  | 0                                                  | NULL                                               |
| Bucket Columns:               | []                                                 | NULL                                               |
| Sort Columns:                 | []                                                 | NULL                                               |
+-------------------------------+----------------------------------------------------+----------------------------------------------------+

CREATE TABLE LIKE

You can also use the CREATE TABLE LIKE <tbl_name> syntax to create an empty Iceberg table based on the definition of another Hive table.

Example:

CREATE TABLE transactions_ctlt LIKE transactions
STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler';

Sample DESCRIBE FORMATTED output

DESCRIBE FORMATTED transactions_ctlt;
+-------------------------------+----------------------------------------------------+----------------------------------------------------+
|           col_name            |                     data_type                      |                      comment                       |
+-------------------------------+----------------------------------------------------+----------------------------------------------------+
| # col_name                    | data_type                                          | comment                                            |
| txn_id                        | int                                                | from deserializer                                  |
| acc_id                        | int                                                | from deserializer                                  |
| txn_amount                    | decimal(10,2)                                      | from deserializer                                  |
| txn_date                      | date                                               | from deserializer                                  |
|                               | NULL                                               | NULL                                               |
| # Detailed Table Information  | NULL                                               | NULL                                               |
| Database:                     | default                                            | NULL                                               |
| OwnerType:                    | USER                                               | NULL                                               |
| Owner:                        | hive                                               | NULL                                               |
| CreateTime:                   | Fri Jun 07 10:08:08 UTC 2024                       | NULL                                               |
| LastAccessTime:               | UNKNOWN                                            | NULL                                               |
| Retention:                    | 0                                                  | NULL                                               |
| Location:                     | hdfs://adh/apps/hive/warehouse/transactions_ctlt   | NULL                                               |
| Table Type:                   | MANAGED_TABLE                                      | NULL                                               |
| Table Parameters:             | NULL                                               | NULL                                               |
|                               | current-schema                                     | {\"type\":\"struct\",\"schema-id\":0,\"fields\":[{\"id\":1,\"name\":\"txn_id\",\"required\":false,\"type\":\"int\"},{\"id\":2,\"name\":\"acc_id\",\"required\":false,\"type\":\"int\"},{\"id\":3,\"name\":\"txn_amount\",\"required\":false,\"type\":\"decimal(10, 2)\"},{\"id\":4,\"name\":\"txn_date\",\"required\":false,\"type\":\"date\"}]} |
|                               | engine.hive.enabled                                | true                                               |
|                               | external.table.purge                               | TRUE                                               |
|                               | metadata_location                                  | hdfs://adh/apps/hive/warehouse/transactions_ctlt/metadata/00000-72d6ad62-e7c4-4e53-97e9-be7f82d5ced9.metadata.json |
|                               | numFiles                                           | 0                                                  |
|                               | numRows                                            | 0                                                  |
|                               | rawDataSize                                        | 0                                                  |
|                               | snapshot-count                                     | 0                                                  |
|                               | storage_handler                                    | org.apache.iceberg.mr.hive.HiveIcebergStorageHandler |
|                               | table_type                                         | ICEBERG                                            |
|                               | totalSize                                          | 0                                                  |
|                               | transient_lastDdlTime                              | 1717754888                                         |
|                               | uuid                                               | d4e3811a-7a36-47b5-bb12-0948f251f6dc               |
|                               | write.parquet.compression-codec                    | zstd                                               |
|                               | NULL                                               | NULL                                               |
| # Storage Information         | NULL                                               | NULL                                               |
| SerDe Library:                | org.apache.iceberg.mr.hive.HiveIcebergSerDe        | NULL                                               |
| InputFormat:                  | org.apache.iceberg.mr.hive.HiveIcebergInputFormat  | NULL                                               |
| OutputFormat:                 | org.apache.iceberg.mr.hive.HiveIcebergOutputFormat | NULL                                               |
| Compressed:                   | No                                                 | NULL                                               |
| Num Buckets:                  | 0                                                  | NULL                                               |
| Bucket Columns:               | []                                                 | NULL                                               |
| Sort Columns:                 | []                                                 | NULL                                               |
+-------------------------------+----------------------------------------------------+----------------------------------------------------+

Create a partitioned table

To create a partitioned table, use the following syntax:

CREATE TABLE transactions_partitioned (
    txn_id int, txn_amount decimal(10,2), txn_date date)
PARTITIONED BY (acc_id int)
STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler';

Run the DESCRIBE FORMATTED command for the newly created table and in the command’s output, search for the default-partition-spec property. Its value stores the partition specification in the Iceberg format.

Sample DESCRIBE FORMATTED output

DESCRIBE FORMATTED transactions_partitioned;
+-------------------------------+----------------------------------------------------+----------------------------------------------------+
|           col_name            |                     data_type                      |                      comment                       |
+-------------------------------+----------------------------------------------------+----------------------------------------------------+
| # col_name                    | data_type                                          | comment                                            |
| txn_id                        | int                                                | from deserializer                                  |
| txn_amount                    | decimal(10,2)                                      | from deserializer                                  |
| txn_date                      | date                                               | from deserializer                                  |
| acc_id                        | int                                                | from deserializer                                  |
|                               | NULL                                               | NULL                                               |
| # Detailed Table Information  | NULL                                               | NULL                                               |
| Database:                     | default                                            | NULL                                               |
| OwnerType:                    | USER                                               | NULL                                               |
| Owner:                        | hive                                               | NULL                                               |
| CreateTime:                   | Thu Jun 06 13:57:31 UTC 2024                       | NULL                                               |
| LastAccessTime:               | UNKNOWN                                            | NULL                                               |
| Retention:                    | 0                                                  | NULL                                               |
| Location:                     | hdfs://adh/apps/hive/warehouse/transactions        | NULL                                               |
| Table Type:                   | MANAGED_TABLE                                      | NULL                                               |
| Table Parameters:             | NULL                                               | NULL                                               |
|                               | bucketing_version                                  | 2                                                  |
|                               | current-schema                                     | {\"type\":\"struct\",\"schema-id\":0,\"fields\":[{\"id\":1,\"name\":\"txn_id\",\"required\":false,\"type\":\"int\"},{\"id\":2,\"name\":\"txn_amount\",\"required\":false,\"type\":\"decimal(10, 2)\"},{\"id\":3,\"name\":\"txn_date\",\"required\":false,\"type\":\"date\"},{\"id\":4,\"name\":\"acc_id\",\"required\":false,\"type\":\"int\"}]} |
|                               | default-partition-spec                             | {\"spec-id\":0,\"fields\":[{\"name\":\"acc_id\",\"transform\":\"identity\",\"source-id\":4,\"field-id\":1000}]} |
|                               | engine.hive.enabled                                | true                                               |
|                               | external.table.purge                               | TRUE                                               |
|                               | metadata_location                                  | hdfs://adh/apps/hive/warehouse/transactions/metadata/00000-b6dde9b3-d1cc-4f92-828e-5dab83b1e5e8.metadata.json |
|                               | snapshot-count                                     | 0                                                  |
|                               | storage_handler                                    | org.apache.iceberg.mr.hive.HiveIcebergStorageHandler |
|                               | table_type                                         | ICEBERG                                            |
|                               | transient_lastDdlTime                              | 1717682251                                         |
|                               | uuid                                               | c6205d6d-ddca-4b04-a780-8b8e1048db0b               |
|                               | write.parquet.compression-codec                    | zstd                                               |
|                               | NULL                                               | NULL                                               |
| # Storage Information         | NULL                                               | NULL                                               |
| SerDe Library:                | org.apache.iceberg.mr.hive.HiveIcebergSerDe        | NULL                                               |
| InputFormat:                  | org.apache.iceberg.mr.hive.HiveIcebergInputFormat  | NULL                                               |
| OutputFormat:                 | org.apache.iceberg.mr.hive.HiveIcebergOutputFormat | NULL                                               |
| Compressed:                   | No                                                 | NULL                                               |
| Num Buckets:                  | 0                                                  | NULL                                               |
| Bucket Columns:               | []                                                 | NULL                                               |
| Sort Columns:                 | []                                                 | NULL                                               |
+-------------------------------+----------------------------------------------------+----------------------------------------------------

DML commands

As of ADH 3.2.4.3, Hive supports the SELECT and INSERT DML operations for Iceberg tables.

NOTE

To run DML commands on Iceberg tables, Hive must use the MapReduce execution engine.

To change the Hive execution engine, use the hive.execution.engine property in ADCM (Clusters → <clusterName> → Services → Hive → Primary Configuration → hive-site.xml). After changing the property, a Hive restart is required.

SELECT

The syntax of the SELECT statement to run on an Iceberg table is identical to a regular Hive SELECT command. However, under the hood Hive runs a different set of actions to benefit from the Iceberg format. The distinctions are present in both query compilation and execution phases.

Here’s a brief description of Iceberg-specific events taking place under the hood when Hive selects data from an Iceberg table:

Hive requests the Iceberg catalog to get the table’s metadata file that stores the table definition. The Iceberg catalog returns a pointer to the metadata file.
Hive reads the metadata file and gets the manifest list location. This entry carries information about manifest files that make up the snapshot, define partitions and partition boundaries, etc. Also, if the requested table is partitioned, Hive reads the partition specification, thus, being able to apply certain partition optimizations at this step.
Hive analyzes each individual manifest file from the manifest files list. Each manifest file carries information about the data files that store actual physical data.
Finally, Hive scans the data files. If the table is partitioned, the engine deals only with data files that match the WHERE clause or other query filters. The major Iceberg format advantage lies in this phase — by this time, the engine has enough information to scan only those data files that are relevant to the given SELECT query. This is made possible due to the data being tracked at the file level (rather than the directory level as implemented in the traditional Hive model).

INSERT

For Iceberg tables, Hive supports the traditional INSERT INTO operation.

Example:

INSERT INTO transactions VALUES
(1, 1002, 10.00, '2023-01-01'),
(2, 1002, 20.00, '2023-01-03'),
(3, 1002, 30.00, '2023-01-02'),
(4, 1001, 100.50, '2023-01-02'),
(5, 1001, 150.50, '2023-01-04'),
(6, 1001, 200.50, '2023-01-03'),
(7, 1003, 50.00, '2023-01-03'),
(8, 1003, 50.00, '2023-01-01'),
(9, 1003, 75.00, '2023-01-04');

Multi-table inserts are also supported, but they are not atomic. The commits are made on a per-table basis. During the commit process, clients may observe partial changes to the table contents and failures can leave partial changes committed. Changes within a single table remain atomic.

The following example shows how to populate two Iceberg tables with data from an existing Hive table using one multi-insert operation.

FROM transactions
   INSERT INTO transactions11 SELECT txn_id, txn_date
   INSERT INTO transactions22 SELECT txn_id, acc_id;

Custom Hive catalogs

From the Hive operational perspective, there is only one global data catalog — HiveCatalog that stores data in HDFS under the /apps/hive/warehouse/ location. However, when using Iceberg tables, you can have several different data catalog types, such as Hive, Hadoop, or a custom catalog implementation.

Iceberg also allows loading a table directly based on its path in HDFS. In this case, a table does not belong to any catalog. Such cross-catalog, path-based tables can be used as temporary or intermediate tables, for example, for JOIN cases.

The following example creates another Hive catalog and specifies a custom warehouse location for the catalog.

SET iceberg.catalog.my_test_hive.type=hive;
SET iceberg.catalog.my_test_hive.warehouse=hdfs:/adh/apps/hive/custom_catalog/warehouse;

Limitations

There are several limitations for Iceberg tables support in Hive. These restrictions are related to the 3.x Hive version being used in ADH. In the upcoming ADH releases, the shift to Hive 4.x will extend Iceberg tables support and eliminate the limitations listed below.

As of ADH 3.2.4.3, Hive supports the following DML/DDL operations:
- CREATE TABLE
- DROP TABLE
- SELECT
- INSERT
As of ADH 3.2.4.3, Iceberg tables in Hive are created using the STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler' clause rather than STORED BY ICEBERG.
Starting with Apache Iceberg 0.11.0, when using Hive with Tez you also have to disable vectorization (hive.vectorized.execution.enabled=false).

Found a mistake? Seleсt text and press Ctrl+Enter to report it