Work with Iceberg tables in Hive
Apache Iceberg is an open, high-performance format for large analytic tables. The ADH Hive service adopts this format to enhance the performance of Hive tables.
NOTE
Hive supports Iceberg tables starting with ADH 3.2.4.3.
|
In Hive, the Iceberg table format is available out-of-the-box.
The support for Iceberg tables is based on Hive StorageHandlers.
This means that to create an Iceberg table in Hive, you should specify the org.apache.iceberg.mr.hive.HiveIcebergStorageHandler
storage handler in the STORED BY
clause.
The Iceberg tables support is enabled by default and can be toggled using the iceberg.engine.hive.enabled
property (Clusters → <clusterName> → Services → Hive → Primary Configuration → hive-site.xml).
Also, the engine.hive.enabled
property can be set programmatically as a table property during the creation of an Iceberg table.
An example is below:
Catalog catalog = ... ;
Map<String, String> tableProperties=Maps.newHashMap();
tableProperties.put(TableProperties.ENGINE_HIVE_ENABLED,"true");
catalog.createTable(tableId,schema,spec,tableProperties);
NOTE
The table property configuration overrides the global ADH configuration.
|
As of ADH 3.2.4.3, the support for Iceberg tables in Hive comprises the following capabilities:
-
Create a table.
-
Drop a table.
-
Select data.
-
Insert data.
DDL commands
Create an Iceberg table
To create an Iceberg table, append the STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler'
clause to the standard Hive CREATE TABLE
statement.
Example:
CREATE TABLE transactions (txn_id int, acc_id int, txn_amount decimal(10,2), txn_date date)
STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler';
Run the DESCRIBE FORMATTED
command to retrieve information about the newly created table:
DESCRIBE FORMATTED transactions;
+-------------------------------+----------------------------------------------------+----------------------------------------------------+ | col_name | data_type | comment | +-------------------------------+----------------------------------------------------+----------------------------------------------------+ | # col_name | data_type | comment | | id | int | from deserializer | | value | string | from deserializer | | | NULL | NULL | | # Detailed Table Information | NULL | NULL | | Database: | default | NULL | | OwnerType: | USER | NULL | | Owner: | hive | NULL | | CreateTime: | Wed Jun 05 14:09:36 UTC 2024 | NULL | | LastAccessTime: | UNKNOWN | NULL | | Retention: | 0 | NULL | | Location: | hdfs://adh/apps/hive/warehouse/test_iceberg_table | NULL | | Table Type: | MANAGED_TABLE | NULL | | Table Parameters: | NULL | NULL | | | bucketing_version | 2 | | | current-schema | {\"type\":\"struct\",\"schema-id\":0,\"fields\":[{\"id\":1,\"name\":\"id\",\"required\":false,\"type\":\"int\"},{\"id\":2,\"name\":\"value\",\"required\":false,\"type\":\"string\"}]} | | | engine.hive.enabled | true | | | external.table.purge | TRUE | | | metadata_location | hdfs://adh/apps/hive/warehouse/test_iceberg_table/metadata/00000-fa65f392-3b34-4e51-915b-8e2aae261098.metadata.json | | | numFiles | 0 | | | numRows | 0 | | | rawDataSize | 0 | | | snapshot-count | 0 | | | storage_handler | org.apache.iceberg.mr.hive.HiveIcebergStorageHandler | | | table_type | ICEBERG | | | totalSize | 0 | | | transient_lastDdlTime | 1717596576 | | | uuid | f9da6135-095e-491f-919e-f2194253b8f8 | | | write.parquet.compression-codec | zstd | | | NULL | NULL | | # Storage Information | NULL | NULL | | SerDe Library: | org.apache.iceberg.mr.hive.HiveIcebergSerDe | NULL | | InputFormat: | org.apache.iceberg.mr.hive.HiveIcebergInputFormat | NULL | | OutputFormat: | org.apache.iceberg.mr.hive.HiveIcebergOutputFormat | NULL | | Compressed: | No | NULL | | Num Buckets: | 0 | NULL | | Bucket Columns: | [] | NULL | | Sort Columns: | [] | NULL | +-------------------------------+----------------------------------------------------+----------------------------------------------------+
The table_type
parameter in the output indicates that the table has been created as an Iceberg table.
CREATE TABLE AS SELECT
You can also use the Hive’s native CREATE TABLE AS <tbl_name>
(CTAS) syntax to create an Iceberg table based on an existing Hive table.
In this case, the major distinction is that an Iceberg table is created at the very beginning of the query execution, whereas the data is inserted after the query is completed.
Thus, for some period of time, the table may exist but contain no data.
CREATE TABLE transactions_ctas
STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler' AS
SELECT * FROM transactions;
DESCRIBE FORMATTED transactions_ctas; +-------------------------------+----------------------------------------------------+----------------------------------------------------+ | col_name | data_type | comment | +-------------------------------+----------------------------------------------------+----------------------------------------------------+ | # col_name | data_type | comment | | id | int | from deserializer | | | NULL | NULL | | # Detailed Table Information | NULL | NULL | | Database: | default | NULL | | OwnerType: | USER | NULL | | Owner: | hive | NULL | | CreateTime: | Mon Jun 03 20:59:25 UTC 2024 | NULL | | LastAccessTime: | UNKNOWN | NULL | | Retention: | 0 | NULL | | Location: | hdfs://adh/apps/hive/warehouse/test_iceberg_table1 | NULL | | Table Type: | MANAGED_TABLE | NULL | | Table Parameters: | NULL | NULL | | | current-schema | {\"type\":\"struct\",\"schema-id\":0,\"fields\":[{\"id\":1,\"name\":\"id\",\"required\":false,\"type\":\"int\"}]} | | | engine.hive.enabled | true | | | external.table.purge | TRUE | | | metadata_location | hdfs://adh/apps/hive/warehouse/test_iceberg_table1/metadata/00000-df221cec-ab76-4de9-95cd-ef7605c8e79f.metadata.json | | | numFiles | 0 | | | numRows | 0 | | | rawDataSize | 0 | | | snapshot-count | 0 | | | storage_handler | org.apache.iceberg.mr.hive.HiveIcebergStorageHandler | | | table_type | ICEBERG | | | totalSize | 0 | | | transient_lastDdlTime | 1717448365 | | | uuid | eb1933a0-43d9-4dd2-9426-cf7ace5063c8 | | | write.parquet.compression-codec | zstd | | | NULL | NULL | | # Storage Information | NULL | NULL | | SerDe Library: | org.apache.iceberg.mr.hive.HiveIcebergSerDe | NULL | | InputFormat: | org.apache.iceberg.mr.hive.HiveIcebergInputFormat | NULL | | OutputFormat: | org.apache.iceberg.mr.hive.HiveIcebergOutputFormat | NULL | | Compressed: | No | NULL | | Num Buckets: | 0 | NULL | | Bucket Columns: | [] | NULL | | Sort Columns: | [] | NULL | +-------------------------------+----------------------------------------------------+----------------------------------------------------+
CREATE TABLE LIKE
You can also use the CREATE TABLE LIKE <tbl_name>
syntax to create an empty Iceberg table based on the definition of another Hive table.
Example:
CREATE TABLE transactions_ctlt LIKE transactions
STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler';
DESCRIBE FORMATTED transactions_ctlt; +-------------------------------+----------------------------------------------------+----------------------------------------------------+ | col_name | data_type | comment | +-------------------------------+----------------------------------------------------+----------------------------------------------------+ | # col_name | data_type | comment | | txn_id | int | from deserializer | | acc_id | int | from deserializer | | txn_amount | decimal(10,2) | from deserializer | | txn_date | date | from deserializer | | | NULL | NULL | | # Detailed Table Information | NULL | NULL | | Database: | default | NULL | | OwnerType: | USER | NULL | | Owner: | hive | NULL | | CreateTime: | Fri Jun 07 10:08:08 UTC 2024 | NULL | | LastAccessTime: | UNKNOWN | NULL | | Retention: | 0 | NULL | | Location: | hdfs://adh/apps/hive/warehouse/transactions_ctlt | NULL | | Table Type: | MANAGED_TABLE | NULL | | Table Parameters: | NULL | NULL | | | current-schema | {\"type\":\"struct\",\"schema-id\":0,\"fields\":[{\"id\":1,\"name\":\"txn_id\",\"required\":false,\"type\":\"int\"},{\"id\":2,\"name\":\"acc_id\",\"required\":false,\"type\":\"int\"},{\"id\":3,\"name\":\"txn_amount\",\"required\":false,\"type\":\"decimal(10, 2)\"},{\"id\":4,\"name\":\"txn_date\",\"required\":false,\"type\":\"date\"}]} | | | engine.hive.enabled | true | | | external.table.purge | TRUE | | | metadata_location | hdfs://adh/apps/hive/warehouse/transactions_ctlt/metadata/00000-72d6ad62-e7c4-4e53-97e9-be7f82d5ced9.metadata.json | | | numFiles | 0 | | | numRows | 0 | | | rawDataSize | 0 | | | snapshot-count | 0 | | | storage_handler | org.apache.iceberg.mr.hive.HiveIcebergStorageHandler | | | table_type | ICEBERG | | | totalSize | 0 | | | transient_lastDdlTime | 1717754888 | | | uuid | d4e3811a-7a36-47b5-bb12-0948f251f6dc | | | write.parquet.compression-codec | zstd | | | NULL | NULL | | # Storage Information | NULL | NULL | | SerDe Library: | org.apache.iceberg.mr.hive.HiveIcebergSerDe | NULL | | InputFormat: | org.apache.iceberg.mr.hive.HiveIcebergInputFormat | NULL | | OutputFormat: | org.apache.iceberg.mr.hive.HiveIcebergOutputFormat | NULL | | Compressed: | No | NULL | | Num Buckets: | 0 | NULL | | Bucket Columns: | [] | NULL | | Sort Columns: | [] | NULL | +-------------------------------+----------------------------------------------------+----------------------------------------------------+
Create a partitioned table
To create a partitioned table, use the following syntax:
CREATE TABLE transactions_partitioned (
txn_id int, txn_amount decimal(10,2), txn_date date)
PARTITIONED BY (acc_id int)
STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler';
Run the DESCRIBE FORMATTED
command for the newly created table and in the command’s output, search for the default-partition-spec
property.
Its value stores the partition specification in the Iceberg format.
DESCRIBE FORMATTED transactions_partitioned; +-------------------------------+----------------------------------------------------+----------------------------------------------------+ | col_name | data_type | comment | +-------------------------------+----------------------------------------------------+----------------------------------------------------+ | # col_name | data_type | comment | | txn_id | int | from deserializer | | txn_amount | decimal(10,2) | from deserializer | | txn_date | date | from deserializer | | acc_id | int | from deserializer | | | NULL | NULL | | # Detailed Table Information | NULL | NULL | | Database: | default | NULL | | OwnerType: | USER | NULL | | Owner: | hive | NULL | | CreateTime: | Thu Jun 06 13:57:31 UTC 2024 | NULL | | LastAccessTime: | UNKNOWN | NULL | | Retention: | 0 | NULL | | Location: | hdfs://adh/apps/hive/warehouse/transactions | NULL | | Table Type: | MANAGED_TABLE | NULL | | Table Parameters: | NULL | NULL | | | bucketing_version | 2 | | | current-schema | {\"type\":\"struct\",\"schema-id\":0,\"fields\":[{\"id\":1,\"name\":\"txn_id\",\"required\":false,\"type\":\"int\"},{\"id\":2,\"name\":\"txn_amount\",\"required\":false,\"type\":\"decimal(10, 2)\"},{\"id\":3,\"name\":\"txn_date\",\"required\":false,\"type\":\"date\"},{\"id\":4,\"name\":\"acc_id\",\"required\":false,\"type\":\"int\"}]} | | | default-partition-spec | {\"spec-id\":0,\"fields\":[{\"name\":\"acc_id\",\"transform\":\"identity\",\"source-id\":4,\"field-id\":1000}]} | | | engine.hive.enabled | true | | | external.table.purge | TRUE | | | metadata_location | hdfs://adh/apps/hive/warehouse/transactions/metadata/00000-b6dde9b3-d1cc-4f92-828e-5dab83b1e5e8.metadata.json | | | snapshot-count | 0 | | | storage_handler | org.apache.iceberg.mr.hive.HiveIcebergStorageHandler | | | table_type | ICEBERG | | | transient_lastDdlTime | 1717682251 | | | uuid | c6205d6d-ddca-4b04-a780-8b8e1048db0b | | | write.parquet.compression-codec | zstd | | | NULL | NULL | | # Storage Information | NULL | NULL | | SerDe Library: | org.apache.iceberg.mr.hive.HiveIcebergSerDe | NULL | | InputFormat: | org.apache.iceberg.mr.hive.HiveIcebergInputFormat | NULL | | OutputFormat: | org.apache.iceberg.mr.hive.HiveIcebergOutputFormat | NULL | | Compressed: | No | NULL | | Num Buckets: | 0 | NULL | | Bucket Columns: | [] | NULL | | Sort Columns: | [] | NULL | +-------------------------------+----------------------------------------------------+----------------------------------------------------
DML commands
As of ADH 3.2.4.3, Hive supports the SELECT
and INSERT
DML operations for Iceberg tables.
NOTE
To run DML commands on Iceberg tables, Hive must use the MapReduce execution engine.
|
To change the Hive execution engine, use the hive.execution.engine
property in ADCM (Clusters → <clusterName> → Services → Hive → Primary Configuration → hive-site.xml).
After changing the property, a Hive restart is required.
SELECT
The syntax of the SELECT
statement to run on an Iceberg table is identical to a regular Hive SELECT
command.
However, under the hood Hive runs a different set of actions to benefit from the Iceberg format.
The distinctions are present in both query compilation and execution phases.
Here’s a brief description of Iceberg-specific events taking place under the hood when Hive selects data from an Iceberg table:
-
Hive requests the Iceberg catalog to get the table’s metadata file that stores the table definition. The Iceberg catalog returns a pointer to the metadata file.
-
Hive reads the metadata file and gets the manifest list location. This entry carries information about manifest files that make up the snapshot, define partitions and partition boundaries, etc. Also, if the requested table is partitioned, Hive reads the partition specification, thus, being able to apply certain partition optimizations at this step.
-
Hive analyzes each individual manifest file from the manifest files list. Each manifest file carries information about the data files that store actual physical data.
-
Finally, Hive scans the data files. If the table is partitioned, the engine deals only with data files that match the
WHERE
clause or other query filters. The major Iceberg format advantage lies in this phase — by this time, the engine has enough information to scan only those data files that are relevant to the givenSELECT
query. This is made possible due to the data being tracked at the file level (rather than the directory level as implemented in the traditional Hive model).
INSERT
For Iceberg tables, Hive supports the traditional INSERT INTO
operation.
Example:
INSERT INTO transactions VALUES
(1, 1002, 10.00, '2023-01-01'),
(2, 1002, 20.00, '2023-01-03'),
(3, 1002, 30.00, '2023-01-02'),
(4, 1001, 100.50, '2023-01-02'),
(5, 1001, 150.50, '2023-01-04'),
(6, 1001, 200.50, '2023-01-03'),
(7, 1003, 50.00, '2023-01-03'),
(8, 1003, 50.00, '2023-01-01'),
(9, 1003, 75.00, '2023-01-04');
Multi-table inserts are also supported, but they are not atomic. The commits are made on a per-table basis. During the commit process, clients may observe partial changes to the table contents and failures can leave partial changes committed. Changes within a single table remain atomic.
The following example shows how to populate two Iceberg tables with data from an existing Hive table using one multi-insert operation.
FROM transactions
INSERT INTO transactions11 SELECT txn_id, txn_date
INSERT INTO transactions22 SELECT txn_id, acc_id;
Custom Hive catalogs
From the Hive operational perspective, there is only one global data catalog — HiveCatalog
that stores data in HDFS under the /apps/hive/warehouse/ location.
However, when using Iceberg tables, you can have several different data catalog types, such as Hive
, Hadoop
, or a custom catalog implementation.
Iceberg also allows loading a table directly based on its path in HDFS. In this case, a table does not belong to any catalog. Such cross-catalog, path-based tables can be used as temporary or intermediate tables, for example, for JOIN cases.
The following example creates another Hive catalog and specifies a custom warehouse location for the catalog.
SET iceberg.catalog.my_test_hive.type=hive;
SET iceberg.catalog.my_test_hive.warehouse=hdfs:/adh/apps/hive/custom_catalog/warehouse;
Limitations
There are several limitations for Iceberg tables support in Hive. These restrictions are related to the 3.x Hive version being used in ADH. In the upcoming ADH releases, the shift to Hive 4.x will extend Iceberg tables support and eliminate the limitations listed below.
-
As of ADH 3.2.4.3, Hive supports the following DML/DDL operations:
-
CREATE TABLE
-
DROP TABLE
-
SELECT
-
INSERT
-
-
As of ADH 3.2.4.3, Iceberg tables in Hive are created using the
STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler'
clause rather thanSTORED BY ICEBERG
. -
Starting with Apache Iceberg 0.11.0, when using Hive with Tez you also have to disable vectorization (
hive.vectorized.execution.enabled=false
).