Hive catalog in Trino

Konstantin Alpashkin

Contents

Enable Hive catalog
Supported Hive formats
Hive catalog configuration properties
SQL support
Performance optimization

The Trino service comes with a built-in catalog configuration for working with Hive. This catalog uses the Hive connector and out-of-the-box is ready to interact with the Hive service available in your ADH cluster.

The architecture of the Trino connector for Hive differs from DBMS-oriented and other connector implementations as it does not utilize the Hive runtime for read/write operations. That is, when Trino executes a query through the Hive connector, it does not invoke any functions of the Hive engine. Instead, Trino interacts directly with the Hive Metastore and employs an HDFS client to perform operations on the raw files stored in a Hive warehouse location in HDFS. This brings forth a requirement that the HDFS Client component must be installed on each host alongside with Trino Workers.

In the built-in Hive catalog, all the configuration properties are auto-generated to ensure access to Hive Metastore and Hive warehouse directories of your ADH cluster. The built-in Hive catalog name is hive-adh. ADH forms this catalog automatically and generates the hive-adh.properties file at /etc/trino/conf/catalog/ if:

Your ADH cluster has the Hive service installed. If Hive was installed after Trino, the Trino service must be restarted.
The Hive catalog configuration is active.

Enable Hive catalog

By default, the built-in Hive catalog of the Trino service is disabled and can be activated using ADCM. For this, toggle the Hive configuration option of the Trino service (Clusters → <clusterName> → Services → Trino → Configuration) and then restart the Trino service.

Once enabled, check the catalog accessibility. For this, connect to Trino Coordinator (for example, via Trino CLI or an SQL client like DBeaver) and run the command:

SHOW CATALOGS;

The built-in Hive catalog appears in the list:

Catalog |
--------+
hive-adh|
system  |

TIP

The up-to-date JDBC connection string for connecting to the Trino service is available on the service configuration page in ADCM (Clusters → <clusterName> → Services → Trino → Info).

Supported Hive formats

The Trino’s Hive catalog can work with Hive data files in the following formats:

ORC
Parquet
Avro

As for serializable formats, the following SerDes are allowed:

TextFile
CSV using org.apache.hadoop.hive.serde2.OpenCSVSerde
JSON using org.apache.hive.hcatalog.data.JsonSerDe
RCFile using ColumnarSerDe
RCFile using LazyBinaryColumnarSerDe
SequenceFile

Hive catalog configuration properties

The Trino connector for Hive supports numerous configuration parameters for fine-tuning the interaction with Hive data files. The up-to-date properties list is available in Trino documentation. You can specify new or override the default property values using ADCM. For this, use the Custom Hive configuration settings section (Clusters → <clusterName> → Services → Trino → Configuration).

SQL support

IMPORTANT

To allow Trino to write to a Hive database, the user name specified while connecting to Trino (for example, via Trino CLI tool or DBeaver) must match the Unix user on the ADH host with sufficient permissions to write to Hive warehouse directories in HDFS. Otherwise, you may observe a similar error:

Failed to create external path hdfs://adh/apps/hive/warehouse/trino_demo_schema1.db for database trino_demo_schema1. This may result in access not being allowed if the StorageBasedAuthorizationProvider is enabled: null

Schema management

A schema in Trino is used to group several tables/views and is equivalent to a database in Hive parlance. Below are examples of base commands for managing Trino schemas:

CREATE SCHEMA "hive-adh"."trino_demo_schema_hive"; (1)
USE "hive-adh"."trino_demo_schema_hive"; (2)
SHOW SCHEMAS;
DROP "hive-adh"."trino_demo_schema_hive"; (3)

1	Creates a new schema (Hive database). `hive-adh` is the name of the built-in Hive catalog. Notice the double quotes used for identifiers.
2	Specifies a session schema. All the subsequent Trino SQL commands will run against the selected Hive database.
3	Deleting a non-empty schema is prohibited. To delete a schema, you have to remove all nested tables first.

To run Trino queries against arbitrary tables that belong to different schemas, you can specify a catalog, schema, and table name within your SQL command, for example:

SELECT * FROM "hive-adh"."trino_demo_schema_hive"."transactions" txns_hive
JOIN "iceberg-adh"."trino_demo_schema_iceberg"."transactions" txns_icb
ON txns_hive.txn_id = txns_icb.txn_id;

TIP

Using Trino UI, you can view and analyze SQL queries submitted to Trino.

DDL examples

Trino allows running most of the Hive DDL operations to manage Hive entities like tables, databases, materialized views, etc. Trino also supports a set of Hive table properties that specify metadata for Hive tables. These properties can be applied to a Hive table using the WITH clause as shown in the examples below. The following snippet provides basic DDL operations on Hive tables using the Trino catalog for Hive.

CREATE TABLE "hive-adh"."trino_demo_schema_hive"."transactions" (id BIGINT, value VARCHAR); (1)

CREATE TABLE "hive-adh"."trino_demo_schema_hive"."transactions_with" (id BIGINT, value VARCHAR) (2)
WITH (
    format = 'ORC',
    transactional=true
);

CREATE TABLE "hive-adh"."trino_demo_schema_hive"."transactions_pttnd" ( (3)
    txn_id BIGINT,
    txn_amount DOUBLE,
    acc_id BIGINT
 ) WITH (
    partitioned_by=ARRAY['acc_id']
);

CREATE TABLE "hive-adh"."trino_demo_schema_hive"."transactions_bckttd" ( (4)
    txn_id BIGINT,
    txn_amount DOUBLE,
    acc_id BIGINT
 ) WITH (
    bucketed_by=ARRAY['acc_id'],
    bucket_count=20
);

CREATE TABLE "hive-adh"."trino_demo_schema_hive"."transactions_ext" (
  	id BIGINT,
  	payload VARCHAR,
	request_time TIMESTAMP
) WITH (
  format = 'TEXTFILE',
  external_location = 'hdfs://adh/apps/hive/ext/test_ext_table' (5)
);

SET SESSION "hive-adh".non_transactional_optimize_enabled=true; (6)
ALTER TABLE "hive-adh"."trino_demo_schema_hive"."transactions_pttnd"
EXECUTE optimize (file_size_threshold => '128MB'); (7)

ANALYZE "hive-adh"."trino_demo_schema_hive"."transactions_pttnd"; (8)

CALL system.create_empty_partition( (9)
    schema_name => "hive-adh",
    table_name => "trino_demo_schema_hive",
    partition_columns => ARRAY['acc_id'],
    partition_values => ARRAY['1005']
);

1	Creates a basic Hive table. Notice that Trino accepts queries in ANSI SQL with corresponding data type, whereas Hive uses HiveQL, which differs from the traditional ANSI SQL. For example, when running this query, the `VARCHAR` data type will be cast to Hive’s `STRING`.
2	Creates a transactional, ORC-formatted table. Creating transactional Hive tables using Trino requires the ORC format to be used. If a table format is not specified explicitly, Trino creates a non-transactional table using the format specified by the `hive.storage-format` catalog configuration property.
3	Creates a partitioned Hive table. Notice the `ARRAY['col1', 'col2']` syntax for defining partitions.
4	Creates a table with a specific number of buckets.
5	Creates an external table that points to a custom HDFS location.
6	Sets a session property. Particularly, `non_transactional_optimize_enabled=true` is required to enable the `OPTIMIZE` command.
7	Runs the `OPTIMIZE` command that rewrites the content of the table so that it is merged into fewer but larger files.
8	Computes statistics for a table. Trino does not support gathering table statistics for transactional Hive tables.
9	Calls a system procedure to create an empty partition. This operation creates an empty partition directory in the Hive warehouse, e.g. /apps/hive/warehouse/trino_demo_schema.db/transactions_pttnd/acc=1005. For more information about procedures supported by the Hive catalog, see Trino documentation.

DML examples

Trino supports most of the Hive DML operations, like INSERT, UPDATE, DELETE, MERGE, etc. The actual behavior of these commands depends on the table format and the Metastore being used. Below are some known limitations and peculiarities of running Hive CRUD operations via Trino:

The UPDATE command is only supported for transactional tables. The target table must be transactional, ACID, and ORC-formatted.
The DELETE command applied to a non-transactional table is only supported if the table is partitioned and the WHERE clause fully matches a partition. Transactional Hive tables in the ORC format support row-by-row deletion, in which WHERE may match arbitrary sets of rows.
MERGE is only supported for ACID tables.
ACID tables created with Hive Streaming Ingest are not supported.
Gathering table statistics using ANALYZE is not supported for Hive transactional tables.
TRUNCATE is not supported by the Hive catalog.

For more information about best practices and limitations on using SQL with the Trino catalog for Hive, see Trino documentation. The following snippet provides examples of DML operations with comments.

INSERT INTO "hive-adh"."trino_demo_schema_hive"."transactions_pttnd"
    VALUES (1, 15, 1001),
    (2, 100, 1002),
    (3, 75, 1001);

SELECT * FROM "hive-adh"."trino_demo_schema_hive"."transactions_pttnd"; (1)

SELECT * FROM "hive-adh"."trino_demo_schema_hive"."transactions_pttnd$properties"; (2)

SELECT * FROM "hive-adh"."trino_demo_schema_hive"."transactions_pttnd$partitions"; (3)

SELECT *, "$path", "$file_size", "$partition" FROM "hive-adh"."trino_demo_schema_hive"."transactions_pttnd"; (4)

SELECT *, "$path", "$file_size" FROM "hive-adh"."trino_demo_schema_hive"."transactions_pttnd"
WHERE "$partition" = 'acc_id=1001'; (5)

SELECT * FROM "hive-adh"."trino_demo_schema_hive"."transactions" txns_hv (6)
JOIN "iceberg-adh"."trino_demo_schema_iceberg"."transactions" txns_icb
ON txns_hv.txn_id = txns_icb.txn_id;

DELETE FROM "hive-adh"."trino_demo_schema_hive"."transactions_pttnd" (7)
WHERE txn_id = 2;

1	An ANSI SQL `SELECT` against the selected Hive database.
2	Returns table metadata from the hidden `$properties` table.
3	Returns a list of all partitions for a partitioned table.
4	Returns table metadata for metadata columns.
5	Returns the file path and file size of all the row records that belong to a specific partition (`acc_id=1001`).
6	Selects data from two independent Trino catalogs — the Hive catalog `hive-adh` and the Iceberg catalog `iceberg-adh`. After Trino gets result sets from both catalogs, it applies a `JOIN` operation and returns the final result set to the client.
7	Deletes a row. The target table must be transactional. Otherwise, Trino throws an error.

For detailed information on Hive DML operations supported by Trino, see Trino documentation.

Performance optimization

Trino provides several optimizations to speed up the interaction with Hive tables. These are the following:

Table statistics. Trino supports gathering table statistics (for transactional tables only) to improve query processing performance.
Dynamic filtering. Provides a significant performance boost for queries with selective JOIN operations by avoiding reading the "unnecessary" data that would be filtered by the JOIN condition.
File system caching. Caches the chunks of data, which are frequently transmitted from HDFS and accessed by workers while executing similar queries.
Table redirection. A technique that allows redirecting CRUD operations to a different catalog.

Found a mistake? Seleсt text and press Ctrl+Enter to report it