Read Hive data from ADQM

To read data from remote Hive tables via SELECT queries, ClickHouse provides the Hive table engine. It currently supports input formats as follows:

  • Text — supports only simple scalar column types except binary;

  • ORC — supports simple scalar columns types except char and the only array complex type;

  • Parquet — supports all simple scalar columns types and the only array complex type.

This article describes how to work with the Hive table engine on the example of ADQM and ADH (Arenadata Hadoop) that provides the Hive service.

Create a Hive table

The basic syntax of a query that creates a Hive table in ADQM is:

CREATE TABLE <table_name> (<column_name> <column_type> [ALIAS <expr>], ...)
ENGINE = Hive('thrift://<host>:<port>', '<hive_database>', '<hive_table>')

Hive table engine parameters:

  • thrift://<host>:<port> — address of an ADH host where the Hive Metastore component is installed;

  • <hive_database> — name of a remote Hive database;

  • <hive_table> — name of a remote Hive table.

An ADQM table’s structure can differ from the structure of an original Hive table you are connecting to:

  • Column names in the ADQM table should be the same as in the original Hive table, but you can use just some of these columns and in any order. In the ADQM table, you can also use the ALIAS expression to specify columns to be calculated from other columns.

  • Column types in the ADQM table should be the same as types in the original Hive table.

  • The PARTITION BY key of the ADQM table should be consistent with the PARTITIONED BY key of the original Hive table, and columns used in the PARTITION BY expression should be defined in the table structure.

HDFS high availability (HA)

If the high availability (HA) mode for HDFS is enabled on an ADH cluster, perform the following configuration to be able to connect to Hive tables from ADQM:

  1. Copy the /etc/hadoop/conf/hdfs-site.xml file from an ADH host to the /etc/clickhouse-server/ folder on an ADQM host.

  2. Add the following setting to the /etc/clickhouse-server/config.xml configuration file of ClickHouse:

  3. Restart the ClickHouse server:

    $ sudo systemctl restart clickhouse-server

An internal name of an HDFS cluster specified as internal nameservice should be in lower case. Otherwise, when attempting to retrieve data from a remote Hive table, ADQM may return the following error message:

Unable to connect to HDFS: InvalidParameter: Cannot parse URI: hdfs://<internal_nameservice_in_lower_case>, missing port or invalid HA configuration
Caused by: HdfsConfigNotFound: Config key: dfs.ha.namenodes.<internal_nameservice_in_lower_case> not found. (NETWORK_ERROR)

where <internal_nameservice_in_lower_case> is an internal nameservice specified in the dfs.internal.nameservices parameter of the hdfs-site.xml configuration file, but in lower case.


The example below shows how use the Hive table engine in ADQM to select data from a remote Hive table located on an ADH cluster (high availability for HDFS is not enabled).

  1. On an ADH host with the Hive HiveServer2 component installed, create a Hive table (for example, test_table) in the default database:

    CREATE TABLE test_table(id int, value int, name string) PARTITIONED BY (date string);
  2. Insert test data into the table:

    INSERT INTO test_table partition(date='2023-11-23') SELECT 0, 1, 'one';
  3. On an ADQM host, run the clickhouse-client console client and create a table based on the Hive engine (replace <adh_host> with an IP address of an ADH host with the Hive Metastore component):

    CREATE TABLE hive_table(id Int8, value Int16, name String, date String)
    ENGINE = Hive('thrift://<adh_host>:9083', 'default', 'test_table')
    PARTITION BY date;
  4. Select data from the table:

    SELECT * FROM hive_table;
    │  0 │     1 │ one  │ 2023-11-23 │
Found a mistake? Seleсt text and press Ctrl+Enter to report it