Read Hive data from ADQM
To read data from remote Hive tables via SELECT
queries, ClickHouse provides the Hive table engine. It currently supports input formats as follows:
-
Text — supports only simple scalar column types except
binary
; -
ORC — supports simple scalar columns types except
char
and the onlyarray
complex type; -
Parquet — supports all simple scalar columns types and the only
array
complex type.
This article describes how to work with the Hive table engine on the example of ADQM and ADH (Arenadata Hadoop) that provides the Hive service.
Create a Hive table
The basic syntax of a query that creates a Hive table in ADQM is:
CREATE TABLE <table_name> (<column_name> <column_type> [ALIAS <expr>], ...)
ENGINE = Hive('thrift://<host>:<port>', '<hive_database>', '<hive_table>')
PARTITION BY <expr>;
Hive table engine parameters:
-
thrift://<host>:<port>
— address of an ADH host where the Hive Metastore component is installed; -
<hive_database>
— name of a remote Hive database; -
<hive_table>
— name of a remote Hive table.
An ADQM table’s structure can differ from the structure of an original Hive table you are connecting to:
-
Column names in the ADQM table should be the same as in the original Hive table, but you can use just some of these columns and in any order. In the ADQM table, you can also use the
ALIAS
expression to specify columns to be calculated from other columns. -
Column types in the ADQM table should be the same as types in the original Hive table.
-
The
PARTITION BY
key of the ADQM table should be consistent with thePARTITIONED BY
key of the original Hive table, and columns used in thePARTITION BY
expression should be defined in the table structure.
HDFS high availability (HA)
If the high availability (HA) mode for HDFS is enabled on an ADH cluster, perform the following configuration to be able to connect to Hive tables from ADQM:
-
Copy the /etc/hadoop/conf/hdfs-site.xml file from an ADH host to the /etc/clickhouse-server/ folder on an ADQM host.
-
Add the following setting to the /etc/clickhouse-server/config.xml configuration file of ClickHouse:
<hdfs> <libhdfs3_conf>/etc/clickhouse-server/hdfs-site.xml</libhdfs3_conf> </hdfs>
-
Restart the ClickHouse server:
$ sudo systemctl restart clickhouse-server
An internal name of an HDFS cluster specified as internal nameservice should be in lower case. Otherwise, when attempting to retrieve data from a remote Hive table, ADQM may return the following error message:
Unable to connect to HDFS: InvalidParameter: Cannot parse URI: hdfs://<internal_nameservice_in_lower_case>, missing port or invalid HA configuration Caused by: HdfsConfigNotFound: Config key: dfs.ha.namenodes.<internal_nameservice_in_lower_case> not found. (NETWORK_ERROR)
where <internal_nameservice_in_lower_case>
is an internal nameservice specified in the dfs.internal.nameservices
parameter of the hdfs-site.xml configuration file, but in lower case.
Example
The example below shows how use the Hive table engine in ADQM to select data from a remote Hive table located on an ADH cluster (high availability for HDFS is not enabled).
-
On an ADH host with the Hive HiveServer2 component installed, create a Hive table (for example,
test_table
) in thedefault
database:CREATE TABLE test_table(id int, value int, name string) PARTITIONED BY (date string);
-
Insert test data into the table:
INSERT INTO test_table partition(date='2023-11-23') SELECT 0, 1, 'one';
-
On an ADQM host, run the clickhouse-client console client and create a table based on the Hive engine (replace
<adh_host>
with an IP address of an ADH host with the Hive Metastore component):CREATE TABLE hive_table(id Int8, value Int16, name String, date String) ENGINE = Hive('thrift://<adh_host>:9083', 'default', 'test_table') PARTITION BY date;
-
Select data from the table:
SELECT * FROM hive_table;
┌─id─┬─value─┬─name─┬─day────────┐ │ 0 │ 1 │ one │ 2023-11-23 │ └────┴───────┴──────┴────────────┘