Hive with HBase
Hive/HBase integration allows Hive QL statements to access HBase tables for both read (SELECT
) and write (INSERT
) operations. You can create a new HBase table in Hive or access an existing table.
Create a new HBase table in Hive
To create a new HBase table that can be managed by Hive, use the STORED BY
clause in the CREATE TABLE
statement. It allows you to specify a storage handler required for integration.
Example:
$ sudo -u hive hive
CREATE TABLE hbase_table_test(key int, value1 string, value2 int, value3 int)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,col_family1:val1,col_family1:val2,col_family2:val3")
TBLPROPERTIES ("hbase.table.name" = "hive_test", "hbase.mapred.output.outputtable" = "hive_test");
Where:
-
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
— specifies a storage handler that allows Hive to access data stored in HBase. -
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,col_family1:val1,col_family1:val2,col_family2:val3")
— controls the mapping of HBase columns to Hive. Thehbase.columns.mapping
can contain the following entries::key
,:timestamp
, or<column-family-name>:[<column-name>][#(binary|string)]
. The type specification is delimited by#
. Ifhbase.columns.mapping
does not include the type specification, the value fromhbase.table.default.storage.type
is used. -
TBLPROPERTIES ("hbase.table.name" = "hive_test", "hbase.mapred.output.outputtable" = "hive_test")
— specifies table properties. Thehbase.table.name
property defines the table name in HBase. In this example, the table is known ashbase_table_test
within Hive, and ashive_test
within HBase. Ifhbase.table.name
is not set, the table has the same name in Hive and HBase. Thehbase.mapred.output.outputtable
property is required if you plan to insert data into the table (it is used by the hbase.mapreduce.TableOutputFormat class).
Execute the list
command to check if HBase has information about the new table:
$ hbase shell
$ list
The list
output should contain the hive_test
table.
In HBase shell, utilize the put
command to add data to this table:
put 'hive_test', '1' , 'col_family1:val1', 'test'
put 'hive_test', '1' , 'col_family1:val2', '1'
put 'hive_test', '1' , 'col_family2:val3', '1'
put 'hive_test', '2' , 'col_family1:val1', 'test2'
put 'hive_test', '2' , 'col_family1:val2', '2'
put 'hive_test', '2' , 'col_family2:val3', '2'
Use the scan
command to check the table data:
scan 'hive_test'
The result:
ROW COLUMN+CELL 1 column=col_family1:val1, timestamp=1694606917613, value=test 1 column=col_family1:val2, timestamp=1694606917647, value=1 1 column=col_family2:val3, timestamp=1694606922996, value=1 2 column=col_family1:val1, timestamp=1694607054746, value=test2 2 column=col_family1:val2, timestamp=1694607054763, value=2 2 column=col_family2:val3, timestamp=1694607056127, value=2 2 row(s) Took 0.0175 seconds
The data is added successfully to the table created in Hive.
Insert a new row to the same table through the Hive shell:
INSERT INTO hbase_table_test VALUES(3,'test3', 3, 3);
Check the table data in the HBase shell:
scan 'hive_test'
The result:
ROW COLUMN+CELL 1 column=col_family1:val1, timestamp=1698334191476, value=test 1 column=col_family1:val2, timestamp=1698334191504, value=1 1 column=col_family2:val3, timestamp=1698334191538, value=1 2 column=col_family1:val1, timestamp=1698334191591, value=test2 2 column=col_family1:val2, timestamp=1698334191616, value=2 2 column=col_family2:val3, timestamp=1698334193906, value=2 3 column=col_family1:val1, timestamp=1698334255870, value=test3 3 column=col_family1:val2, timestamp=1698334255870, value=3 3 column=col_family2:val3, timestamp=1698334255870, value=3 3 row(s) Took 0.0133 seconds
This example demonstrates that Hive and HBase can work with the same table.
Use an existing HBase table in Hive
You can also operate with an existing HBase table in Hive. The example below uses the people_ages
table from the Bulk loading via built-in MapReduce jobs article.
For an existing HBase table, create an external table in the Hive shell and specify other properties according to the table structure:
CREATE EXTERNAL TABLE hbase_people_ages (name string, age int)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,basic:age")
TBLPROPERTIES ("hbase.table.name" = "people_ages", "hbase.mapred.output.outputtable" = "people_ages");
Select data to check whether the table is available:
SELECT * FROM hbase_people_ages LIMIT 5;
The result:
OK Abbott Delia 62 Abbott Howard 24 Abbott Jack 29 Adams Clyde 29 Aguilar Myrtie 23 Time taken: 0.161 seconds, Fetched: 5 row(s)
It is also possible to combine access to HBase tables with native Hive tables via JOIN
and UNION
clauses.
Create another table in Hive:
CREATE TABLE people (key int, name string, points int);
INSERT INTO people
VALUES (1,'Phelps Rena', 35), (2,'Quinn Brian', 48), (1,'Reyes Danny', 52);
Execute the query with a JOIN
clause:
SELECT hbase_people_ages.*, people.points FROM hbase_people_ages JOIN people ON (hbase_people_ages.name = people.name);
The result:
OK Phelps Rena 18 35 Quinn Brian 20 48 Reyes Danny 41 52 Time taken: 10.003 seconds, Fetched: 3 row(s)