Hive with HBase

Hive/HBase integration allows Hive QL statements to access HBase tables for both read (SELECT) and write (INSERT) operations. You can create a new HBase table in Hive or access an existing table.

Create a new HBase table in Hive

To create a new HBase table that can be managed by Hive, use the STORED BY clause in the CREATE TABLE statement. It allows you to specify a storage handler required for integration.

Example:

$ sudo -u hive hive
CREATE TABLE hbase_table_test(key int, value1 string, value2 int, value3 int)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,col_family1:val1,col_family1:val2,col_family2:val3")
TBLPROPERTIES ("hbase.table.name" = "hive_test", "hbase.mapred.output.outputtable" = "hive_test");

Where:

  • STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' — specifies a storage handler that allows Hive to access data stored in HBase.

  • WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,col_family1:val1,col_family1:val2,col_family2:val3") — controls the mapping of HBase columns to Hive. The hbase.columns.mapping can contain the following entries: :key, :timestamp, or <column-family-name>:[<column-name>][#(binary|string)]. The type specification is delimited by #. If hbase.columns.mapping does not include the type specification, the value from hbase.table.default.storage.type is used.

  • TBLPROPERTIES ("hbase.table.name" = "hive_test", "hbase.mapred.output.outputtable" = "hive_test") — specifies table properties. The hbase.table.name property defines the table name in HBase. In this example, the table is known as hbase_table_test within Hive, and as hive_test within HBase. If hbase.table.name is not set, the table has the same name in Hive and HBase. The hbase.mapred.output.outputtable property is required if you plan to insert data into the table (it is used by the hbase.mapreduce.TableOutputFormat class).

Execute the list command to check if HBase has information about the new table:

$ hbase shell

$ list

The list output should contain the hive_test table.

In HBase shell, utilize the put command to add data to this table:

put 'hive_test', '1' , 'col_family1:val1', 'test'
put 'hive_test', '1' , 'col_family1:val2', '1'
put 'hive_test', '1' , 'col_family2:val3', '1'

put 'hive_test', '2' , 'col_family1:val1', 'test2'
put 'hive_test', '2' , 'col_family1:val2', '2'
put 'hive_test', '2' , 'col_family2:val3', '2'

Use the scan command to check the table data:

scan 'hive_test'

The result:

ROW                                                          COLUMN+CELL
1                                                           column=col_family1:val1, timestamp=1694606917613, value=test
1                                                           column=col_family1:val2, timestamp=1694606917647, value=1
1                                                           column=col_family2:val3, timestamp=1694606922996, value=1
2                                                           column=col_family1:val1, timestamp=1694607054746, value=test2
2                                                           column=col_family1:val2, timestamp=1694607054763, value=2
2                                                           column=col_family2:val3, timestamp=1694607056127, value=2
2 row(s)
Took 0.0175 seconds

The data is added successfully to the table created in Hive.

Insert a new row to the same table through the Hive shell:

INSERT INTO hbase_table_test VALUES(3,'test3', 3, 3);

Check the table data in the HBase shell:

scan 'hive_test'

The result:

ROW                   COLUMN+CELL
 1                    column=col_family1:val1, timestamp=1698334191476, value=test
 1                    column=col_family1:val2, timestamp=1698334191504, value=1
 1                    column=col_family2:val3, timestamp=1698334191538, value=1
 2                    column=col_family1:val1, timestamp=1698334191591, value=test2
 2                    column=col_family1:val2, timestamp=1698334191616, value=2
 2                    column=col_family2:val3, timestamp=1698334193906, value=2
 3                    column=col_family1:val1, timestamp=1698334255870, value=test3
 3                    column=col_family1:val2, timestamp=1698334255870, value=3
 3                    column=col_family2:val3, timestamp=1698334255870, value=3
3 row(s)
Took 0.0133 seconds

This example demonstrates that Hive and HBase can work with the same table.

Use an existing HBase table in Hive

You can also operate with an existing HBase table in Hive. The example below uses the people_ages table from the Bulk loading via built-in MapReduce jobs article.

For an existing HBase table, create an external table in the Hive shell and specify other properties according to the table structure:

CREATE EXTERNAL TABLE hbase_people_ages  (name string, age int)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,basic:age")
TBLPROPERTIES ("hbase.table.name" = "people_ages", "hbase.mapred.output.outputtable" = "people_ages");

Select data to check whether the table is available:

SELECT * FROM hbase_people_ages LIMIT 5;

The result:

OK
Abbott Delia    62
Abbott Howard   24
Abbott Jack     29
Adams Clyde     29
Aguilar Myrtie  23
Time taken: 0.161 seconds, Fetched: 5 row(s)

It is also possible to combine access to HBase tables with native Hive tables via JOIN and UNION clauses.

Create another table in Hive:

CREATE TABLE people (key int, name string, points int);

INSERT INTO people
    VALUES (1,'Phelps Rena', 35), (2,'Quinn Brian', 48), (1,'Reyes Danny', 52);

Execute the query with a JOIN clause:

SELECT hbase_people_ages.*, people.points FROM hbase_people_ages JOIN people ON (hbase_people_ages.name = people.name);

The result:

OK
Phelps Rena     18      35
Quinn Brian     20      48
Reyes Danny     41      52
Time taken: 10.003 seconds, Fetched: 3 row(s)
Found a mistake? Seleсt text and press Ctrl+Enter to report it