Impala with HBase

Impala allows you to execute SQL queries on HBase tables. From the perspective of an Impala user who has an RDBMS background, HBase is a key/value store where the value consists of multiple fields. The key is mapped to one column in the Impala table, and the various fields of the value are mapped to other columns in the Impala table. Since Impala and Hive share the same Metastore database, once you create the table in Hive, you can query it through Impala.

Example of using an HBase table

The example below demonstrates how to operate with HBase tables. In this example, the HBase people_ages table from the Bulk loading via built-in MapReduce jobs article is used.

  1. To use an HBase table in Impala, it is necessary to create an external table in Hive shell.

    Specify the following options:

    • The STORED BY clause with the org.apache.hadoop.hive.hbase.HBaseStorageHandler storage handler.

    • The TBLPROPERTIES clause to map the created table to a corresponding table that exists in HBase. The hbase.mapred.output.outputtable property is required if you need to insert data into the table.

    • The type of the column corresponding to the HBase row key as STRING.

      $ sudo -u hive hive
      CREATE EXTERNAL TABLE hbase_people_ages (name string, age int)
      STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
      WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,basic:age")
      TBLPROPERTIES ("hbase.table.name" = "people_ages", "hbase.mapred.output.outputtable" = "people_ages");
  2. After creating a new table through Hive, execute the INVALIDATE METADATA statement in impala-shell to make Impala aware of the new table.

    $ impala-shell
    INVALIDATE METADATA hbase_people_ages;
  3. Insert a new row to the table:

    INSERT INTO hbase_people_ages VALUES ('Gary Morgan', 35), ('Jim Brian', 48);
  4. Select data from the table:

    SELECT * from hbase_people_ages WHERE name LIKE 'Gar%';

    The result:

    +------------------+-----+
    | name             | age |
    +------------------+-----+
    | Garcia Nathaniel | 61  |
    | Gardner Helena   | 54  |
    | Garner Andre     | 37  |
    | Garner Juan      | 28  |
    | Garner Mark      | 47  |
    | Garrett Isabelle | 52  |
    | Garrett Steven   | 24  |
    | Garrett Willie   | 42  |
    | Gary Morgan      | 35  |
    | Garza Manuel     | 30  |
    | Garza Paul       | 51  |
    +------------------+-----+
    Fetched 11 row(s) in 5.74s

For efficient queries, use the WHERE clause to find a single key value or a range of key values. Try to avoid queries that do full-table scans, which are efficient for regular Impala tables but inefficient in HBase.

NOTE
The Impala INSERT …​ VALUES syntax suits HBase tables because inserting a single row is an efficient operation for an HBase table. When you use the INSERT …​ SELECT syntax, HBase only stores the most recent version of each unique row key. If an INSERT …​ SELECT statement copies over multiple rows containing the same key value, the query only returns one row with each key column value. Although Impala does not have an UPDATE statement, you can achieve the same result by executing the INSERT statement with the same key value.

Performance considerations

Impala uses the HBase client API via Java Native Interface (JNI) to query data stored in HBase. This operation does not read HFiles directly. The extra communication overhead makes it important to choose what data to store in HBase, and construct queries that can retrieve the HBase data efficiently. To do this, consider the following recommendations:

  • Use the HBase table for queries that return a single row or a small range of rows, not queries that perform a full table scan of an entire table. If a query references an HBase table and does not contain the WHERE clause, that is a strong indicator of an inefficient query.

  • HBase may offer acceptable performance for storing small dimension tables. A table should be small enough that executing a full table scan for every query is efficient enough. HDFS is also appropriate for dimension tables that do not need to support update, delete, or insert queries with small numbers of rows.

  • Query predicates are applied to row keys as start and stop keys, thereby limiting the scope of a particular search. If a row key field is not mapped to a column type of STRING, comparison operations do not work.

  • If predicates are used for non-key columns, Impala returns more rows than HBase, which can utilize the SingleColumnValueFilter filter. When a row key predicate is applied to a single row, HBase finds and returns that row. When a non-key predicate is used, even if it only applies to a single row, HBase must scan the entire table to find the correct result.

  • For performance reasons, it is important to declare a key column as STRING in the CREATE EXTERNAL TABLE statement. You can also define other columns with the INT or BOOLEAN types as STRING since Impala can optimize this type more effectively in HBase tables. For more information, see the Interpreting EXPLAIN Output for HBase Queries section of the Using Impala to Query HBase Tables.

Use сases

The following are representative use cases for using Impala to query HBase tables:

  • Store rapidly incrementing counters. For example, how many times a web page has been viewed, the number of user connections or votes for a post on social networks. HBase is efficient for capturing this type of data: the append-only storage mechanism is efficient for writing each change to disk, and a query always returns the latest value.

  • Store wide tables. Wide tables include many columns (thousands), which contain many attributes of an entity, for example, an online service user. Very often, in these tables, most of the column values are NULL, 0, FALSE, empty string, or other blank and placeholder values. For example, a site user may never use some site features, not fill out a special field in the profile, or not visit a certain part of the site. A typical query against this type of table is to retrieve a single row with all the information about a subject, rather than calculating the sum, average, or filtering millions of rows as in Impala-managed tables.

Limitations of the Impala and HBase integration

The Impala integration with HBase has the following limitations and restrictions:

  • If you run the DROP TABLE command for an Impala-managed table that is mapped to an HBase table, the underlying table is not removed in HBase.

  • The INSERT OVERWRITE statement is not available for HBase tables. To add new data, or modify an existing row, insert a new row. To modify an existing row, use the same key value.

  • If you execute the CREATE TABLE LIKE statement for a table mapped to an HBase table, the new table has the same underlying HBase table name. The new table is an alias for the old one, not a new table with an identical column structure.

  • Copying data into an HBase table using the Impala INSERT …​ SELECT syntax might produce fewer new rows than are in the SELECT query result set. If the result set contains multiple rows with the same value for the key column, each row overwrites a previous row with the same key value. Since the order of the inserted rows is unpredictable, you cannot rely on this technique to preserve the latest version of a particular key value.

  • Since the complex data types (ARRAY, STRUCT, and MAP) are currently only supported in Parquet tables, you cannot use these types in HBase tables that are queried through Impala.

  • The LOAD DATA statement cannot be used with HBase tables.

  • The TABLESAMPLE clause of the SELECT statement cannot be applied to HBase tables.

Found a mistake? Seleсt text and press Ctrl+Enter to report it