Impala architecture
Overview
Impala provides fast, interactive SQL queries on data stored in HDFS, HBase, or S3. In addition to the unified storage platform, Impala also uses the same metadata, SQL syntax (Hive SQL), and JDBC driver as Apache Hive. It makes Impala a unified platform for real-time or batch-oriented queries.
Impala is targeted for integration with standard business intelligence environments and supports the most relevant industry standards: clients can connect via JDBC, the authentication is accomplished with Kerberos or LDAP.
Impala has the following advantages:
-
The SQL interface that data scientists and analysts already know.
-
Ability to query high volumes of data.
-
Distributed queries in a clustered environment for easy scaling and the use of cost-effective commodity hardware.
-
Ability to share data files between different components without copying or export/import operations. For example, you can write and transform data with Hive, and query with Impala.
The Impala-based solution consists of the following parts:
-
Clients — Hue, JDBC clients, and the Impala Client can interact with Impala. These interfaces are used to issue queries or complete administrative tasks, for example, connecting to Impala.
-
Hive Metastore — stores information about the data available to Impala. For example, Metastore lets Impala know what databases are available and what the structure of those databases is. If you create, drop, and alter objects, and load data into tables through Impala SQL statements, the relevant metadata are changed automatically.
-
Impala — processes that run on DataNodes, coordinate and execute queries. Each instance of Impala can receive, plan, and coordinate queries from Impala clients. Queries are distributed among Impala nodes, and these nodes then act as workers, executing parallel query fragments.
-
HBase and HDFS — the storage for data that can be queried.
User applications send SQL queries to Impala through JDBC. The user application may connect to any Impala daemon in the cluster. This daemon becomes the coordinator for the query. Impala parses the query and analyzes it to determine what tasks need to be performed by Impala instances across the cluster. The execution is planned to achieve optimal efficiency. Local Impala instances access HDFS and HBase services to get data. Each Impala daemon returns data to the coordinator that aggregates the results and sends them to the client.
Physical representation
By default, Impala stores data files of an unpartitioned table in the root directory. However, all data files that are located in any directory below the root are part of the table data set. This is a common approach to working with unpartitioned tables, and Apache Hive also uses this technique.
Data files of a partitioned table are placed in subdirectories. The path reflects the partition column values. For example, for day 17
, month 2
of table T
, all data files will be located in the <root>/day=17/month=2/
directory. Note that this form of partitioning does not imply a co-location of individual partition data. The blocks of the partition’s data files are distributed across HDFS DataNodes.
Impala also gives the user great flexibility when choosing file formats. It supports the most popular file formats: Avro, RC, Sequence, plain text, and Parquet. These formats can be combined with different compression algorithms, such as snappy, gzip, and bz2. You can specify the storage format in the CREATE TABLE
or ALTER TABLE
statements. It is also possible to select a separate format for each table partition.
In most use cases, the Apache Parquet is recommended as an up-to-date, open-source columnar file format offering both high compression and high scan efficiency. Most Hadoop-based frameworks including Hive, Pig, MapReduce, and Cascading can process Parquet.
Components of the Impala service in ADH
Impala is a distributed, massively parallel processing (MPP) database engine. It consists of different daemon processes that run on specific hosts within your cluster.
Impala Daemon
The core Impala component is the Impala Daemon. It is represented by the impalad
process. The Impala daemon receives a client request, becomes a coordinator, parses a query request, splits it into different tasks, and distributes them to other impalad
node processes. After each impalad
worker node process receives the request, it starts to execute a local query (for example, querying the HDFS DataNode or an HBase region server), and returns the query results to the central coordinator. The coordinator collects query results from other impalad
processes, combines them, and sends the result to the client.
The Impala daemon consists of three modules:
-
Query planner — receives query requests from SQL APP, JDBC, and other clients, and converts the query into many sub-queries (creates an execution plan).
-
Query coordinator — distributes these sub-queries among the Impala nodes.
-
Query executor — is responsible for the execution of the sub-queries and returning the sub-query results to the coordinator.
Impala daemons can be deployed in one of the following ways:
-
HDFS and Impala are co-located, and an Impala daemon runs on each host that is a DataNode.
-
Impala is deployed separately in a compute cluster and reads remotely from HDFS, S3, etc.
The Impala daemons are in constant communication with Statestore to confirm which daemons are healthy and can accept new queries.
They also receive broadcast messages from the Impala Catalog Service whenever any Impala daemon in the cluster creates, alters, or drops any type of object, or when Impala processes an INSERT
or LOAD DATA
statement.
You can control which hosts operate as query coordinators and which work as query executors to improve the scalability of highly concurrent workloads on large clusters. For more information, see How to configure Impala with dedicated coordinators.
Impala Statestore
The Impala Statestore component collects the health status of all Impala daemons and continuously forwards results to all impalad
process nodes. Statestore is represented by the statestored
daemon process. The cluster only needs this process on one host. If an Impala daemon goes offline due to a hardware failure, network error, software issue, or other reasons, the Statestore informs all other Impala daemons, and new queries are not sent to the unreachable Impala daemon.
Since the Statestore helps when something goes wrong and broadcasts the metadata to the coordinators, this is not always critical to the normal operation of the Impala cluster. If the Statestore becomes unavailable, the Impala daemons continue to run and distribute work among themselves as they normally do when working with data known to Impala. The cluster just becomes less reliable if some Impala daemons fail. The metadata becomes also less consistent since it changes when the Statestore is offline. When the Statestore becomes available, it reestablishes the communication with the Impala daemons and resumes monitoring and broadcasting functions.
If you execute a DDL statement while the Statestore is down, the queries that access the new DDL-created object will fail.
Impala Catalog Service
The Impala Catalog Service component sends broadcast messages with metadata changes made by Impala SQL statements to all the Impala daemons in a cluster. For example, when you create a table, load data, or perform other table-changing or data-changing operation through Impala. It is represented by the catalogd
daemon process.
You only need this process on one host in a cluster. Since the metadata changes are passed through the Statestore daemon, it makes sense to install Impala Catalog Service and Impala Statestore components on the same host.
You can use the load_catalog_in_background parameter to determine when the metadata of a table should be loaded. The load_catalog_in_background checkbox is located on the CLUSTERS → ADH cluster → Services → Impala → Components → Impala Catalog Service → Configuration tab.
If it is set to false
, the table metadata is loaded when it is referenced for the first time. In this case, the first run of a query can be slower than subsequent runs.
If load_catalog_in_background is set to true
, the Catalog Service loads metadata for a table even if that metadata is not required for any query. So metadata can be already loaded when the first requested query is executed. However, it is not recommended to set this option to true
. Background load can interfere with query-specific metadata loading and cause occasional long-running queries that are difficult to diagnose. Impala also may load metadata for tables that are never used, increasing catalog size and memory usage for both the Catalog Service and Impala daemon.
Impala Client
The Impala Client component is represented by impala-shell that is a command-line interface for issuing queries to the Impala daemon. You can install this on one or more hosts anywhere on your network. It can connect remotely to any instance of the Impala daemon.
Impala role in ADH
A major Impala goal is to make SQL-on-Hadoop operations fast and efficient. Impala can interchange data with other Hadoop components, as both a consumer and a producer, so it can fit flexibly into your ETL and ELT pipelines.
Impala and Hive
Where practical, Impala uses existing Apache Hive infrastructure that many Hadoop users already have in place to perform long-running, batch-oriented SQL queries. Impala stores table definitions in MySQL or PostgreSQL database known as Metastore, the same database where Hive keeps this type of data. Thus, Impala can access tables defined or loaded by Hive, if all columns use Impala-supported data types, file formats, and compression codecs.
The initial focus on query features and performance means that Impala can read more types of data with the SELECT
statement than it can write with the INSERT
statement. To query data using the Avro, RCFile, or SequenceFile file formats, you should load the data using Hive.
The Impala query optimizer can also use table statistics and column statistics. Originally, you gathered this information with the ANALYZE TABLE
statement in Hive. In Impala, use the COMPUTE STATS
statement instead. COMPUTE STATS
requires less setup, is more reliable, and does not require switching back and forth between impala-shell and the Hive shell.
Impala Metadata and Metastore
As described above, Impala maintains information about table definitions in a central database known as Metastore. Impala also tracks metadata for the low-level characteristics of data files — the physical locations of blocks within HDFS.
For tables with a large volume of data and/or many partitions, retrieving all the metadata can be time-consuming. Each Impala node caches this metadata to reuse for future queries on the same table.
If the table definition or the data in the table is updated, all Impala daemons in the cluster must receive the latest metadata, and replace the obsolete cached metadata, before issuing a query on that table. The metadata update is automatically coordinated through the Impala Catalog Service component, for all DDL and DML statements issued through Impala. For DDL and DML issued through Hive, or changes made manually on files in HDFS, you need to use the REFRESH
statement (when new data files are added to existing tables) or the INVALIDATE METADATA
statement (for new tables, after dropping a table, performing an HDFS rebalance operation, or deleting data files). INVALIDATE METADATA
retrieves metadata for all the tables tracked by Metastore. If you know which tables have been changed outside of Impala, you can run REFRESH <table_name>
for each affected table to get the latest metadata for those tables.
Impala and HDFS
Impala uses HDFS as its primary data storage. Impala relies on the redundancy provided by HDFS to protect against hardware or network failures on individual nodes. Impala table data is represented as data files in HDFS using standard HDFS file formats and compression codecs. When data files are present in the directory for a new table, Impala reads them all, regardless of file names. New data is added to files with names controlled by Impala.
Impala and HBase
HBase is an alternative to HDFS as a storage for Impala data. It is a database system built on top of HDFS, without built-in SQL support. To query the contents of the HBase tables through Impala, define tables in Impala and map them to equivalent tables in HBase. This way you can also execute JOIN queries including both Impala and HBase tables.