Phoenix performance tuning

Vladimir Adamenkov

Contents

Configuration tuning
Database tuning
Query optimization techniques

This article contains recommendations on how to optimize the performance of Phoenix in ADH.

Configuration tuning

Connection

Parameters described below are related to connection details in Phoenix.

Connection parameters

Parameter Default and description Recommendation

phoenix.query.timeoutMs

Default: 600000 (in milliseconds).

Defines the time the query is allowed to run before it is terminated on the client side. This prevents long-running or runaway queries from consuming excessive resources and potentially causing performance issues

Set a relatively conservative value (30000 or 60000) and monitor the actual execution times of your queries. Use the HBase web UI, Phoenix EXPLAIN PLAN command, or application-level logging for that purpose. Identify the queries that consistently exceed the current timeout. If you find that the timeout is too restrictive and causes the queries that should succeed to consistently fail, gradually increase the value by small amounts (e.g. 30 seconds or 1 minute at a time) and monitor the results.

This parameter has to be added manually in the Custom hbase-site.xml section of the HBase service configuration page

phoenix.query.threadPoolSize

Default: 128.

This parameter defines the maximum number of threads available in the thread pool dedicated to executing incoming Phoenix queries. Each thread handles a single query (or a portion of a complex query)

Start with the default value and run the typical workload. Monitor the key metrics:

CPU usage;
RAM usage;
query latency;
thread pool metrics.

If the CPUs are underutilized, the query latency is high, or query queue depth is consistently high, increase the parameter by 25%. If the CPUs are maxed out and the RAM is exhausted, decrease the parameter by 25%. After each adjustment, rerun the workload and monitor the metrics again.

This parameter has to be added manually in the Custom hbase-site.xml section of the HBase service configuration page

hbase.client.scanner.timeout.period

Default: 60000 (in milliseconds).

This parameter defines a limit for a scanner inactivity before the Region server closes it

Start with the default value and run the typical workload. Search the logs for scanner timed out exceptions and/or long-running queries being interrupted. If there happens to be a lot of such occasions, increase the parameter value. Rerun the workload and check the logs again.

This parameter is located in the hbase-site.xml section of the HBase service configuration page

zookeeper.session.timeout

Default: 90000 (in milliseconds).

This parameter defines the timeout for ZooKeeper sessions. HBase (and therefore Phoenix) relies on ZooKeeper for coordination. If a Region server loses its ZooKeeper session, it can lead to connection issues and instability

In environments with unstable networks or high ZooKeeper load, consider increasing this value. However, a higher value means it will take longer to detect a truly failed Region server. Do not set this too low, as transient network hiccups can cause unnecessary disconnections. This needs to be coordinated with the ZooKeeper configuration itself.

This parameter is located in the hbase-site.xml section of the HBase service configuration page

Query

Parameters described below are related to query performance in Phoenix.

Query parameters

Parameter Default and description Recommendation

phoenix.query.spoolThresholdBytes

Default: 20971520 (in bytes).

This parameter limits the amount of memory a query can use for sorting or grouping results before spilling data to disk for temporary storage. It is a critical parameter for queries that use ORDER BY, GROUP BY, or JOIN clauses, especially when dealing with large datasets. Insufficient memory can lead to excessive disk I/O, significantly impacting query performance

Set this parameter to 10485760 (10 MB) and run the typical workload. Monitor the key metrics:

Query latency — measure the average and 95th/99th percentile query latency.
Disk I/O utilization.
Garbage Collector (GC) activity.

If any of those metrics are too high — double the parameter value, rerun the workload and monitor the metrics again.

This parameter has to be added manually in the Custom hbase-site.xml section of the HBase service configuration page

phoenix.query.maxRows

Default: -1 (no limit).

This parameter limits the maximum number of rows that a single query can return. If a query attempts to return more rows than this limit, Phoenix will throw an exception (org.apache.phoenix.exception.PhoenixIOException) and terminate the query

Start with the default value and run the typical workload. Monitor query execution and identify any queries that return unexpectedly large result sets or cause performance problems. Examine query logs to understand the typical result set sizes for different types of queries. Identify any queries that consistently return a very large number of rows. Based on your findings, set the parameter value that is high enough to accommodate most of your legitimate queries but low enough to prevent runaway queries from overwhelming the system. Set a relatively conservative value (e.g. 10000 or 100000) and rerun the workload. Monitor query execution and look for queries that are terminated due to exceeding the phoenix.query.maxRows parameter value. If you find legitimate queries that are being terminated, increase the limit slightly and rerun the workload again.

This parameter has to be added manually in the Custom hbase-site.xml section of the HBase service configuration page

phoenix.query.resultset.cache.enabled

Default: false.

This parameter defines whether the client-side caching of query results is enabled. If enabled, Phoenix will cache the results of queries, allowing subsequent identical queries to return results from the cache instead of re-executing on the server

Enable this parameter if you have frequently executed, read-heavy queries with stable data. This can drastically improve read performance. However, it increases client memory consumption and introduces a potential for stale data if the underlying data changes frequently and the cache is not invalidated. Monitor client memory and data freshness.

This parameter has to be added manually in the Custom hbase-site.xml section of the HBase service configuration page

phoenix.query.optimizeMaxParallelScans

Default: 10.

This parameter defines the maximum number of parallel scans that Phoenix can initiate for certain types of queries (e.g. queries that involve multiple parallel scans across different regions or tables)

If you have many regions per table or a highly distributed dataset, increasing this value might improve performance for complex analytical queries that can benefit from parallelism. Increase with caution, as a very high value can put significant load on the cluster meta-data services and on the network. Monitor Region server and ZooKeeper activity.

This parameter has to be added manually in the Custom hbase-site.xml section of the HBase service configuration page

phoenix.mutate.batchSize

Default: 1000.

This parameter specifies the number of mutation operations (INSERT, UPDATE, or DELETE) that are accumulated on the client side before being sent to the HBase server in a single batch. A larger batch size reduces the number of round trips to the server, which can improve write performance

Increase this value for higher write throughput if you are performing bulk loading or batch updates. Start by doubling the default value and monitor performance.

If you are experiencing OutOfMemoryError on the client side or on Region servers, or if a Region server overloads due to large batches, decrease the value.

This parameter has to be added manually in the Custom hbase-site.xml section of the HBase service configuration page

phoenix.mutate.maxSizeBytes

Default: 10485760 (in bytes).

Defines the maximum size of a single mutation batch. This prevents very large mutations from overwhelming the region servers

If you are inserting rows with very large values (e.g. large binary data) and the default limit is consistently exceeded, increase this value. Be careful not to set this too high, as it can lead to region server issues.

If you are seeing errors related to oversized mutations, decrease this value.

This parameter has to be added manually in the Custom hbase-site.xml section of the HBase service configuration page

CAUTION

The last two properties (phoenix.mutate.batchSize and phoenix.mutate.maxSizeBytes) work together. The batch will be sent when either limit is reached.

JDBC URL

Parameters described below can be used in the JDBC URL string when connecting to Phoenix.

JDBC URL parameters

Parameter Default and description Recommendation

phoenix.query.client.fetchSize

Default: 1000.

This parameter specifies the number of rows that the client requests from the server in each fetch operation during query execution. A larger fetch size reduces the number of network round trips, potentially improving performance for queries that return many rows

Increase the value for queries that retrieve a large number of rows to reduce network latency overhead. Decrease the value if the memory on the client becomes an issue or if network conditions are poor, causing large fetches to time out. It is often beneficial to match this value with the phoenix.statement.prefetchRows parameter value if you use it (described next)

phoenix.statement.prefetchRows

Default: 0 (disabled).

This parameter specifies the number of rows pre-fetched into the client local buffer. When set to a value greater than 0, Phoenix will attempt to fetch the specified number of rows ahead of what is currently being consumed by the application, smoothing out the execution of the next() calls

Specify a value greater than 0 (e.g. 1000 or 5000) for queries where the client iterates through results using next(). This can significantly improve perceived query performance by reducing latency between row fetches. A value too high might consume excessive client memory

phoenix.schema.isNamespaceMappingEnabled

Default: false.

This parameter determines whether Phoenix should use HBase namespaces to represent Phoenix schemas

Set this parameter to true if you want to use HBase namespaces for schema management. This provides better separation of concerns and can improve performance in multi-tenant environments by leveraging HBase namespace-based security and resource management. This requires proper setup and configuration of HBase namespaces. If HBase namespaces are not used, leave it as false

phoenix.use.stats

Default: true.

This parameter defines whether the use of statistics is enabled when planning queries

Leave this value as true to allow Phoenix to make informed decisions about query execution based on table statistics. If you have a very small dataset or know that your statistics are significantly out of date, you might temporarily set it to false to force Phoenix to use a different query plan. However, in most cases, it is best to keep statistics enabled and ensure they are regularly updated. You can force the statistics update with the UPDATE STATISTICS <table_name> query

async

Default: false.

This parameter defines whether asynchronous queries are enabled. This allows the client to submit queries and receive results later

Set this parameter to true if you want to submit queries without blocking. The results can be processed later. It requires additional code and infrastructure to handle the asynchronous results

Region server

Parameters described below are related to the Region servers functioning in respect to Phoenix.

Region server parameters

Parameter Default and description Recommendation

hbase.rpc.timeout

Default: 60000 (in milliseconds).

This parameter defines the general timeout for RPC calls to HBase region servers. If a client does not receive a response from a Region server within this time, the RPC call is considered failed. Phoenix uses RPC calls for virtually all operations

If you are frequently seeing RpcTimeoutException or CallTimeoutException errors, which often happens when a Region server is under heavy load or there is network latency, increase this value. Increase by small increments (e.g. by 10000 at a time) and monitor your logs.

This parameter has to be added manually in the Custom hbase-site.xml section of the HBase service configuration page

phoenix.coprocessor.maxServerCacheBytes

Default: 0 (disabled).

This parameter defines the server side cache size for query results within the coprocessor on the Region servers. This can significantly improve performance for frequently executed queries with relatively small result sets. The value is defined in bytes

If you have many repeated queries, start with a modest value (e.g. 104857600 for 100 MB). Closely monitor the memory usage on your Region servers. A value too high can cause memory pressure and performance problems. You also need to monitor the cache usage rate. If it is too low, then you are wasting memory.

This parameter has to be added manually in the Custom hbase-site.xml section of the HBase service configuration page

phoenix.index.builder.threads

Default: 1.

This parameter specifies the number of threads that will be used to build global indexes. Global indexes are built asynchronously in the background, and this parameter controls the level of parallelism used for index building

If you have a large number of rows to index and want to speed up the index building process, increase this value up to the number of available CPU cores on your Region servers.

If the index building process is putting too much load on the Region servers, decrease this value.

This parameter has to be added manually in the Custom hbase-site.xml section of the HBase service configuration page

Database tuning

Row key design

The row key is the most important design consideration in HBase and, therefore, in Phoenix. A well-designed row key can significantly improve query performance and data locality. A poorly designed row key can lead to hotspots, slow queries, and overall system inefficiency.

Below is the list of key considerations for row key design:

Query patterns. The row key should be chosen based on how you intend to query the data. Identify the most common and performance-critical queries and design the row key to optimize for those. Consider the following:
- Range scans. If you frequently need to retrieve data within a specific range, the row key should include the date or a derivative of it as a prefix.
- Point lookups. If you need to retrieve individual rows quickly based on a unique identifier, make that identifier the row key.
- Prefix-based queries. If you often use queries based on a prefix of a specific attribute, include that attribute at the beginning of the row key.
Hotspotting. Avoid row key designs that lead to hotspotting. It occurs when a disproportionate amount of data is written to or read from a single Region server, creating a bottleneck. Common causes:
- Monotonically increasing keys. Using timestamps, sequence numbers, or other monotonically increasing values as the row key will direct all new writes to a single region, overloading that server.
- Sequential keys. Similar to monotonically increasing keys, sequential keys can also cause hotspots if they are concentrated within a small range.
Salting. This is a technique to mitigate hotspotting by adding a prefix to the row key to distribute writes across multiple regions.
Pre-splitting. Pre-splitting involves creating multiple regions upfront to distribute the initial load more evenly. This is especially important for large datasets.
Key length. Keep row keys as short as possible without sacrificing uniqueness or query performance. Shorter keys reduce storage space and improve scan performance.
Composite keys. Phoenix supports composite row keys, which combine multiple columns into a single key. This is useful when you need to execute queries based on a combination of attributes.
- Order of columns. The order of columns in the composite key is significant. Place the most frequently queried columns at the beginning.
- Data types. Choose appropriate data types for each column in the composite key to optimize storage space and performance.

Example

Consider a table storing sensor data with the following columns:

sensor_id
timestamp
temperature
humidity

If you frequently query for sensor data within a specific time range, you might design the composite row key as a combination of sensor_id and timestamp. However, if the number of sensors is small and writing to specific time ranges is concentrated, you will have the hotspotting issue. In that case, you would need to salt the sensor_id with some number of salt regions.

Column family design

In HBase and Phoenix, the data is stored in column families. Choosing the right column family design is crucial for optimizing storage efficiency, query performance, and overall system scalability.

Below is the list of key considerations for column family design:

Locality of reference. Columns that are accessed together frequently should be grouped into the same column family. This improves read performance because HBase retrieves entire column families at once. Conversely, columns that are rarely accessed together should be placed in separate column families.
Number of column families. Minimize the number of column families in a table. Each column family has associated overhead in terms of storage and memory. A small number of well-designed column families is generally preferable to a large number of fragmented families. Aim for a few (2-3) column families.
Storage characteristics. Consider the storage characteristics of the data within each column family:
- Data type. Columns with similar data types (e.g. strings, numbers, binaries) can be grouped together to optimize storage and compression.
- Data size. Columns containing large objects (e.g. images, documents) should be placed in their own column family to avoid impacting the performance of other frequently accessed data.
Schema evolution. Column families provide a degree of flexibility for schema evolution. You can add new columns to an existing column family without affecting other families. However, consider the impact of adding a large number of columns to a single column family, as it can affect performance.
Column family naming. Choose descriptive and meaningful names for your column families, but keep them as short as possible.
Compression. Column families can be configured with different compression algorithms. Choose the appropriate compression algorithm based on the data type and access patterns within each family.
In-memory storage. You can configure a column family to be stored in memory for faster access. This is useful for frequently accessed data that is relatively small. However, be mindful of memory usage on your region servers.

Example

Consider a table storing user profile information with the following columns: user_id, username, email, first_name, last_name, address, city, state, zip, and profile_picture. You might design the column families as follows:

info: username, email, first_name, last_name
address: address, city, state, zip
media: profile_picture

This design groups the frequently accessed user information into the info column family, separates the address information into its own family, and places the large profile picture into a separate family to avoid impacting the performance of other queries.

Indexing strategies

Phoenix provides powerful indexing capabilities to accelerate query performance on HBase data. Unlike traditional relational databases, where indexes are typically created on a small number of columns, Phoenix allows you to create indexes on any combination of columns, including expressions and functions. Choosing the right indexing strategy is crucial for optimizing query performance and minimizing index maintenance overhead.

There are two main types of indexes in Phoenix: global and local.

Global indexes

Global indexes are stored in a separate HBase table and cover all rows in the original table. When a query uses a global index, Phoenix retrieves the data from the index table instead of the original table.

Global indexes are best suited for read-heavy workloads where queries frequently filter on the indexed columns. They are updated synchronously during data writes. This can impact write performance, especially for large datasets.

Global indexes are not usable if the underlying table is not available.

Local indexes

Local indexes are stored on the same region servers as the original table data. When a query uses a local index, Phoenix retrieves the data from the index table and the original table within the same region server.

Local indexes are ideal for write-heavy workloads where queries need to filter on non-leading columns of the row key or columns that are not part of a global index. They are updated asynchronously after data writes. This minimizes the impact on write performance.

Local indexes are available even if the underlying table is not available.

Syntax examples for creating a global and a local index:

CREATE INDEX my_index ON my_table (column1, column2);
CREATE LOCAL INDEX my_index ON my_table (column1, column2);

Data compression

Data compression is a crucial technique for reducing storage space and improving I/O performance in HBase and Phoenix. By compressing data before it is stored, you can significantly reduce the amount of disk space required and the time it takes to read and write data. However, compression also introduces overhead due to the CPU cycles required to compress and decompress data. Choosing the right compression algorithm is a trade-off between storage savings, I/O performance, and CPU usage.

HBase supports several compression algorithms listed below, each with its own characteristics:

NONE. No compression is used. This is the default algorithm. It provides the fastest read and write performance but consumes the most storage space.
LZO. A fast and widely used compression algorithm that provides a good balance between compression ratio and CPU usage. Requires separate installation and configuration.
GZIP. A popular compression algorithm that offers a high compression ratio but consumes more CPU cycles than LZO. Suitable for data that is not frequently accessed.
Snappy. A fast compression algorithm that is designed for speed rather than high compression ratios. Ideal for data that is frequently accessed and requires low latency.
LZ4. Extremely fast compression algorithm but offers lower compression ratios than other algorithms.

The best compression algorithm for your data depends on several factors:

Data type. Some compression algorithms are better suited for certain data types than others. For example, GZIP is often a good choice for text data, while Snappy is suitable for binary data.
Access patterns. If your data is frequently accessed, you should choose a compression algorithm that offers low latency, such as Snappy or LZ4. If your data is rarely accessed, you can choose a compression algorithm that provides a high compression ratio, such as GZIP.
CPU usage. Be mindful of the CPU overhead of compression. If your system is already CPU-bound, you may want to choose a compression algorithm that is less CPU-intensive, such as LZO or Snappy.
Storage capacity. If your storage capacity is limited, you should choose a compression algorithm that provides a high compression ratio.

Data locality

Data locality is a key concept in HBase and Phoenix that refers to the proximity of data to the compute resources that need to access it. Maximizing data locality is crucial for achieving high performance because it reduces the amount of data that needs to be transferred over the network, which can be a significant bottleneck.

HBase stores data in regions, which are distributed across Region servers in the cluster. Each Region server is responsible for managing a subset of the data. When a client application requests data, HBase attempts to serve the request from the region server that is closest to the data.

Data locality is affected by the following factors:

Row key design. The row key is the most important factor affecting data locality. A well-designed row key will group related data together on the same region server, making it easier to access the data efficiently.
Region splitting and merging. HBase automatically splits and merges regions based on data size and load. If regions are not split and merged appropriately, data locality can suffer. It is important to monitor region sizes and configure the region splitting and merging policies accordingly.
Data placement. HBase attempts to place data on Region servers that are close to the data consumers. However, factors such as network topology and rack awareness can affect data placement.
Compaction. Compaction is the process of merging and rewriting data files in HBase. Compaction can improve data locality by reorganizing data on disk.
Bulk loading. Using bulk loading to load data into HBase allows you to pre-sort your data which can increase data locality.

You can use the following strategies for improving the data locality:

Optimal row key design. A well-designed row key is essential for data locality. Choose a row key that groups related data together on the same region server. Consider salting to prevent hotspotting, while still maintaining locality for related data within the same salted region.
Region pre-splitting. Pre-splitting the table into multiple regions can improve data locality by distributing the data across multiple Region servers from the start.
Rack awareness. Configure HBase to be aware of the rack topology of your cluster. This allows HBase to place data on region servers that are in the same rack as the data consumers.
HDFS configuration. Properly configure HDFS to ensure that data is stored close to the region servers.

Query optimization techniques

Query execution plan

Understanding the query execution plan is fundamental to optimizing SQL queries in Phoenix. The execution plan outlines the steps Phoenix will take to retrieve and process the data. By analyzing this plan, you can identify potential bottlenecks and areas for improvement.

You can obtain the query execution plan using the EXPLAIN PLAN statement in Phoenix. For example:

EXPLAIN PLAN SELECT * FROM users WHERE age > 30 AND city = 'New York';

Result example:

CLIENT 1-WAY PARALLEL 1/1 [SERVER FILTER]
    SERVER FILTER BY "AGE" > 30
        CLIENT 1-WAY FULL SCAN OVER USERS

The execution plan is typically presented as a tree structure, with the root node representing the final result and the leaf nodes representing the data sources. Each node in the tree describes an operation that Phoenix will perform. Below is a table with the description of the execution plan operations.

Execution plan operations

Operation Description

SERIAL

Indicates that the operation will be performed serially on a single region server. This is generally less efficient than parallel execution

PARALLEL <n>-WAY

Indicates that the operation will be performed in parallel across multiple region servers, with <n> specifying the number of ways (degree of parallelism). Look for opportunities to increase parallelism where appropriate, but be mindful of resource constraints

FULL SCAN OVER <table_name>

Indicates a full table scan, meaning that Phoenix will read every row in the table to find the matching rows. Full table scans are generally inefficient, especially for large tables. A full scan over an index table is not necessarily inefficient

RANGE SCAN OVER <table_name> <startRow, stopRow>

Indicates a range scan, where Phoenix will read only a subset of the rows in the table based on a range of row keys. This is more efficient than a full table scan, but it can still be inefficient if the range is larger than necessary. An index table will define its ranges differently from the original table

POINT LOOKUP OVER <table_name> <rowKey>

Indicates a point lookup, where Phoenix will directly retrieve a single row based on its row key. This is the most efficient type of read operation

CLIENT <op type>

Indicates that the query execution is happening on the client side

JOIN <join type>

Indicates the type of the JOIN operation performed (e.g. STAR JOIN). Analyzing join operations is crucial for optimization

FILTER <predicate>

Indicates that Phoenix is filtering the data based on a predicate (the WHERE clause). Ensure the filter is as selective as possible

AGGREGATE <aggregation function>

Indicates that Phoenix is performing an aggregation function (e.g. SUM, AVG, COUNT)

SORT

Indicates that the results are being sorted. Sorting can be expensive, so avoid it if possible. If you can use an index to provide the required sort order, do so

Based on the analysis of the execution plan, you can take steps to improve query performance. In the example above, you could create an index on the age and city columns like this:

CREATE INDEX idx_users_age_city ON users (age, city);

After creating the index, the execution plan might look like the following:

CLIENT 1-WAY RANGE SCAN OVER IDX_USERS_AGE_CITY [30,]
    SERVER FILTER BY "CITY" = 'New York'

SQL efficiency

Writing efficient SQL queries is paramount for optimal performance in Phoenix. While Phoenix leverages HBase for storage, inefficient SQL can still lead to slow query execution, even with properly indexed tables. This section focuses on crafting SQL queries that minimize the workload on Phoenix and maximize resource utilization.

Key principles for writing efficient SQL:

Select only necessary columns. Explicitly specify the columns you need in the SELECT clause. Using the SELECT * clause retrieves all columns from the table, even if you only need a few. This increases the amount of data that needs to be read, transferred, and processed.
Filter early and efficiently. Applying filters late in the query execution (e.g. after a full table scan) can be very inefficient. Use the WHERE clause to filter data as early as possible. Ensure that your WHERE clause leverages available indexes. Place the most restrictive conditions in the WHERE clause first.
Use the correct data types. Inconsistent data types in comparisons and predicates can prevent Phoenix from using indexes effectively and may lead to type conversion overhead. Ensure that the data types in your SQL queries match the data types of the corresponding columns in the table schema.
Leverage built-in functions efficiently. Some built-in functions can be more efficient than others. Understand the performance characteristics of different built-in functions. Use functions that are optimized for HBase/Phoenix.
Avoid unnecessary subqueries. Subqueries can often be rewritten as joins, which can be more efficient in Phoenix. Correlated subqueries (subqueries that depend on the outer query) can be particularly problematic. Try to rewrite subqueries as joins whenever possible. Consider denormalization as an alternative if the subquery is used frequently.
Limit the result set. Retrieving large result sets can put a strain on the client and network. Use the LIMIT clause to restrict the number of rows returned. If pagination is needed, use LIMIT and OFFSET in conjunction.
Use COUNT(*) carefully. COUNT(*) on a large table without a WHERE clause can be very expensive, as it requires scanning the entire table. If you need to count all rows in a table, consider using a pre-aggregated count or a lightweight approach if the table is not excessively large. If possible, add a WHERE clause to restrict the count to a smaller subset of the data.
Leverage UPSERT SELECT for efficient data updates and inserts. Inserting data row-by-row can be very inefficient. Use UPSERT SELECT to insert or update data in batch. This is especially useful for loading data from other tables or external sources.

JOIN optimization

Joins in Phoenix can be relatively expensive operations compared to point lookups or range scans. Therefore, optimizing joins is crucial for achieving good performance when working with related data across multiple tables. HBase distributed nature makes join performance a more significant concern than it is in traditional relational databases. This section offers strategies for optimizing joins in Phoenix, including choosing the right join type, leveraging data locality, and, when appropriate, avoiding joins altogether through denormalization.

Phoenix supports several join types and related strategies, each with its own performance characteristics:

Star joins. Optimized for scenarios where one large table (the fact table) is joined with multiple smaller dimension tables. Phoenix attempts to broadcast only the parts of the dimension tables required by the query to all relevant region servers (those containing data from the fact table in the join), allowing for efficient local joins. This is often the preferred join strategy in Phoenix when the data model and query patterns support it. Phoenix can typically determine if it is beneficial to execute a star join based on the data statistics available to it.
Broadcast joins. Suited specifically for scenarios where one table is genuinely small (fits comfortably in memory on each region server) and is joined with one or more larger tables. Phoenix broadcasts the entire small table to all region servers. This is different from a star join where only relevant parts of the dimension tables are sent. This approach can yield excellent performance because the join operations happen locally on each region server, minimizing data transfer. Phoenix makes use of this optimization automatically when applicable.
Sort-merge joins. While Phoenix supports sort-merge joins, they are generally less efficient than star or broadcast joins, especially in distributed environments like HBase. They involve sorting the input tables based on the join keys and then merging the sorted results. The performance bottleneck arises from the sorting overhead and the potential for data shuffling across the network. Phoenix typically resorts to sort-merge joins only when other more efficient join strategies cannot be applied. This might occur if the tables are very large and cannot be effectively broadcast, or if the join keys are not conducive to a hash-based join.

Possible strategies for optimizing the joins:

Denormalization. Often the best optimization for joins is to avoid them entirely. Consider denormalizing data by embedding related information directly into the fact table or creating pre-joined views. For example, instead of joining orders and customers tables to retrieve customer information with each order, you could include relevant customer details (e.g. customer name, city) directly in the orders table. However, denormalization can lead to data redundancy and potential consistency issues, so carefully weigh the pros and cons based on your specific use case.
Data locality. Place related tables on the same Region servers to minimize network traffic during joins. This can be achieved through co-location strategies in HBase.
Choosing the correct join order. The order in which tables are joined can significantly impact performance. Generally, it is best to start with the table that has the most restrictive filter and join it with smaller tables first. Filtering early reduces the amount of data that needs to be processed in subsequent join operations.
Filtering before joining. Apply the WHERE clauses to filter data from each table before the JOIN operation. Reducing the size of the tables being joined improves join performance.
Broadcast joins for small tables. For joins involving a small dimension table and a large fact table, use a broadcast join. Broadcast joins copy the smaller table to all region servers, allowing for local joins with the larger table. Phoenix may automatically choose a broadcast join when it detects a small table. You can often influence this decision using hints.
Leverage the star join optimization. If your data model conforms to a star schema (one large fact table joined with multiple smaller dimension tables), take advantage of Phoenix star join optimization.

Hints

Hints are directives that you can add to your SQL queries to influence Phoenix query execution plan. They provide a way to guide the optimizer and potentially improve performance when the default execution plan is not optimal. While Phoenix automatically optimizes queries, hints allow you to override its decisions in specific situations. Use hints carefully, as they can make your queries less portable and harder to maintain.

Hints are placed after the SELECT, INSERT, UPDATE, or DELETE keywords and have the following syntax:

SELECT /*+ hint1, hint2, ... */ column1, column2 FROM table_name WHERE condition;

Considerations when using hints:

Use with caution. Hints override the default query optimizer, potentially masking underlying performance issues or leading to sub-optimal plans if used incorrectly.
Test thoroughly. Always test the performance of your queries with and without hints to ensure that the hint is actually improving performance.
Retain maintainability. Hints can make your queries less portable and harder to maintain. Document the reasons for using a hint and monitor its effectiveness over time.
Keep in mind the optimizer changes. Phoenix query optimizer might change in future versions, rendering your hints ineffective or even detrimental. Regularly review and update your hints as needed.
Understand the execution plan. Use the EXPLAIN PLAN function to understand how your hints are affecting the query execution plan.

Query server

The Phoenix Query Server (PQS) provides a way to access Phoenix from clients that do not have direct access to the HBase cluster. It acts as a proxy, allowing clients to submit SQL queries over a thin JDBC driver, abstracting away the complexities of interacting directly with HBase. This approach is beneficial in scenarios where security restrictions, network configurations, or client-side dependencies prevent direct connections to the HBase cluster.

The Phoenix Query Server follows a client-server architecture:

Client. The client application uses a thin JDBC driver to connect to the PQS. This driver is often more lightweight than the standard Phoenix JDBC driver because it does not need to handle direct HBase connections.
Phoenix Query Server. The PQS is a standalone process that runs outside of the HBase cluster (although it is often co-located with HBase nodes). It receives SQL queries from clients, translates them into Phoenix queries, executes them against the HBase cluster, and returns the results to the client. The PQS handles tasks such as:
- Authentication and authorization.
- Query parsing and optimization (though it leverages Phoenix internal optimizer).
- Connection pooling to HBase.
- Result set management.
HBase Cluster. The HBase cluster stores the data and executes the actual data access operations. The PQS connects to the HBase cluster using the standard Phoenix JDBC driver.

Benefits of using the Phoenix Query Server:

Simplified client access. Clients can access Phoenix data using a standard JDBC interface, without needing to install the full Phoenix client libraries or have direct access to the HBase cluster.
Security. The PQS provides a central point for authentication and authorization, allowing you to control access to Phoenix data. You can restrict direct access to the HBase cluster, enhancing security.
Network isolation. The PQS can be deployed in a DMZ or other secure network zone, isolating the HBase cluster from direct client access. This is crucial for security in many organizations.
Resource management. The PQS can manage connections to the HBase cluster, preventing clients from overloading the cluster with too many connections. It enables connection pooling and resource management.
Language independence. Clients can be written in any language that supports JDBC, providing flexibility in application development.
Thin JDBC driver. The JDBC client is lightweight, making it easy to deploy on various platforms.

Found a mistake? Seleсt text and press Ctrl+Enter to report it