Hive ACID transactions
Hive supports ACID transactions that provide the following features:
-
Atomicity. Ensures that transactions are either executed fully or not executed at all. In other words, transactions are indivisible, and if one part of a transaction fails, the entire transaction is rolled back, leaving the system in its original state.
-
Consistency. Ensures that the database remains healthy before and after committing a transaction. The transactions preserve the integrity of the data and the constraints defined on it.
-
Isolation. Ensures that the parallel execution of multiple transactions does not interfere with each other.
-
Durability. Once a transaction is committed, its effects are permanent and will not be lost, even in the case of a system failure.
Together, these properties ensure reliability, integrity, and robustness of Hive transactions. Hive supports ACID features at the row level, meaning that one application (for example, a streaming Kafka-powered app) can continuously add rows while other apps can still read from the same partition without interfering with each other.
Delta files and compactions
When you insert or update data in a transactional table, Hive creates a set of delta files each reflecting the changes made to the table (partition) in scope of one transaction. When Hive clients request data, Hive analyzes the delta files to include the updates from individual transactions into the result set, thus returning the most up-to-date table state — this is the basic idea of Hive transactions.
To avoid accumulation of too many delta files (which inevitably leads to performance degradation), Hive performs regular cleanups called compactions. The compaction process automatically runs in the background at regular intervals without affecting concurrent reads/writes.
There are two types of compactions:
-
Minor. Merges several delta files to a single file.
-
Major. Rewrites one or more delta files and the base file to a new base file. This process is far more resource-expensive than the minor compaction.
Besides automatic regular compaction runs, you can trigger a compaction manually using the COMPACT command.
Enable ACID transactions
By default, Hive ACID transactions are disabled.
To work with transactional tables, enable the ACID Transactions option in ADCM (Clusters → <clusterName> → Services → Hive → Primary configuration) and then restart your Hive service.
When you activate this option, ADCM automatically updates all the Hive configuration properties required to work with ACID transactions, including the transaction manager (hive.txn.manager
).
Work with transactional tables
Create a table
To create a fully-ACID transactional table, the following is required:
-
Hive ACID transactions are enabled in ADCM.
-
The created table is bucketed.
-
ORC is used as the storage format. Currently, this is the only format supported for transactional tables.
-
The
TBLPROPERTIES ("transactional"="true")
clause is used.
For example:
CREATE TABLE `transactions`
(`txn_id` int,
`acc_id` int,
`txn_amount` decimal(10,2),
`txn_date` date)
CLUSTERED BY (acc_id) INTO 3 BUCKETS (1)
STORED AS ORC (2)
TBLPROPERTIES ("transactional"="true"); (3)
1 | Splits the new table into X buckets. |
2 | Sets ORC as the storage format. |
3 | Sets the property to create a transactional table. |
Using DESCRIBE FORMATTED <table_name>
, you can ensure that the new table has been created as transactional.
The following output row indicates this.
... | transactional | true | ...
If you observe the following error while creating a transactional table, ensure that the ACID Transactions option is set to true
in ADCM and your Hive service was restarted.
This command is not allowed on an ACID table <table_name> with a non-ACID transaction manager
Convert non-ACID tables to ACID
You can convert a non-ACID Hive table into a full-ACID one. The major requirement is that the source table should be a managed Hive table with ORC used as the storage format. For example:
ALTER TABLE <non-acid_table> SET TBLPROPERTIES ("transactional"="true");
To make an external table ACID, first, you have to make the table a managed one. For example:
ALTER TABLE <external_table_name> SET TBLPROPERTIES("EXTERNAL"="false")
NOTE
Once a Hive table is created as an ACID table, you cannot convert it back to a non-ACID table.
|
Insert and update data
Writing or updating data in a transactional table is a no fancy operation.
The INSERT
/UPDATE
/DELETE
syntax is standard, for example:
INSERT INTO transactions VALUES
(1, 1002, 10.00, '2023-01-01'),
(2, 1001, 100.50, '2023-01-02'),
(3, 1003, 50.00, '2023-01-03');
The main feature is when you update a transactional table, Hive stores the updates as delta files in the HDFS warehouse directory.
To view the delta files created as a result of the INSERT
query above, run the command:
$ hdfs dfs -ls /apps/hive/warehouse/transactions/delta_0000001_0000001_0000
TIP
The numbers in delta_0000001_0000001_0000 indicate the transaction ID.
|
The output:
Found 3 items -rw-r--r-- 3 hive hadoop 1 2024-03-28 17:21 /apps/hive/warehouse/transactions/delta_0000001_0000001_0000/_orc_acid_version (1) -rw-r--r-- 3 hive hadoop 906 2024-03-28 17:21 /apps/hive/warehouse/transactions/delta_0000001_0000001_0000/bucket_00000 (2) -rw-r--r-- 3 hive hadoop 959 2024-03-28 17:21 /apps/hive/warehouse/transactions/delta_0000001_0000001_0000/bucket_00002 (2)
1 | Stores the ORC format version. |
2 | These files store the inserted/updated data split by buckets. |
Special Hive transaction commands
Below are several commands specific to Hive transactional mechanism.
COMPACT
You can manually trigger a major/minor compaction using the following syntax:
ALTER TABLE <table_name> COMPACT 'major|minor';
Running this command enqueues a compaction operation and outputs the queued compaction ID.
... Compaction enqueued with id {id}
To view the list of all compactions, use the command:
SHOW COMPACTIONS;
The sample output:
+---------------+-----------+----------------+------------+--------+------------+-----------+-------------+---------------+--------------+ | compactionid | dbname | tabname | partname | type | state | workerid | starttime | duration | hadoopjobid | +---------------+-----------+----------------+------------+--------+------------+-----------+-------------+---------------+--------------+ | CompactionId | Database | Table | Partition | Type | State | Worker | Start Time | Duration(ms) | HadoopJobId | | 1 | default | transactions | --- | MINOR | initiated | --- | --- | --- | None | +---------------+-----------+----------------+------------+--------+------------+-----------+-------------+---------------+--------------+
SHOW TRANSACTIONS
This command returns a list of all currently open and aborted transactions in the system, for example:
+-----------------+--------------------+----------------+----------------------+-------+--------------------------------+ | txnid | state | startedtime | lastheartbeattime | user | host | +-----------------+--------------------+----------------+----------------------+-------+--------------------------------+ | Transaction ID | Transaction State | Started Time | Last Heartbeat Time | User | Hostname | | 10 | ABORTED | 1711646536000 | 1711646536000 | hive | ka-adh-2.ru-central1.internal | | 29 | OPEN | 1711970603000 | 1711970603000 | hive | ka-adh-2.ru-central1.internal | +-----------------+--------------------+----------------+----------------------+-------+--------------------------------+
SHOW LOCKS
This command displays the locks acquired on a table or a partition. A sample output below.
+----------+------------------+----------------+------------+-------------+---------------+--------------+-----------------+-----------------+----------------+-------+--------------------------------+----------------------------------------------------+ | lockid | database | table | partition | lock_state | blocked_by | lock_type | transaction_id | last_heartbeat | acquired_at | user | hostname | agent_info | +----------+------------------+----------------+------------+-------------+---------------+--------------+-----------------+-----------------+----------------+-------+--------------------------------+----------------------------------------------------+ | Lock ID | Database | Table | Partition | State | Blocked By | Type | Transaction ID | Last Heartbeat | Acquired At | User | Hostname | Agent Info | | 41.1 | _dummy_database | _dummy_table | NULL | ACQUIRED | | SHARED_READ | 50 | 0 | 1711992560000 | hive | ka-adh-2.ru-central1.internal | hive_20240401172919_fbb1738c-5736-458f-956e-d89ea7d3a727 | | 41.2 | default | transactions2 | NULL | ACQUIRED | | SHARED_READ | 50 | 0 | 1711992560000 | hive | ka-adh-2.ru-central1.internal | hive_20240401172919_fbb1738c-5736-458f-956e-d89ea7d3a727 | +----------+------------------+----------------+------------+-------------+---------------+--------------+-----------------+-----------------+----------------+-------+--------------------------------+----------------------------------------------------+
Limitations
There are several limitations for Hive ACID transactions:
-
Unlike traditional SQL transactions,
BEGIN
,COMMIT
, andROLLBACK
operations are not supported. All Hive transactions are auto-commit. -
Only ORC storage format is supported.
-
A Hive table must be bucketed to make use of ACID transactions.
-
External tables cannot be made ACID.
-
Reading/writing to an ACID table from a non-ACID session is not allowed. In other words, the Hive transaction manager must be set to
org.apache.hadoop.hive.ql.lockmgr.DbTxnManager
in order to work with ACID tables. -
The existing ZooKeeper and in-memory lock managers are not compatible with transactions.
-
Using Oracle as the Metastore DB and
datanucleus.connectionPoolingType=BONECP
may generate intermittent "No such lock …" error and "No such transaction…" errors. Settingdatanucleus.connectionPoolingType=DBCP
is recommended in this case. -
LOAD DATA
is not supported with transactional tables.