Hive backup and restore

This article describes how to back up and restore Hive table data. At a high level, backing up a Hive table involves the following actions:

  • Creating a copy of the Hive table data stored in the Hive warehouse in HDFS.

  • Creating a dump of the Hive Metastore database.

Restoring a Hive table from a backup involves the reverse steps, namely:

  • Copying the backup files to the corresponding Hive warehouse directory in HDFS.

  • Importing the dump file into the Hive Metastore database.

More details about the backup/restore processes are presented below. Also, the article provides an end-to-end scenario that walks you through all the major operations required to back up/restore a Hive table.

Back up Hive table

To back up a managed Hive table, do the following:

  1. Stop Hive via ADCM.

  2. Back up Hive warehouse data in HDFS.

    Hive stores raw table data under the HDFS directory specified by the hive.metastore.warehouse.dir parameter (defaults to /apps/hive/warehouse). To create a copy of the warehouse directory, use HDFS snapshots, distcp, or any other way to copy an HDFS directory.

  3. Back up the Hive Metastore database.

    Apart from the raw table data stored in HDFS, Hive Metastore stores table metadata in a relational database. This metadata includes schema, partitions, buckets, indexes, and other system information which is vital for proper Hive table operation. All the metadata is stored in the hive database that should be dumped when creating a backup.

Restore Hive table

To restore a Hive table from the backup, do the following:

  1. Stop Hive via ADCM.

  2. Copy the HDFS backup files to the Hive warehouse location.

  3. Import the SQL dump file to the database used by Hive Metastore.

    NOTE
    When you import a backup dump file to Hive Metastore, it overwrites the existing hive database with metadata. Thus, after the import of a backup dump, your Hive cluster will retain only those Hive tables that were present in the backup dump.

Example

The following scenario guides you through all the steps required to back up and restore a Hive table. The scenario assumes the full overwrite of Hive metadata when restoring the table. This means that only the tables from the backup will remain operable in the Hive cluster after the restore.

  1. Create a test Hive table with some data:

    SELECT * FROM transactions;
    +----------------------+----------------------+--------------------------+------------------------+
    | transactions.txn_id  | transactions.acc_id  | transactions.txn_amount  | transactions.txn_date  |
    +----------------------+----------------------+--------------------------+------------------------+
    | 1                    | 1002                 | 10.00                    | 2023-01-01             |
    | 8                    | 1003                 | 50.00                    | 2023-01-01             |
    | 3                    | 1002                 | 30.00                    | 2023-01-02             |
    | 4                    | 1001                 | 100.50                   | 2023-01-02             |
    | 2                    | 1002                 | 20.00                    | 2023-01-03             |
    | 6                    | 1001                 | 200.50                   | 2023-01-03             |
    | 7                    | 1003                 | 50.00                    | 2023-01-03             |
    | 5                    | 1001                 | 150.50                   | 2023-01-04             |
    | 9                    | 1003                 | 75.00                    | 2023-01-04             |
    +----------------------+----------------------+--------------------------+------------------------+
    SQL for creating the test table

     
    Use the following SQL to create and populate the test table using /bin/beeline.

    DROP TABLE IF EXISTS transactions;
    CREATE TABLE transactions(txn_id int, acc_id int, txn_amount decimal(10,2), txn_date date);
    INSERT INTO transactions VALUES
    (1, 1002, 10.00, '2023-01-01'),
    (2, 1002, 20.00, '2023-01-03'),
    (3, 1002, 30.00, '2023-01-02'),
    (4, 1001, 100.50, '2023-01-02'),
    (5, 1001, 150.50, '2023-01-04'),
    (6, 1001, 200.50, '2023-01-03'),
    (7, 1003, 50.00, '2023-01-03'),
    (8, 1003, 50.00, '2023-01-01'),
    (9, 1003, 75.00, '2023-01-04');
  2. In ADCM, stop the Hive service.

  3. Create a snapshot of the Hive warehouse directory in HDFS. In this scenario, the required warehouse directory is /apps/hive/warehouse/transactions (assuming hive.metastore.warehouse.dir points to apps/hive/warehouse/). To create a snapshot, follow the instructions:

     
    Make the /apps/hive/warehouse/transactions directory snapshottable:

    $ sudo -u hdfs hdfs dfsadmin -allowSnapshot /apps/hive/warehouse/transactions

    The output:

    Allowing snapshot on /apps/hive/warehouse/transactions succeeded

    Then, create a snapshot of the warehouse directory:

    $ sudo -u hdfs hdfs dfs -createSnapshot /apps/hive/warehouse/transactions snapshot_transactions

    The output:

    Created snapshot /apps/hive/warehouse/{test-table-name}/.snapshot/snapshot_transactions

    Copy the snapshot to a temporary HDFS location since /apps/hive/warehouse/transactions will be deleted on further steps. For example:

    $ sudo -u hdfs hadoop fs -mkdir /user/tmp
    $ sudo -u hdfs hadoop fs -chown hdfs:hdfs /user/tmp
    $ sudo -u hdfs hdfs dfs -cp /apps/hive/warehouse/transactions/.snapshot/snapshot_transactions /user/tmp

    Remove the snapshot from /apps/hive/warehouse/transactions. This step is necessary because leaving the snapshot will otherwise prevent Hive from deleting the HDFS warehouse directory on further steps.

    $ sudo -u hdfs hdfs dfs -deleteSnapshot /apps/hive/warehouse/transactions/ snapshot_transactions
  4. Dump the Hive Metastore database. In this scenario, Hive uses a MariaDB database that runs as an ADH service. For other database engines like Postgres, use the corresponding dump syntax. To dump the MariaDB database, on the ADH host with MariaDB server, run as root:

    $ sudo mysqldump -uroot hive > hive_backup.sql
  5. Start Hive.

  6. Drop the test Hive table.

    DROP TABLE transactions;

    Ensure that the HDFS directory /apps/hive/warehouse/transactions has been removed.

  7. Restore the test Hive table from the backup. For this, complete the steps below.

     
    Copy the snapshot files to the Hive warehouse location.

    $ sudo -u hdfs hdfs dfs -cp /user/tmp/snapshot_transactions /apps/hive/warehouse/transactions
    $ sudo -u hdfs hdfs dfs -ls /apps/hive/warehouse/transactions

    Then, import the SQL dump file into the Metastore database. For MariaDB, use the following command:

    $ sudo mysql hive < hive_backup.sql
  8. Restart Hive.

  9. Retrieve some data from the newly restored Hive table:

    SELECT * FROM transactions;

    The output indicates that the test table has been successfully restored.

    +----------------------+----------------------+--------------------------+------------------------+
    | transactions.txn_id  | transactions.acc_id  | transactions.txn_amount  | transactions.txn_date  |
    +----------------------+----------------------+--------------------------+------------------------+
    | 1                    | 1002                 | 10.00                    | 2023-01-01             |
    | 2                    | 1002                 | 20.00                    | 2023-01-03             |
    | 3                    | 1002                 | 30.00                    | 2023-01-02             |
    | 4                    | 1001                 | 100.50                   | 2023-01-02             |
    | 5                    | 1001                 | 150.50                   | 2023-01-04             |
    | 6                    | 1001                 | 200.50                   | 2023-01-03             |
    | 7                    | 1003                 | 50.00                    | 2023-01-03             |
    | 8                    | 1003                 | 50.00                    | 2023-01-01             |
    | 9                    | 1003                 | 75.00                    | 2023-01-04             |
    +----------------------+----------------------+--------------------------+------------------------+
Found a mistake? Seleсt text and press Ctrl+Enter to report it