Ozone integration with cluster services

This article describes a use case of integrating the Ozone service with an up-and-running ADH cluster. The scenario shows how to configure ADH services like Spark3, Hive, Impala, and YARN to use the Ozone storage rather than HDFS.

After switching to Ozone, the HDFS storage remains accessible and can be used by services along with Ozone.

Test cluster

The use case assumes the following environment:

  • An ADH cluster 3.3.6.2 or higher is installed and is kerberized. The cluster includes the following services:

    • Core configuration

    • ZooKeeper

    • HDFS

    • YARN

    • ADPG

    • Hive

    • Impala

    • Spark3

  • The Ozone service is added to the running ADH cluster. After Ozone is installed, run the Update Core configuration action of the Core configuration service and restart the following services:

    • YARN

    • Hive

    • Impala

    • Spark3

    This is required to pull in necessary Ozone dependencies to the classpath.

  • The HDFS/Ozone services use the namespaces as described in the table.

    Service/Component Configuration property Value

    HDFS

    dfs.internal.nameservices

    adh

    Ozone

    ozone.service.id

    adho

    Ozone Manager

    ozone.om.service.ids

    adho

    Ozone Storage Container Manager

    ozone.scm.service.ids

    adho

Configure ADH services to use Ozone

Step 1. Prepare Ozone file system

  1. Using Ozone CLI, create a volume and a bucket where ADH services will store their data:

    $ ozone sh volume create ozone
    $ ozone sh bucket create /ozone/adh

    Verify the new bucket:

    $ hdfs dfs -ls ofs://adho/ozone

    The output:

    drwxrwxrwx   - k_alpashkin_krb1 k_alpashkin_krb1          0 2025-02-05 15:41 ofs://adho/ozone/adh
  2. In the Core configuration service, set Ozone as the default file system using the following property.

    Configuration section Configuration property Value

    core-site.xml

    fs.defaultFS

    ofs://adho

  3. Run the Update Core configuration action of the Core configuration service.

Step 2. Configure Spark3

  1. Copy the necessary Spark archives from HDFS to Ozone and create log directories for Spark3 operation:

    $ hdfs dfs -get hdfs://adh/apps/spark/spark3-yarn-archive.tgz
    $ hdfs dfs -mkdir -p ofs://adho/ozone/adh/apps/spark
    $ hdfs dfs -put spark3-yarn-archive.tgz ofs://adho/ozone/adh/apps/spark/spark3-yarn-archive.tgz
    $ hdfs dfs -mkdir -p ofs://adho/ozone/adh/var/log/spark/apps
  2. Configure the Spark3 service to work with Ozone. Namely, update the properties listed in the table.

    Configuration section Configuration property Value

    spark-defaults.conf

    spark.yarn.archive

    ofs://adho/ozone/adh/apps/spark/spark3-yarn-archive.tgz

    spark.eventLog.dir

    ofs://adho/ozone/adh/var/log/spark/apps

    spark.history.fs.logDirectory

  3. Restart the Spark3 service.

Step 3. Configure Hive

  1. Copy the necessary Hive files from HDFS to Ozone and create a new warehouse directory:

    $ hdfs dfs -get hdfs://adh/apps/tez/tez-0.10.3.tar.gz
    $ hdfs dfs -mkdir -p ofs://adho/ozone/adh/apps/tez
    $ hdfs dfs -put tez-0.10.3.tar.gz ofs://adho/ozone/adh/apps/tez/tez-0.10.3.tar.gz
    $ hdfs dfs -mkdir -p ofs://adho/ozone/adh/apps/hive/warehouse
  2. Configure the Hive service to use the new warehouse location in Ozone.

    Configuration section Configuration property Value

    hive-site.xml

    hive.metastore.warehouse.dir

    ofs://adho/ozone/adh/apps/hive/warehouse

  3. (Optional) Change the storage location of the default Hive database.

    Change database files location

     

    1. On a host with the Hive client installed, open the beeline shell:

      $ beeline
    2. In the beeline shell, connect to Hive using the JDBC string for the kerberized environment, for example:

      !connect jdbc:hive2://ka-adh-1.ru-central1.internal:2181,ka-adh-2.ru-central1.internal:2181,ka-adh-3.ru-central1.internal:2181/;principal=hive/ka-adh-2.ru-central1.internal@AD.RANGER-TEST;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=arenadata/cluster/17/hiveserver2
      TIP
      You can find the up-to-date JDBC string on the Hive service Info page in ADCM (Clusters → <cluster_name> → Services → Hive → Info).
    3. Change the database location by running:

      ALTER DATABASE default SET LOCATION 'ofs://adho/ozone/adh/apps/hive/warehouse';
    4. Verify the new database location:

      DESCRIBE DATABASE EXTENDED default;

      The output:

    +----------+------------------------+--------------------------------------------+-------------------------------------------+-------------+-------------+-----------------+----------------+-------------+
    | db_name  |        comment         |                  location                  |              managedlocation              | owner_name  | owner_type  | connector_name  | remote_dbname  | parameters  |
    +----------+------------------------+--------------------------------------------+-------------------------------------------+-------------+-------------+-----------------+----------------+-------------+
    | default  | Default Hive database  | ofs://adho/ozone/adh/apps/hive/warehouse   | ofs://adho/ozone/adh/apps/hive/warehouse  | public      | ROLE        |                 |                |             |
    +----------+------------------------+--------------------------------------------+-------------------------------------------+-------------+-------------+-----------------+----------------+-------------+
  4. Restart the Hive service.

Step 4. Configure YARN

  1. Create the necessary directories for the YARN service:

    $ hdfs dfs -mkdir -p ofs://adho/ozone/adh/system/yarn/node-labels
  2. Configure YARN to work with Ozone. Namely, specify the properties listed in the table.

    Configuration section Configuration property Value

    Custom yarn-site.xml

    yarn.node-labels.fs-store.root-dir

    ofs://adho/ozone/adh/system/yarn/node-labels

    yarn.nodemanager.remote-app-log-dir

    ofs://adho/ozone/adh/logs

  3. Restart YARN.

After completing the steps above, Hive, Spark, and Impala services will by default store data in Ozone rather than HDFS. Also, temporary and user files generated by YARN during application execution will also be stored in Ozone.

Use Ozone and HDFS

Hive

At the same time, the above services configured to work with Ozone, can still interact with HDFS. For example, all the Hive tables originally created in HDFS before the transition to Ozone, remain accessible. To create new Hive objects in HDFS, specify the location explicitly. For example:

CREATE DATABASE IF NOT EXISTS db_for_hdfs
COMMENT 'This is a database in HDFS.'
LOCATION 'hdfs://adh/apps/hive/warehouse/db_for_hdfs.db';
NOTE
All tables created in the db_for_hdfs database will be stored in HDFS by default.

Check the new database location:

DESCRIBE DATABASE EXTENDED db_for_hdfs;

The output:

+--------------+---------------------------+------------------------------------------------+----------------------------------------------------+-------------+-------------+-----------------+----------------+-------------+
|   db_name    |          comment          |                    location                    |                  managedlocation                   | owner_name  | owner_type  | connector_name  | remote_dbname  | parameters  |
+--------------+---------------------------+------------------------------------------------+----------------------------------------------------+-------------+-------------+-----------------+----------------+-------------+
| db_for_hdfs  | This is database in hdfs  | hdfs://adh/apps/hive/warehouse/db_for_hdfs.db  | ofs://adho/ozone/adh/apps/hive/warehouse/db_for_hdfs.db | hive   | USER        |                 |                |             |
+--------------+---------------------------+------------------------------------------------+----------------------------------------------------+-------------+-------------+-----------------+----------------+-------------+

Spark3

Having configured your kerberized Spark cluster to work with Ozone, you can still run Spark jobs that interact with HDFS. However, for this, you have to explicitly instruct Spark to pass Kerberos authentication to access required HDFS directories. You can do this by using the spark.kerberos.access.hadoopFileSystems parameter, which accepts a comma-separated list of kerberized Hadoop filesystems your Spark application is going to access. For example:

/bin/spark3-submit \
    --deploy-mode cluster \
    --master yarn \
    --conf spark.kerberos.access.hadoopFileSystems=hdfs://adh
    test.py
NOTE

When using Ozone, Spark3 writes staging data to the /user/{username} Ozone bucket. Keep in mind that in a kerberized environment, {username} is resolved to the name of the Kerberos principal. Thus, if the user name used to run a Spark application does not meet bucket naming requirements (e.g. contains _ characters), the Spark job fails. To write Spark staging data to an Ozone bucket created in advance, use the spark.yarn.stagingDir parameter. This parameter can be specified in ADCM settings (Custom spark-defaults.conf) or by using the --conf notation, for example:

/bin/spark3-submit \
    --deploy-mode cluster \
    --master yarn \
    --conf spark.yarn.stagingDir=ofs://adho/ozone/adh/var \
    test.py
Found a mistake? Seleсt text and press Ctrl+Enter to report it