Ozone integration with cluster services

Konstantin Alpashkin

Contents

Test cluster
Configure ADH services to use Ozone
Use Ozone and HDFS
- Hive
- Spark3

This article describes a use case of integrating the Ozone service with an up-and-running ADH cluster. The scenario shows how to configure ADH services like Spark3, Hive, Impala, and YARN to use the Ozone storage rather than HDFS.

After switching to Ozone, the HDFS storage remains accessible and can be used by services along with Ozone.

Test cluster

The use case assumes the following environment:

An ADH cluster 3.3.6.2 or higher is installed and is kerberized. The cluster includes the following services:
- Core configuration
- ZooKeeper
- HDFS
- YARN
- ADPG
- Hive
- Impala
- Spark3
The Ozone service is added to the running ADH cluster. After Ozone is installed, run the Update Core configuration action of the Core configuration service and restart the following services:
- YARN
- Hive
- Impala
- Spark3
This is required to pull in necessary Ozone dependencies to the classpath.

The HDFS/Ozone services use the namespaces as described in the table.

Service/Component

Configuration property

Value

HDFS

dfs.internal.nameservices

adh

Ozone

ozone.service.id

adho

Ozone Manager

ozone.om.service.ids

adho

Ozone Storage Container Manager

ozone.scm.service.ids

adho

Configure ADH services to use Ozone

Step 1. Prepare Ozone file system

Using Ozone CLI, create a volume and a bucket where ADH services will store their data:

$ ozone sh volume create ozone
$ ozone sh bucket create /ozone/adh

Verify the new bucket:

$ hdfs dfs -ls ofs://adho/ozone

The output:

drwxrwxrwx   - k_alpashkin_krb1 k_alpashkin_krb1          0 2025-02-05 15:41 ofs://adho/ozone/adh

In the Core configuration service, set Ozone as the default file system using the following property.

Configuration section Configuration property Value

core-site.xml

fs.defaultFS

ofs://adho
Run the Update Core configuration action of the Core configuration service.

Step 2. Configure Spark3

Copy the necessary Spark3 archives from HDFS to Ozone and create log directories for Spark3 operation:

$ hdfs dfs -get hdfs://adh/apps/spark/spark3-yarn-archive.tgz
$ hdfs dfs -mkdir -p ofs://adho/ozone/adh/apps/spark
$ hdfs dfs -put spark3-yarn-archive.tgz ofs://adho/ozone/adh/apps/spark/spark3-yarn-archive.tgz
$ hdfs dfs -mkdir -p ofs://adho/ozone/adh/var/log/spark3/apps

Configure the Spark3 service to work with Ozone. Namely, update the properties listed in the table.

Configuration section

Configuration property

Value

spark-defaults.conf

spark.yarn.archive

ofs://adho/ozone/adh/apps/spark/spark3-yarn-archive.tgz

spark.eventLog.dir

ofs://adho/ozone/adh/var/log/spark3/apps

spark.history.fs.logDirectory

Restart the Spark3 service.

Step 3. Configure Hive

Copy the necessary Hive files from HDFS to Ozone and create a new warehouse directory:

$ hdfs dfs -get hdfs://adh/apps/tez/tez-0.10.3.tar.gz
$ hdfs dfs -mkdir -p ofs://adho/ozone/adh/apps/tez
$ hdfs dfs -put tez-0.10.3.tar.gz ofs://adho/ozone/adh/apps/tez/tez-0.10.3.tar.gz
$ hdfs dfs -mkdir -p ofs://adho/ozone/adh/apps/hive/warehouse

Configure the Hive service to use the new warehouse location in Ozone.

Configuration section Configuration property Value

hive-site.xml

hive.metastore.warehouse.dir

ofs://adho/ozone/adh/apps/hive/warehouse

(Optional) Change the storage location of the default Hive database.

Change database files location

On a host with the Hive client installed, open the beeline shell:
```
$ beeline
```

In the beeline shell, connect to Hive using the JDBC string for the kerberized environment, for example:

!connect jdbc:hive2://ka-adh-1.ru-central1.internal:2181,ka-adh-2.ru-central1.internal:2181,ka-adh-3.ru-central1.internal:2181/;principal=hive/ka-adh-2.ru-central1.internal@AD.RANGER-TEST;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=arenadata/cluster/17/hiveserver2

TIP

You can find the up-to-date JDBC string on the Hive service Info page in ADCM (Clusters → <cluster_name> → Services → Hive → Info).

Change the database location by running:

ALTER DATABASE default SET LOCATION 'ofs://adho/ozone/adh/apps/hive/warehouse';

Verify the new database location:
```
DESCRIBE DATABASE EXTENDED default;
```
The output:

+----------+------------------------+--------------------------------------------+-------------------------------------------+-------------+-------------+-----------------+----------------+-------------+
| db_name  |        comment         |                  location                  |              managedlocation              | owner_name  | owner_type  | connector_name  | remote_dbname  | parameters  |
+----------+------------------------+--------------------------------------------+-------------------------------------------+-------------+-------------+-----------------+----------------+-------------+
| default  | Default Hive database  | ofs://adho/ozone/adh/apps/hive/warehouse   | ofs://adho/ozone/adh/apps/hive/warehouse  | public      | ROLE        |                 |                |             |
+----------+------------------------+--------------------------------------------+-------------------------------------------+-------------+-------------+-----------------+----------------+-------------+

Restart the Hive service.

Step 4. Configure YARN

Create the necessary directories for the YARN service:

$ hdfs dfs -mkdir -p ofs://adho/ozone/adh/system/yarn/node-labels

Configure YARN to work with Ozone. Namely, specify the properties listed in the table.

Configuration section

Configuration property

Value

Custom yarn-site.xml

yarn.node-labels.fs-store.root-dir

ofs://adho/ozone/adh/system/yarn/node-labels

yarn.nodemanager.remote-app-log-dir

ofs://adho/ozone/adh/logs

Restart YARN.

After completing the steps above, Hive, Spark, and Impala services will by default store data in Ozone rather than HDFS. Also, temporary and user files generated by YARN during application execution will also be stored in Ozone.

Use Ozone and HDFS

Hive

At the same time, the above services configured to work with Ozone, can still interact with HDFS. For example, all the Hive tables originally created in HDFS before the transition to Ozone, remain accessible. To create new Hive objects in HDFS, specify the location explicitly. For example:

CREATE DATABASE IF NOT EXISTS db_for_hdfs
COMMENT 'This is a database in HDFS.'
LOCATION 'hdfs://adh/apps/hive/warehouse/db_for_hdfs.db';

NOTE

All tables created in the db_for_hdfs database will be stored in HDFS by default.

Check the new database location:

DESCRIBE DATABASE EXTENDED db_for_hdfs;

The output:

+--------------+---------------------------+------------------------------------------------+----------------------------------------------------+-------------+-------------+-----------------+----------------+-------------+
|   db_name    |          comment          |                    location                    |                  managedlocation                   | owner_name  | owner_type  | connector_name  | remote_dbname  | parameters  |
+--------------+---------------------------+------------------------------------------------+----------------------------------------------------+-------------+-------------+-----------------+----------------+-------------+
| db_for_hdfs  | This is database in hdfs  | hdfs://adh/apps/hive/warehouse/db_for_hdfs.db  | ofs://adho/ozone/adh/apps/hive/warehouse/db_for_hdfs.db | hive   | USER        |                 |                |             |
+--------------+---------------------------+------------------------------------------------+----------------------------------------------------+-------------+-------------+-----------------+----------------+-------------+

Spark3

Having configured your kerberized Spark cluster to work with Ozone, you can still run Spark jobs that interact with HDFS. However, for this, you have to explicitly instruct Spark to pass Kerberos authentication to access required HDFS directories. You can do this by using the spark.kerberos.access.hadoopFileSystems parameter, which accepts a comma-separated list of kerberized Hadoop filesystems your Spark application is going to access. For example:

/bin/spark3-submit \
    --deploy-mode cluster \
    --master yarn \
    --conf spark.kerberos.access.hadoopFileSystems=hdfs://adh
    test.py

NOTE

When using Ozone, Spark3 writes staging data to the /user/{username} Ozone bucket. Keep in mind that in a kerberized environment, {username} is resolved to the name of the Kerberos principal. Thus, if the user name used to run a Spark application does not meet bucket naming requirements (e.g. contains _ characters), the Spark job fails. To write Spark staging data to an Ozone bucket created in advance, use the spark.yarn.stagingDir parameter. This parameter can be specified in ADCM settings (Custom spark-defaults.conf) or by using the --conf notation, for example:

/bin/spark3-submit \
    --deploy-mode cluster \
    --master yarn \
    --conf spark.yarn.stagingDir=ofs://adho/ozone/adh/var \
    test.py

Found a mistake? Seleсt text and press Ctrl+Enter to report it