Ozone integration with cluster services
This article describes a use case of integrating the Ozone service with an up-and-running ADH cluster. The scenario shows how to configure ADH services like Spark3, Hive, Impala, and YARN to use the Ozone storage rather than HDFS.
After switching to Ozone, the HDFS storage remains accessible and can be used by services along with Ozone.
Test cluster
The use case assumes the following environment:
-
An ADH cluster 3.3.6.2 or higher is installed and is kerberized. The cluster includes the following services:
-
Core configuration
-
ZooKeeper
-
HDFS
-
YARN
-
ADPG
-
Hive
-
Impala
-
Spark3
-
-
The Ozone service is added to the running ADH cluster. After Ozone is installed, run the Update Core configuration action of the Core configuration service and restart the following services:
-
YARN
-
Hive
-
Impala
-
Spark3
This is required to pull in necessary Ozone dependencies to the classpath.
-
-
The HDFS/Ozone services use the namespaces as described in the table.
Service/Component Configuration property Value HDFS
dfs.internal.nameservices
adh
Ozone
ozone.service.id
adho
Ozone Manager
ozone.om.service.ids
adho
Ozone Storage Container Manager
ozone.scm.service.ids
adho
Configure ADH services to use Ozone
Step 1. Prepare Ozone file system
-
Using Ozone CLI, create a volume and a bucket where ADH services will store their data:
$ ozone sh volume create ozone $ ozone sh bucket create /ozone/adh
Verify the new bucket:
$ hdfs dfs -ls ofs://adho/ozone
The output:
drwxrwxrwx - k_alpashkin_krb1 k_alpashkin_krb1 0 2025-02-05 15:41 ofs://adho/ozone/adh
-
In the Core configuration service, set Ozone as the default file system using the following property.
Configuration section Configuration property Value core-site.xml
fs.defaultFS
ofs://adho
-
Run the Update Core configuration action of the Core configuration service.
Step 2. Configure Spark3
-
Copy the necessary Spark archives from HDFS to Ozone and create log directories for Spark3 operation:
$ hdfs dfs -get hdfs://adh/apps/spark/spark3-yarn-archive.tgz $ hdfs dfs -mkdir -p ofs://adho/ozone/adh/apps/spark $ hdfs dfs -put spark3-yarn-archive.tgz ofs://adho/ozone/adh/apps/spark/spark3-yarn-archive.tgz $ hdfs dfs -mkdir -p ofs://adho/ozone/adh/var/log/spark/apps
-
Configure the Spark3 service to work with Ozone. Namely, update the properties listed in the table.
Configuration section Configuration property Value spark-defaults.conf
spark.yarn.archive
ofs://adho/ozone/adh/apps/spark/spark3-yarn-archive.tgz
spark.eventLog.dir
ofs://adho/ozone/adh/var/log/spark/apps
spark.history.fs.logDirectory
-
Restart the Spark3 service.
Step 3. Configure Hive
-
Copy the necessary Hive files from HDFS to Ozone and create a new warehouse directory:
$ hdfs dfs -get hdfs://adh/apps/tez/tez-0.10.3.tar.gz $ hdfs dfs -mkdir -p ofs://adho/ozone/adh/apps/tez $ hdfs dfs -put tez-0.10.3.tar.gz ofs://adho/ozone/adh/apps/tez/tez-0.10.3.tar.gz $ hdfs dfs -mkdir -p ofs://adho/ozone/adh/apps/hive/warehouse
-
Configure the Hive service to use the new warehouse location in Ozone.
Configuration section Configuration property Value hive-site.xml
hive.metastore.warehouse.dir
ofs://adho/ozone/adh/apps/hive/warehouse
-
(Optional) Change the storage location of the
default
Hive database.Change database files location-
On a host with the Hive client installed, open the
beeline
shell:$ beeline
-
In the
beeline
shell, connect to Hive using the JDBC string for the kerberized environment, for example:!connect jdbc:hive2://ka-adh-1.ru-central1.internal:2181,ka-adh-2.ru-central1.internal:2181,ka-adh-3.ru-central1.internal:2181/;principal=hive/ka-adh-2.ru-central1.internal@AD.RANGER-TEST;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=arenadata/cluster/17/hiveserver2
TIPYou can find the up-to-date JDBC string on the Hive service Info page in ADCM (Clusters → <cluster_name> → Services → Hive → Info). -
Change the database location by running:
ALTER DATABASE default SET LOCATION 'ofs://adho/ozone/adh/apps/hive/warehouse';
-
Verify the new database location:
DESCRIBE DATABASE EXTENDED default;
The output:
+----------+------------------------+--------------------------------------------+-------------------------------------------+-------------+-------------+-----------------+----------------+-------------+ | db_name | comment | location | managedlocation | owner_name | owner_type | connector_name | remote_dbname | parameters | +----------+------------------------+--------------------------------------------+-------------------------------------------+-------------+-------------+-----------------+----------------+-------------+ | default | Default Hive database | ofs://adho/ozone/adh/apps/hive/warehouse | ofs://adho/ozone/adh/apps/hive/warehouse | public | ROLE | | | | +----------+------------------------+--------------------------------------------+-------------------------------------------+-------------+-------------+-----------------+----------------+-------------+
-
-
Restart the Hive service.
Step 4. Configure YARN
-
Create the necessary directories for the YARN service:
$ hdfs dfs -mkdir -p ofs://adho/ozone/adh/system/yarn/node-labels
-
Configure YARN to work with Ozone. Namely, specify the properties listed in the table.
Configuration section Configuration property Value Custom yarn-site.xml
yarn.node-labels.fs-store.root-dir
ofs://adho/ozone/adh/system/yarn/node-labels
yarn.nodemanager.remote-app-log-dir
ofs://adho/ozone/adh/logs
-
Restart YARN.
After completing the steps above, Hive, Spark, and Impala services will by default store data in Ozone rather than HDFS. Also, temporary and user files generated by YARN during application execution will also be stored in Ozone.
Use Ozone and HDFS
Hive
At the same time, the above services configured to work with Ozone, can still interact with HDFS. For example, all the Hive tables originally created in HDFS before the transition to Ozone, remain accessible. To create new Hive objects in HDFS, specify the location explicitly. For example:
CREATE DATABASE IF NOT EXISTS db_for_hdfs
COMMENT 'This is a database in HDFS.'
LOCATION 'hdfs://adh/apps/hive/warehouse/db_for_hdfs.db';
NOTE
All tables created in the db_for_hdfs database will be stored in HDFS by default.
|
Check the new database location:
DESCRIBE DATABASE EXTENDED db_for_hdfs;
The output:
+--------------+---------------------------+------------------------------------------------+----------------------------------------------------+-------------+-------------+-----------------+----------------+-------------+ | db_name | comment | location | managedlocation | owner_name | owner_type | connector_name | remote_dbname | parameters | +--------------+---------------------------+------------------------------------------------+----------------------------------------------------+-------------+-------------+-----------------+----------------+-------------+ | db_for_hdfs | This is database in hdfs | hdfs://adh/apps/hive/warehouse/db_for_hdfs.db | ofs://adho/ozone/adh/apps/hive/warehouse/db_for_hdfs.db | hive | USER | | | | +--------------+---------------------------+------------------------------------------------+----------------------------------------------------+-------------+-------------+-----------------+----------------+-------------+
Spark3
Having configured your kerberized Spark cluster to work with Ozone, you can still run Spark jobs that interact with HDFS.
However, for this, you have to explicitly instruct Spark to pass Kerberos authentication to access required HDFS directories.
You can do this by using the spark.kerberos.access.hadoopFileSystems
parameter, which accepts a comma-separated list of kerberized Hadoop filesystems your Spark application is going to access.
For example:
/bin/spark3-submit \
--deploy-mode cluster \
--master yarn \
--conf spark.kerberos.access.hadoopFileSystems=hdfs://adh
test.py
NOTE
When using Ozone, Spark3 writes staging data to the /user/{username} Ozone bucket.
Keep in mind that in a kerberized environment,
|