Spark4 configuration parameters

To configure the service, use the following configuration parameters in ADCM.

NOTE
  • Some of the parameters become visible in the ADCM UI after the Advanced flag has been set.

  • The parameters that are set in the Custom group will overwrite the existing parameters even if they are read-only.

Common
Parameter Description Default value

Dynamic allocation (spark.dynamicAllocation.enabled)

Defines whether to use dynamic resource allocation that scales the number of executors, registered with this application, up and down, based on the workload

false

Credential Encryption
Parameter Description Default value

Encryption enable

Enables or disables the credential encryption feature. When enabled, Spark4 stores configuration passwords and credentials required for interacting with other services in the encrypted form

false

Credential provider path

Path to a keystore file with secrets

jceks://hdfs/apps/spark/security/spark4.jceks

Custom jceks

Set to true to use a custom JCEKS file. Set to false to use the default auto-generated JCEKS file

false

spark-defaults.conf
Parameter Description Default value

spark.yarn.archive

Archive containing all the required Spark JARs for distribution to the YARN cache. If set, this configuration replaces spark.yarn.jars and the archive is used in all the application containers. The archive should contain JAR files in its root directory. The archive can also be hosted on HDFS to speed up file distribution

hdfs:///apps/spark/spark4-yarn-archive.tgz

spark.yarn.appMasterEnv.JAVA_HOME

Value of JAVA_HOME for YARN Application Master

/usr/lib/jvm/java-arenadata-openjdk-17

spark.executorEnv.JAVA_HOME

Value of JAVA_HOME for Executor processes

/usr/lib/jvm/java-arenadata-openjdk-17

spark.yarn.historyServer.address

Spark History server address

 — 

spark.master

Cluster manager to connect to

yarn

spark.dynamicAllocation.enabled

Defines whether to use dynamic resource allocation that scales the number of executors, registered with this application, up and down, based on the workload

false

spark.shuffle.service.enabled

Enables the external shuffle service. This service preserves the shuffle files written by executors so that executors can be safely removed, or so that shuffle fetches can continue in the event of executor failure. The external shuffle service must be set up in order to enable it

false

spark.eventLog.enabled

Defines whether to log Spark events, useful for reconstructing the Web UI after the application has finished

true

spark.eventLog.dir

Base directory where Spark events are logged, if spark.eventLog.enabled=true. Within this base directory, Spark creates a sub-directory for each application, and logs the events specific to the application in this directory. You may want to set this to a unified location like an HDFS directory so history files can be read by the History Server

hdfs:///var/log/spark4/apps

spark.dynamicAllocation.executorIdleTimeout

If dynamic allocation is enabled and an executor has been idle for more than this duration, the executor will be removed. For more details, see Spark documentation

120s

spark.dynamicAllocation.cachedExecutorIdleTimeout

If dynamic allocation is enabled and an executor which has cached data blocks has been idle for more than this duration, the executor will be removed. For more details, see Spark documentation

600s

spark.history.provider

Name of the class that implements the application history backend. Currently there is only one implementation provided with Spark that looks for application logs stored in the file system

org.apache.spark.deploy.history.FsHistoryProvider

spark.history.fs.cleaner.enabled

Specifies whether the History Server should periodically clean up event logs from storage

true

spark.history.store.path

A local directory where to cache application history data. If set, History Server will store application data on disk instead of keeping it in memory. The data written to disk will be re-used in case of the History Server restart

/var/log/spark4/history

spark.serializer

Class used for serializing objects that will be sent over the network or need to be cached in the serialized form. By default, works with any serializable Java object but it may be quite slow, so it is recommended to use org.apache.spark.serializer.KryoSerializer and configure Kryo serialization when speed is necessary. Can be any subclass of org.apache.spark.Serializer

org.apache.spark.serializer.KryoSerializer

spark.driver.extraClassPath

Extra classpath entries to be added to the classpath of the driver

  • /usr/lib/hive/lib/hive-shims-scheduler.jar

  • /usr/lib/hadoop-yarn/hadoop-yarn-server-resourcemanager.jar

  • /usr/lib/spark4/jars/adb-spark-connector-assembly-release-1.2.0-spark-4.0.1_arenadata1.jar

  • /usr/lib/spark4/jars/adqm-spark-connector-assembly-release-1.1.0-spark-4.0.1_arenadata1.jar

spark.executor.extraClassPath

Extra classpath entries to add to the classpath of the executors

  • /usr/lib/spark4/jars/adb-spark-connector-assembly-release-1.2.0-spark-4.0.1_arenadata1.jar

  • /usr/lib/spark4/jars/adb-spark-connector-assembly-release-1.2.0-spark-4.0.1_arenadata1.jar

spark.history.ui.port

Port number of the History Server web UI

18094

spark.ui.port

Port number of the Spark web UI

4150

spark.history.fs.logDirectory

Log directory of the History Server

hdfs:///var/log/spark4/apps

spark.sql.extensions

A comma-separated list of Iceberg SQL extensions classes

org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions

spark.sql.catalog.spark_catalog

Iceberg catalog implementation class

org.apache.iceberg.spark.SparkSessionCatalog

spark.sql.hive.metastore.jars

Location of the JARs that should be used to instantiate HiveMetastoreClient

path

spark.sql.hive.metastore.jars.path

A list of comma-separated paths to JARs used to instantiate HiveMetastoreClient

file:///usr/lib/hive/lib/*.jar

spark.driver.extraLibraryPath

Path to extra native libraries for driver

/usr/lib/hadoop/lib/native/

spark.yarn.am.extraLibraryPath

Path to extra native libraries for Application Master

/usr/lib/hadoop/lib/native/

spark.executor.extraLibraryPath

Path to extra native libraries for Executor

/usr/lib/hadoop/lib/native/

spark.yarn.appMasterEnv.HIVE_CONF_DIR

A directory on the Application Master with Hive configs required for running Hive in the cluster mode

/etc/spark4/conf

spark.yarn.historyServer.allowTracking

Allows using Spark History Server for tracking UI even if web UI is disabled for a job

True

spark.connect.grpc.binding.port

Port number to connect to Spark Connect via gRPC

15012

spark.artifactory.dir.path

Path to an artifact directory used by Spark Connect

tmp

spark.sql.security.confblacklist

Prevents overriding specified parameters from an application point of view or for information security reasons

spark.sql.extensions

spark.history.kerberos.enabled

Indicates whether the History Server should use Kerberos to login. This is required if the History Server is accessing HDFS files on a secure Hyperwave cluster

false

spark.acls.enable

Defines whether Spark ACLs should be enabled. If enabled, checks if the user has access permissions to view or modify jobs. Note: this requires the user to be known. If the user is null, no checks will be made. Filters can be used within the UI to authenticate and set the user

false

spark.modify.acls

Defines who has access to modify a running Spark application

spark,hdfs

spark.modify.acls.groups

A comma-separated list of user groups that have modify access to the Spark application

spark,hdfs

spark.history.ui.acls.enable

Specifies whether ACLs should be checked to authorize users viewing the applications in the History Server. If enabled, access control checks are performed regardless of what the individual applications had set for spark.ui.acls.enable. If disabled, no access control checks are made for any application UIs available through the History Server

false

spark.history.ui.admin.acls

A comma-separated list of users that have view access to all the Spark applications in History Server

spark,hdfs,dr.who

spark.history.ui.admin.acls.groups

A comma-separated list of groups that have view access to all the Spark applications in History Server

spark,hdfs,dr.who

spark.ui.view.acls

A comma-separated list of users that have view access to the Spark application. By default, only the user that started the Spark job has view access. Using * as a value means that any user can have view access to this Spark job

spark,hdfs,dr.who

spark.ui.view.acls.groups

A comma-separated list of groups that have view access to the Spark web UI to view the Spark Job details. This can be used if you have a set of administrators or developers or users who can monitor the Spark job submitted. Using * in the list means any user in any group can view the Spark job details on the Spark web UI. The user groups are obtained from the instance of the groups mapping provider specified by spark.user.groups.mapping

spark,hdfs,dr.who

spark.ssl.keyPassword

Password to the private key in the keystore

 — 

spark.ssl.keyStore

Path to the keystore file. The path can be absolute or relative to the directory in which the process is started

 — 

spark.ssl.keyStoreType

Type of keystore used

JKS

spark.ssl.trustStorePassword

Password to the private key in the truststore

 — 

spark.ssl.trustStoreType

Type of the truststore

JKS

spark.ssl.enabled

Defines whether to use SSL for Spark

 — 

spark.ssl.protocol

Defines the TLS protocol to use. The protocol must be supported by JVM

TLSv1.2

spark.ssl.ui.port

Port number used by Spark web UI in case of active SSL

4151

spark.ssl.historyServer.port

Port number used by Spark History Server web UI in case of active SSL

18094

spark.executorEnv.PYTHONPATH

Value of the PYTHONPATH environment variable for the Executor processes

./pyspark.zip:./py4j.zip

spark.yarn.appMasterEnv.PYTHONPATH

Value of the PYTHONPATH environment variable for Application Master

./pyspark.zip:./py4j.zip

spark.yarn.dist.archives

Comma-separated list of archives to be extracted into the working directory of each Executor

hdfs:///apps/spark4/pyspark.zip#pyspark.zip,hdfs:///apps/spark4/py4j.zip#py4j.zip

Custom log4j.properties
Parameter Description Default value

Spark4 spark-log4j2.properties

Stores the Log4j configuration used for logging Spark4’s activity

spark-log4j2.properties

Spark heap memory settings
Parameter Description Default value

Spark History Server Heap Memory

Sets the maximum Java heap size for Spark History Server

1G

Spark4 Connect Heap Memory

Sets the maximum Java heap size for a Spark Connect server

1G

Other
Parameter Description Default value

ad-runtime-utils

Java configuration to be used by the service

 — 

Custom spark-defaults.conf

In this section you can define values for custom parameters that are not displayed in ADCM UI, but are allowed in the configuration file spark-defaults.conf

 — 

spark-env.sh

Contents of the spark-env.sh file used to initialize environment variables on worker nodes

spark-env.sh

spark-history-env.sh

Contents of the spark-history-env.sh file used to initialize environment variables for the Spark History Server

spark-history-env.sh

Ranger plugin enabled

Enables or disables the Ranger plugin

false

Spark4 Client component
Parameter Description Default value

adb_spark4_connector

Version of the adb-spark4-connector package to be installed

1.2.0_4.0.x

adqm_spark4_connector

Version of the adqm-spark4-connector package to be installed

1.1.0_4.0.x

adh_pyspark

Version of the adh-pyspark package to be installed

3.10.4

Found a mistake? Seleсt text and press Ctrl+Enter to report it