Run Spark jobs on Kubernetes
Prerequisites
-
Install an ADH cluster with the HDFS, Hive, and Spark services.
-
Create a separate HDFS directory for loading test data. The user who runs
spark-submitshould be the owner of this directory:$ sudo -u hdfs hdfs dfs -mkdir /user/<username> $ sudo -u hdfs hdfs dfs -chown <username>:<username> /user/<username>
If the Spark3 Ranger plugin is enabled, complete the following steps:
-
Configure Ranger UserSync to pull users and groups from LDAP.
-
In the Spark service’s configuration in Ranger, click Manage Service → Edit Service. In the Add New Configurations section, set the
userstore.download.auth.usersparameter to*.
Setup
Kubernetes
-
Generate the kubeconfig file on a host with the Spark Client component.
-
Define a service account and bind roles required to run Spark executors (adjust the namespace if needed) in the sa.yaml file.
sa.yaml--- apiVersion: v1 kind: ServiceAccount metadata: name: spark-submit-sa namespace: default --- apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: name: spark-submit-sa-role namespace: default rules: - apiGroups: - "" resources: - pods - configmaps - persistentvolumeclaims - services - secrets verbs: - get - list - watch - create - update - patch - delete - deletecollection --- apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: spark-submit-sa-rb namespace: default subjects: - kind: ServiceAccount name: spark-submit-sa namespace: default roleRef: apiGroup: rbac.authorization.k8s.io kind: Role name: spark-submit-sa-role -
Create the defined service account:
$ kubectl create -f sa.yamlThe expected output is:
serviceaccount/spark-submit-sa created role.rbac.authorization.k8s.io/spark-submit-sa-role created rolebinding.rbac.authorization.k8s.io/spark-submit-sa-rb created
-
If SSL is enabled, prepare a secret with SSL certificates (adjust namespace if needed) that is created during SSL enabling. It should contain the /etc/ssl/truststore.jks file. Note that truststore.jks is expected to have all the trusted CA.
ssl.yamlapiVersion: v1 kind: Secret metadata: name: ssl-secret namespace: default data: truststore.jks: base64 content of /etc/ssl/truststore.jks$ kubectl create -f ssl.yaml
Spark Client hosts
-
If Kerberos is enabled, generate a ticket cache and create a secret for it:
$ kinit -k -c FILE:<krb5_ccache> -t <keytab>> <principal> $ kubectl create secret generic user-krb-cache --from-file=<krb5_ccache>where:
-
<krb5_ccache>— path to a ticket cache file. -
<keytab>— path to a user keytab file. -
<principal>— principal name in the<username>@<realm>format.
-
-
Configure Spark files.
If the Spark3 Ranger plugin is enabled, append the following lines to ranger-spark-security.xml (can be done using the Custom ranger-spark-security.xml Spark parameter in ADCM):
<property> <name>ranger.plugin.hive.use.rangerGroups</name> <value>True</value> </property> <property> <name>ranger.plugin.hive.use.only.rangerGroups</name> <value>True</value> </property>
Submit a job
As an example of a job, the following file is used:
from pyspark.sql import SparkSession
def main():
spark = SparkSession.builder.appName("HDFSWriteExample").getOrCreate()
data = [("row1", 1), ("row2", 2), ("row3", 3)]
df = spark.createDataFrame(data, ["col1", "col2"])
hdfs_output_path = "hdfs://hdfs/user/admin/tmp/test/"
df.write.format("csv").mode("overwrite").save(hdfs_output_path)
print(f"Data successfully written to {hdfs_output_path}")
spark.stop()
if __name__ == "__main__":
main()
Run spark-submit:
$ /bin/spark3-submit --deploy-mode cluster --master <master> --conf spark.kubernetes.container.image=<image> --conf spark.kubernetes.namespace=<namespace> --conf spark.kubernetes.authenticate.driver.serviceAccountName=<sa> --conf spark.kubernetes.file.upload.path=<upload_path> write_hdfs.py
where:
-
<master>is a Kubernetes API endpoint accessible from a Spark Client host, e.g.k8s://https://10.92.14.35. -
<image>is the Spark driver and executor image. It should be unpacked from an offline pack and uploaded to a registry. A link to the image in the registry should be provided in this parameter, e.g.hub.arenadata.io/adh-enterprise/spark3-docker:3.5.4_arenadata3-adh-4.2.0-x86_64. -
<namespace>is a namespace in a Kubernetes to run driver and executor, e.g.default. -
<sa>is a Kubernetes service account defined during the second step. -
<upload_path>is a path to an HDFS or Ozone directory, e.g.hdfs://hdfs/user/admin/tmp/test/orofs://adhom/ozone/adh/apps/spark-submit/upload. -
Add the following parameters if SSL is enabled:
-
--conf spark.kubernetes.driver.secrets.<ssl-secret>— mount SSL to driver, where<ssl-secret>is the secret with SSL data created during the third step. -
--conf spark.kubernetes.executor.secrets.<ssl-secret>— mount SSL to executor, where<ssl-secret>is the secret with SSL data created during the third step. -
--conf spark.driver.extraJavaOptions— use mounted truststore for all requests, e.g.-Djavax.net.ssl.trustStore=/etc/ssl/truststore.jks -Djavax.net.ssl.trustStorePassword=<password>,<password>should be aligned with truststore’s password.
-
-
Add the following parameters if Kerberos is enabled:
-
--conf spark.kerberos.access.hadoopFileSystems— a comma-separated list of available file systems, e.g.hdfs://<hdfs_nameservice>,ofs://<ozone_mgr>/ozone, where<hdfs_nameservices>is the value of the dfs.internal.nameservices HDFS parameter;<ozone_mgr>is the value of the ozone.om.service.ids Ozone Manager parameter. -
--principal— a Kerberos principal of a user, e.g.user@RU-CENTRAL1.INTERNAL. -
--keytab— a Kerberos keytab file of a user, e.g../user.keytab. -
--conf spark.kubernetes.kerberos.krb5.path— path to the krb5.conf file inside container, e.g./etc/krb5.conf. -
--conf spark.kubernetes.driverEnv.KRB5CCNAME— path to a ticket cache. -
--conf spark.kubernetes.driverEnv.KRB5PRINCIPAL— Kerberos principal name in the<username>@<realm>format. -
--conf spark.kubernetes.driver.secrets.<krb5_ccache_secret>— mount path for the ticket cache secret created on the first step.
-
You can monitor the job in the Spark History Server web UI.