Home
Arenadata Hyperwave
Services
Spark
Integrations
Run Spark jobs on Kubernetes

Run Spark jobs on Kubernetes

Sergei Tikhomirov

Collapse content Expand content

Contents

Prerequisites
Setup
- Kubernetes
- Spark Client hosts
Submit a job

Prerequisites

Install an ADH cluster with the HDFS, Hive, and Spark services.
Create a separate HDFS directory for loading test data. The user who runs spark-submit should be the owner of this directory:
```
$ sudo -u hdfs hdfs dfs -mkdir /user/<username>
$ sudo -u hdfs hdfs dfs -chown <username>:<username> /user/<username>
```

If the Spark3 Ranger plugin is enabled, complete the following steps:

Configure Ranger UserSync to pull users and groups from LDAP.
In the Spark service’s configuration in Ranger, click Manage Service → Edit Service. In the Add New Configurations section, set the userstore.download.auth.users parameter to *.

Setup

Kubernetes

Generate the kubeconfig file on a host with the Spark Client component.

Define a service account and bind roles required to run Spark executors (adjust the namespace if needed) in the sa.yaml file.

sa.yaml

---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: spark-submit-sa
  namespace: default
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: spark-submit-sa-role
  namespace: default
rules:
  - apiGroups:
      - ""
    resources:
      - pods
      - configmaps
      - persistentvolumeclaims
      - services
      - secrets
    verbs:
      - get
      - list
      - watch
      - create
      - update
      - patch
      - delete
      - deletecollection
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: spark-submit-sa-rb
  namespace: default
subjects:
  - kind: ServiceAccount
    name: spark-submit-sa
    namespace: default
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: spark-submit-sa-role

Create the defined service account:

$ kubectl create -f sa.yaml

The expected output is:

serviceaccount/spark-submit-sa created
role.rbac.authorization.k8s.io/spark-submit-sa-role created
rolebinding.rbac.authorization.k8s.io/spark-submit-sa-rb created

If SSL is enabled, prepare a secret with SSL certificates (adjust namespace if needed) that is created during SSL enabling. It should contain the /etc/ssl/truststore.jks file. Note that truststore.jks is expected to have all the trusted CA.
ssl.yaml
apiVersion: v1 kind: Secret metadata: name: ssl-secret namespace: default data: truststore.jks: base64 content of /etc/ssl/truststore.jks
```
$ kubectl create -f ssl.yaml
```

Spark Client hosts

If Kerberos is enabled, generate a ticket cache and create a secret for it:
```
$ kinit -k -c FILE:<krb5_ccache> -t <keytab>> <principal>
$ kubectl create secret generic user-krb-cache --from-file=<krb5_ccache>
```
where:
- <krb5_ccache> — path to a ticket cache file.
- <keytab> — path to a user keytab file.
- <principal> — principal name in the <username>@<realm> format.

Configure Spark files.

If the Spark3 Ranger plugin is enabled, append the following lines to ranger-spark-security.xml (can be done using the Custom ranger-spark-security.xml Spark parameter in ADCM):

<property>
	<name>ranger.plugin.hive.use.rangerGroups</name>
	<value>True</value>
</property>
<property>
	<name>ranger.plugin.hive.use.only.rangerGroups</name>
	<value>True</value>
</property>

Submit a job

As an example of a job, the following file is used:

write-hdfs.py

from pyspark.sql import SparkSession

def main():
    spark = SparkSession.builder.appName("HDFSWriteExample").getOrCreate()

    data = [("row1", 1), ("row2", 2), ("row3", 3)]
    df = spark.createDataFrame(data, ["col1", "col2"])

    hdfs_output_path = "hdfs://hdfs/user/admin/tmp/test/"

    df.write.format("csv").mode("overwrite").save(hdfs_output_path)

    print(f"Data successfully written to {hdfs_output_path}")

    spark.stop()

if __name__ == "__main__":
    main()

Run spark-submit:

$ /bin/spark3-submit --deploy-mode cluster --master <master> --conf spark.kubernetes.container.image=<image> --conf spark.kubernetes.namespace=<namespace> --conf spark.kubernetes.authenticate.driver.serviceAccountName=<sa>  --conf spark.kubernetes.file.upload.path=<upload_path> write_hdfs.py

where:

<master> is a Kubernetes API endpoint accessible from a Spark Client host, e.g. k8s://https://10.92.14.35.
<image> is the Spark driver and executor image. It should be unpacked from an offline pack and uploaded to a registry. A link to the image in the registry should be provided in this parameter, e.g. hub.arenadata.io/adh-enterprise/spark3-docker:3.5.4_arenadata3-adh-4.2.0-x86_64.
<namespace> is a namespace in a Kubernetes to run driver and executor, e.g. default.
<sa> is a Kubernetes service account defined during the second step.
<upload_path> is a path to an HDFS or Ozone directory, e.g. hdfs://hdfs/user/admin/tmp/test/ or ofs://adhom/ozone/adh/apps/spark-submit/upload.
Add the following parameters if SSL is enabled:
- --conf spark.kubernetes.driver.secrets.<ssl-secret> — mount SSL to driver, where <ssl-secret> is the secret with SSL data created during the third step.
- --conf spark.kubernetes.executor.secrets.<ssl-secret> — mount SSL to executor, where <ssl-secret> is the secret with SSL data created during the third step.
- --conf spark.driver.extraJavaOptions — use mounted truststore for all requests, e.g. -Djavax.net.ssl.trustStore=/etc/ssl/truststore.jks -Djavax.net.ssl.trustStorePassword=<password>, <password> should be aligned with truststore’s password.
Add the following parameters if Kerberos is enabled:
- --conf spark.kerberos.access.hadoopFileSystems — a comma-separated list of available file systems, e.g. hdfs://<hdfs_nameservice>,ofs://<ozone_mgr>/ozone, where <hdfs_nameservices> is the value of the dfs.internal.nameservices HDFS parameter; <ozone_mgr> is the value of the ozone.om.service.ids Ozone Manager parameter.
- --principal — a Kerberos principal of a user, e.g. user@RU-CENTRAL1.INTERNAL.
- --keytab — a Kerberos keytab file of a user, e.g. ./user.keytab.
- --conf spark.kubernetes.kerberos.krb5.path — path to the krb5.conf file inside container, e.g. /etc/krb5.conf.
- --conf spark.kubernetes.driverEnv.KRB5CCNAME — path to a ticket cache.
- --conf spark.kubernetes.driverEnv.KRB5PRINCIPAL — Kerberos principal name in the <username>@<realm> format.
- --conf spark.kubernetes.driver.secrets.<krb5_ccache_secret> — mount path for the ticket cache secret created on the first step.

You can monitor the job in the Spark History Server web UI.

Found a mistake? Seleсt text and press Ctrl+Enter to report it