Run Spark jobs on Kubernetes

Prerequisites

  • Install an ADH cluster with the HDFS, Hive, and Spark services.

  • Create a separate HDFS directory for loading test data. The user who runs spark-submit should be the owner of this directory:

    $ sudo -u hdfs hdfs dfs -mkdir /user/<username>
    $ sudo -u hdfs hdfs dfs -chown <username>:<username> /user/<username>

If the Spark3 Ranger plugin is enabled, complete the following steps:

  1. Configure Ranger UserSync to pull users and groups from LDAP.

  2. In the Spark service’s configuration in Ranger, click Manage Service → Edit Service. In the Add New Configurations section, set the userstore.download.auth.users parameter to *.

Setup

Kubernetes

  1. Generate the kubeconfig file on a host with the Spark Client component.

  2. Define a service account and bind roles required to run Spark executors (adjust the namespace if needed) in the sa.yaml file.

    sa.yaml
    ---
    apiVersion: v1
    kind: ServiceAccount
    metadata:
      name: spark-submit-sa
      namespace: default
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: Role
    metadata:
      name: spark-submit-sa-role
      namespace: default
    rules:
      - apiGroups:
          - ""
        resources:
          - pods
          - configmaps
          - persistentvolumeclaims
          - services
          - secrets
        verbs:
          - get
          - list
          - watch
          - create
          - update
          - patch
          - delete
          - deletecollection
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: RoleBinding
    metadata:
      name: spark-submit-sa-rb
      namespace: default
    subjects:
      - kind: ServiceAccount
        name: spark-submit-sa
        namespace: default
    roleRef:
      apiGroup: rbac.authorization.k8s.io
      kind: Role
      name: spark-submit-sa-role
  3. Create the defined service account:

    $ kubectl create -f sa.yaml

    The expected output is:

    serviceaccount/spark-submit-sa created
    role.rbac.authorization.k8s.io/spark-submit-sa-role created
    rolebinding.rbac.authorization.k8s.io/spark-submit-sa-rb created
  4. If SSL is enabled, prepare a secret with SSL certificates (adjust namespace if needed) that is created during SSL enabling. It should contain the /etc/ssl/truststore.jks file. Note that truststore.jks is expected to have all the trusted CA.

    ssl.yaml
    apiVersion: v1
    kind: Secret
    metadata:
      name: ssl-secret
      namespace: default
    data:
      truststore.jks: base64 content of /etc/ssl/truststore.jks
    $ kubectl create -f ssl.yaml

Spark Client hosts

  1. If Kerberos is enabled, generate a ticket cache and create a secret for it:

    $ kinit -k -c FILE:<krb5_ccache> -t <keytab>> <principal>
    $ kubectl create secret generic user-krb-cache --from-file=<krb5_ccache>

    where:

    • <krb5_ccache> — path to a ticket cache file.

    • <keytab> — path to a user keytab file.

    • <principal> — principal name in the <username>@<realm> format.

  2. Configure Spark files.

    If the Spark3 Ranger plugin is enabled, append the following lines to ranger-spark-security.xml (can be done using the Custom ranger-spark-security.xml Spark parameter in ADCM):

    <property>
    	<name>ranger.plugin.hive.use.rangerGroups</name>
    	<value>True</value>
    </property>
    <property>
    	<name>ranger.plugin.hive.use.only.rangerGroups</name>
    	<value>True</value>
    </property>

Submit a job

As an example of a job, the following file is used:

write-hdfs.py
from pyspark.sql import SparkSession

def main():
    spark = SparkSession.builder.appName("HDFSWriteExample").getOrCreate()

    data = [("row1", 1), ("row2", 2), ("row3", 3)]
    df = spark.createDataFrame(data, ["col1", "col2"])

    hdfs_output_path = "hdfs://hdfs/user/admin/tmp/test/"

    df.write.format("csv").mode("overwrite").save(hdfs_output_path)

    print(f"Data successfully written to {hdfs_output_path}")

    spark.stop()

if __name__ == "__main__":
    main()

Run spark-submit:

$ /bin/spark3-submit --deploy-mode cluster --master <master> --conf spark.kubernetes.container.image=<image> --conf spark.kubernetes.namespace=<namespace> --conf spark.kubernetes.authenticate.driver.serviceAccountName=<sa>  --conf spark.kubernetes.file.upload.path=<upload_path> write_hdfs.py

where:

  • <master> is a Kubernetes API endpoint accessible from a Spark Client host, e.g. k8s://https://10.92.14.35.

  • <image> is the Spark driver and executor image. It should be unpacked from an offline pack and uploaded to a registry. A link to the image in the registry should be provided in this parameter, e.g. hub.arenadata.io/adh-enterprise/spark3-docker:3.5.4_arenadata3-adh-4.2.0-x86_64.

  • <namespace> is a namespace in a Kubernetes to run driver and executor, e.g. default.

  • <sa> is a Kubernetes service account defined during the second step.

  • <upload_path> is a path to an HDFS or Ozone directory, e.g. hdfs://hdfs/user/admin/tmp/test/ or ofs://adhom/ozone/adh/apps/spark-submit/upload.

  • Add the following parameters if SSL is enabled:

    • --conf spark.kubernetes.driver.secrets.<ssl-secret> — mount SSL to driver, where <ssl-secret> is the secret with SSL data created during the third step.

    • --conf spark.kubernetes.executor.secrets.<ssl-secret> — mount SSL to executor, where <ssl-secret> is the secret with SSL data created during the third step.

    • --conf spark.driver.extraJavaOptions — use mounted truststore for all requests, e.g. -Djavax.net.ssl.trustStore=/etc/ssl/truststore.jks -Djavax.net.ssl.trustStorePassword=<password>, <password> should be aligned with truststore’s password.

  • Add the following parameters if Kerberos is enabled:

    • --conf spark.kerberos.access.hadoopFileSystems — a comma-separated list of available file systems, e.g. hdfs://<hdfs_nameservice>,ofs://<ozone_mgr>/ozone, where <hdfs_nameservices> is the value of the dfs.internal.nameservices HDFS parameter; <ozone_mgr> is the value of the ozone.om.service.ids Ozone Manager parameter.

    • --principal — a Kerberos principal of a user, e.g. user@RU-CENTRAL1.INTERNAL.

    • --keytab — a Kerberos keytab file of a user, e.g. ./user.keytab.

    • --conf spark.kubernetes.kerberos.krb5.path — path to the krb5.conf file inside container, e.g. /etc/krb5.conf.

    • --conf spark.kubernetes.driverEnv.KRB5CCNAME — path to a ticket cache.

    • --conf spark.kubernetes.driverEnv.KRB5PRINCIPAL — Kerberos principal name in the <username>@<realm> format.

    • --conf spark.kubernetes.driver.secrets.<krb5_ccache_secret> — mount path for the ticket cache secret created on the first step.

You can monitor the job in the Spark History Server web UI.

Found a mistake? Seleсt text and press Ctrl+Enter to report it