Spark Connect usage

This article shows examples of working with a Spark3 cluster using Spark Connect.

Spark3 Connect component

To interact with your Spark3 cluster via Spark Connect, the corresponding component (Spark3 Connect) of the Spark3 service has to be installed in your ADH cluster. You can add this component using Spark3 service actions in ADCM. After the component is installed, it can be used out-of-the-box without any configurations.

Use Spark Connect with Spark applications

A client Spark application that employs Spark Connect is very similar to a regular Spark app. The key feature is using a remote SparkSession object, which is responsible for connecting to a remote Spark Connect server. With the remote session type, the Spark application code runs on the YARN cluster where Spark Connect server is installed. Once a remote SparkSession is created, it can be used as if it were a regular Spark session object. All the communication with the remote Spark Server is managed automatically by the Spark Connect client library.

The main ways of creating a remote Spark session are shown below.

Use spark3-submit

When using /bin/spark3-submit to launch Spark client applications, a remote session can be created in the application code as shown below.

from pyspark import SparkConf
from pyspark import SparkContext
from pyspark.sql import SparkSession

conf = SparkConf()
conf.setMaster('yarn')
conf.setAppName('spark-yarn')
sparkSession = SparkSession
    .builder
    .appName("spark-yarn")
    .config(conf=conf)
    .remote("sc://<sc_host>:<sc_port>") (1)
    .getOrCreate()

data = [('First', 1), ('Second', 2), ('Third', 3)]
df = sparkSession.createDataFrame(data)
df.write.csv("hdfs:///user/admin/tmp/test") (2)
quit()
1 Creates a remote Spark session. <sc_host>:<sc_port> is the gRPC endpoint of the Spark3 Connect component. You can find the up-to-date endpoint on the Clusters → <your_cluster_name> → Services → Spark3 → Info page in ADCM.
2 Writes the DataFrame to HDFS. All the operations are performed on the Spark3 cluster with the Spark3 Connect component.

Once submitted, the following output indicates that the application has successfully connected to the Spark Connect server:

Client connected to the Spark Connect server at ka-adh-1.ru-central1.internal:15002
NOTE
Only one SparkSession object can exist within a Spark application at a time. You can kill a Spark session using SparkSession.stop() and create a remote session as shown in the example above.

Use PySpark shell

When using PySpark shell (/usr/bin/pyspark3), the SparkSession object is created automatically during the shell startup.

...
SparkSession available as 'spark'.

To force PySpark shell to create the remote session type, use the --remote flag when running /usr/bin/pyspark3. For example:

/bin/pyspark3 --remote "sc://<sc_host>:<sc_port>"

Where <sc_host>:<sc_port> is a gRPC endpoint of the Spark3 Connect component. You can find the up-to-date endpoint on the Clusters → <your_cluster_name> → Services → Spark3 → Info page in ADCM.

Authentication

Spark Connect does not provide built-in authentication and the gRPC channel between a Spark client app and the Spark Connect server remains unsecure. A way to secure this communication channel is through using gRPC proxies.

Found a mistake? Seleсt text and press Ctrl+Enter to report it