Spark architecture
Apache Spark is a distributed processing engine designed for running data engineering, data science, and machine learning tasks on large-scale data clusters. Its major features are the following:
-
High processing speed due to in-memory computations.
-
Unified APIs for analytics. Spark provides a single platform for batch processing, real-time streaming, and interactive queries.
-
SQL support. It allows using ANSI SQL to work with your datasets.
-
Multi-language support. Spark provides APIs for popular languages like Java, Scala, Python, and R.
-
Built-in machine learning tools. Spark’s MLlib library provides a wide range of ML algorithms and tools to do machine learning on large, distributed datasets.
Components
Below is a high-level diagram of the Spark architecture.
The major Spark concepts are as follows:
-
Application. A user program written using the Spark API. Spark supports applications written in Python, Scala, Java, and R.
-
Cluster manager. An external service (for example, YARN) for allocating cluster resources to Spark.
-
Driver. The process that runs the
main()method of the Spark application and creates theSparkContextobject. -
SparkContext. The core Spark API object used for interacting with a Spark cluster from your Spark code. Every application starts with instantiating a
SparkContext. Once created, it is used to interact with the cluster manager, track executors' state, keep track of job status and execution plans. -
Worker node. A machine that can run Spark executors.
-
Executor. A JVM process launched on a worker node to do the portion of computing for a Spark application. Each application gets its own set of executors.
-
Task. A unit of work to be sent to one executor.
-
Job. A sequence of computations that occur in order to complete a Spark action (for example,
save(),collect()). -
Stage. Each job is divided into smaller sets of tasks called stages that depend on each other (similarly to map/reduce stages in MapReduce).
Driver and executors
The Spark architecture follows the master-worker paradigm. To run an application, Spark uses the following internal processes:
-
Driver (master). The process that runs the
main()method of a Spark application, interacts with a cluster manager, and submits tasks to executors. -
Executors. An executor is a process that executes tasks from the driver, keeps data in memory/on disk, and reports computation results back to the driver. When a Spark application starts, Spark spawns multiple executors on available cluster nodes. The allocation of executors is done automatically through a cluster manager. You can control the amount of resources to be allocated per executor.
The disposition of driver and executors in a Spark cluster is done automatically and may vary depending on the Spark deployment modes.
Spark deployment modes
Spark supports two deployment modes, which define where the driver process runs.
The deployment mode is specified using the --deploy-mode [client|cluster] option when submitting an application, for example:
./bin/spark3-submit
--deploy-mode [client|cluster]
...
mySparkApp.jar
The key difference between the two modes is described below.
| Characteristic | Client mode | Cluster mode |
|---|---|---|
Driver process location |
Runs on the client machine, the one where the Spark application has been submitted |
Runs on one of the worker nodes of the Spark cluster. The cluster manager chooses the node automatically |
Resource management |
The client machine is responsible for managing the driver process |
The cluster manager is responsible for managing the driver process |
Network traffic |
Assumes active communication between the client machine, cluster manager, and executors (shuffles, JAR/Python files with dependencies, Spark DAGs/tasks, etc.). This requires the client machine to have a fast and stable network connection to the Spark cluster infrastructure |
Intense communication between the client machine and the cluster manager mainly occurs during the application submission. After that, the network traffic between the client machine and the Spark cluster is minimal |
Fault tolerance |
Less fault-tolerant compared to the cluster mode. If the client machine fails along with the driver process, the entire application fails |
More fault-tolerant since the driver process runs on a YARN-managed node within a cluster. If the driver node fails, YARN restarts the driver container automatically |
Suitability |
Interactive testing/development environments where quick iterations and debugging are required. It is also useful for small Spark applications or when resources are limited |
Production deployments where scalability, fault tolerance, and resource isolation are critical. This mode should be the choice for efficient handling of large-scale data processing tasks |
|
NOTE
If a deployment mode is not specified explicitly, Spark runs in the client mode.
|
Execution workflow
Below is a generic execution workflow of a Spark application.
-
The Spark driver creates a
SparkContextobject. Depending on the deploy mode, the Spark driver runs:-
either inside the Application Master process managed by YARN (cluster mode);
-
or in a client process, and the Application Master is only used for requesting resources from YARN (client mode).
-
-
SparkContextrequests YARN to allocate executors. YARN launches containers for executors on cluster nodes with NodeManagers. -
Every executor registers itself in the driver, notifying that it is ready to receive tasks.
-
To perform a transformation, Spark builds or updates a logical plan in the form of a DAG. Since Spark transformations are lazy, no actual work is done by executors at this step. A transformation example is below:
val filtered_students = df.filter($"age" > 18) -
To perform an action, Spark executes the DAG. An action example:
filtered_students.count() -
The driver converts the logical plan to an optimized logical plan and then to a physical plan. During this phase, pushdowns are optimized.
-
DAG Scheduler splits the job into stages.
-
The driver creates tasks and schedules them. Tasks are serialized and sent to executors via the TaskScheduler. The executor for each stage is chosen based on data locality, available executor resources, etc.
-
Executors work on assigned tasks. Each executor reads input blocks, runs the transformation code for each partition, and writes shuffle files when required. Completion status is sent back to the driver.
-
The driver assembles results from executors and the Spark application processing completes. For actions like
count(), results from tasks are aggregated on the driver side. In case of writes, executors directly write data to HDFS/S3.
Spark distribution in ADH
In ADH, the Spark engine is available as a separate service (Spark3 or Spark4, the number indicates the core version). Once added to a cluster, you can use the Spark features out-of-the-box.
Spark services available in ADH include the following components:
-
Spark client. Includes core libraries for running Spark jobs.
-
Spark Connect. Acts as a gateway providing a remote connection to a Spark cluster.
-
Spark History Server. Provides a web UI with details about completed Spark applications.
-
Spark Livy Server. A RESTful service that allows submitting Spark jobs over HTTP.
Spark connectors
The Spark distribution available in ADH additionally includes connectors that allow integration with Arenadata products, such as:
Cluster manager
A cluster manager is responsible for allocating resources (such as CPU and memory) to Spark in order to run an application. Technically, Spark can run several cluster managers, like Mesos and Kubernetes, however, in ADH the Spark distribution is preconfigured to work with YARN. In ADH, this cluster manager is used by default and requires no configuration on the Spark side.
Spark4 vs Spark3
The Spark 4 core version introduced significant changes compared to Spark 3.x, providing enhanced performance, reliability, and developer experience. The full list of features is available in Apache Spark release notes. Below you can find the summary of major updates.
-
ANSI SQL compliance by default. Spark4 has the ANSI mode enabled by default. This brings in the following enhancements:
-
Runtime exceptions are thrown for invalid SQL operations instead of returning
NULL. -
Stricter type checking.
-
Better data quality assurance through rejection of invalid casts during inserts.
-
-
Data type enhancements:
-
The
VARIANTdata type introduction. This is a major advancement for handling semi-structured data, particularly JSON. -
String collation support. Provides better locale-specific string comparisons and sorting.
-
-
Python ecosystem enhancements:
-
Python Data Sources. Native Python data source capabilities allow developers to create custom data sources entirely in Python.
-
Polymorphic Python UDTFs. Polymorphic user-defined table functions allow for more flexible data transformations, and enhanced code reusability.
-
-
Spark Connect enhancements:
-
Spark4 introduces a new lightweight Python client, dramatically reducing deployment footprints.
-
Enhanced remote connectivity patterns.
-
-
Streaming and state management enhancements:
-
Spark 4 provides access to streaming state for debugging and monitoring.
-
Enhanced troubleshooting capabilities for complex streaming pipelines.
-
-
Debugging enhancements. Implementation of structured JSON logging provides machine-readable logs for automated processing. This provides better integration with modern logging aggregation systems.