spark-submit
The spark-submit
script is used to launch applications on a cluster.
It is located in the Spark’s bin directory.
An example of invoking spark-submit
is below:
./bin/spark-submit \
--class <main-class> \
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key>=<value> \
... # other options
<application-jar> \
[application-arguments]
Where:
-
--class
— the entry point for your Spark application (e.g.org.apache.spark.examples.SparkPi
). -
--master
— the master URL for the cluster (e.g. spark://23.195.26.187:7077). -
--deploy-mode
— indicates whether to deploy your driver on the worker nodes (cluster
) or locally as an external client (client
, default value). -
--conf
— an arbitrary Spark configuration property in thekey=value
format. For values that contain spaces, you should wrap the properties in quotes ("key1"="value 1"
). Multiple configurations should be passed as separate arguments (for example,--conf <key>=<value> --conf <key2>=<value2>
). -
<application-jar>
— the path to a JAR including your application and all the dependencies. The URL must be globally accessible inside of your cluster, for instance, an hdfs://* path or a file://* path that is present on all nodes. -
[application-arguments]
— arguments to be passed to the main method of your main class, if any.
Master URL
The master URL passed to Spark can be in one of the following formats:
URL | Description |
---|---|
local |
Runs Spark locally with one worker thread (no parallelism) |
local[K] |
Runs Spark locally with K worker threads (ideally, set this to the number of cores on your machine) |
local[K,F] |
Runs Spark locally with K worker threads and F |
local[*] |
Runs Spark locally with as many worker threads as the number of logical cores on your machine |
local[*,F] |
Runs Spark locally with as many worker threads as the number of logical cores on your machine and F |
local-cluster[N,C,M] |
Local-cluster mode is used only for unit tests. It emulates a distributed cluster in a single JVM with N number of workers, C cores per worker and M MiB of memory per worker |
spark://HOST:PORT |
Connects to the given Spark cluster in the Spark standalone mode. The port must be whichever one your master is configured to use, which is 7077 by default |
spark://HOST1:PORT1,HOST2:PORT2 |
Connects to the given Spark cluster in the Spark standalone mode with standby masters with Zookeeper. The list must have all the master hosts in the high availability cluster set up with Zookeeper. The port must be whichever each master is configured to use, which is 7077 by default |
mesos://HOST:PORT |
Connects to the given Mesos cluster.
The port must be whichever one your is configured to use, which is 5050 by default.
Or, for a Mesos cluster using ZooKeeper, use |
yarn |
Connects to a YARN cluster in client or cluster mode depending on the value of |
k8s://HOST:PORT |
Connects to a Kubernetes cluster in client or cluster mode depending on the value of |