spark-submit & spark-shell

There are two ways to launch Spark jobs on your cluster. You can run it on any host with the Spark client:

  • spark-submit. Submit interactive statements through the Scala, Python, or R shell.

  • spark-shell. Create a Spark application that runs interactively or in batch mode, using Scala, Python, R, or Java.

Alternatively, you can use Livy to submit and manage Spark applications on a cluster. Livy is a Spark service that allows local and remote applications to interact with Apache Spark over an open source REST interface. Livy offers additional multi-tenancy and security functionality.

spark-submit

To launch Spark applications on a cluster, you can use the spark-submit script in the Spark bin directory. It can use all of the Spark’s supported cluster managers through a uniform interface so you don’t have to configure your application specially for each cluster manager.

If your code depends on other projects, you have to package the dependencies alongside with your application to run the application on a cluster with Spark. To do this, create an assembly jar (or uber-jar) containing your code and its dependencies. You can use sbt or Maven assembly plugins. When creating assembly jars, list Spark and Hadoop as provided dependencies; those need not be bundled since the cluster manager provides them during the runtime.

When you have built the uber-jar, you can run it using the bin/spark-submit script. This script takes care of setting up the classpath with Spark and its dependencies, and can support different cluster managers and deploy modes that Spark supports.

You can find more information at the Submitting Applications page of the Spark documentation.

spark-shell

Spark shell provides a simple way to learn the API, as well as a powerful tool to analyze the data interactively. You can use the API interactively by launching a shell for Scala (spark-shell script). Note that each interactive shell automatically creates SparkContext in a variable called sc.

It is available in Scala, that runs on the Java VM and is a good way to use existing Java libraries. Start it by running the following command in the Spark directory:

$ ./bin/spark-shell
Found a mistake? Seleсt text and press Ctrl+Enter to report it