Prerequisites

Checklist

Before installing and using the Spark connector, it is important to make sure that the requirements are met:

  • There is an access to the ADB cluster.

  • There is access to the Spark cluster.

  • There is a network connection between the ADB master node and the Spark driver.

  • There is a network connection between the ADB master node and each Spark executor node.

  • There is a network connection between each ADB segment node and each Spark executor node.

You need to grant special rights to a user which has an access to ADB tables. These rights include the external tables creation. Use the following sample:

alter role with create ext table(protocol='gpfdist') create ext table(type='writable') create ext table(type='readable') login;

Also you need to register all permissions in pg_hba.conf for this user, if there are none.

Memory

In general, Spark can run well with anywhere from 8 GB to hundreds of gigabytes of memory per machine. In all cases, we recommend allocating only at most 75% of the memory for Spark; leave the rest for the operating system and buffer cache.

How much memory you will need will depend on your application. To determine how much your application uses for a certain dataset size, load part of your dataset in a Spark RDD and use the Storage tab of Spark’s monitoring UI (http://<driver-node>:4040) to see its size in memory. Note that memory usage is greatly affected by storage level and serialization format – see the tuning guide for tips on how to reduce it.

Finally, note that the Java VM does not always behave well with more than 200 GB of RAM. If you purchase machines with more RAM than this, you can run multiple worker JVMs per node. In Spark’s standalone mode, you can set the number of workers per node with the SPARK_WORKER_INSTANCES variable in conf/spark-env.sh, and the number of cores per worker with SPARK_WORKER_CORES.

Network

The Spark connector uses the gpfdist protocol to transfer data between ADB segments and Spark execution nodes. By default, the connector starts an instance of the gpfdist server inside the port range [8080 .. 65536]. It is important to make sure there are open ports in this range and they are accessible from the master and ADB segments.

Found a mistake? Seleсt text and press Ctrl+Enter to report it