Prerequisites

Checklist

Before installing and using ADB Spark Connector, you have to make sure that the following requirements are met:

  • There is access to the ADB cluster.

  • There is access to the Spark cluster.

  • There is network connection between the ADB master node and the Spark driver.

  • There is network connection between the ADB master node and each Spark executor node.

  • There is network connection between each ADB segment node and each Spark executor node.

You need to grant special rights to a user who has access to ADB tables. Those rights include the right to create external tables. You can use the following sample:

ALTER ROLE <role_name>
WITH    CREATEEXTTABLE(protocol='gpfdist',type='readable')
        CREATEEXTTABLE(protocol='gpfdist',type='writable');

where <role_name> is a user name in ADB.

If there are no permissions for that user, you have to specify all permissions in the pg_hba.conf file.

Memory

In general, Spark runs fine using any memory amount between 8 GB and hundreds of gigabytes per machine. We recommend to allocate 75% of the memory for Spark at most — leave the rest for the operating system and buffer cache.

The memory amount that you need depends on your application. To determine the memory amount that your application uses for a certain dataset size, load a part of your dataset into the Spark RDD, then use the Storage tab of the Spark monitoring UI (http://<driver-node>:4040) to see the memory size for that part. Memory usage is affected by storage level and serialization format. See the tuning guide for tips on how to reduce the memory usage.

NOTE
Java VM does not always behave well if there is more than 200 GB of RAM. If you purchase machines with more RAM, you can run multiple worker Java VMs per node. In Spark standalone mode, you can set the number of worker machines per node via the SPARK_WORKER_INSTANCES variable in the conf/spark-env.sh script. You can also set the number of cores per worker machine via the SPARK_WORKER_CORES variable.

Network

ADB Spark Connector uses the gpfdist protocol to transfer data between ADB segments and Spark execution nodes. By default, the connector starts an instance of the gpfdist server inside the [8080 .. 65536] port range.

IMPORTANT
Make sure there are open ports in the range specified above. The ports should be accessible from the ADB master.
Found a mistake? Seleсt text and press Ctrl+Enter to report it