ADB Spark 3 Connector prerequisites

Checklist

Before installing and using ADB Spark 3 Connector, you have to make sure that the following requirements are met:

  • There is access to the ADB cluster.

  • There is access to the Spark 3 cluster.

  • There is network connection between the ADB master node and the Spark 3 driver.

  • There is network connection between the ADB master node and each Spark 3 executor node.

  • There is network connection between each ADB segment node and each Spark 3 executor node.

You need to grant special rights to the user who will access ADB tables from Spark 3. Those rights include the right to create external tables. You can use the following sample:

ALTER ROLE <role_name>
WITH    CREATEEXTTABLE(protocol='gpfdist',type='readable')
        CREATEEXTTABLE(protocol='gpfdist',type='writable');

where <role_name> is a user name in ADB.

You also need to configure the selected user’s access to ADB tables from Spark 3 hosts in the pg_hba.conf file (see ADB Spark 3 Connector usage examples).

Memory

In general, Spark 3 runs fine using any memory amount between 8 GB and hundreds of gigabytes per machine. We recommend to allocate 75% of the memory for Spark 3 at most — leave the rest for the operating system and buffer cache.

The memory amount that you need depends on your application. To determine the memory amount that your application uses for a certain dataset size, load a part of your dataset into the Spark 3 RDD, then use the Storage tab of the Spark 3 monitoring UI (http://<driver-node>:4040) to see the memory size for that part. Memory usage is affected by storage level and serialization format. See the tuning guide for tips on how to reduce the memory usage.

NOTE
Java VM does not always behave well if there is more than 200 GB of RAM. If you purchase machines with more RAM, you can run multiple worker Java VMs per node. In Spark 3 standalone mode, you can set the number of worker machines per node via the SPARK_WORKER_INSTANCES variable in the conf/spark-env.sh script. You can also set the number of cores per worker machine via the SPARK_WORKER_CORES variable.

Network

ADB Spark 3 Connector uses the gpfdist protocol to transfer data between ADB segments and Spark 3 execution nodes. By default, the connector starts an instance of the gpfdist server inside the [8080 .. 65536] port range.

IMPORTANT
Make sure there are open ports in the range specified above. The ports should be accessible from the ADB master.
Found a mistake? Seleсt text and press Ctrl+Enter to report it