ADB Spark 3 Connector overview

ADB Spark 3 Connector provides the ability of high speed parallel data exchange between Apache Spark 3 and Arenadata DB.

Architecture

Each Spark 3 application consists of a controlling process (driver) and a number of distributed worker processes (executors). The interaction scheme is provided below.

architecture dark
ADB Spark 3 Connector architecture
architecture light
ADB Spark 3 Connector architecture

Work algorithms

Data reading

ADB Spark 3 Connector initializes a driver in order to load an ADB table into Spark 3. That driver establishes a connection to the ADB master via JDBC to obtain the necessary metadata about the table. The metadata contains information on type, tables structure, and distribution key. The distribution key allows you to distribute the data from a table to the existing Spark 3 performers.

The data is stored at ADB segments. A Spark 3 application that uses a connector can transfer data from each segment into one or more Spark 3 partitions. There are five partition modes:

  • According to the gp_segment_id. The data from ADB tables is read and distributed across Spark partitions depending on a segment. The number of Spark 3 partitions corresponds to the number of active ADB segments. This partition mode does not require additional parameters and is used by default.

  • According to the specified column and the number of partitions. Data is divided into Spark 3 partitions inside the column range by the specified partitions number. The number of Spark 3 partitions corresponds to the number of ranges. It is necessary to set the appropriate parameters (see ADB Spark 3 Connector options). This mode has limitations: only the integer and the date/time table fields can be used as a column.

  • According to the specified column. Data is divided into Spark 3 partitions depending on the unique values of the specified column. The number of Spark 3 partitions corresponds to the number of unique values of the specified column. It is necessary to set the appropriate parameters (see ADB Spark 3 Connector options). This mode is recommended for the case of a small and limited set of values.

  • According to the specified number of partitions. Data is divided into Spark 3 partitions depending on the specified number of partitions. It is necessary to set the appropriate parameters (see ADB Spark 3 Connector options).

  • According to the specified hash function and the number of partitions. Data is divided into Spark 3 partitions depending on the specified hash function and the specified number of partitions. It is necessary to set the appropriate parameters (see ADB Spark 3 Connector options). This mode has limitations: only the expression that returns nothing but an integer can be used as a hash function.

Some data processing tasks are spread across all Spark 3 executors along with the corresponding partition that was created before. The data exchange is performed in parallel via external writable tables for every partition.

reading dark
Data reading algorithm
reading light
Data reading algorithm

Data writing

ADB Spark 3 Connector initializes a driver to load the table from Spark 3 to ADB. After that, the driver establishes a connection to the ADB master via JDBC and prepares the load of ADB. The load method depends on the writing mode.

The following writing modes are supported:

  • overwrite. Depending on spark.db.table.truncate, the target table is either completely deleted, or all data is cleaned up before the writing process.

  • append. The additional data is written to the target table.

  • errorIfExists. If the target table exists, the error appears.

A number of the spark.db.create.table options are also available. This allows you to specify additional settings when you create a target table (see ADB Spark 3 Connector options).

A task to load the data is created for each Spark 3 partition. The data exchange is performed in parallel via external readable tables for every segment.

writing dark
Data writing algorithm
writing light
Data writing algorithm

Supported data types

Matching of data types during a transfer is shown in the tables below.

Transfer from ADB to Spark 3
ADB Spark 3

bigint

LongType

bigSerial

LongType

bit

StringType

bytea

BinaryType

boolean

BooleanType

char

StringType

date

DateType

decimal

DecimalType

float4

FloatType

float8

DoubleType

int

IntegerType

interval

CalendarIntervalType

serial

IntegerType

smallInt

ShortType

text

StringType

time

TimeStampType

timestamp

TimeStampType

timestamptz

TimeStampType

timetz

TimeStampType

varchar

StringType

Transfer from Spark 3 to ADB
Spark 3 ADB

BinaryType

byte

BooleanType

boolean

CalendarIntervalType

interval

DateType

date

DecimalType

numeric

DoubleType

float8

FloatType

float4

IntegerType

int

LongType

bigint

ShortType

smallint

StringType

text

TimeStampType

timestamp

Supported JDBC drivers

ADB Spark 3 Connector communicates with ADB via a JDBC connection. The PostgreSQL JDBC driver comes with the connector.

A JDBC connection string for the standard driver has the following structure:

jdbc:postgresql://<master_ip>[:<port>]/<database_name>

where:

  • <master_ip> — an IP address of the ADB master host.

  • <port> — a connection port (5432 by default).

  • <database_name> — a database name in ADB.

Additionally, ADB Spark 3 Connector supports the use of third-party JDBC drivers. In order to use a third-party JDBC driver, do the following:

  1. Prepare an ODBC connection string for the ADB master.

  2. Submit a JAR file with a third-party JDBC driver via one of the following ways:

    • Use the --jars environment variable when calling spark3-shell or spark3-submit.

    • Build the uber-jar file with all dependencies.

    • Install a JAR file with a third-party driver on all Spark 3 nodes.

  3. Set the full path to the driver Java class in the connector option spark.adb.driver.

Connector usage

To transfer data between Spark 3 and ADB via ADB Spark 3 Connector, you can use the standard utilities spark3-shell and spark3-submit. Examples of the spark3-shell usage can be found in the ADB Spark 3 Connector usage examples article.

When developing a standalone Spark application, it is necessary to combine the connector JAR file (/usr/lib/spark3/jars/adb-spark-connector-assembly-release-<connector_version>-spark-<spark_version>.jar) with other application dependencies and put the result into the uber-jar file. For more information, see Submitting Applications in the Spark documentation.

Found a mistake? Seleсt text and press Ctrl+Enter to report it