Spark connector overview

The ADB Spark connector provides the possibility of high-speed, parallel data exchange between Apache Spark and Arenadata DB.

Architecture

Each Spark application consists of a controlling process — a Driver, and a set of distributed worker processes — Executors.

architecture dark
architecture light

Data reading

The Spark connector initializes the driver at the initial stage for the loading ADB table into the Spark. Next, the driver establishes a connection with the ADB master using JDBC to obtain the necessary metadata about the table. This metadata contains information about the type and the tables structure, as well as about the distribution key. This key allows you to efficiently distribute data from the table to existing Spark performers.

The data are stored on ADB segments. A Spark application, using a connector, can receive data from each segment into one or more Spark partitions. Currently there are five partitioning modes:

  • According to gp_segment_id. Data from the ADB tables are read and distributed across Spark partitions in accordance with the segment. Here, the number of Spark partitions corresponds to the number of active ADB segments. This partitioning mode is used by default and does not require additional parameters.

  • According to the specified column and the number of partitions. Data are divided by Spark partitions inside the range of the column by the specified partitions number. In this case, the number of Spark partitions corresponds to the number of ranges. It is necessary to set the appropriate parameters (Spark connector options). This mode has limitations — only integer and date/time fields of the table can be used as a column.

  • According to the specified column. Data are divided into Spark partitions in accordance with the unique values of the specified column. In this method, the number of Spark partitions corresponds to the number of unique values of the specified column. It is necessary to set the appropriate parameter (Spark connector options). This mode is recommended for the usage with a small, limited set of values.

  • According to the specified number of partitions. The data are divided into Spark partitions according to a hash function by default into the specified number of partitions. It is necessary to set the appropriate parameter (Spark connector options).

  • According to the specified hash function and the number of partitions. The data are divided into Spark partitions according to the specified hash function into the specified number of partitions. It is necessary to set the appropriate parameters (Spark connector options). This mode has limitations - an expression that returns only an integer type can be used as a hash function.

There are data processing tasks which are spread to each Spark executor along with a created previously corresponding partition. The data exchange uses external writable tables in parallel for each partition.

reading dark
reading light

Data writing

The Spark connector initializes the driver at the initial stage to load the table from Spark to ADB. Then, the driver establishes a connection with the ADB master using JDBC and prepares the ADB for loading depending on the writing mode.

The following recording modes are currently supported:

  • Overwrite (overwrite), in which the target table, depending on spark.db.table.truncate, is either completely deleted, or all data is cleaned up in it before the writing process;

  • Adding (append), at which the additional data is written to the target table;

  • Error if the target table exists (errorIfExists).

A number of spark.db.create.table options are also available, allowing you to specify additional settings when creating a target table (Spark connector options).

Next, a data loading task is created for each Spark partition. Data is exchanged using the mechanism of readable-external tables in parallel for each segment.

writing dark
writing light

Supported data types

From ADB into the Spark
ADB Spark

bigint

LongType

bigSerial

LongType

bit

StringType

bytea

BinaryType

boolean

BooleanType

char

StringType

date

DateType

decimal

DecimalType

float4

FloatType

float8

DoubleType

int

IntegerType

interval

CalendarIntervalType

serial

IntegerType

smallInt

ShortType

text

StringType

time

TimeStampType

timestamp

TimeStampType

timestamptz

TimeStampType

timetz

TimeStampType

varchar

StringType

From Spark into the ADB
Spark ADB

BinaryType

bytea

BooleanType

boolean

CalendarIntervalType

interval

DateType

date

DecimalType

numeric

DoubleType

float8

FloatType

float4

IntegerType

int

LongType

bigint

ShortType

smallInt

StringType

text

TimeStampType

timestamp

Found a mistake? Seleсt text and press Ctrl+Enter to report it