Spark connector overview
The ADB Spark connector provides the possibility of high-speed, parallel data exchange between Apache Spark and Arenadata DB.
Architecture
Each Spark application consists of a controlling process — a Driver, and a set of distributed worker processes — Executors.
Data reading
The Spark connector initializes the driver at the initial stage for the loading ADB table into the Spark. Next, the driver establishes a connection with the ADB master using JDBC to obtain the necessary metadata about the table. This metadata contains information about the type and the tables structure, as well as about the distribution key. This key allows you to efficiently distribute data from the table to existing Spark performers.
The data are stored on ADB segments. A Spark application, using a connector, can receive data from each segment into one or more Spark partitions. Currently there are five partitioning modes:
-
According to gp_segment_id. Data from the ADB tables are read and distributed across Spark partitions in accordance with the segment. Here, the number of Spark partitions corresponds to the number of active ADB segments. This partitioning mode is used by default and does not require additional parameters.
-
According to the specified column and the number of partitions. Data are divided by Spark partitions inside the range of the column by the specified partitions number. In this case, the number of Spark partitions corresponds to the number of ranges. It is necessary to set the appropriate parameters (Spark connector options). This mode has limitations — only integer and date/time fields of the table can be used as a column.
-
According to the specified column. Data are divided into Spark partitions in accordance with the unique values of the specified column. In this method, the number of Spark partitions corresponds to the number of unique values of the specified column. It is necessary to set the appropriate parameter (Spark connector options). This mode is recommended for the usage with a small, limited set of values.
-
According to the specified number of partitions. The data are divided into Spark partitions according to a hash function by default into the specified number of partitions. It is necessary to set the appropriate parameter (Spark connector options).
-
According to the specified hash function and the number of partitions. The data are divided into Spark partitions according to the specified hash function into the specified number of partitions. It is necessary to set the appropriate parameters (Spark connector options). This mode has limitations - an expression that returns only an integer type can be used as a hash function.
There are data processing tasks which are spread to each Spark executor along with a created previously corresponding partition. The data exchange uses external writable tables in parallel for each partition.
Data writing
The Spark connector initializes the driver at the initial stage to load the table from Spark to ADB. Then, the driver establishes a connection with the ADB master using JDBC and prepares the ADB for loading depending on the writing mode.
The following recording modes are currently supported:
-
Overwrite (overwrite), in which the target table, depending on
spark.db.table.truncate
, is either completely deleted, or all data is cleaned up in it before the writing process; -
Adding (append), at which the additional data is written to the target table;
-
Error if the target table exists (
errorIfExists
).
A number of spark.db.create.table
options are also available, allowing you to specify additional settings when creating a target table (Spark connector options).
Next, a data loading task is created for each Spark partition. Data is exchanged using the mechanism of readable-external tables in parallel for each segment.
Supported data types
ADB | Spark |
---|---|
bigint |
LongType |
bigSerial |
LongType |
bit |
StringType |
bytea |
BinaryType |
boolean |
BooleanType |
char |
StringType |
date |
DateType |
decimal |
DecimalType |
float4 |
FloatType |
float8 |
DoubleType |
int |
IntegerType |
interval |
CalendarIntervalType |
serial |
IntegerType |
smallInt |
ShortType |
text |
StringType |
time |
TimeStampType |
timestamp |
TimeStampType |
timestamptz |
TimeStampType |
timetz |
TimeStampType |
varchar |
StringType |
Spark | ADB |
---|---|
BinaryType |
bytea |
BooleanType |
boolean |
CalendarIntervalType |
interval |
DateType |
date |
DecimalType |
numeric |
DoubleType |
float8 |
FloatType |
float4 |
IntegerType |
int |
LongType |
bigint |
ShortType |
smallInt |
StringType |
text |
TimeStampType |
timestamp |