Thank you for rating our AI Search!

We would be grateful if you could share your thoughts so we can improve our AI Search for you and other readers.

Your feedback will help us improve AI Search experience.

Hello, I’m Arenadata AI!

Select language

EN | RU

Change color mode

ADB architecture

Anton Vasilev

Contents

Coordinator
Standby coordinator
Segments
Interconnect
Use ADB for ETL
Integration with other systems
Advanced cluster management tools

Arenadata DB (ADB) is a Massively Parallel Processing (MPP) database based on Greengage DB.

The MPP architecture consists of a cluster of simultaneously operating nodes. ADB implements this as a cluster of separate PostgreSQL instances.

ADB uses the shared-nothing approach where each segment has its own portion of data. Multiple segments can run on a single host, but they process data independently. The system primarily scales horizontally by adding more physical nodes to the cluster.

The ADB architecture consists of three core components:

Coordinator
Standby coordinator
Segments

ADB architecture

Coordinator

Coordinator (formerly master) is the entry point to the database. End users connect to the coordinator to submit queries.

The coordinator functions include:

Client authentication. The coordinator controls client access to the system using the standard pg_hba.conf file. Segments accept SQL client connections from the coordinator only.
Query parsing. The coordinator checks the original SQL statement for correct syntax. ADB follows PostgreSQL syntax with some adjustments to support its distributed design.
Optimization. The optimizer determines the most efficient way to execute an SQL query by evaluating multiple potential execution plans and selecting the one with the lowest estimated cost.
Workload distribution. The coordinator distributes workload across query executors on all segments.
Getting results. The coordinator aggregates local results generated by segments (such as filtered rows, aggregated values, or joined data) and returns the final result to the client.
Administrative operations. Beyond processing queries, the coordinator performs database management. For example, it updates system catalogs after receiving a DDL command (such as CREATE TABLE).

The coordinator doesn’t store user data; it resides on segments only. The coordinator stores the global system catalog (tables like pg_database and pg_authid) with metadata about the database itself and about structure and properties of user data. When a user creates a new table, the coordinator stores information about its schema, ownership, and distribution type while the actual content of the table is stored on the segment.

Standby coordinator

The standby coordinator serves as a warm standby: it is kept up to date and ready to take over, but it is not processing queries unless activated. The standby coordinator is optional but highly recommended.

If the primary coordinator fails, there is no automatic switching to the standby coordinator. Instead, ADB relies on administrator intervention. For this, you can use the Activate Standby cluster action in the web interface of Arenadata Cluster Manager (ADCM).

Data integrity and recovery is implemented using write-ahead logging (WAL) inherited from PostgreSQL. WAL ensures that all changes to data files are first recorded in the WAL log and flushed to permanent storage before the actual data pages are modified on disk. This guarantees that even if a crash occurs, the database can be recovered by replaying the WAL records.

Segments

Each segment is an independent PostgreSQL database that stores and processes its set of data. ADB distributes user data across segments based on the distribution key that you select when you create a table. Balanced distribution of data and workloads across segments is the key to obtaining the best performance from ADB.

TIP

It is advised that all segment hosts have identical hardware resources and configurations. This is because the coordinator aggregates results only after all segments complete their work, so the ADB performance will be as high as the slowest segment server in the cluster. If one segment is significantly slower than the others, it will delay the entire query.

Primary segments

Primary segments are active nodes that store data, receive query plans, and execute queries. They run local database operations on their sets of data and send the results back to the coordinator.

Mirror segments

Mirror segments provide high availability support for primary segments. If a primary segment becomes unavailable, the system automatically fails over to the mirror copy. The mirror then starts acting as a primary segment, and the unavailable primary segment becomes a mirror segment. The administrator must recover the failed segment, and then exchange the primary and mirror segments so they are in their initial (preferred) roles.

A mirror segment must always reside on a different host than its primary segment. Two mirror configurations exist:

Group mirroring
Spread mirroring

Group mirroring

With group mirroring, all mirror segments for primary segments of a host are put on one backup host. For example, in the image below, the Primary 0 and Primary 1 segments reside on Segment host 1, and their mirrors Mirror 0 and Mirror 1 reside on Segment host 2.

Key characteristics of group mirroring:

It is the least expensive option because it can be achieved with two hosts only.
It leads to unbalanced load in case of failure. When a host fails, its mirror host will get double the workload: in addition to its own primary segments (Primary 2 and Primary 3 on the image below), its mirrors (Mirror 0 and Mirror 1) will act as primary segments. The host performance will degrade by 50% or more.

group mirroring

Group mirroring

Spread mirroring

With spread mirroring, mirrors for each host’s primary segments are spread across as many hosts as there are segments per host. For example, in the image below, the Primary 0 and Primary 1 segments reside on Segment host 1, and their mirrors reside on different hosts: Mirror 0 on Segment host 2, and Mirror 1 on Segment host 3.

Comparing to group mirroring:

Better fault tolerance. Host failures affect only one mirror per host, minimizing workload disruption. On the image below, if Segment host 1 fails, Segment host 2 will have Mirror 0 activated, and Segment host 3 will have Mirror 1 activated. The load on each host will increase by 50%, and this extra load is evenly distributed among hosts.
More expensive. This configuration requires the cluster to have at least one more host than there are segments per host.

Spread mirroring

Interconnect

The interconnect is a high-speed network layer that enables communication between segments during query execution. A 10-Gigabit system, or faster, is recommended.

Most database operations run across all segments in parallel and require data motion (data moving between segments). The interconnect is used for data motion, and slow interconnect can lead to delays in such operations.

The interconnect is not required for queries executed on a single segment (targeted query), particularly those filtered by the distribution key where all necessary data can be found on the segment.

By default, the interconnect uses User Datagram Protocol with flow control (UDPIFC). You can select the interconnect type (udpifc, tcp, or proxy) in the ADB configuration parameters.

Use ADB for ETL

Production systems may deploy designated servers for the extract, transform, and load (ETL) purposes outside the ADB cluster. These external servers can connect to the ADB cluster via standard protocols like ODBC, JDBC, or libpq.

To transfer large volumes of data, you can use the File Distribution Protocol utility (gpfdist). You can run it directly on ETL servers where external files reside. With this architecture, ETL operations don’t compete with query resources.

To load data using gpfdist, define external tables with gpfdist as the data source. When you select data using this external table, segments directly connect to gpfdist and load data in parallel. The coordinator host only coordinates the process.

ETL hosts

Integration with other systems

For integration with diverse data sources and various data formats, ADB uses Platform Extension Framework (PXF) and specialized connectors.

PXF is a data access framework that allows you to query external systems directly. It enables federated queries: you can access and combine data from different sources with a single query. It also supports filter pushdown, which lets external data sources filter data for the WHERE clause on their side and thus reduces the amount of data that is transferred to ADB. PXF allows you to handle multiple data formats including Avro, OCF, CSV, Text, ORC, and JSON.

On top of PXF, connectors are built to enable data transfer between the cluster and particular external systems (such as file storage, databases, and streams).

Built-in PXF connectors enable integration with Oracle Database, S3, HBase, Hadoop, HDFS, Hive, and JDBC sources. The Enterprise Edition of ADB provides the following additional connectors: ADB ClickHouse Connector, ADB to Kafka Connector and Kafka to ADB Connector, ADB Spark 3 Connector, and ADB to ADB Connector.

Connectors

Additionally, to extend native SQL capabilities, ADB provides database extensions, such as PostGIS for support of geographic objects or plpgsql for support of the PL/pgSQL language. For the full list of extensions available in ADB, refer to Supported extensions.

Advanced cluster management tools

As part of the Enterprise Edition, the following solutions are available in ADB to enhance database management and monitoring:

Arenadata DB Control (ADB Control), which provides real-time monitoring of SQL commands and transactions, sessions, resource groups, system metrics, and more.
Arenadata DB Backup Manager (ADBM), which is a fault-tolerant system for ADB binary backup management.

Found a mistake? Seleсt text and press Ctrl+Enter to report it