Arenadata Hyperwave

Arenadata Hyperwave (ADH) is a multipurpose hybrid platform based on open-source and proprietary components developed to store, process, and analyze data of any structure and volume.

Originally, Arenadata Hyperwave was based on tools and components of the Hadoop ecosystem, including Hadoop Distributed File System (HDFS), MapReduce, YARN, and other Apache projects: Hive, Spark, HBase, and Flink. Over time, the platform has evolved and now includes Apache Ozone, Trino, Impala, Iceberg, Kyuubi, and SSM, providing a modern data platform for hybrid workloads.

Use cases

Traditional Data Lake

The approach that includes a dedicated DWH (data warehouse) for operational reporting, a separate solution for fast SQL queries, and requires a data lake for low-cost historical data storage with easy integration is still relevant and in demand.

Data integration
Collecting and unifying heterogeneous sources (IoT sensors, web logs, financial transactions, social media) into a single data lake.
Logs analytics
Storing raw log files in a data lake with further aggregation to DWH for event correlation and incident alerts.
Big data analysis
Pre-processing and storing raw data in a Data Lake, whereas deep statistical and BI analysis are performed in DWH.

Lakehouse

Lakehouse is a universal data platform that combines the power of traditional DWH and the flexibility of Data Lake. It fits any workload starting with batch analytics up to stream computing and ML.

Generative AI and LLM
Generative AI and LLMs use huge amounts of unstructured data. But ingesting data from different sources may lead to slow and inconsistent response. With rapid evolution of the GenAI technology and new tools becoming available, it is crucial to store data in open formats that can be easily accessed by different query engines and vector databases.
Big data analysis
ADH can be used to process and analyze large volumes of data, such as web page activity, data from various sensors, metrics from social networks, and financial data. This information can be useful for analyzing customer behavior statistics, market trends and other important factors.
Fraud detection and prevention (real-time)
Streaming transaction and event processing with low latency, incorporating ML models on-the-fly.
Data integration
ADH can be used to integrate data from different sources and formats into a single, centralized data warehouse. This can help businesses reduce the amount of sparse data and provide a single, consistent view of the data.

Data Mesh & Multitenancy

Data Mesh turns data into products, and Arenadata Hyperwave provides the infrastructure for this approach: domain teams work with isolated data through a single catalog while maintaining autonomy. The multitenancy support is implemented at all levels: starting with storage and resource level separation, precise access control (Apache Ranger) up to dedicating a physical cluster with the required components under a separate domain. This allows you to securely share the platform across teams, partners and environments, maintaining consistency across domains and optimizing costs.

Corporate Data Mesh solutions
Domain teams develop their own storage using a single catalog while maintaining autonomy and uniformity of the metadata.
Safe environment separation
The physical or logical isolation of clusters for different partners, business units, or environments (dev/prod).
Domain and access control
Granular fine-tuning of security and audit policies.

This approach allows you to choose the optimal composition of ADH components: you can start with a Data Lake and then extend your system with Lakehouse capabilities, eventually scaling towards a domain-oriented Data Mesh platform with multitenancy.

Enterprise

Community

Support for key components

High availability and disaster recovery features

Advanced security features, including encryption, role-based access control

Automated management and monitoring tools

Deploy & upgrade automation

Offline installation

Technical support 24/7

Corporate training courses

Tailored solutions

Available integrations

ADQM

ADB

ADPG

ADS

Iceberg

Oracle

MS SQL

AWS S3

Azure Storage

Azure Datalake

GCS

JDBC

Solr

Phoenix

Zeppelin

Airflow

AVRO

PARQUET

ORC

XML

JSON

Operating systems

Alt Linux

CentOS

RedHat

Astra Linux

Ubuntu

RedOS

Support for key components

High availability and disaster recovery features

Advanced security features, including encryption, role-based access control

Automated management and monitoring tools

Deploy & upgrade automation

Offline installation

Technical support 24/7

Corporate training courses

Tailored solutions

Available integrations

ADQM

ADB

ADPG

ADS

Iceberg

Oracle

MS SQL

AWS S3

Azure Storage

Azure Datalake

GCS

JDBC

Solr

Phoenix

Zeppelin

Airflow

AVRO

PARQUET

ORC

XML

JSON

Operating systems

Alt Linux

CentOS

RedHat

Astra Linux

Ubuntu

RedOS

Components

Trino

Trino is a tool designed to efficiently process huge amounts of data using distributed federated queries. The engine provides the ability to query multiple disparate data sources in the same system using the same SQL, which greatly simplifies analytics that require understanding the big picture of all your data. Federated queries in Trino can access your object storage, your main relational databases, and your new streaming or NoSQL system – all in one query.

Apache Ozone

Apache Ozone is an open-source, scalable, and distributed object store designed for big data workloads. It is part of the Apache Hadoop ecosystem and is built on top of Hadoop Distributed File System (HDFS).

Ozone is designed to provide high performance and scalability for storing and processing large amounts of unstructured data such as log files, images, videos, and other large data objects. It is optimized for workloads that require high throughput and low latency, such as big data analytics, machine learning, and streaming data processing.

One of the key features of Ozone is its support for multiple storage classes, including hot, warm, and cold storage. This allows users to store data based on its access patterns and lifecycle, optimizing cost and performance.

Ozone also includes built-in data replication and distribution capabilities, enabling data to be stored across multiple nodes in a Hadoop cluster for improved availability and durability.

Apache Iceberg

Iceberg is an open, high-performance format for creating huge analytical tables. Iceberg brings the reliability and simplicity of SQL tables to big data, allowing engines such as Spark, Trino, Flink, Hive, Impala, and others to safely work with the same tables at the same time.

In addition, the format provides a wide range of functionality that will allow you to work more efficiently with your data. It includes Time Travel and Rollback, Schema Evolution, Hidden Partitioning, Data Compression, and much more.

Smart Storage Manager

Smart Storage Manager is a service that aims to optimize the efficiency of storing and managing data in the Hadoop Distributed File System. SSM collects HDFS operation data and system state information, and based on the collected metrics can automatically use methodologies such as cache, storage policies, heterogeneous storage management (HSM), data compression, and Erasure Coding. In addition, SSM provides the ability to configure asynchronous replication of data and namespaces to a backup cluster for the purpose of organizing DR.

Hue

HUE (Hadoop User Experience) is a web-based interface for the Hadoop ecosystem for data analytics.

Hue allows users to perform data analysis without losing any context. The goal is to promote self service and stay simple like Excel so users can find, explore, query and analyze data. One of the main advantages of Hue is the ability to connect to various data sources: Apache Hive, Impala, Flink SQL, Spark SQL, Phoenix, ksqlDB, Apache Hadoop HDFS, Ozone, HBase, etc.

Apache Kyuubi

Apache Kyuubi is a distributed and multi-tenant gateway to provide SQL on Data Warehouses and Lakehouses.

Kyuubi builds distributed SQL query engines on top of various kinds of modern computing frameworks, e.g. Apache Spark, Flink, Hive, Impala, etc., to query massive datasets distributed over fleets of machines from heterogeneous data sources.

Apache Impala

Apache Impala is an open-source massively parallel processing (MPP) SQL query engine for processing large volumes of data in real-time. It allows users to perform interactive queries on Apache Hadoop data stored in HDFS or Apache HBase. Impala was developed to address the need for a faster, more efficient SQL query engine for big data processing than traditional batch-oriented SQL engines.

Impala provides high-speed performance through its MPP architecture, which enables it to distribute processing across multiple nodes in a Hadoop cluster. It also includes support for advanced features such as complex joins, subqueries, and aggregation functions.

Impala is designed to be easy to use and integrate with existing BI and analytics tools. It supports standard SQL queries and JDBC/ODBC drivers for easy integration with a wide range of applications.

Apache ZooKeeper

Apache ZooKeeper is a distributed coordination service that is designed to help manage large distributed systems. It provides a centralized infrastructure for maintaining configuration information, naming, providing distributed synchronization, and providing group services. ZooKeeper is used extensively in Hadoop clusters to help manage the coordination of distributed systems and to ensure that each node in the cluster is aware of the state of the other nodes.

Hadoop Distributed File System (HDFS)

HDFS is a highly scalable and fault-tolerant distributed file system that forms the foundation of the ADH platform. It allows you to store large volumes of data across multiple nodes in a cluster, with built-in redundancy to ensure that data is always available, even in case of a node failure. HDFS is optimized for handling large files, making it an ideal choice for big data applications.

Apache YARN

YARN is a resource management and job scheduling framework that allows you to run multiple applications simultaneously on a Hadoop cluster. YARN enables you to allocate cluster resources dynamically, based on the needs of each application, and to monitor and manage those resources to ensure optimal performance.

Apache HBase

This is a NoSQL database that provides real-time read/write access to large datasets stored in Hadoop. HBase is designed to handle massive volumes of data and is optimized for random, real-time access to data, making it a popular choice for big data applications that require low-latency access to large datasets.

Apache Phoenix

Apache Phoenix is an open-source, SQL-like query engine for Hadoop that is designed to provide fast and efficient querying of large datasets. Phoenix is built on top of HBase, which means that it can handle massive amounts of data with low latency and provides support for real-time updates and access to data.

Apache Spark

Apache Spark is a fast and powerful open-source data processing engine that provides scalable, fault-tolerant data processing capabilities for big data workloads. The Apache Spark component of Arenadata Hadoop provides a high-performance and distributed computing framework that can process large datasets in parallel across a cluster of nodes. With its advanced analytics capabilities, including machine learning, graph processing, and SQL-like querying, Apache Spark can help businesses extract valuable insights from their data.

Apache Hive

Apache Hive is an open-source data warehouse infrastructure that provides data summarization, query, and analysis capabilities for large datasets stored in Hadoop. The Apache Hive component of Arenadata Hadoop provides a SQL-like interface for querying data in Hadoop, enabling businesses to perform ad-hoc queries, data analysis, and reporting. Hive translates SQL queries into MapReduce jobs, which can be executed on a Hadoop cluster. With its support for partitioning, indexing, and compression, Hive can help businesses optimize data storage and processing in Hadoop.

Apache Tez

Apache Tez is an open-source data processing framework that provides a flexible, efficient, and scalable way to execute complex data processing tasks on a Hadoop cluster. When used together with Apache Hive, Tez provides a faster and more efficient way to execute Hive queries, by replacing the MapReduce execution engine with a more optimized one. The Hive + Tez combination in Arenadata Hadoop provides a powerful and scalable platform for data warehousing, allowing businesses to perform ad-hoc queries, data analysis, and reporting at scale. With Tez's support for dynamic task scheduling and data partitioning, it can accelerate query processing by optimizing the data flow between Hive operators.

Apache Flink

Apache Flink is an open-source stream processing framework that enables the processing of large volumes of real-time data with low latency. The Apache Flink component of Arenadata Hadoop provides a distributed computing framework for real-time data processing that can be seamlessly integrated with batch processing. Flink supports event-driven processing and provides a unified programming model for both batch and stream processing, making it ideal for building end-to-end data processing pipelines. With its advanced features, including support for stateful streaming, windowing, and machine learning, Apache Flink can help businesses gain real-time insights from their data.

Apache Solr

Apache Solr is an open-source, enterprise-level search platform that is built on top of the Apache Lucene search library. Solr provides a robust and scalable search solution that is used by organizations of all sizes to power search functionality on their websites, mobile apps, and other applications.

Features

Time-saving

Reduced installation and configuration time compared to the manual installation

Easy to use

Users can easily install and configure Hadoop without requiring extensive technical knowledge

Standardization

Standardized installation across multiple machines, reducing the risk of errors and inconsistencies

Increased efficiency

Reduced risk of system downtime and overall improved system efficiency

Expertise

Our team evaluates bug fixes received from the community and determines which ones should be included in the product. In addition, we independently develop new and improve existing functionality.

Arenadata Platform Security

Enterprise edition

Arenadata Platform Security (ADPS) is a combination of two security components:

Apache Ranger

Apache Ranger is an open-source security framework that provides centralized policy management for Hadoop and other big data ecosystems. The Arenadata platform integrates with Apache Ranger to provide policy-based access control and fine-grained authorization for data and analytics applications.

Apache Knox

Apache Knox is an open-source gateway that provides secure access to Hadoop clusters and other big data systems. The Arenadata platform integrates with Apache Knox to provide secure access to the platform and its services.

Together, ADPS provides a comprehensive security framework that includes policy-based access control, fine-grained authorization, and secure access to the platform and its services. This helps organizations protect sensitive data and ensure compliance with regulations.

ADB Spark Connector

The ADB Spark connector provides the possibility of high-speed, parallel data exchange between Apache Spark and Arenadata DB.

It has great flexibility in configuration and a multitude of features, including:

high speed of data transmission;
automatic data schema generation;
flexible partitioning;
support for push-down operators;
support for batch operations.

Read documentation

ADQM Spark Connector

Multifunctional connector with support for parallel read/write operations between Apache Spark and Arenadata QuickMarts.

It has great flexibility in configuration and a multitude of features, including:

high speed of data transmission;
automatic data schema generation;
flexible partitioning;
support for push-down operators;
support for batch operations.

Read documentation

Product comparison

Compare with

Cloudera 6.3.4

Cloudera 7.3.1

Compare with

Cloudera 6.3.4

Cloudera 7.3.1

Infrastructure

Management system

Arenadata Cluster Manager (ADCM)

A single tool for managing the lifecycle of all Arenadata products.

ADCM is installed with one command and only requires Docker.

Cloudera Manager

Automatic deployment and configuration.

Custom monitoring and reporting.

Cloudera Manager

Automatic deployment and configuration.

Custom monitoring and reporting.

Built-in monitoring

Yes

Centralized upgrade

Yes

IT landscape support

Ability to deploy various combinations of bare metal, cloud

Yes

By using infrastructure bundles, ADH supports installation on physical and virtual servers (on-premises), in private and public clouds according to the IaaS model. Additionally, infrastructure bundles provide automatical installation on existing nodes and nodes creation "on the fly" for part of cloud providers (YC, VK).

Yes

Supported.

Yes

Supported.

Support for cloud providers

Yandex Cloud;

VK Cloud;

Sber Cloud;

Google Cloud Platform.

Google Cloud Platform;

AWS;

Azure.

Google Cloud Platform;

AWS;

Azure.

Domestic OS support

Alt Linux

Yes

Astra Linux

Yes

Red OS

Yes

Features

Offline installation

Yes

High availability

Yes

ADH supports high availability for key critical platform data services (YARN, HDFS, Hive).

Yes

Integration with other products

Yes

ADH supports a number of proprietary solutions for integration:

Spark Tarantool (Picodata) Connector;
Spark Arenadata DB Connector;
Spark Arenadata QuickMarts Connector.

ADH also provides:

Kerberos support for PXF;
Informatica DEI 10.4 support for ADH 2.X.

Yes

Security settings

SSL encryption

Yes

Via ADCM.

Yes

Standard access separation based on Role Base Access Control

Yes

Flexible settings with Ranger in a separate ADPS product, which can serve multiple instances of ADH and other Arenadata products.

Yes

Single point of secure access

Yes

Knox as a part of ADPS.

Yes

Additionally

Technical support 24/7

Yes

On-demand fixes and improvements

Yes

Training/workshops

Yes

Full training on working with Arenadata products.

Not available for Russia

Community version

Yes

ADH is the only commercial distribution with a free version available. You can just download it.

Documentation

Yes

Detailed documentation in Russian and English languages for all services, their installation, configuration, and operation.

Publicly available.

Yes

Publicly available.

Yes

Publicly available.

Registration in the register of domestic software

Yes

Successful deployments

Yes

ADH has been used for hundreds of thousands of hours in more than 20 Russian leader companies as a central data platform, which stores and processes up to 25 petabyte data.

Yes

Release history with descriptions

Yes

Complete release history with service versions and description of the upgraded functionality is available in the open domain.

Yes

Complete release history with service versions and description of the upgraded functionality is available in the open domain.

Yes

Complete release history with service versions and description of the upgraded functionality is available in the open domain.

Comparison of current service versions

Service

ADH 4.0.0

Cloudera 6.3.4

Cloudera 7.3.1

HDFS & YARN

3.3.6_arenadata1

3.0.0

3.1.1

Impala

4.5.0_arenadata1

3.2.0

4.0.0

Hive

4.0.1_arenadata1

2.1.1

3.1.3

HBase

2.5.10_arenadata1

2.1.4

2.4.17

Phoenix

6.0.0_arenadata1

5.0

5.1.3

Tez

0.10.4_arenadata1

0.9.2

0.9.1

Zeppelin

0.11.2_arenadata2

0.8.2

ZooKeeper

3.8.4_arenadata1

3.4.5

3.8.1

Airflow2

2.6.3

Solr

8.11.3_arenadata1

7.4.0

8.11.2

Spark3

3.5.4_arenadata1

3.0.1

3.5.4

Knox

2.0.0_arenadata1

1.2.0

2.0.0

Ranger

2.6.0_arenadata1

2.1.0

2.4.0

Flink

1.20.1_arenadata1

Kyuubi

1.10.1_arenadata1

SSM

2.1.0

HUE

4.11.0_arenadata3

4.4.0

4.5.0

Trino

468_arenadata2

Ozone

1.4.1_arenadata2

1.4.0

“Product comparison” section is relevant on the date of 25.06.2025.

Releases

2023

ADH 4.0.0

ADH 3.3.6.1_b1

ADH 3.2.4.3_b1

ADH 3.2.4.2_b2

ADH 3.2.4.1_b3

ADH 3.2.4.2_b1

ADH 3.2.4.1_b2

ADH 3.2.4.1_b1

ADH 3.1.2.1_b2

ADH 3.1.2.1

ADH 2.1.10

ADH 2.1.8

ADH 2.1.7

ADH 2.1.6

ADH 2.1.4_b11

ADH 2.1.4_b10

ADH 2.1.4_b9

ADH 2.1.4_b5

ADH 2.1.4_b3

ADH 2.1.4_b2

ADH 2.1.4_b1

ADH 2.1.3

ADH 2.1.2.5

ADH 2.1.2.3

ADH 2.1.2.2

ADH 2.1.2.1

ADH 2.1.2.0

ADH 2.1.1

ADH 2.1.0