Arenadata Documentation
Our passion is to build efficient flexible solutions that scale up to dozens of petabytes
Products
Explore our range of solutions in the world of Big Data
Overview
Arenadata Hadoop (ADH) is a commercial distribution of the open-source Apache Hadoop software. It is a big data platform designed for storing, processing, and analyzing large volumes of structured and unstructured data.
Arenadata Hadoop includes various tools and components that are part of the Hadoop ecosystem, such as the Hadoop Distributed File System (HDFS), MapReduce, YARN, and various other Apache projects. It also includes additional software components and tools that are designed to make it easier to deploy, manage, and use Hadoop in enterprise environments.
Use cases
Big data analytics

ADH can be used to process and analyze large volumes of data, such as clickstream data, sensor data, social media data, and financial data. This can help businesses gain valuable insights into customer behavior, market trends, and other important metrics.

Machine learning and artificial intelligence

ADH can be used as a data processing platform for machine learning and artificial intelligence applications. This can help businesses to build predictive models, detect anomalies, and automate decision-making processes.

Data integration

ADH can be used to integrate data from multiple sources and formats into a unified, centralized data repository. This can help businesses to eliminate data silos and provide a single, consistent view of data.

Fraud detection and prevention

ADH can be used to detect and prevent fraud by analyzing large volumes of data in real-time. This can help businesses to identify and respond to fraudulent activities quickly, reducing losses and protecting their reputation.

Log analytics

ADH can be used to process and analyze log data generated by IT systems and applications. This can help businesses to troubleshoot issues, identify performance bottlenecks, and improve system reliability.

Enterprise
Community
Support for key Hadoop components
High availability and disaster recovery features
Advanced security features, including encryption, role-based access control
Automated management and monitoring tools
Deploy & upgrade automation
Offline installation
Technical support 24/7
Corporate training courses
Tailored solutions
Available integrations
ADQM
Arenadata QuickMarts
  • ADQM Spark connector provides the ability of high speed parallel data exchange between ADH Apache Spark and Arenadata QuickMarts (ADQM).
  • Hive JdbcStorageHandler supports reading from a JDBC data source in Hive.
  • Flink JDBC connector allows reading data from and writing data into any relational databases with a JDBC driver.
ADB
ADB
  • ADB Spark connector provides the ability of high speed parallel data exchange between Apache Spark and Arenadata DB (ADB).
  • Hive JdbcStorageHandler supports reading from a JDBC data source in Hive.
  • Flink JDBC connector allows reading data from and writing data into any relational databases with a JDBC driver.
ADS
ADS
  • Spark Streaming streamlines your real-time data processing with Spark Streaming, Kafka, or Arenadata Streaming (ADS), enabling seamless data ingestion, processing, and analysis at scale.
  • Flink Apache Kafka connector provides high-performance stream processing, enabling real-time data analysis, transformation, and visualization at scale.
Oracle
Oracle
  • Spark JDBC connector connects Spark to any JDBC-compatible database like Oracle and unlocks new opportunities for data analysis, processing, and visualization.
  • Hive JdbcStorageHandler supports reading from a JDBC data source in Hive.
  • Flink JDBC connector allows reading data from and writing data into any relational databases with a JDBC driver.
MS SQL
MS SQL
  • Spark JDBC connector connects Spark to any JDBC-compatible database like MS SQL and unlocks new possibilities for data analysis, processing, and visualization.
  • Hive JdbcStorageHandler supports reading from a JDBC data source in Hive.
  • Flink JDBC connector allows reading data from and writing data into any relational databases with a JDBC driver.
AWS S3
AWS S3
  • Hadoop AWS module provides support for AWS integration.
  • S3a connector provides a fast and efficient way to access data stored in Simple Storage Service (S3) from Spark applications.
  • Flink S3 connector allows to use S3 with Flink for reading and writing data as well as in conjunction with the streaming state backends.
Azure Storage
Azure Storage
  • Hadoop Azure module provides support for integration with ASB.
  • Spark WASB (Windows Azure Storage Blob) connector is an Apache Spark library that enables Spark applications to read and write data from Azure Blob Storage.
Azure Datalake
Azure Datalake
  • Spark ABFS (Azure Blob File System) connector provides an API for Spark applications to read and write data directly from ADLS Gen2 without the need to stage data on a local disk.
  • Flink ABS allows to use Azure Blob Storage with Flink for reading and writing data.
GCS
GCS
  • Spark GS connector provides an API for Spark applications to read and write data directly from Google Cloud Storage (GCS) without the need to stage data on a local disk.
  • Flink GCP can be used for reading and writing data and for the checkpoint storage.
JDBC
JDBC
  • Spark JDBC connector connects Spark to any JDBC-compatible database and unlocks new possibilities for data analysis, processing, and visualization.
  • Hive JdbcStorageHandler supports reading from JDBC data source in Hive.
  • Flink JDBC connector allows reading data from and writing data into any relational databases with a JDBC driver.
Solr
Solr
The Spark Solr integration is a library that allows Spark applications to read and write data to Apache Solr. With the Spark Solr integration, Spark applications can read data from Solr using SolrRDD, which allows the parallelization of data processing across a Spark cluster.
Phoenix
Phoenix

The Spark Apache Phoenix integration is a library that enables Spark applications to interact with Apache Phoenix, which is an open-source SQL wrapper for Apache HBase, that provides a way to use an SQL-like syntax to query and manage data stored in HBase.

With the Spark Apache Phoenix integration, Spark applications can read data from Phoenix tables using PhoenixRDD, which provides a distributed representation of the data stored in a Phoenix table.

Zeppelin
Zeppelin

Apache Zeppelin is a web-based notebook interface for interactive data analytics with Apache Hadoop. It allows to create and execute data-driven workflows using a variety of languages within a single, integrated environment.

Airflow
Airflow
Airflow2 is a platform for creating, scheduling, and monitoring data workflows. It provides a web-based interface for creating and managing workflows, which can include tasks such as data ingestion, transformation, and loading.
AVRO
AVRO
AVRO is a binary data format that is designed to be compact and fast. It supports schema evolution, which allows data schemas to change over time without requiring data to be rewritten or reloaded.
PARQUET
PARQUET
PARQUET is a columnar storage format that is optimized for processing large datasets. It stores data in a columnar fashion, which allows for faster access to individual columns and improved compression ratios.
ORC
ORC
ORC (Optimized Row Columnar) is another columnar storage format that is designed to be highly efficient and scalable. It supports compression and predicate push-down, which can greatly improve query performance.
DELTA
DELTA
DELTA is a transactional storage format that is built on top of Parquet and provides support for ACID transactions. It also supports schema evolution and provides features like versioning and time travel.
XML
XML
XML is a markup language used for representing structured data. Spark can handle XML data by using libraries like spark-xml.
JSON
JSON
JSON (JavaScript Object Notation) is a lightweight data format that is commonly used for exchanging data between applications. Spark has built-in support for reading and writing JSON data.
Operating systems
AltLinux 8.4 SP
Supported
CentOS 7
Supported
RedHat 7
Supported
AstraLinux
Currently in development
Support for key Hadoop components
High availability and disaster recovery features
Advanced security features, including encryption, role-based access control
Automated management and monitoring tools
Deploy & upgrade automation
Offline installation
Technical support 24/7
Corporate training courses
Tailored solutions
Available integrations
ADQM
Arenadata QuickMarts
Available only for Enterprise
ADB
ADB
Available only for Enterprise
ADS
ADS
  • Spark Streaming streamlines your real-time data processing with Spark Streaming, Kafka, or Arenadata Streaming (ADS), enabling seamless data ingestion, processing, and analysis at scale.
  • Flink Apache Kafka connector provides high-performance stream processing, enabling real-time data analysis, transformation, and visualization at scale.
Oracle
Oracle
  • Spark JDBC connector connects Spark to any JDBC-compatible database like Oracle and unlocks new opportunities for data analysis, processing, and visualization.
  • Hive JdbcStorageHandler supports reading from a JDBC data source in Hive.
  • Flink JDBC connector allows reading data from and writing data into any relational databases with a JDBC driver.
MS SQL
MS SQL
  • Spark JDBC connector connects Spark to any JDBC-compatible database like MS SQL and unlocks new possibilities for data analysis, processing, and visualization.
  • Hive JdbcStorageHandler supports reading from a JDBC data source in Hive.
  • Flink JDBC connector allows reading data from and writing data into any relational databases with a JDBC driver.
AWS S3
AWS S3
  • Hadoop AWS module provides support for AWS integration.
  • S3a connector provides a fast and efficient way to access data stored in Simple Storage Service (S3) from Spark applications.
  • Flink S3 connector allows to use S3 with Flink for reading and writing data as well as in conjunction with the streaming state backends.
Azure Storage
Azure Storage
  • Hadoop Azure module provides support for integration with ASB.
  • Spark WASB (Windows Azure Storage Blob) connector is an Apache Spark library that enables Spark applications to read and write data from Azure Blob Storage.
Azure Datalake
Azure Datalake
  • Spark ABFS (Azure Blob File System) connector provides an API for Spark applications to read and write data directly from ADLS Gen2 without the need to stage data on a local disk.
  • Flink ABS allows to use Azure Blob Storage with Flink for reading and writing data.
GCS
GCS
  • Spark GS connector provides an API for Spark applications to read and write data directly from Google Cloud Storage (GCS) without the need to stage data on a local disk.
  • Flink GCP can be used for reading and writing data and for the checkpoint storage.
JDBC
JDBC
  • Spark JDBC connector connects Spark to any JDBC-compatible database and unlocks new possibilities for data analysis, processing, and visualization.
  • Hive JdbcStorageHandler supports reading from JDBC data source in Hive.
  • Flink JDBC connector allows reading data from and writing data into any relational databases with a JDBC driver.
Solr
Solr
The Spark Solr integration is a library that allows Spark applications to read and write data to Apache Solr. With the Spark Solr integration, Spark applications can read data from Solr using SolrRDD, which allows the parallelization of data processing across a Spark cluster.
Phoenix
Phoenix

The Spark Apache Phoenix integration is a library that enables Spark applications to interact with Apache Phoenix, which is an open-source SQL wrapper for Apache HBase, that provides a way to use an SQL-like syntax to query and manage data stored in HBase.

With the Spark Apache Phoenix integration, Spark applications can read data from Phoenix tables using PhoenixRDD, which provides a distributed representation of the data stored in a Phoenix table.

Zeppelin
Zeppelin

Apache Zeppelin is a web-based notebook interface for interactive data analytics with Apache Hadoop. It allows to create and execute data-driven workflows using a variety of languages within a single, integrated environment.

Airflow
Airflow
Airflow2 is a platform for creating, scheduling, and monitoring data workflows. It provides a web-based interface for creating and managing workflows, which can include tasks such as data ingestion, transformation, and loading.
AVRO
AVRO
AVRO is a binary data format that is designed to be compact and fast. It supports schema evolution, which allows data schemas to change over time without requiring data to be rewritten or reloaded.
PARQUET
PARQUET
PARQUET is a columnar storage format that is optimized for processing large datasets. It stores data in a columnar fashion, which allows for faster access to individual columns and improved compression ratios.
ORC
ORC
ORC (Optimized Row Columnar) is another columnar storage format that is designed to be highly efficient and scalable. It supports compression and predicate push-down, which can greatly improve query performance.
DELTA
DELTA
DELTA is a transactional storage format that is built on top of Parquet and provides support for ACID transactions. It also supports schema evolution and provides features like versioning and time travel.
XML
XML
XML is a markup language used for representing structured data. Spark can handle XML data by using libraries like spark-xml.
JSON
JSON
JSON (JavaScript Object Notation) is a lightweight data format that is commonly used for exchanging data between applications. Spark has built-in support for reading and writing JSON data.
Operating systems
AltLinux 8.4 SP
Available only for Enterprise
CentOS 7
Supported
RedHat 7
Supported
AstraLinux
Currently in development
Components
Apache Impala

In development. Apache Impala is an open-source massively parallel processing (MPP) SQL query engine for processing large volumes of data in real-time. It allows users to perform interactive queries on Apache Hadoop data stored in HDFS or Apache HBase. Impala was developed to address the need for a faster, more efficient SQL query engine for big data processing than traditional batch-oriented SQL engines.

Impala provides high-speed performance through its MPP architecture, which enables it to distribute processing across multiple nodes in a Hadoop cluster. It also includes support for advanced features such as complex joins, subqueries, and aggregation functions.

Impala is designed to be easy to use and integrate with existing BI and analytics tools. It supports standard SQL queries and JDBC/ODBC drivers for easy integration with a wide range of applications.

Apache Ozone

In development. Apache Ozone is an open-source, scalable, and distributed object store designed for big data workloads. It is part of the Apache Hadoop ecosystem and is built on top of Hadoop Distributed File System (HDFS).

Ozone is designed to provide high performance and scalability for storing and processing large amounts of unstructured data such as log files, images, videos, and other large data objects. It is optimized for workloads that require high throughput and low latency, such as big data analytics, machine learning, and streaming data processing.

One of the key features of Ozone is its support for multiple storage classes, including hot, warm, and cold storage. This allows users to store data based on its access patterns and lifecycle, optimizing cost and performance.

Ozone also includes built-in data replication and distribution capabilities, enabling data to be stored across multiple nodes in a Hadoop cluster for improved availability and durability.

Apache ZooKeeper

Apache ZooKeeper is a distributed coordination service that is designed to help manage large distributed systems. It provides a centralized infrastructure for maintaining configuration information, naming, providing distributed synchronization, and providing group services. ZooKeeper is used extensively in Hadoop clusters to help manage the coordination of distributed systems and to ensure that each node in the cluster is aware of the state of the other nodes.

Hadoop Distributed File System (HDFS)

HDFS is a highly scalable and fault-tolerant distributed file system that forms the foundation of the ADH platform. It allows you to store large volumes of data across multiple nodes in a cluster, with built-in redundancy to ensure that data is always available, even in case of a node failure. HDFS is optimized for handling large files, making it an ideal choice for big data applications.

Apache YARN

YARN is a resource management and job scheduling framework that allows you to run multiple applications simultaneously on a Hadoop cluster. YARN enables you to allocate cluster resources dynamically, based on the needs of each application, and to monitor and manage those resources to ensure optimal performance.

Apache HBase

This is a NoSQL database that provides real-time read/write access to large datasets stored in Hadoop. HBase is designed to handle massive volumes of data and is optimized for random, real-time access to data, making it a popular choice for big data applications that require low-latency access to large datasets.

Apache Phoenix

Apache Phoenix is an open-source, SQL-like query engine for Hadoop that is designed to provide fast and efficient querying of large datasets. Phoenix is built on top of HBase, which means that it can handle massive amounts of data with low latency and provides support for real-time updates and access to data.

Apache Spark

Apache Spark is a fast and powerful open-source data processing engine that provides scalable, fault-tolerant data processing capabilities for big data workloads. The Apache Spark component of Arenadata Hadoop provides a high-performance and distributed computing framework that can process large datasets in parallel across a cluster of nodes. With its advanced analytics capabilities, including machine learning, graph processing, and SQL-like querying, Apache Spark can help businesses extract valuable insights from their data.

Apache Hive

Apache Hive is an open-source data warehouse infrastructure that provides data summarization, query, and analysis capabilities for large datasets stored in Hadoop. The Apache Hive component of Arenadata Hadoop provides a SQL-like interface for querying data in Hadoop, enabling businesses to perform ad-hoc queries, data analysis, and reporting. Hive translates SQL queries into MapReduce jobs, which can be executed on a Hadoop cluster. With its support for partitioning, indexing, and compression, Hive can help businesses optimize data storage and processing in Hadoop.

Apache Tez

Apache Tez is an open-source data processing framework that provides a flexible, efficient, and scalable way to execute complex data processing tasks on a Hadoop cluster. When used together with Apache Hive, Tez provides a faster and more efficient way to execute Hive queries, by replacing the MapReduce execution engine with a more optimized one. The Hive + Tez combination in Arenadata Hadoop provides a powerful and scalable platform for data warehousing, allowing businesses to perform ad-hoc queries, data analysis, and reporting at scale. With Tez's support for dynamic task scheduling and data partitioning, it can accelerate query processing by optimizing the data flow between Hive operators.

Apache Flink

Apache Flink is an open-source stream processing framework that enables the processing of large volumes of real-time data with low latency. The Apache Flink component of Arenadata Hadoop provides a distributed computing framework for real-time data processing that can be seamlessly integrated with batch processing. Flink supports event-driven processing and provides a unified programming model for both batch and stream processing, making it ideal for building end-to-end data processing pipelines. With its advanced features, including support for stateful streaming, windowing, and machine learning, Apache Flink can help businesses gain real-time insights from their data.

Apache Solr

Apache Solr is an open-source, enterprise-level search platform that is built on top of the Apache Lucene search library. Solr provides a robust and scalable search solution that is used by organizations of all sizes to power search functionality on their websites, mobile apps, and other applications.

Features
Time-saving
Reduced installation and configuration time compared to the manual installation
Easy to use
Users can easily install and configure Hadoop without requiring extensive technical knowledge
Standardization
Standardized installation across multiple machines, reducing the risk of errors and inconsistencies
Increased efficiency
Reduced risk of system downtime and overall improved system efficiency
Expertise
Our team evaluates bug fixes and enhancements from the broader Hadoop community and determines which ones to incorporate into their product
Arenadata Platform Security
Enterprise edition
Arenadata Platform Security (ADPS) is a combination of two security components:
Apache Ranger
Apache Ranger is an open-source security framework that provides centralized policy management for Hadoop and other big data ecosystems. The Arenadata platform integrates with Apache Ranger to provide policy-based access control and fine-grained authorization for data and analytics applications.
Apache Knox
Apache Knox is an open-source gateway that provides secure access to Hadoop clusters and other big data systems. The Arenadata platform integrates with Apache Knox to provide secure access to the platform and its services.
Together, ADPS provides a comprehensive security framework that includes policy-based access control, fine-grained authorization, and secure access to the platform and its services. This helps organizations protect sensitive data and ensure compliance with regulations.
ADB Spark Connector
The ADB Spark connector provides the possibility of high-speed, parallel data exchange between Apache Spark and Arenadata DB.
It has great flexibility in configuration and a multitude of features, including:
  • high speed of data transmission;
  • automatic data schema generation;
  • flexible partitioning;
  • support for push-down operators;
  • support for batch operations.
ADQM Spark Connector
Multifunctional connector with support for parallel read/write operations between Apache Spark and Arenadata QuickMarts.
It has great flexibility in configuration and a multitude of features, including:
  • high speed of data transmission;
  • automatic data schema generation;
  • flexible partitioning;
  • support for push-down operators;
  • support for batch operations.
Roadmap
2023
ADH 2.1.8
  • Airflow2: added the high availability mode
  • Airflow2: added LDAP authentication/authorization support
  • Airflow2: added support for external broker configuration
  • Hive version updated to 3.1.3 with some important fixes
ADH 2.1.7
  • Added the livy-spark3 component to the Spark3 service
  • Added the Apply configs from ADCM checkbox for all services
  • Flink build 1.15.1 is available
  • Added the ability to connect to Flink JobManager in the high availability mode
  • Added package checks optimizations for the installation
ADH 2.1.6
  • Added support for AltLinux 8.4
  • Added support for FreeIPA kerberization
  • Added support for customization of krb5.conf via ADCM
  • Added support for customization of ldap.conf via ADCM
ADH 2.1.4_b11
  • Added the ability to specify external nameservices
  • Added the ability to connect to HiveServer2 in the fault-tolerant mode
ADH 2.1.4_b10
  • The check box Rewrite current service SSL parameters is added for the Enable SSL action
  • Custom authentication (LDAP/AD) is enabled for Hive2Server
  • The Ranger plugin for Solr authorization is added
  • The ability to remove services from the cluster is added
  • The ability to customize configuration files via ADCM is added
  • The support of Kerberos REALM is added
ADH 2.1.4_b9
  • The Kerberos authentication is enabled for Web UI
  • The ability to configure SSL in the Hadoop clusters is added
ADH 2.1.4_b5
  • The ability to use Active Directory as Kerberos storage is implemented
  • The AD/LDAP/SIMPLE authorization is added for Zeppelin
ADH 2.1.4_b3
  • The MIT Kerberos integration is implemented in ADCM
  • The Ranger plugin is made operable on kerberized services
ADH 2.1.4_b2
  • Host actions are added
ADH 2.1.4_b1
  • The ability to use external PostgreSQL in Hive Metastore is added
  • Spark 3.1.1 is implemented for ADH 2.X
  • The offline installation is implemented for ADH
ADH 2.1.3
  • Implemented integration with Ranger 2.0.0
ADH 2.1.2.5
  • Client components for Flink are added
  • Client components for HDFS are added
  • Client components for YARN are added
ADH 2.1.2.3
  • The ADH bundle is divided into community and enterprise versions
  • The High Availability for NameNodes is implemented
ADH 2.1.2.2
  • The epel-release installation is disabled
  • Nginx is copied from the Epel repository to the ADH2 repository
ADH 2.1.2.1
  • Solr 8.2.0 is added for ADH 2.2
  • Sqoop is added into the ADH bundle
ADH 2.1.2.0
  • The ability to configure Hive ACID is added
  • Flink is added into the ADH bundle
  • GPU support is enabled for YARN
  • Airflow is added into the ADH bundle
ADH 2.1.1
  • YARN Scheduler configuration is implemented
  • HDFS mover is implemented
  • The cluster-wide Install button is added to the ADCM UI
ADH 2.1.0
Implemented service management for the following services:
  • Livy Server
  • Zeppelin
  • Spark Thrift Server
  • Spark Thrift Server
  • Spark Server
  • Phoenix Server
  • HBase Thrift
  • HBase Region Server
  • HBase Master
  • Node Manager
  • Resource Manager
  • Timeline Service
  • WebHCat
  • MySQL
  • Hive Metastore
  • Hive Server
  • DataNodes
  • Secondary NameNodes
  • NameNodes