Add services

In ADCM a service means a software that performs some function. Examples of services for ADH clusters: HDFS, HBase, Hive, etc. The steps for adding services to a cluster are listed below:

  1. Select a cluster on the Clusters page. To do this, click a cluster name in the Name column.

    Select a cluster
    Select a cluster
  2. Open the Services tab on the cluster page and click Add service.

    Switch to adding services
    Switch to adding services
  3. In the opened dialog, select services that should be added to the cluster and click Add.

    Select services
    Select services

    The brief description of available services is listed below.

    Services that can be added to the ADH cluster
    Service Purpose

    Airflow

    A service used for creation, scheduling, and monitoring workflows in the form of Directed Acyclic Graphs (DAGs) of tasks. Can be used in Hadoop clusters for building ETL/ELT processes

    Airflow2

    In comparison with the Airflow 1.x version, it offers the following features: High Availability, lowered task latency, full REST API, TaskFlow API, task groups, independent providers and others

    Flink

    A distributed platform used in high-load Big Data applications for analyzing data stored in Hadoop clusters. It can be used in different streaming use cases: event-driven applications, stream and batch analytics, data pipelines and ETL, etc.

    HBase

    А non-relational, distributed database written in Java and used on the top of HDFS. Belongs to the class of column-oriented key-value storages. It is useful for random, real-time read/write access to Big Data

    HDFS

    A distributed file system used in Hadoop clusters for storing large files. Provides the possibility of the streaming access to the information distributed block-by-block across cluster nodes

    Hive

    A software designed for building data warehouses (DWH) and analyzing Big Data. It runs on the top of HDFS and other compatible systems, such as Apache HBase. It facilitates writing, reading, and managing large datasets stored in distributed systems

    Impala

    Impala provides fast, interactive SQL queries on data stored in HDFS, HBase, or S3. In addition to the unified storage platform, Impala also uses the same metadata, SQL syntax (Hive SQL), and JDBC driver as Apache Hive. It makes Impala a unified platform for real-time or batch-oriented queries

    Kyuubi

    Apache Kyuubi is a distributed and multi-tenant gateway to provide SQL on Data Warehouses and Lakehouses. It builds distributed SQL query engines on top of various kinds of modern computing frameworks, e.g. Apache Spark, Flink, Hive, Impala, etc., to query massive datasets distributed over fleets of machines from heterogeneous data sources

    Monitoring

    A service that should be added if monitoring of the ADH cluster is planned

    Maria DB

    Maria DB — a relational database created on the base of MySQL and compatible with it. Some MariaDB commands and interfaces are closer to NoSQL, than to SQL. For example, it provides such data storage types as: ColumnStore — for column data storage and distributed architecture support, OQGRAPH — for storing tree and graph structures, etc.

    Solr

    A search platform based on the Apache Lucene project. Its main features include full-text search, faceted search, highlighting search results, distributed indexing, integration with databases, processing documents with a complex format (Word, PDF, etc.), load-balanced querying, centralized configuration, and others

    Spark

    Spark 2.x. A fast analytics engine used for large-scale data processing and compatible with Hadoop data. It can run in Hadoop clusters using the YARN or Spark’s standalone mode. It can process data in HDFS, HBase, Cassandra, Hive, and other Hadoop input formats. Supports both batch processing and new workloads like streaming, machine learning, interactive queries, etc.

    Spark3

    Spark 3.x. In comparison with the Spark 2.x version, it offers such new features as adaptive execution of Spark SQL, Dynamic Partition Pruning (DPP), graph processing, enhanced support for Deep Learning, and others

    Sqoop

    A service designed to transfer bulk data between Hadoop and relational databases or mainframes. You can use it, for example, to import data from MySQL, Oracle or other relational database management systems (RDBMS) to Hadoop clusters, convert the data in a certain way, and then export the data back to the RDBMS

    SSM

    Smart Storage Manager is a service that aims to optimize the efficiency of storing and managing data in the Hadoop Distributed File System. SSM collects HDFS operation data and system state information, and based on the collected metrics can automatically use methodologies such as cache, storage policies, heterogeneous storage management (HSM), data compression, and Erasure Coding. In addition, SSM provides the ability to configure asynchronous replication of data and namespaces to a backup cluster for the purpose of organizing DR

    YARN

    A service needed for managing cluster resources and scheduling/monitoring jobs. Uses a special daemon (ResourceManager) that abstracts all the computing resources of the cluster and manages their provision to distributed applications

    Zeppelin

    A service that plays a role of a web-based notebook and enables interactive data analytics. Allows to create queries to data in Hadoop clusters and display the results in the form of tables, graphs, charts, etc.

    Zookeeper

    A centralized coordination service for distributed applications. It is used in Hadoop clusters for failure detection, active NameNode election, health monitoring, session management, etc.

    The minimal set of services recommended for ADH clusters is described below:

    • HDFS;

    • YARN;

    • Zookeeper (optional for the Community Edition of ADH).

    These services make up the core of Hadoop and are sufficient to organize distributed data storage and processing. The full list of services depends on the requirements of a particular project.

  4. As a result, the added services are displayed at the Services tab.

    The result of successful adding services to a cluster
    The result of successful adding services to a cluster
NOTE
You can also add services later. The process of adding new services to already running cluster does not differ from installing a service from scratch.
Found a mistake? Seleсt text and press Ctrl+Enter to report it