ADH architecture

Konstantin Alpashkin, Albert Bagdasaryan

Hadoop implements the following two main features:

Data managing and storing.
Data processing and computations.

The components that make the core of the Hadoop architecture are:

Hadoop Common is a set of the main libraries and utilities that Hadoop modules require.
Hadoop Distributed File System (HDFS) is a distributed file system that stores data on commodity machines, provides very high aggregate bandwidth across the cluster, patterns after the UNIX file system, and provides POSIX-like semantics.
Hadoop YARN is a platform that is in charge of managing computing resources in clusters and utilizing them for planning user applications.
Hadoop MapReduce is an application of the MapReduce programming model for large-scale data processing.

Both MapReduce and HDFS were inspired by Google’s papers MapReduce: Simplified Data Processing on Large Clusters and The Google File System.

High-level ADH architecture

Hadoop ecosystem is neither a programming language nor a service, it is a platform (framework) that solves big data problems. You can consider Hadoop as a set of services for loading, storing, analyzing, and maintaining big data. The diagram below shows the services supported by Hadoop.

Hadoop ecosystem

Data access:

Apache Zeppelin — a web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala, Python, R, and more.
Apache Kyuubi — a distributed multi-tenant gateway to provide SQL on Data Warehouses and Lakehouses.
HUE — an SQL assistant with support for many databases including Apache Impala and Hive. Its intelligent query editor with useful features such as autocomplete makes it a convenient tool for working with data storages.

Data processing:

Apache Spark — a unified analytics engine for large-scale data processing. It provides high-level APIs in Java, Scala, Python, and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Structured Streaming for incremental computation and stream processing.
Apache Hive — a data warehouse software that facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Structure can be projected onto data already in storage. A command line tool and a JDBC driver are provided to connect to Hive.
Apache Flink — a framework and a distributed processing engine for stateful computations over unbounded and bounded data streams. Flink has been designed to run in all common cluster environments, perform computations at in-memory speed and at any scale.
Apache Impala — a fast database for querying data stored in a Hadoop cluster. Impala shares the metadata, SQL syntax (Hive SQL), and the JDBC driver with Apache Hive, which makes it a unified platform for real-time or batch-oriented queries.
Apache HBase — a Hadoop database, which is a distributed, scalable big data store. Use Apache HBase when you need random, real-time read/write access to your big data.
Apache Solr — a highly reliable, scalable, and fault tolerant platform, providing distributed indexing, replication and load-balanced querying, automated failover and recovery, centralized configuration, and more. Solr powers the search and navigation features of many of the world’s largest internet sites.

Security:

Apache Ranger — a framework to enable, monitor, and manage comprehensive data security across the Hadoop platform.
Apache Knox — an application gateway for interacting with the REST APIs and UIs of Apache Hadoop deployments. Knox provides perimeter security so that the enterprise can confidently extend Hadoop access to more of those new users while also maintaining compliance with enterprise security policies.

Data orchestration:

Apache Airflow — a platform to programmatically author, schedule, and monitor workflows.

Service management:

SSM — a service that aims to optimize the efficiency of storing and managing data in the Hadoop Distributed File System. In addition, it provides the ability to configure asynchronous replication of data and namespaces to a backup cluster for the purpose of organizing DR.

Found a mistake? Seleсt text and press Ctrl+Enter to report it