ADH architecture

Hadoop implements the following two main features:

  • Data managing and storing.

  • Data processing and computations.

 
The components that make the core of the Hadoop architecture are:

  • Hadoop Common is a set of the main libraries and utilities that Hadoop modules require.

  • Hadoop Distributed File System (HDFS) is a distributed file system that stores data on commodity machines, provides very high aggregate bandwidth across the cluster, patterns after the UNIX file system, and provides POSIX-like semantics.

  • Hadoop YARN is a platform that is in charge of managing computing resources in clusters and utilizing them for planning user applications.

  • Hadoop MapReduce is an application of the MapReduce programming model for large-scale data processing.

Both MapReduce and HDFS were inspired by Google’s papers MapReduce: Simplified Data Processing on Large Clusters and The Google File System.

CoreComponents dark
High-level ADH architecture
CoreComponents light
High-level ADH architecture

Hadoop ecosystem is neither a programming language nor a service, it is a platform (framework) that solves big data problems. You can consider Hadoop as a set of services for loading, storing, analyzing, and maintaining big data. The diagram below shows all the services supported by ADH.

Services dark
ADH services
Services light
ADH services

Data access and processing:

  • Apache Spark — a unified analytics engine for large-scale data processing. It provides high-level APIs in Java, Scala, Python, and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Structured Streaming for incremental computation and stream processing.

  • Apache Hive — a data warehouse software that facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Structure can be projected onto data already in storage. A command line tool and a JDBC driver are provided to connect to Hive.

  • Apache Phoenix — an open source, massively parallel, relational database engine that supports online transaction processing (OLTP) for Hadoop and uses Apache HBase as its backing store. Phoenix provides a JDBC driver that hides the intricacies of the NoSQL store and allows the following:

    • create, delete, and alter SQL tables, views, indexes, and sequences;

    • insert and delete rows in bulk;

    • query data through SQL.

    Phoenix compiles queries and other statements into native NoSQL store APIs rather than using MapReduce. This allows to build low latency applications on top of NoSQL stores.

  • Apache Solr — a highly reliable, scalable, and fault tolerant platform, providing distributed indexing, replication and load-balanced querying, automated failover and recovery, centralized configuration, and more. Solr powers the search and navigation features of many of the world’s largest internet sites.

  • Apache TEZ — an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN in Apache Hadoop. Tez improves the MapReduce paradigm by dramatically improving its speed, while maintaining MapReduce’s ability to scale to petabytes of data. Important Hadoop ecosystem projects like Apache Hive and Apache Pig use Apache Tez, as do a growing number of third-party data access applications developed for the broader Hadoop ecosystem.

  • Apache HBase — a Hadoop database, which is a distributed, scalable big data store. Use Apache HBase when you need random, real-time read/write access to your big data.

Security:

  • Apache Ranger — a framework to enable, monitor, and manage comprehensive data security across the Hadoop platform.

  • Apache Atlas — a scalable and extensible set of core foundational governance services that enable enterprises to effectively and efficiently meet their compliance requirements within Hadoop. Allows integration with the whole enterprise data ecosystem.

  • Apache Knox — an application gateway for interacting with the REST APIs and UIs of Apache Hadoop deployments. Knox provides perimeter security so that the enterprise can confidently extend Hadoop access to more of those new users while also maintaining compliance with enterprise security policies.

Data orchestration:

  • Apache Airflow — a platform to programmatically author, schedule, and monitor workflows.

  • Apache Flink — a framework and a distributed processing engine for stateful computations over unbounded and bounded data streams. Flink has been designed to run in all common cluster environments, perform computations at in-memory speed and at any scale.

  • Apache Zeppelin — a web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala, Python, R, and more.

Found a mistake? Seleсt text and press Ctrl+Enter to report it