NiFi overview

Features

NiFi is a platform created to automate the management of data flows between systems.

The main functionality of NiFi is listed below:

  • Data flows are managed using special processors, each of those has its own functional purpose. At the same time, to ensure data availability, guarantee data delivery and resilience when handling reboots and unexpected system failures, the following functions are supported:

    • Use a Write-Ahead Log (WAL), that fixes every change that occurs to the FlowFile as a transactional unit, records the successful metadata change in the FlowFile repository, and in case of system failure it restores its state.

    • The ability to flexibly configure data buffering in a queue using backpressure when the queue reaches certain parameters.

    • The ability to configure schemes for prioritizing data processing from a queue.

  • Features that make flow management easy:

    • NiFi user interface provides the ability to visually create and manage data streams in real time. To create new components in the system or make changes to existing ones, there is no need to stop all components.

    • Flow templates allow you to create and share flow projects.

    • Storing in the Data Provenance repository information about the provenance of the data and all the operations to which they were subjected while passing through the system. This allows you to analyze the system for compliance with specified requirements, as well as optimization and troubleshooting.

    • Ability to view content stored in content repository at any time during the stream processing cycle.

  • Features that ensure system security:

    • Possibility of using encrypted protocols (two-way SSL) when transferring data between two systems.

    • NiFi supports SSL authentication and provides pluggable authorization in the user interface or via Apache Ranger. Multi-tenant authorization can also be configured, allowing different groups of users to configure access to different parts of the data flow with different levels of authorization.

  • Extension options:

    • NiFi provides the ability to connect extensions through the following points: processors, controller services, reporting tasks, prioritizers, and client user interfaces.

    • Limited set of dependencies for creating extensions and use of NiFi’s own class loader model.

    • Possibility of connecting between NiFi instances using the Site-to-Site (S2S) communication protocol, which ensures secure and efficient data transfer, as well as the use of extension libraries for data transfer.

  • Scaling options:

    • NiFi is designed to scale by clustering multiple nodes. This allows you to configure thread balancing as well as failover between nodes.

    • Throughput throttling can be done by changing the number of tasks on the SCHEDULING tab when configuring the processor.

Architecture

Core components of a NiFi node

NiFi runs inside a JVM on the host operating system.

The components of a NiFi node are shown in the figure below.

NiFi node
NiFi node
NiFi node
NiFi node

The NiFi node contains the following components:

  • Web Server is a server that accepts HTTP requests from clients to manage NiFi streams and produces the result in the form of demonstrating stream changes visually (using the user interface) or HTTP responses.

  • FlowFile is a NiFi object representing a package of information moving through the system. For every package NiFi keeps track of attributes as key/value pairs and content associated with it.

  • Processor is a NiFi component that performs the FlowFiles data processing actions stated in its function, for example, retrieving or publishing data. Processors can use one or more FlowFiles. Processors have access to the attributes of a given FlowFile and its contents.

  • Extension is a library that increases or changes the functionality of an object and is loaded when extension points are launched. Extensions in NiFi are launched and executed inside the JVM. Extension points include: processors, controller services, reporting tasks, prioritizers, and client user interfaces.

  • Flow Controller is a NiFi component that performs the functions of managing the exchange of FlowFiles between processors and controlling the launch of extensions. A thread controller can be configured to control a specific set of threads.

  • NiFi repositories are repositories that store data for running NiFi streams. Repositories are directories in the NiFi node’s local storage:

    • FlowFile Repository is a repository that contains metadata for all current FlowFiles in a flow, which tracks the entire path of the FlowFile through the system. The repository implementation is pluggable. By default, Write-Ahead Log is used to store data. The path to the repository on disk is specified using the nifi.flowfile.repository.directory parameter.

    • Content Repository is a repository that contains the actual bytes of the content of this FlowFile. The repository implementation is pluggable. By default, a repository is a mechanism that stores blocks of data on the file system. The path to the repository on disk is specified using the nifi.content.repository.directory parameter. You can specify more than one file system storage location.

    • Provenance Repository is a repository that contains all data about the occurrence of events, the history of FlowFiles. Event data at each location is indexed and searchable. The repository implementation is pluggable. The path to the repository on disk is specified using the nifi.provenance.repository.directory parameter. You can specify one or more physical disk volumes.

NiFi cluster architecture

Starting with version 1.0, NiFi uses the Zero-Leader Clustering paradigm, in which there is no single system leader, and accordingly, there is no single point of failure.

To coordinate the operation of NiFi nodes in a distributed system (in a cluster), ZooKeeper is used, which ensures the availability of flow data in case of failures or maintenance of nodes.

NOTE
  • The concepts and main components of ZooKeeper are described in the ZooKeeper article.

  • The mechanism of the leader election using ZooKeeper is described in the Leader election in ZooKeeper article.

The figure below shows how the NiFi nodes interact with each other and with ZooKeeper in the cluster.

NiFi cluster
NiFi cluster
NiFi cluster
NiFi cluster

When NiFi nodes start up as part of a cluster, each one registers itself in ZooKeeper — creates an ephemeral ZooKeeper node (znode) that specifies a node ID, hostname, and port. ZooKeeper monitors node availability.

Changes to the NiFi flow can be made by the DataFlow manager (DFM) via the NiFi user interface page or by connecting to the NiFi node web server (for example, using NiFi REST API). In this case, DFM can use any node from the cluster. Every change made is replicated to all connected nodes in the cluster.

ZooKeeper, in accordance with the ZAB consensus algorithm protocol (ZooKeeper Atomic BroadCast), elects key cluster nodes:

  • Cluster Coordinator is a node in the NiFi cluster that performs the following operations in the NiFi cluster:

    • Processes heartbeat signals, through which cluster nodes report information about their status, and disabling nodes that do not receive a heartbeat signal within the time specified in the nifi.cluster.protocol.heartbeat.interval parameter (default value 5 s).

    • Provides the current actual stream to new joining nodes (the decision about whether a node can join is made based on the configured firewall file nifi.cluster.firewall.file).

    • Determines the correct version of a thread when starting a NiFi cluster using the leader election mechanism — voting on node threads. In this case, the versions of streams of different nodes are compared and if the streams of two nodes coincide, a vote is given for this stream. Flow elections are controlled by the nifi.cluster.flow.election.max and nifi.cluster.flow.election.max.wait.time parameters. On the node whose flow conflicts with the flow of the "election winner" node, a backup copy of the flow configuration file flow.xml.gz is created (the file is stored in accordance with the nifi.flow.configuration.file parameter), and the file itself is replaced by the "correct stream" file.

    • Coordinates flows on all nodes. All web requests from each node are forwarded to the coordinator, which then replicates the request to all nodes in the cluster.

  • Primary Node is a node where you can run isolated processors, which work on only one cluster node. By default, in a NiFi cluster, the same data stream runs on all nodes. In a situation where the processor is processing files from a remote directory, it may be necessary to limit data transfer to one channel to avoid errors when duplicating data. Enabling processor “isolation” is done using the Execution parameter on the SCHEDULING tab when configuring the processor.

The cluster coordinator and primary node are displayed on the Cluster page of the global UI menu.

ZooKeeper monitors the availability of nodes and automatically selects a new cluster coordinator and master node if the previous ones become unavailable. Also from ZooKeeper, all nodes receive the ID of the cluster coordinator to send heartbeat signals.

Configure NiFi

After adding and installing the NiFi service as part of ADS cluster, you can configure NiFi parameters on the configuration page of the NiFi service via ADCM.

To configure backpressure parameters, expand the Main node in the configuration settings tree and enter new values ​​for the parameters.

To change storage locations for NiFi repositories, expand the Directories node in the configuration settings tree and enter new values ​​for the parameters.

To configure the NiFi cluster operating parameters, set the Show advanced switch to active, expand the nifi.properties node and enter new values ​​for the parameters. To change NiFi parameters that are not available in the ADCM interface, use the Add key,value field in the nifi.properties node. Select Add property and enter the name of the parameter and its value.

Found a mistake? Seleсt text and press Ctrl+Enter to report it