Glossary

ACL

Access сontrol list — it defines users or groups that can access a particular object, and operations that are allowed or prohibited for them to carry out over this object.

AD

Active Directory -- a directory service for Windows Server family operating systems. It was initially created as an LDAP-compatible implementation of a directory service. However, starting with Windows Server 2008, it includes integration capabilities with other authorization services, performing an integrating and unifying role for them.

It allows administrators to apply group policies to ensure consistency in the configuration of the user work environment, deploy software on multiple computers through group policies or System Center Configuration Manager (formerly -- Microsoft Systems Management Server), install operating system, application, and server software updates on all computers of the network using Windows Server Update Service. It stores data and environment settings in a centralized database. Active Directory networks can be of various sizes: from several dozen to several million objects.

Apache Maven

A management and analysis tool for software projects.

API

Application programming interface — a set of ready-made classes, procedures, functions, structures, and constants, provided by an application (library, service) or operating system for use in external software products.

Broker

The Kafka server where partitions are stored.

Basic authentication

An HTTP authentication scheme in which user credentials are transmitted as Base64-encoded ID/password pairs.

Commit log

An ordered structure of messages that is write-only. No data can be modified or deleted from a commit log. Large logs are divided into partitions. If a log consists of a single partition, it is a partition.

Connector

A class whose instance manages the integration of Kafka Connect with another system. Source connectors pull data from other systems into Kafka, and sink connectors push data from Kafka to other storage systems, ensuring a seamless data flow.

Consensus algorithm

A set of principles and rules by which all nodes participating in a cluster automatically come to a consensus on the current state of the network. Kafka can use one of the algorithms: Raft or ZAB.

Consumer

A client application that subscribes to messages, reads, and processes them.

Consumer group

A group of consumers that are joined together to read data from topics. The group for a consumer is specified using group.id in the consumer’s options file.

Controller

A Kafka broker that has a controlling role in the data distribution processes within Kafka. The function of the Kafka controller and the mechanism for choosing it varies depending on the consensus algorithm used in the cluster:

  • ZAB algorithm — ZooKeeper quorum is used to manage cluster metadata. The controller is responsible for selecting leading replicas for partitions and transmits data about new ISRs to the quorum. The broker that was launched earlier becomes the controller.

  • Raft algorithm — a quorum of KRaft controllers is used to manage cluster metadata. Servers selected as controllers participate in the metadata quorum. Each controller can be either active, recording metadata, or a metadata replica.

CLI

Command-line interface — a kind of text user interface (TUI), in which instructions to the computer are given mainly by typing text strings (commands) from the keyboard. It is also known by the names console and terminal.

Cluster

A group of servers and coordinating software, united logically, capable of processing identical requests, and used as a single resource.

Debezium

An open-source distributed platform that operates on the Change Data Capture (CDC) pattern. The Debezium connector collects changes in databases and transmits them to external applications. Debezium is used by external applications to record and process data changes in the DBMS: insert, update, and delete events.

Deserialization

Recovering data structure from a byte stream.

DNS

Domain name system — a distributed and hierarchical system, used to identify computers, services, and other resources, reachable through the Internet or other internet protocol networks. It is most often used to get an IP address by the host name (computer or device), obtain information about mail routing, service nodes, etc.

A distributed DNS database is maintained, using a hierarchy of DNS servers that interact over the specific protocol.

FlowFile

A NiFi object that represents a packet of information moving through the system. For each packet, NiFi tracks attributes in the form of key/value pairs and the content associated with the packet.

Firewall

A software package, used to protect the computer from hacking, as well as all kinds of viruses and Trojans. Due to this system, the degree of the network security is increased and many attacks on the computer are reflected by filtering information packets.

Follower

One of the partition replicas that is a backup for the partition leader. A follower is an ISR replica if it is a full copy of the leader.

FreeIPA

A free and open source identity management system for Linux/UNIX networked environments. It is based on Fedora Linux, 389 Directory Server, MIT Kerberos, NTP, DNS, the DogTag certificate system, SSSD, and other free/open-source components. FreeIPA is designed with an intent to provide the same services as Active Directory.

FQDN

Fully qualified domain name — aa domain name that has no ambiguities in its definition. Includes the names of all the parent domains in the DNS hierarchy.

HBase

A non-relational distributed database with open source code, written in Java. Represents the analogue of Google BigTables.

It is developed as a part of the Apache Software Foundation Hadoop project. Runs on the top of HDFS and provides BigTable-like capabilities for Hadoop, i.e. provides a fail-safe way to store large volumes of sparse data.

HDFS

Hadoop Distributed File System — a file system, designed to store large files, distributed block-by-block between nodes of the computing cluster. All blocks in HDFS (except the last file block) have the same size, and each block can be hosted on multiple nodes. The block size and replication factor (the number of nodes, on which each block should be located) are determined in the settings at the file level. Due to replication, the distributed system is resistant to failures of individual nodes.

Heap

A dynamically allocated memory area, created at the start of the JVM.

Host

A computer or other device connected to a computer network. A host may work as a server offering information resources, services, and applications to users or other hosts on the network. Hosts are assigned at least one network address.

Instance

A single copy of any software, running on a single physical or virtual server. In the object-oriented programming, this term is also called an object of some class.

IP

Internet protocol address — a unique network address of a node in a computer network, built on the basis of the TCP/IP protocol stack. The IP address consists of two parts: a network number and a node number.

ISR

In-sync replica — a replica that is a complete copy of the leader’s log, has the same offsets, and messages in the same order. Such a replica is a candidate for partition leader.

JDBC

Java DataBase Connectivity — an industry standard for interaction between Java applications and various DBMSs, implemented as the java.sql package, which is part of Java SE.

JMX

Java management extensions — the Java technology, designed for control and management of applications, system objects, devices (for example, printers) and computer networks.

Kafka Streams

A client library for creating Java and Scala applications where input and output data is stored in Kafka clusters. Developed using Apache Maven.

Kerberos KDC

Key distribution center — a third-party authentication mechanism that is used by users and services to authenticate each other.

It consists of three parts:

  • A database of users and services (known as principals) that the KDC has access to, and the corresponding Kerberos passwords.

  • Authentication server (AS) that performs the initial authentication and issues a ticket-granting ticket (TGT).

  • Ticket-granting server (TGS) — a server that issues subsequent tickets based on the initial TGT.

Kerberos authentication server

An authentication server whose main function is to receive a request containing the name of a client requesting authentication and return an encrypted ticket-granting ticket (TGT) to the client. Then, the user can use this TGT for further requests. In most Kerberos implementations, the TGT lifetime is 8-10 hours. After that, the client should request a TGT from the authentication server again.

Kerberos keytab

A file, containing one or more principals and their keys. It is used for authentication in the Kerberos infrastructure and allows not to enter usernames and passwords manually.

Kerberos principal

A unique name of a user or service.

Kerberos realm

A Kerberos network that includes a KDC and several Clients.

Kerberos TGS

Ticket-granting server — a server for issuing grants or permissions.

Kerberos TGT

Ticket-granting ticket — includes a copy of the session key, user name, and ticket expiration time. TGT is encrypted using the master key of the KDC and can only be decrypted by the KDC service itself.

KRaft

A mode of Kafka operation that uses a quorum of controllers and leverages the Raft event-based consensus protocol.

LDAP

Lightweight directory access protocol — a simple protocol that uses TCP/IP and allows authentication, search and compare operations, as well as operations for adding, modifying, or deleting records.

Leader

One of the replicas through which messages are written. The leader for each partition is selected using a quorum in accordance with the consensus algorithm protocol adopted in the cluster (ZAB or Raft). Partition leaders are evenly distributed between brokers. Each broker can be a leader for one partition and a follower for another.

Metadata

A structured service information about the data used. Contains data about brokers storing data partitions and partition leaders.

Message

A record of an event that has occurred, including the state of an object, the value of a physical quantity, or any other parameter that requires tracking, storage, or transmission to another system.

Message delivery guarantees

Delivery semantics that exist in Kafka:

  • At most once — messages will be processed once or not at all (lost).

  • At least once — messages will be processed at least once.

  • Exactly once — each message will be processed once and only once.

MTBF

Mean time between failures — the average time from the end of the restoration of the functional state of the system after its failure, until the next failure occurs.

MySQL

An open-source relational database management system.

Node

A device, connected to other devices via a network. It has its own IP address and is able to exchange data. Nodes can be computers, mobile phones, pocket computers, as well as special network devices, such as: routers, switches, hubs, and so on.

NTP

Network time protocol — a network protocol for synchronizing the internal computer clock, using networks with variable latency.

OpenJDK

A fully compatible Java Development Kit project, consisting exclusively of free and open source code.

Partition

An ordered, immutable sequence of records, the contents of a commit log.

PostgreSQL

A free object-relational database management system.

Postgres

A superuser in PostgreSQL who has all rights in all databases, including the right to create other users. Global rights can be changed at any time by the current superuser.

Processor

A NiFi component that performs actions related to FlowFile data processing, such as retrieving or publishing data, as declared in its function. Processors may use one or more FlowFiles. Processors have access to the attributes of a given FlowFile and its contents.

Producer

A client application that publishes (writes) messages to Kafka.

Quorum

A set of specialized nodes that are responsible for storing and updating the cluster metadata in a timely manner. Kafka can use one of two types of quorum:

  • KRaft quorum — a quorum of controllers, responsible for servicing only the internal topic __cluster_metadata with Kafka metadata when using the Raft algorithm.

  • ZooKeeper quorum — a quorum of ZooKeeper nodes, responsible for storing Kafka metadata and electing a Kafka controller when using the ZAB algorithm.

RAID

Redundant array of independent disks — a data virtualization technology, which combines several disks into a logical element for redundancy and performance enhancement.

Replica

A copy of a Kafka topic partition. Replicas can also refer to Kafka brokers in a cluster that contain identical partitions.

Replication

A mechanism that creates and distributes exact copies of each partition of a topic in Kafka to brokers. Replication ensures message availability in the event of failures or maintenance. The replication factor is a parameter that determines the number of copies (replicas) of each partition.

REST

Representational state transfer — an architectural style of interaction between components of a distributed application on the network via the HTTP protocol. It is a consistent set of constraints that are taken, when designing a distributed hypermedia system.

Root

Superuser — a special account in Unix-like systems, the owner of which has the right to perform all without excluding operations.

S2S

Site-to-Site (Server-to-server) — a protocol that describes the direct exchange of data or requests between two or more servers on a network.

S3

Simple storage service — a cloud storage system originally created as part of Amazon Web Services, organized according to the object principle.

SASL PLAINTEXT

A simple username/password authentication mechanism that is typically used with TLS for encryption to implement secure authentication. Key concepts:

  • SASL is a framework for authentication and data security used in Internet protocols.

  • PLAIN is a simple mechanism for transmitting passwords in clear text.

  • PLAINTEXT is an authenticator that is configured to support the PLAIN authentication mechanism.

Script

A brief description of the actions, performed by the system. The difference between programs and scripts is quite blurry: the script is a program, dealing with ready-made software components.

In a narrower sense, a scripting language can be understood as a specialized language for expanding the capabilities of a command shell, a text editor or operating system administration tools.

Self-signed certificate

A special type of the certificate, signed by its subject. Technically, this type is no different from a certificate, signed by the certification center (CC), only that the user creates its own signature. In this case the creator of the certificate is also the certification center. All root certificates of trusted CC are self-signed.

Segment

A part of a partition stored in a separate file on the disk.

Serialization

The process of writing data from a human-readable semantic structure format to a byte array in binary code format for machine processing. Schema Registry supports the following serialization formats: JSON, AVRO, PROTOBUF.

Schema

The structure of the serialized data format. The schema describes the data written to the topic, what type of information it contains. This information links producers and consumers. There are special requirements for each format, for example, the format schema must comply with the AVRO format requirements.

Source code

A text of a computer program in any programming or markup language that can be read by a human. In a generalized sense — any input data for the translator.

Snapshot

A copy of files and directories of the file system (or database) at a certain point in time.

SQL

Structured query language — a declarative programming language used to create, modify, and manipulate data in a relational database managed by an underlying database management system.

SSH

Secure shell — an application layer network protocol that allows performing remote control of the operating system and tunneling of TCP connections (for example, for files transfer). It is similar in functionality to the Telnet and rlogin protocols, but, unlike them, encrypts all traffic, including transmitted passwords. SSH allows a choice of different encryption algorithms. SSH сlients and SSH servers are available for most network operating systems.

SSL

Secure sockets layer — a cryptographic protocol that implies more secure communication. It uses asymmetric cryptography for authentication exchange keys, symmetric encryption for saving privacy, message authentication codes for integrity messages.

Sudo

Substitute user and do — a program for system administration Unix-like operating systems that allows delegating certain privileged resources for users with the maintenance of the work protocol. The main idea is to give users as few rights as possible, while enough to solve the tasks.

Su

Switch user — a command in Unix-like operating systems that allows a user to log in under a different name, without terminating the current session. Usually, it is used for temporary login by the superuser to perform administrative work.

Tiered Storage

A method of assigning different categories of data to different types of storage media to reduce overall storage costs and improve the performance and availability of mission-critical applications.

Topic

The category under which messages are published to Kafka. In a topic, messages are written to the commit log.

URI

Uniform resource identifier — a unified sequence of characters, identifying an abstract or physical resource.

URL

Uniform resource locator — a unified locator of an abstract or physical resource.

WAL

Write-ahead log — a write-ahead log that records every change that occurs to a FlowFile as a transactional unit and records successful metadata changes in the FlowFile repository, and in the event of a system failure, restores its state.

Found a mistake? Seleсt text and press Ctrl+Enter to report it