ADS monitoring metrics

This article describes the metrics for monitoring an ADS cluster. For information on installing monitoring, refer to the articles:

Metric types

Two groups of metrics are available for an ADS cluster: system metrics and service metrics.

In Graphite, system metrics are shown in the System_metrics section, service metrics in the services section.

System metrics show the general characteristics of cluster hosts, usually related to resource consumption. System metrics are shown in the table below.

System metrics
Metrics group Description

cpu

CPU utilization

diskspace

Disk capacity

files

File statistics

iostat

Input/output operation performance

loadavg

System load averages

memory

Memory usage

netstat

Network connection statistics

network

Network interface performance

uptime

Uptime statistics

Service metrics show the characteristics of the services available to hosts in an ADS cluster.

The metrics for each service are shown in the tables below.

Kafka
Metrics group Metric name Description

ReplicaManager

AtMinIsrPartitionCount

Number of partitions corresponding to minIsr (min.insync.replicas)

FailedIsrUpdatesPerSec

The number of failed In-Sync Replicas (ISR) updates per second.

The metric has the following attributes:

  • сount;

  • fifteen minute rate;

  • five minute rate;

  • mean rate;

  • one minute rate.

IsrExpandsPerSec

In-Sync Replicas (ISR) expand rate.

The metric has the following attributes:

  • сount;

  • fifteen minute rate;

  • five minute rate;

  • mean rate;

  • one minute rate.

IsrShrinksPerSec

In-Sync Replicas (ISR) shrink rate.

The metric has the following attributes:

  • сount;

  • fifteen minute rate;

  • five minute rate;

  • mean rate;

  • one minute rate.

LeaderCount

Number of partition leaders

MaxLag

Maximum delay

OfflineReplicaCount

Number of offline replicas

PartitionCount

Number of partitions

ReassigningPartitions

Number of reassigned partitions

UnderMinIsrPartitionCount

Number of partitions whose number of In-Sync Replicas (ISR) is less than minIsr

UnderReplicatedPartitions

Number of incompletely replicated partitions (ISR < current replicas)

broker

Group metrics have the following attributes:

  • сount;

  • fifteen minute rate;

  • five minute rate;

  • mean rate;

  • one minute rate.

BytesInPerSec

Number of input bytes per second

BytesOutPerSec

Number of output bytes per second

BytesRejectedPerSec

Rejected data transfer rate (bytes per second)

FailedFetchRequestsPerSec

Number of get requests that failed per second

FailedProduceRequestsPerSec

The number of requests per second of the message producer that failed

FetchMessageConversionsPerSec

The number of fetch message conversions per second

InvalidMagicNumberRecordsPerSec

The number of errors per second when checking a message due to an invalid number of Magic. Magic Byte Protocol is used for backward compatibility of the message format

InvalidMessageCrcRecordsPerSec

Errors per second when checking a message due to a bad CRC checksum

InvalidOffsetOrSequenceRecordsPerSec

Number of errors while validating a message due to a non-constant offset or sequence number in the packet

MessagesInPerSec

Total number of incoming messages

NoKeyCompactedTopicRecordsPerSec

Number of messages validation errors due to no key specified for a compressed topic per second

ProduceMessageConversionsPerSec

Number of message conversions per second for the broker

ReassignmentBytesInPerSec

Number of bytes of incoming redirect traffic per second

ReassignmentBytesOutPerSec

Number of bytes of outgoing redirect traffic per second

ReplicationBytesInPerSec

Number of bytes of data transfer from other brokers per second

ReplicationBytesOutPerSec

Number of bytes transferred for other brokers per second

TotalFetchRequestsPerSec

Number of get requests per second

TotalProduceRequestsPerSec

Number of requests to produce messages per second

controller

ActiveControllerCount

Number of active controllers

ControllerState

Controller status

GlobalPartitionCount

Total number of partitions

GlobalTopicCount

Total number of topics

OfflinePartitionsCount

Number of partitions that do not have an active leader and are therefore not writable or readable

PreferredReplicaImbalanceCount

Number of imbalances in the preferred replica

ReplicasIneligibleToDeleteCount

Number of non-compliant replicas waiting to be deleted

ReplicasToDeleteCount

Number of replicas waiting to be deleted

TopicsIneligibleToDeleteCount

Number of non-compliant topics waiting to be deleted

TopicsToDeleteCount

Number of topics waiting to be deleted

controllerStats

Group metrics have the following attributes:

  • сount;

  • fifteen minute rate;

  • five minute rate;

  • mean rate;

  • one minute rate;

  • 50/75/95/98/999/99th percentiles of the sample;

  • max;

  • mean;

  • minimum;

  • StdDev.

AutoLeaderBalanceRateAndTimeMs

The frequency and duration of automatic leader rebalancing

ControlledShutdownRateAndTimeMs

The frequency and duration of controlled stops

ControllerChangeRateAndTimeMs

The frequency and duration of controller changes

ControllerShutdownRateAndTimeMs

Monitors how often and how long the controller stops

IsrChangeRateAndTimeMs

Monitors frequency and duration of controller delay for manual balancing of leaders

LeaderAndIsrResponseReceivedRateAndTimeMs

The frequency and duration of receiving a leader response and synchronized replicas (In-Sync Replicas, ISR)

LeaderElectionRateAndTimeMs

The frequency and duration of leader election

ListPartitionReassignmentRateAndTimeMs

List of partition reassignment speed and time

LogDirChangeRateAndTimeMs

The frequency and duration (CPU latency) of LogDir changes

ManualLeaderBalanceRateAndTimeMs

The frequency and duration (CPU latency) of manual leaderboard balancing

PartitionReassignmentRateAndTimeMs

The frequency and duration (CPU latency) of partition remapping

TopicChangeRateAndTimeMs

The frequency and duration (CPU latency) of creating new topics

TopicDeletionRateAndTimeMs

The frequency and duration (CPU latency) of deleting topics

TopicUncleanLeaderElectionEnableRateAndTimeMs

The frequency and duration (CPU latency) of selecting an unclean leader (the leader appointed after the previous unreplicated leader was disabled)

UncleanLeaderElectionEnableRateAndTimeMs

The frequency and duration (CPU latency) of unclean leader election

UncleanLeaderElectionsPerSec

The number of unclean leader elections per second

UpdateFeaturesRateAndTimeMs

The frequency and duration (CPU latency) of feature updates

memory

HeapMemoryUsage_committed

Usage of dynamic (heap) memory — committed (in MB)

HeapMemoryUsage_init

Usage of dynamic (heap) memory — initialization (in MB)

HeapMemoryUsage_max

Usage of dynamic (heap) memory — maximum (in MB)

HeapMemoryUsage_used

Used dynamic (heap) memory (in MB)

NonHeapMemoryUsage_committed

Use of non-heap memory — committed (in MB)

NonHeapMemoryUsage_init

Non-heap memory usage — initialization (in MB)

NonHeapMemoryUsage_max

Use of non-heap memory — maximum (in MB)

NonHeapMemoryUsage_used

Used non-dynamic (heap) memory (in MB)

ObjectPendingFinalizationCount

Number of objects for which finalization is pending

network

NetworkProcessorAvgIdlePercent

The average percentage of time that network processor threads are idle. Values ​​range from 0 (all resources in use) to 1 (all resources available)

RequestQueueSize

Request queue size

ResponseQueueSize

Response queue size

session

Group metrics have the following attributes:

  • сount;

  • fifteen minute rate;

  • five minute rate;

  • mean rate;

  • one minute rate.

ZooKeeperAuthFailuresPerSec

The number of unsuccessful (due to authorization errors) attempts to connect to the cluster per second

ZooKeeperDisconnectsPerSec

The number of ZooKeeper client disconnects from the cluster per second. This metric tells if the broker goes down, but does not tell if ZooKeeper is running

ZooKeeperExpiresPerSec

The number of expired ZooKeeper sessions per second. Shows system performance most accurately

ZooKeeperReadOnlyConnectsPerSec

The number of connections from a client to a server that is "Read Only" (neither "follower" nor "leader") per second. Client can only read ZooKeeper state

ZooKeeperSaslAuthenticationsPerSec

The number of client authentication successes per second

ZooKeeperSyncConnectsPerSec

The number of successful ZooKeeper client connections to the cluster per second

MiNiFi
Metrics group Metric name Description

MinifiAgentCollector/Instance

activeThreads

Number of active threads

bytesRead

Number of bytes read

bytesSent

Number of bytes sent

queuedContentSize

Queue content size

queuedCount

Number of bytes in the queue

MinifiAgentCollector/SystemDiagnostics

availableProcessors

Available processors

diskUtilization

Disk usage (in %)

heapUtilization

Heap usage (in %)

loadAverage

Average load (in %)

usedHeap

Memory heap used (in MB)

usedSpaceGeneral

Disk space usage (in MB)

NiFi
Metrics group Metric name Description

heap

committed

Usage of heap memory — committed (in MB)

init

Usage of heap memory — initialization (in MB)

max

Usage of heap memory — maximum (in MB)

usage

Usage of heap memory (in MB)

used

Used heap memory (in MB)

non-heap

committed

Use of non-heap memory — committed (in MB)

init

Non-heap memory usage — initialization (in MB)

max

Use of non-heap memory — maximum (in MB)

usage

Use of non-heap memory (in MB)

used

Used non-heap memory (in MB)

pools → Code Cache

committed

Code-Cache memory area — committed (in MB)

init

Code-Cache memory area — initialization (in MB)

max

Code-Cache memory area — maximum (in MB)

usage

Usage of the Code-Cache memory area (in MB)

used

Used Code-Cache memory area (in MB)

pools → Compressed-Class-Space

committed

Compressed-Class-Space — committed (in MB)

init

Memory area Compressed-Class-Space — initialization (in MB)

max

Compressed-Class-Space — maximum (in MB)

usage

Use area of ​​memory Compressed-Class-Space (in MB)

used

Used memory area Compressed-Class-Space (in MB)

pools → G1-Eden-Space

committed

Memory area G1-Eden-Space — committed (in MB)

init

Memory area G1-Eden-Space — initialization (in MB)

max

Memory area G1-Eden-Space — maximum (in MB)

usage

Use area of ​​memory G1-Eden-Space (in MB)

used

Used memory area G1-Eden-Space (in MB)

used-after-gc

Used after GC start area of ​​memory G1-Eden-Space (in MB)

pools → G1-Old-Gen

committed

Memory area G1-Old-Gen — assignment (committed) (in MB)

init

Memory area G1-Old-Gen — initialization (in MB)

max

G1-Old-Gen memory area — maximum (in MB)

usage

Usage of the G1-Old-Gen memory area (in MB)

used

Used area of ​​memory G1-Old-Gen (in MB)

used-after-gc

Used after GC start area of ​​memory G1-Old-Gen (in MB)

pools → G1-Survivor-Space

committed

Memory area G1-Survivor-Space — committed (in MB)

init

Memory area G1-Survivor-Space — initialization (in MB)

max

G1-Survivor-Space memory area — maximum (in MB)

usage

Usage of the memory area G1-Survivor-Space (in MB)

used

Used memory area G1-Survivor-Space (in MB)

used-after-gc

Used after GC start (used-after-gc) area of ​​memory G1-Survivor-Space (in MB)

pools → Metaspace

committed

Memory area Metaspace — committed (in MB)

init

Memory area Metaspace — initialization (in MB)

max

Memory area Metaspace — maximum (in MB)

usage

Usage of the Metaspace memory area (in MB)

used

Used Metaspace memory area (in MB)

total

committed

Assigned to use total memory (in MB)

init

Initialization of total memory (in MB)

max

Maximum total memory (in MB)

usage

Usage of total memory (in MB)

used

Used total memory (in MB)

 — 

ActiveThreads

Number of active threads

 — 

BytesQueued

Number of bytes in the queue

 — 

BytesReadLast5Minutes

Number of bytes read in the last 5 minutes

 — 

BytesReceivedLast5Minutes

Number of bytes received in the last 5 minutes

 — 

BytesSentLast5Minutes

Number of bytes sent in the last 5 minutes

 — 

BytesWrittenLast5Minutes

Number of bytes written in the last 5 minutes

 — 

FlowFilesQueued

Number of FlowFiles in the queue

 — 

FlowFilesReceivedLast5Minutes

Number of FlowFiles received in the last 5 minutes

 — 

FlowFilesSentLast5Minutes

Number of FlowFiles sent in the last 5 minutes

 — 

TotalTaskDurationNanoSeconds

Total task duration (in nanoseconds)

View monitoring metrics

You can use the web interfaces Graphite and Grafana to view monitoring metrics.

Graphite

Graphite stores numeric time series data and displays graphs of the data on demand. To view the metrics, enter the address of the host with the monitoring cluster into the browser address bar. Example: http://10.92.16.223. If you didn’t specify a different port during сonfiguring services of the monitoring cluster, Graphite uses the default port (80). If you specified Web-interface TCP port when configuring Graphite, specify the port number separated by a colon. Example: http://10.92.16.223:8080.

In the left part of the window that opens, expand the Metrics → Arenadata → ADS node. The list displays two groups of metrics: System_metrics and services.

Monitoring metrics in Graphite
Monitoring metrics in Graphite
Monitoring metrics in Graphite
Monitoring metrics in Graphite

Grafana

Grafana allows you to query data and visualize metrics stored in Graphite. To open the Grafana web interface, enter the host address of the monitoring cluster into the browser bar. If you don’t specify a port during сonfiguring services of the monitoring cluster, Grafana uses port 3000. Example: http://10.92.16.223:3000. If you specified Port when configuring Grafana, specify the port number separated by a colon. Example: http://10.92.16.223:8081.

In the form that opens, click on the Home menu item. Select one of the monitoring dashboards to view the metrics.

Grafana web interface
Grafana web interface

A chart form will open.

Metrics charts available in Grafana
Metrics charts available in Grafana
Found a mistake? Seleсt text and press Ctrl+Enter to report it