Maintenance mode

Overview

For hosts, ADQM supports maintenance mode implemented in the ADCM user interface. Temporarily unavailable or incorrectly functioning hosts can be switched to the maintenance mode to exclude their participation in cluster or service actions and perform actions without possible errors that may be associated with these hosts. The state of hosts in the maintenance mode does not affect the state of the cluster or services.

You can use this functionality for hardware or software maintenance, changing configuration settings, troubleshooting, decommissioning, or removing cluster nodes.

 
Specifics and limitations of the maintenance mode functionality supported by ADQM:

  • It is not allowed to add any components of ADQM services to a host that is in the maintenance mode. However, you can logically remove components from a host in the maintenance mode if it is unavailable.

  • If at least one host is switched to the maintenance mode, the Install and Upgrade cluster/service actions are not available.

  • For the Zookeeper and Clickhousekeeper services, execution of the Reconfig and restart action may fail if the leader of the ZooKeeper/ClickHouse Keeper cluster is on a host switched to the maintenance mode.

  • Maintenance mode is also supported to execute actions (except Install and Upgrade) of the ADQMDB service with integrated ClickHouse Keeper.

NOTE
Maintenance mode in ADQM is currently supported for hosts only.

Examples

Examples below use an ADQM cluster deployed on four hosts, which has the following topology:

  • shard 1 with two replicas — dev-adqm-1.ru-central1.internal and dev-adqm-2.ru-central1.internal;

  • shard 2 with two replicas — dev-adqm-3.ru-central1.internal and dev-adqm-4.ru-central1.internal.

The ADQMDB and Zookeeper services are installed in the cluster with their components distributed across hosts as shown in the image below.

Distribution of the cluster components across hosts
Distribution of the cluster components across hosts

For testing purposes, one replicated table is added to the cluster:

CREATE TABLE test_repl_table on cluster 'default_cluster' (id UInt64)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/test_repl_table', '{replica}')
ORDER BY id;

The table contains a data row in the shard 2:

INSERT INTO test_repl_table VALUES ('1');

Switch a host to the maintenance mode

The following example shows how to enable the maintenance mode for an ADQMDB host to temporarily exclude it from processing.

  1. Shut down a host on which the ADQMDB service’s components are installed — for example, the dev-adqm-4.ru-central1.internal host.

    Check connection to the host via SSH

     
    The Ansible [stdout] of the Check connection action shows a message that the host is unreachable.

    Result of the Check connection action
    Result of the "Check connection" action
    Check the cluster state

     
    The Check cluster action fails because it finds one of the hosts to be unreachable while checking the ADQMDB service.

    Result of the Check cluster action
    Result of the "Check" cluster action
    Message about the unavailable ADQMDB host
    Message about the unavailable ADQMDB host
  2. Enable the maintenance mode for the dev-adqm-4.ru-central1.internal host. To do this, click the icon maintenance default on the Hosts tab of the cluster page.

    Enable the maintenance mode for the host
    Enable the maintenance mode for the host
    Host in the maintenance mode

     
    Actions in the ADCM interface become unavailable to the host in the maintenance mode.

    Host in the maintenance mode
    Host in the maintenance mode

    Now, the complete unavailability of the host over the network does not lead to an error in executing cluster actions (for example, Check) and ADQMDB service actions (Check, Start, Stop, Restart, Reconfig and restart, Manage auto core dump).

    Successful execution of the Check action for the cluster
    Successful execution of the "Check" action for the cluster

    Ansible [stdout] of the action step corresponding to the ADQMDB service check displays a message indicating that the host is not participating in processing because it is in the maintenance mode.

    Message about the ADQMDB host in the maintenance mode
    Message about the ADQMDB host in the maintenance mode

After performing the Reconfig and restart action for the ADQMDB service, the host switched to the maintenance mode will be removed from the cluster topology definition in the remote_servers section of the /etc/clickhouse-server/config.xml configuration file on all ADQMDB hosts except the host in the maintenance mode. At the same time, the cluster without this host remains operational — data is available on other hosts, and you can continue working with it.

In order for the ADQMDB host to be processed by ADCM actions again, turn off the maintenance mode, and then run the Reconfig and restart action on the ADQMDB service.

Remove a host from the cluster

If one of the ADQM hosts fails, it can be removed from the cluster (for example, this can be considered if replicas of data remain in the cluster and the presence of this host is not critical, or a new host will be added to replace the removed one).

In the following example, the dev-adqm-4.ru-central1.internal host with the ADQMDB service’s components installed is deleted from the cluster.

  1. Turn on the maintenance mode for the dev-adqm-4.ru-central1.internal host as described above.

  2. Remove the replica corresponding to the dev-adqm-4.ru-central1.internal host from the logical cluster topology using the Cluster Configuration setting of the ADQMDB service. Click Save and perform the Reconfigure and restart action for the ADQMDB service to apply the new cluster topology.

  3. Run the Add/Remove components action for the ADQMDB service and remove the Clickhouse Server and Clickhouse JDBC Bridge components from the dev-adqm-4.ru-central1.internal host.

  4. Once all the components have been removed from the host, you can remove it from the cluster — on the Hosts page, click the icon unlink default.

    Remove a host from the cluster
    Remove a host from the cluster

    Confirm the action by clicking Unlink in the window that appears.

CAUTION
When you need to delete a ZooKeeper or ClickHouse Keeper host, it is necessary to prepare a replacement host and then replace the failed host with the replacement host in a single step using the Add/Remove components action for the Zookeeper or Clickhousekeeper service (since the number of ZooKeeper/ClickHouse Keeper hosts should not be even).

Replace an ADQMDB host

The following example shows how to add a new host (with the same name — dev-adqm-4.ru-central1.internal) to the cluster instead of the removed ADQMDB host and restore data replication between the new host and the existing one (dev-adqm-3.ru-central1.internal).

  1. Create a new dev-adqm-4.ru-central1.internal host in ADCM and add it to the ADQM cluster.

  2. Add this host to the cluster topology via the Cluster Configuration parameter of the ADQMDB service, specifying it as a replica of the dev-adqm-3.ru-central1.internal host in the shard 2.

  3. Run the Add/Remove components action for the ADQMDB service. Add the Clickhouse Server and Clickhouse JDBC Bridge components to the new host.

  4. On the new host, create a table:

    CREATE TABLE test_repl_table on cluster 'default_cluster' (id UInt64)
    ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/test_repl_table', '{replica}')
    ORDER BY id;

    If the REPLICA_ALREADY_EXISTS error occurs, before creating the table on the new host, run the following command on the dev-adqm-3.ru-central1.internal host, which is a replica of dev-adqm-4.ru-central1.internal:

    SYSTEM DROP REPLICA 'dev-adqm-4.ru-central1.internal';

    Make sure that data from the dev-adqm-3.ru-central1.internal replica has been inserted into the table on the new host:

    SELECT * FROM test_repl_table;
       ┌─id─┐
    1. │  1 │
       └────┘
Found a mistake? Seleсt text and press Ctrl+Enter to report it