Pilot cluster

This article represents a plan of a pilot ADH cluster based on balanced workload pattern. Even a small and simple cluster needs to have at least three DataNodes and one NameNode. You can use one physical machine (or a virtual machine) or three different machines — it is up to you.

For more details about workloads, see Hardware requirements depending on workload patterns.

Parameters

For a pilot cluster example the following parameters will be used:

  • Data volume — 500 TB.

  • Replication factor — 3.

  • Data retention period — one year.

  • Balanced workload.

  • Data formats distribution:

    • Plain text, AVRO, Parquet, Jason, ORC, and so on — 20%;

    • GZIP and Snappy — 80%.

Hardware

The specification for DataNodes depends on the stored and analyzed data volume.

Data volume

According to the replication factor of 3, the storage volume of is required for storing data for one year. Assume 20% of the data is in the container storage and the other 80% is in the Snappy compressed Parquet format. Parquet Snappy compresses data to approximately 70-80% of the original volume. For this example, 80% will be assumed. Below is the formula for the storage requirement calculation:

 
With the planned parameters, the required storage is

Apart from the data, space for processing the data and for some other tasks is also required. Assume that only 10% of the data is processed each day on average and this process creates three times temporary data. This means that approximately 30% of total data storage volume is required as extra storage.

The total storage required for data and other activities is .

As for the DataNode, JBOD is recommended. 20% of data storage have to be allocated for the JBOD file system needs. Therefore, the data storage requirement will increase by 20%. So the final figure in this example is , 845 TB if rounded up.

DataNodes number

Suppose a DataNode has a JBOD of 12 disks, each disk worth of 4 TB. To calculate the number of DataNodes required for the 845 TB storage, use the following formula: .

NOTE
Generally it is not necessary to build the whole cluster at once. For example, you can start with 25% of total nodes and scale up the cluster to 100% as data grows.

CPU and RAM

As mentioned in recommendations, the following CPU and RAM parameters fit for the pilot cluster:

  • 8 CPU cores;

  • 128 GB RAM.

Found a mistake? Seleсt text and press Ctrl+Enter to report it