For a better understanding and as an example we will plan a pilot Hadoop Cluster. Basically, a balanced workload pattern should fit. Even a small and simple cluster needs to have at least three DataNodes and one NameNode. You can use one physical machine (or a virtual machine) or three different machines — it is up to you.
For more details about workloads, see Hardware requirements depending on workload patterns.
For the pilot cluster, take the following parameters:
The volume of data is 500 TB.
The replication factor is three.
The retention period of the data is one year.
The balanced workload is used.
The data formats: 20% of plain text, AVRO, Parquet, Jason, ORC, and so on; and 80% of compresses GZIP and Snappy.
The specification for DataNodes depends on the stored and analyzed data volume.
According to our replication factor, which is three, we need the storage of 500 TB * 3 = 1500 TB for storing data for one year. Assume 20% of data is in container storage and 80% of data is in the Snappy compressed Parquet format. Parquet Snappy compresses data to approximately 70-80%. We have taken it 80%. Here is the storage requirement calculation:
With the planned parameters, the required storage is
1500 * 0.2 + 1500 * 0.8 * (1 - 0.8) = 540 TB
In addition to the data, we need space for processing the data and for some other tasks. We need to decide how much should go to the extra space. We also assume that on an average day, only 10% of data is being processed and a data process creates three times temporary data. So, we need around 30% of total storage as extra storage.
The total storage required for data and other activities is 540 + 540 * 0.3 = 702 TB.
As for the DataNode, JBOD is recommended. We need to allocate 20% of data storage to the JBOD file system needs. Therefore, the data storage requirement will go up by 20%. Now, the final figure we arrive at is 702 * (1 + 0.2) = 842.4 TB. Let’s say that disk storage is about 845 TB.
Now, we need to calculate the number of DataNodes required for the 845 TB storage. Suppose we have a JBOD of 12 disks, each disk worth of 4 TB. DataNode capacity will be 48 TB.
The number of required DataNodes is 845 / 48 ~ 18.
NOTEWe do not need to set up the whole cluster on the first day. We can scale up the cluster as data grows from small to big. We can start with 25% of total nodes to 100% as data grows.
As we say in our recommendations, the following CPU and RAM parameters fit for the pilot cluster:
8 CPU cores;
128 GB RAM.