DataNode
DataNode is a Hadoop primary storage and computational resource. Many specialists may think that since DataNodes play such an important role in a cluster, we should use the best hardware available for them. This is not entirely true. Hadoop was designed with an idea that DataNodes are like "disposable workers" that are fast enough to do useful work as a part of the cluster, but cheap enough to be easily replaced if they fail. Frequency of hardware failures in large clusters is probably one of the most important considerations that core Hadoop developers had in mind. Hadoop addresses this issue by moving the redundancy implementation from the cluster hardware to the cluster software itself.
DataNode does the following operations:
-
writing and reading of data blocks by client requests;
-
executing MapReduce tasks;
-
sending a heartbeat and the block states to the NameNode (block reports);
-
participating in the replication process.
Do we need RAID
Hadoop provides redundancy on many levels. Each DataNode stores only some blocks of HDFS files and those blocks are replicated multiple times to different nodes, so in the event of a single server failure, data remains accessible. The cluster can even tolerate multiple nodes' failure, depending on the configuration you choose. Hadoop goes beyond that and allows you to specify which servers reside on which racks and tries to store copies of data on separate racks. In this way, it significantly increases probability that your data remains accessible even if the whole rack goes down (though this is not a strict guarantee). This design means that there is no reason to invest into the RAID controller for Hadoop DataNodes.
Storage options
Usually one DataNode has 12-24 disks with 1 TB capacity. Instead of using RAID for local disks, a JBOD (Just a Bunch of Disks) is a preferred choice. It provides better performance for Hadoop workload and reduces hardware costs. You don’t have to worry about individual disk failure since redundancy is provided by HDFS.
The number of DataNodes depends on the NameNode RAM size. Remember that the NameNode uses roughly 1 GB of RAM per million HDFS blocks. By default the size of one HDFS block is 128 MB.
A typical implementation is to use 19" racks or cabinets with 1U-2U servers.
Memory options
The second DataNode role is to serve as a data processing node and execute custom MapReduce code. MapReduce jobs are split into lots of separate tasks, which are executed in parallel on multiple DataNodes so that for a job to produce logically consistent results, all subtasks must be completed.
Processors
We recommend to use motherboards with two CPU sockets, each with eight cores 2.5-3 GHz. The Intel architecture is commonly used.
Network
Fast communication is as vital for DataNodes as for the NameNode, so we recommend to use a pair of bonded 10 Gbps connections. This bonded pair provides redundancy and also doubles throughput to 20 Gbps. For smaller clusters (less than 50 nodes) you could get away with using 1 Gbps connectors.