NameNode

To ensure high availability, use both a primary NameNode and a secondary NameNode. They are the crucial parts of any Hadoop cluster. They should be highly available. Both servers keep the HDFS state in the fsimage file and logs in the edits file.

NameNode does the following operations:

  • performing all operations with files in the HDFS;

  • mapping files and DataNode blocks;

  • storing metadata of HDFS files and folders;

  • storing DataNode block locations;

  • controlling data replication.

The secondary NameNode is a reserve storage of the fsimage and edits files. It periodically updates the fsimage file from the edits log, thus preventing the latter from growing too large.

Storage options

Both NameNode servers should have highly reliable storage for their namespace storage and edits log. Typically, hardware RAID and reliable network storage are justifiable options.

For Hadoop NameNodes, regardless of the number of DataNodes, the storage characteristics are consistent. Use four near 1 TB SAS drives with a RAID HDD controller configured for RAID 1+0. SAS drives are more expensive than SATA drives and have lower storage capacity, but they are faster and much more reliable.

Deploying your SAS drives as a RAID array ensures that the Hadoop management services have a redundant store for their mission-critical data. This gives you enough stable, fast, and redundant storage to support the management of your Hadoop cluster.

Memory options

Memory requirements vary considerably depending on the scale of a Hadoop cluster. Memory is a critical factor for NameNodes, because the active and standby NameNode servers rely heavily on RAM to manage HDFS. As such, use error-correcting memory (ECC) in Hadoop NameNodes. NameNodes usually require 64 to 128 GB of RAM.

The NameNode memory requirement is a direct function of the number of file blocks stored in HDFS. As a rule, the NameNode uses roughly 1 GB of RAM per million HDFS blocks.

NOTE
Remember that files are broken down into individual blocks and replicated so that you have three copies of each block, not the file.

Processors

We recommend to use motherboards with two CPU sockets, each with eight cores 2.5-3 GHz. The Intel architecture is commonly used.

Network

Fast communication is vital for the services on NameNodes, so we recommend using a pair of bonded 10 Gbps connections. This bonded pair provides redundancy and also doubles throughput to 20 Gbps. For smaller clusters (less than 50 nodes) you could get away with using 1 Gbps connectors.

Found a mistake? Seleсt text and press Ctrl+Enter to report it