Hardware requirements depending on workload patterns

The Hadoop workloads present many resource-management challenges and potential conflicts. There are different types of conflicts, for example:

  • between long-running, resource-intensive jobs and shorter interactive queries;

  • between Hadoop and non-Hadoop workloads running on the same cluster and competing for the same resources.

 
In general, there are three main types of workload:

  • Compute intensive

  • Storage intensive

  • Balanced

Also, there could be situations, when you may not know your eventual workload patterns at first. Initial actions with a Hadoop cluster are usually very different than the actual jobs you will run in your production environment. Therefore we recommend following the advice for the balanced workload in a pilot Hadoop cluster. And then you can plan or rescale the current cluster according to the real workload.

Patterns

Same wheels

Hadoop nodes are the wheels which are moving everything. If the wheels are identical, then the movement is carried out smoothly and without jerks. But if they are different, then there may be different problems in the uniformity of movement. We recommend using the same configuration on all cluster machines with minor differences between DataNodes and between NameNodes.

The following characteristics must be the same for all cluster machines:

  • CPU

  • RAM

  • Network

See more details in General recommendations.

Compute intensive

This workload type is CPU bound and characterized by the need of a large number of CPU cores and large amount of memory to store in-process data. This usage pattern is typical for natural language processing or HPCC workloads. For the compute intensive patterns, it is necessary to use at least 10 CPU cores per machine.

Storage intensive

For this type of workload, we recommend investing in more disks per machine. The number of machines depends on the volume of data to store and analyze, which determines the required number of spinning disks per machine. The latter is usually fixed in a cluster.

For a low density server, the main aim is to keep the cost low to be able to afford a large number of machines. 8 CPU cores match this requirement and will give reasonable processing power. Each map or reduce task will utilize a single CPU core, but since some time will be spent waiting for input-output operations, it makes sense to oversubscribe for CPU cores. With 8 cores available, we can configure about 12 map and reduce slots per node.

The optimal size of one hard disk is 2-3 TB. On average, you can install 12 disks in each machine.

Balanced workload

The logic behind choosing and balancing hardware components for a high density cluster is the same as for a lower density one. As an example of such a configuration, you can choose a machine with 12 x 2-3 or 24 x 1 TB hard drives. Having lower capacity disks per server is preferable, because it will provide better input and output throughput and better fault tolerance. To increase the computational power of an individual machine, we use 8 CPU cores and 128-256 GB of RAM.

General recommendations

IMPORTANT

The following system requirements are minimal. The target sizing should be calculated based on the customer requirements.

In this table, there are the most common hardware recommendations based on the workload pattern for the ADH cluster.

Server Load pattern Storage CPU RAM Network

DataNodes

Balanced workload

12 Disks 2-3 TB JBOD

8 cores

128-256 GB

1 Gbps onboard, 2x10 Gbps interconnect

Compute-intensive workload

12 Disks 1-2 TB JBOD

10 cores

128-256 GB

10 Gbps onboard, 2x10 Gbps interconnect

Storage-heavy workload

12 Disks 4+ TB JBOD

8 cores

128-256 GB

10 Gbps onboard, 2x10 Gbps interconnect

NameNode

Balanced workload

4+ Disks 2-3 TB RAID 10

8 cores

128-256 GB

10 Gbps onboard, 2x10 Gbps interconnect

Resource Manager/YARN

Balanced workload

4+ Disks 2-3 TB RAID 10

8 cores

128-256 GB

10 Gbps onboard, 2x10 Gbps interconnect

Found a mistake? Seleсt text and press Ctrl+Enter to report it