MapReduce overview


The MapReduce framework operates exclusively key/value pairs. Whether data is in a structured or unstructured format, the framework converts the incoming data into key/value pairs:

  • Key is a reference to the input value.

  • Value is a set of data to be processed.

Map and Reduce

The Map function produces a new list of key and value pairs:

  • An output of Map is called intermediate output.

  • Its type can be different from the input type.

  • An output of Map is stored on the local disk from where it is shuffled to nodes performing the data reduce operations.

The Reduce function takes intermediate key/value pairs from the Map function and processes them. Usually reducers compute aggregation or summation operations:

  • Input given to a reducer is generated by Map (intermediate output).

  • The key/value pairs provided to reducers are sorted by key.

The Reduce function produces a final list of key/value pairs. It is stored in HDFS and its type can be different from the input type.

In a mapper, input data is processed by a user-defined function. The required business logic is mostly implemented at the mapper level, so that all heavy processing is done by the mapper in parallel. For this reason, the number of mappers is significantly greater than the number of reducers. Mappers generate output as intermediate data that goes as input to the reducers.

This intermediate result is processed by a user-defined function working at reducer. Then the final output is generated. This final output is stored in HDFS. The replication is done as usually.

For additional information, see Quick start with MapReduce.

Found a mistake? Seleсt text and press Ctrl+Enter to report it