Quick start with Mapreduce

A MapReduce application typically consists of three code parts:

  • Mapper code

  • Reducer code

  • Driver code

 
Those parts are usually based on classes with the same names:

  • Mapper. This class is responsible for the first stage in data processing. InputSplit is the logical representation of the data. It describes a unit of work that contains a single map task in a MapReduce job. RecordReader processes each input record and generates a respective key-value pair. Hadoop’s Mapper store saves this intermediate data to the local disk.

  • Reducer. This class is responsible for the second stage. The MapReduce service sends the intermediate output generated by the mappers to the reducers that process it and generate the final output to be saved in HDFS.

  • Driver. This class is responsible for setting up a MapReduce job to run in Hadoop.

In this example, we will count words in the sample.txt text file using MapReduce. Let us find the unique words and the number of occurrences of those unique words. Data processing goes through several stages as presented in the following diagram.

MapReduce workflow
MapReduce workflow
MapReduce workflow
MapReduce workflow
  1. We divide input data into three splits as shown in the diagram. This will distribute the work among all DataNodes.

  2. We tokenize the words in each of the mappers and give a hardcoded value 1 to every word (known as token). The rationale behind giving a hardcoded value equal to 1 is that every word itself occurs once.

  3. A list of key-value pairs are created where the keys are individual words and values are 1. For the first line ("One Two Three") we have 3 key-value pairs: ("One", 1), ("Two", 1), and ("Three", 1). The mapping process is the same for all involved DataNodes.

  4. After the map phase, a partition process starts to sort and shuffle the intermediate data, so that all pairs with the same key are sent to one reducer.

  5. After the sorting and shuffling phase, each reducer will have a unique key and a list of values corresponding exactly to that key. For example: ("One", [1,1]), ("Zero", [1,1,1]), and so on.

  6. Each reducer counts the values that are present in that list of values. As shown in the diagram, a reducer gets a list of values, for example, [1,1] for the key "One". Then it counts the number of ones in the list and gives the final output as ("One", 2).

  7. Finally, all the output key and value pairs are collected and written to the output file.

Mapper code

The mapper code performs the following operations:

  1. Create a class Map that extends the Mapper class from the MapReduce library.

  2. Define the data types of input and output keys and values in the class declaration using angle brackets. Both the input and output of the map method in the class is a key and value pair.

     
    Input:

    • The key is the offset of a line in the text file: LongWritable type.

    • The value is an individual line: Text type.

     
    Output:

    • The key is a tokenized word: Text type.

    • We have the hardcoded value in our case which is 1: IntWritable type.

      Output example: ("Dear", 1), ("Bear", 1), and so on.

The result is a Java code that breaks an input line into words and assigns them a hardcoded value equal to 1.

Reducer code

The reducer code performs the following operations:

  1. Create a class Reduce that extends class Reducer similar to what we did for mapper.

  2. Define the data types of input and output keys and values in the class declaration using angle brackets as done for the mapper. Both the input and output are key and value pairs.

     
    Input:

    • The key is a unique word generated at the sorting and shuffling phase: Text type.

    • The value is a list of integers corresponding to each key: IntWritable type.

      Input example: ("Bear", [1, 1]), ("Car", [1, 1, 1]), and so on.

     
    Output:

    • The key one of unique words present in the input text file: Text type.

    • The value is the number of occurrences of each of the unique words: IntWritable type.

      Output example: ("Bear", 2), ("Car", 3), and so on.

The discussed code aggregates the values present in the list of each key and produces the final answer. In general, a single reducer is created for each of the unique words, and you can specify the number of reducers in mapred-site.xml.

Driver code

The driver code performs the following operations:

  1. Set the configuration of our MapReduce job that will run in Hadoop.

  2. Specify the name of the job and the data types for input and output of the mapper and reducer.

  3. Specify the names of the mapper and reducer classes.

  4. Specify the path to the folders with input and output data.

The setInputFormatClass() method specifies how a mapper will read the input data or what will be the unit of work. Here we have chosen the TextInputFormat type, so that the mapper reads a single line at a time from the input text file.

The main() method is the entry point for the driver. In this method, we invoke a new Configuration object for the job.

Found a mistake? Seleсt text and press Ctrl+Enter to report it