Ad by Google
To get the optimal use of Map Reduce framework, you must to have to understand the internal of Map Reduce and how it works.
In my previous tutorial, I have shared a Hello world program(Word count tutorial) in Hadoop map reduce programming. And in this tutorial, I am going to explain the internals of Map Reduce and how it works.
How Map Reduce Works?
MapReduce is a framework for writing application to process the big data in parallel on large clusters of commodity hardware in reliable and fault tolerant manner. MapReduce executes on top of HDFS(Hadoop Distributed File System) in two different phases called map phase and reduce phase. Most importantly map reduce programs are parallel processing, works on key value pattern, map phase works on key-value and produce the output as another key-value and the same way reduce phase, reduce phase takes the output of map task as an input of the reduce task and produce another key-value as an output.
MapReduce processing involves lot of different activities to complete works done, let see an example in below picture to processing an input file
From the above pictorial view one can easily figure out what is happening, but the concern is when does it happens. So we have input files(on HDFS) to be process using map reduce programming.
- Step 1. When you submit a job an instance of JobSubmitter is created.
- Step 2. JobSubmitter ask the resource manager(YARN) for a new application ID used for the map reduce job-ID
- Step 3. Verify the output directory of the job, if output directory is not provided at the time of job submission or the output directory is already exist than job will not submitted and thrown an error to the map reduce program, something like output directory is already exist.
- Step 4. Based on the input file size, input format will create input splits (In picture, it's depicted as split 0,split 1, and split2). By default input split is created based on the size of block(default block size is 128 MB). Note: Default input format is Text Input format, you can create your own input format also.
- Step 5. Once input split is created, copies all the resources needed to run the job including splits to the shared file system in a directory name after the job id.
- Step 6. JobSubmitter instance will submit the job on the resource manager by calling submitApplication method.
- Step 7. Resource manager further transfer the request to the YARN scheduler to allocates a container and start the application master's(driver/main class) process on the node manager.
- Step 8. Created the map task corresponding to each input splits(in picture three map task for three input split). Also reduce task is created here by default 1 reduce task created(number of reduce task to create is configurable).
- Step 9. Map task is ready to execute but don't know how to read the records from the split, so here Record Reader comes in a picture to read one line at a time from the input split and assign to the map task.(From the picture you can see, RR exist between input split and map task).
- Step 10. Map task generates output, than Partitioner's job started to control the keys of the intermediate map output (Custom Partitioner Tutorial in MapReduce)
- Step 11. MapReduce guarantee that the output of the each mapper is sorted by key, that's why between Partitioner and Reducer Shuffe&Sort activity is processed(you can easily correlate them from the picture).
- Step 12. Reducer received the input from the mapper via shuffle and sorting phase and start reducer activity.
- Step 13. Output is generated in a specified output directory with named _SUCCESS and part-r-0000* files.
Note: The part-r-0000* file is generated as equal to the number of reducer, if three reducer is specified in the driver class than three part file will created like part-r-00000, part-r-00001 and part-r-00002
If you are new to Big data technologies and want start learning then the first step is to Install Apache Hadoop in a Single Node Cluster.