0
Sponsored Links


Ad by Google
Big data technologies are getting much and more popular and very demanding, we have already seen what is big data in my previous post and the fundamentals to process those big data you need Hadoop and MapReduce, here is a detail description about what is Hadoop and in this post, I am going to explain you what is MapReduce with a very popular word count program example. For complete Hadoop MapReduce tutorial you may follow this Hadoop MapReduce Tutorial

What is MapReduce??
MapReduce is a programming model for processing the big data into a distributed fashion(parallel processing) with the help of commodity hardware on large cluster of computer, in a very reliable and fault tolerant manner.

MapReduce works on two different phases by breaking the job into map task and reduce task, each task works on a key value pattern and both takes key-value pair as an input and produce the key-value pair as an output. Output of Map task become an input for the Reduce task, although the type of key-value is as per your requirement. For more about Hadoop data types you can read this.

The map task and reduce task are scheduled using YARN, if any task somehow fails, then, it will automatically rescheduled to run.
In Map task you have to provide the implementation of map function,and implementation of reduce function in Reduce task.
Here is an example of Map task:
public class MapTask extends Mapper<LongWritable, Text, Text, IntWritable> {
 @Override
 public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
  // TODO:provide the logic of mapper class
  context.write(new Text(), new IntWritable());
 }
}

Reduce task:
public class ReduceTask extends Reducer<Text, IntWritable, Text, IntWritable> {

 @Override
 public void reduce(Text key, Iterable<IntWritable> values, Context context)
   throws IOException, InterruptedException {
  // TODO: provide the logic of reducer class
  context.write(key, new IntWritable());
 }
}
Although MapReduce works divided into two sub-task Map task and Reduce task but, complete MapReduce processing involves multiple activity, below is the diagram depicting all the activities of MapReduce.
For detail explanation of the above diagram how map reduce works you can read this.
By default,MapReduce works in record oriented fashion where each line of input file is treated as one record. Whenever you submit a job, Hadoop creates input splits based on your input file size, by default input split size is equal to the hdfs block size and the default hdfs block size is 128 MB.

That means if you have input file of 10 GB to process, then.
1 block size = 128 MB
1 input split size = block size
10 GB = 10000MB
e.i, number of input splits = 10000/128 = 79
78 input split = 78*128 = 9984MB
10000-9984 = 16MB

So for 16 MB exactly one input split will created,and total become 78+1
Total 79 number of input split will created for 10GB of input files.

Note: That's for default block size, of course you can increase the block size to reduce the number of input split.
Hadoop creates 1 mapper for each input split, e.i total 79 number of mapper will created for 79 input split to process the records from the files.

For word count program here is a separate tutorial, you can read this.
Sponsored Links

0 comments:

Post a Comment