Majority of the introductory mapreduce programs I have seen in various websites and books are wordcount. Here I am explaining the working of a basic mapreduce program without anyprocessing logic.
By default, mapreduce accepts text input format and the java class which is responsible for this reading is TextInputFormat.class. It internally calls a class called LineRecordReader.
The LineRecordReader reads the input split and it returns record by record which ultimately reaches the mapper class. By default one record is a single line. ( because the entire text is splitted as records using delimiter ‘/n’).
The LineRecordReader returns a key-value pair which ultimately reaches the mapper class. By default, the key is offset and value is a single line.
Here I am writing a program which is nothing but just the basic frame of a mapreduce program.
When you supply an input text to this program, you will get the output as a set of lines and offsets, which is nothing but the input that is getting to the mapper.
Here the output will be (line, offset) because I am swapping the key and value in the mapper class.
The program is written below.
Mapper Class
Here the mapper class is written without adding any processing logic. It get the input as key-value pair where the key will be the offset and value will be the line.
After that it is sending output as another key value pair. Here I am just swapping the key and value. If you don’t want that change, you can directly pass that similar to the input (remember to make the corresponding changes in the mapper and reducer class by changing the LongWritable to Text and Text to LongWritable)
package org.amal.hadoop; import java.io.IOException; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; public class MapperClass extends Mapper<LongWritable, Text, Text, LongWritable> { public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { context.write(value, key); } }
Reducer Class
Here also I am not adding any processing logic. Just to read key-value pair and write it to the output.
Reducer gets input as key, list(values), that is the reason for the usage of iterable and a for loop.
package org.amal.hadoop; import java.io.IOException; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; public class Reduce extends Reducer<Text, LongWritable, Text, LongWritable> { public void reduce (Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException{ for(LongWritable val: values) { context.write(key,val); } } }
Driver Class or Runner Class
package org.amal.hadoop; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; public class Driver { public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "SampleMapreduce"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(LongWritable.class); job.setJarByClass(Driver.class); job.setMapperClass(MapperClass.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } }
Sample Input
Music is an art form whose medium is sound and silence. Its common elements are pitch , rhythm , dynamics, and the sonic qualities of timbre and texture. The word derives from Greek.
Output
Its common elements are pitch , rhythm , dynamics, and the sonic qualities of timbre and texture. 56 Music is an art form whose medium is sound and silence. 0 The word derives from Greek. 154
So the output we got is the line and offset (just the swapped input). 56, 0 and 154 are the offsets.
Any idea on why the lines got rearranged in the output.??
If you examine the input and output, you can see a change in order. This is because, the keys are undergoing a sorting process based on its hash value. Because of that only the change in order occurred.. 🙂
Hi Amal,
The best mapreduce program for basic understanding.
Can you please let me know your email address.
I have couple of questions.
Thanks,
alla
amalgjose@gmail.com