A Basic Hadoop Mapreduce Program

Majority of the introductory mapreduce programs I have seen in various websites and books are wordcount. Here I am explaining the working of a basic  mapreduce program without anyprocessing logic.

By default, mapreduce accepts text input format and the java class  which is responsible for this reading is TextInputFormat.class. It internally calls a class called LineRecordReader.

The LineRecordReader reads the input split and it returns record by record which ultimately reaches the mapper class. By default one record is a single line. ( because the entire text is splitted as records using delimiter ‘/n’).

The LineRecordReader returns a key-value pair which ultimately reaches the mapper class. By default, the key is offset and value is a single line.

Here I am writing a program which is nothing but just the basic frame of a mapreduce program.
When you supply an input text to this program, you will get the output as a set of lines and offsets, which is nothing but the input that is getting to the mapper.

Here the output will be (line, offset) because I am swapping the key and value in the mapper class.
The program is written below.

Mapper Class

Here the mapper class is written without adding any processing logic. It get the input as key-value pair where the key will be the offset and value will be the line.

After that it is sending output as another key value pair. Here I am just swapping the key and value. If you don’t want that change, you can directly pass that similar to the input (remember to make the corresponding changes in the mapper and reducer class by changing the LongWritable to Text and Text to LongWritable)


package org.amal.hadoop;

import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class MapperClass extends Mapper<LongWritable, Text, Text, LongWritable> {

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

context.write(value, key);
}
}

Reducer Class

Here also I am not adding any processing logic. Just to read key-value pair and write it to the output.
Reducer gets input as key, list(values), that is the reason for the usage of iterable and a for loop.

package org.amal.hadoop;

import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class Reduce extends Reducer<Text, LongWritable, Text, LongWritable> {

public void reduce (Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException{
for(LongWritable val: values)
{
context.write(key,val);
}
}
}

Driver Class or Runner Class


package org.amal.hadoop;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

public class Driver {

public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();

Job job = new Job(conf, "SampleMapreduce");

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
job.setJarByClass(Driver.class);
job.setMapperClass(MapperClass.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);

FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.waitForCompletion(true);
}

}

Sample Input

Music is an art form whose medium is sound and silence.
Its common elements are pitch , rhythm , dynamics, and the sonic qualities of timbre and texture.
The word derives from Greek.

Output

Its common elements are pitch , rhythm , dynamics, and the sonic qualities of timbre and texture.    56
Music is an art form whose medium is sound and silence.    0
The word derives from Greek.    154

So the output we got is the line and offset (just the swapped input). 56, 0 and 154 are the offsets.
Any idea on why the lines got rearranged in the output.??
If you examine the input and output, you can see a change in order. This is because, the keys are undergoing a sorting process based on its hash value. Because of that only the change in order occurred.. 🙂

Advertisements

About amalgjose
I am an Electrical Engineer by qualification, now I am working as a Software Engineer. I am very much interested in Electrical, Electronics, Mechanical and now in Software fields. I like exploring things in these fields. I like travelling, long drives and very much addicted to music.

2 Responses to A Basic Hadoop Mapreduce Program

  1. Madhavi says:

    Hi Amal,

    The best mapreduce program for basic understanding.

    Can you please let me know your email address.

    I have couple of questions.

    Thanks,
    alla

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: