A Simple Mapreduce Program – Wordcount

Hello world is the trial program for almost all programming languages. Like that for hadoop-mapreduce, the trial program is wordcount, which is the basic simple mapreduce program. This program helps us in getting a good understanding of parallel processing of hadoop.

It consists of three classes.

1) Driver class- which is the main class

2) Mapper class- which does the map functions

3) Reducer class- which does the reduce functions

Driver Class

import java.io.IOException;
import java.util.Date;
import java.util.Formatter;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class WordCountDriver {

public static void main(String[] args) throws IOException,
InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration();
GenericOptionsParser parser = new GenericOptionsParser(conf, args);
args = parser.getRemainingArgs();

Job job = new Job(conf, "wordcount");

job.setJarByClass(WordCountDriver.class);

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);

job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);

Formatter formatter = new Formatter();
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(WordCountMapper.class);
job.setReducerClass(WordCountReducer.class);

System.out.println(job.waitForCompletion(true));
}
}

Mapper Class

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
private Text word = new Text();
private final static IntWritable one = new IntWritable(1);
 
protected void map(LongWritable key, Text value, Context context)
 throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
 word.set(tokenizer.nextToken());
 context.write(word, one);
}
}
}

Reducer Class

import java.io.IOException;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
 
public class WordCountReducer extends
 Reducer<Text, IntWritable, Text, IntWritable> {
 protected void reduce(Text key, Iterable<IntWritable> values,
 Context context) throws IOException, InterruptedException {
 int sum = 0;
 for (IntWritable value : values) {
 sum += value.get();
 }
 context.write(key, new IntWritable(sum));
 }
}

The jars necessary for this code is taken from the same version of hadoop package which is installed in the cluster. If the version is different, then it will result in error.
Here the mapper class reads the input file line by line. Then inside the mapper class, we convert the line to string after that we tokenize it into words. ie each line is split into individual words. The output of the mapper class is given to the reducer. the output of the mapper is in the form of a pair.
The context.write method actually gives a key-value pair to the reducer. Here the key is the word and value is “one” which is a variable assigned to the value 1.

In the Reducer, we merges these words and counts the values attached to similar words.
For example if we give an input file as

Hi all I am fine
Hi all I am good
hello world

After Mapper class the output will be,

Hi 1
all 1
I 1
am 1
fine 1
Hi 1
all 1
I 1
am 1
good 1
hello 1
world 1

After Reducer class the output will be

Hi 2
all 2
I 2
am 2
fine 1
good 1
hello 1
world 1

Hi,
I just copy pasted each of these programs into 3 java files and tried to run from cmd. When I first ran Mapper program, I get the following errors:

WordCountMapper.java:4: error: package org.apache.hadoop.io does not exist
import org.apache.hadoop.io.IntWritable;
^
WordCountMapper.java:5: error: package org.apache.hadoop.io does not exist
import org.apache.hadoop.io.LongWritable;
^
WordCountMapper.java:6: error: package org.apache.hadoop.io does not exist
import org.apache.hadoop.io.Text;
^
WordCountMapper.java:7: error: package org.apache.hadoop.mapreduce does not e
t
import org.apache.hadoop.mapreduce.Mapper;
^
WordCountMapper.java:9: error: cannot find symbol
public class WordCountMapper extends Mapper {
^
symbol: class Mapper
WordCountMapper.java:9: error: cannot find symbol
public class WordCountMapper extends Mapper {
^
symbol: class LongWritable
WordCountMapper.java:9: error: cannot find symbol
public class WordCountMapper extends Mapper {
^
symbol: class Text
WordCountMapper.java:9: error: cannot find symbol
public class WordCountMapper extends Mapper {
^
symbol: class Text
WordCountMapper.java:9: error: cannot find symbol
public class WordCountMapper extends Mapper {
^
symbol: class IntWritable
WordCountMapper.java:10: error: cannot find symbol
private Text word = new Text();
^
symbol: class Text
location: class WordCountMapper
WordCountMapper.java:11: error: cannot find symbol
private final static IntWritable one = new IntWritable(1);
^
symbol: class IntWritable
location: class WordCountMapper
WordCountMapper.java:13: error: cannot find symbol
protected void map(LongWritable key, Text value, Context context)
^
symbol: class LongWritable
location: class WordCountMapper
WordCountMapper.java:13: error: cannot find symbol
protected void map(LongWritable key, Text value, Context context)
^
symbol: class Text
location: class WordCountMapper
WordCountMapper.java:13: error: cannot find symbol
protected void map(LongWritable key, Text value, Context context)
^
symbol: class Context
location: class WordCountMapper
WordCountMapper.java:10: error: cannot find symbol
private Text word = new Text();
^
symbol: class Text
location: class WordCountMapper
WordCountMapper.java:11: error: cannot find symbol
private final static IntWritable one = new IntWritable(1);
^
symbol: class IntWritable
location: class WordCountMapper
16 errors

What could be the possibl reason

5 thoughts on “A Simple Mapreduce Program – Wordcount”

Khathu says:

August 28, 2013 at 12:22 pm

hi, may you please help me, i am trying to merge many small files into one large file using mapreduce. the input will be a folder with many files inside and the output should be a folder with only one file

Sushma Karra says:

January 25, 2014 at 9:15 pm

Hi,
I just copy pasted each of these programs into 3 java files and tried to run from cmd. When I first ran Mapper program, I get the following errors:

WordCountMapper.java:4: error: package org.apache.hadoop.io does not exist
import org.apache.hadoop.io.IntWritable;
^
WordCountMapper.java:5: error: package org.apache.hadoop.io does not exist
import org.apache.hadoop.io.LongWritable;
^
WordCountMapper.java:6: error: package org.apache.hadoop.io does not exist
import org.apache.hadoop.io.Text;
^
WordCountMapper.java:7: error: package org.apache.hadoop.mapreduce does not e
t
import org.apache.hadoop.mapreduce.Mapper;
^
WordCountMapper.java:9: error: cannot find symbol
public class WordCountMapper extends Mapper {
^
symbol: class Mapper
WordCountMapper.java:9: error: cannot find symbol
public class WordCountMapper extends Mapper {
^
symbol: class LongWritable
WordCountMapper.java:9: error: cannot find symbol
public class WordCountMapper extends Mapper {
^
symbol: class Text
WordCountMapper.java:9: error: cannot find symbol
public class WordCountMapper extends Mapper {
^
symbol: class Text
WordCountMapper.java:9: error: cannot find symbol
public class WordCountMapper extends Mapper {
^
symbol: class IntWritable
WordCountMapper.java:10: error: cannot find symbol
private Text word = new Text();
^
symbol: class Text
location: class WordCountMapper
WordCountMapper.java:11: error: cannot find symbol
private final static IntWritable one = new IntWritable(1);
^
symbol: class IntWritable
location: class WordCountMapper
WordCountMapper.java:13: error: cannot find symbol
protected void map(LongWritable key, Text value, Context context)
^
symbol: class LongWritable
location: class WordCountMapper
WordCountMapper.java:13: error: cannot find symbol
protected void map(LongWritable key, Text value, Context context)
^
symbol: class Text
location: class WordCountMapper
WordCountMapper.java:13: error: cannot find symbol
protected void map(LongWritable key, Text value, Context context)
^
symbol: class Context
location: class WordCountMapper
WordCountMapper.java:10: error: cannot find symbol
private Text word = new Text();
^
symbol: class Text
location: class WordCountMapper
WordCountMapper.java:11: error: cannot find symbol
private final static IntWritable one = new IntWritable(1);
^
symbol: class IntWritable
location: class WordCountMapper
16 errors

What could be the possibl reason

1. amalgjose says:
  
  January 26, 2014 at 1:32 am
  
  Please check whether u have added all the necessary jar files to the classpath..
  
Rohan says:

June 14, 2014 at 2:16 am

Hi , I am getting error at this line

GenericOptionsParser parser = new GenericOptionsParser(conf, args);

Quick fix showing to configure build path. Please let me know how to remove this error

1. amalgjose says:
  
  June 16, 2014 at 12:45 pm
  
  I think you missed adding one jar. Check whether the hadoop-commons.jar is added to your classpath.

All About Tech

Victory goes to the player who makes the next-to-last mistake

A Simple Mapreduce Program – Wordcount

Driver Class

Mapper Class

Reducer Class

After Mapper class the output will be,

Hi 1 all 1 I 1 am 1 fine 1 Hi 1 all 1 I 1 am 1 good 1 hello 1 world 1

After Reducer class the output will be

5 thoughts on “A Simple Mapreduce Program – Wordcount”

Leave a comment Cancel reply

Driver Class

Mapper Class

Reducer Class

After Mapper class the output will be,

Hi 1 all 1 I 1 am 1 fine 1 Hi 1 all 1 I 1 am 1 good 1 hello 1 world 1

After Reducer class the output will be

Share this:

Related

5 thoughts on “A Simple Mapreduce Program – Wordcount”

Leave a comment Cancel reply