Simple Word Count using Apache Pig

The basic hello world program in hadoop is the word count program. I have explained the word count implementation using java mapreduce and hive queries in my previous posts. Here I am explaining the implementation of basic word count logic using pig script.
TOKENIZE is a build in function available in apache pig which tokenizes a line into words.
Similarly FLATTEN and COUNT are also built-in functions available in apache pig.
You can check the output or flow of each step by using DUMP command after every step.
The below pig scripts will do the count of words in the input file.

A = load 'data.txt' as (line:chararray);
B = foreach A generate TOKENIZE(line) as tokens;
C = foreach B generate flatten(tokens) as words;
D = group C by words;
E = foreach D generate group, COUNT(C);
F = order E by $1;
dump F;
Advertisements

About amalgjose
I am an Electrical Engineer by qualification, now I am working as a Software Engineer. I am very much interested in Electrical, Electronics, Mechanical and now in Software fields. I like exploring things in these fields. I like travelling, long drives and very much addicted to music.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: