Simple Word Count using Apache Pig

The basic hello world program in hadoop is the word count program. I have explained the word count implementation using java mapreduce and hive queries in my previous posts. Here I am explaining the implementation of basic word count logic using pig script.
TOKENIZE is a build in function available in apache pig which tokenizes a line into words.
Similarly FLATTEN and COUNT are also built-in functions available in apache pig.
You can check the output or flow of each step by using DUMP command after every step.
The below pig scripts will do the count of words in the input file.

A = load 'data.txt' as (line:chararray);
B = foreach A generate TOKENIZE(line) as tokens;
C = foreach B generate flatten(tokens) as words;
D = group C by words;
E = foreach D generate group, COUNT(C);
F = order E by $1;
dump F;

About amalgjose
I am an Electrical Engineer by qualification, now I am working as a Software Architect. I am very much interested in Electrical, Electronics, Mechanical and now in Software fields. I like exploring things in these fields. I love travelling, long drives and music.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: