May 31, 2014 Leave a comment
The basic hello world program in hadoop is the word count program. I have explained the word count implementation using java mapreduce and hive queries in my previous posts. Here I am explaining the implementation of basic word count logic using pig script.
TOKENIZE is a build in function available in apache pig which tokenizes a line into words.
Similarly FLATTEN and COUNT are also built-in functions available in apache pig.
You can check the output or flow of each step by using DUMP command after every step.
The below pig scripts will do the count of words in the input file.
A = load 'data.txt' as (line:chararray); B = foreach A generate TOKENIZE(line) as tokens; C = foreach B generate flatten(tokens) as words; D = group C by words; E = foreach D generate group, COUNT(C); F = order E by $1; dump F;