Simple Word Count using Apache Pig

The basic hello world program in hadoop is the word count program. I have explained the word count implementation using java mapreduce and hive queries in my previous posts. Here I am explaining the implementation of basic word count logic using pig script.
TOKENIZE is a build in function available in apache pig which tokenizes a line into words.
Similarly FLATTEN and COUNT are also built-in functions available in apache pig.
You can check the output or flow of each step by using DUMP command after every step.
The below pig scripts will do the count of words in the input file.

A = load 'data.txt' as (line:chararray);
B = foreach A generate TOKENIZE(line) as tokens;
C = foreach B generate flatten(tokens) as words;
D = group C by words;
E = foreach D generate group, COUNT(C);
F = order E by $1;
dump F;

Pig – Local and Distributed Execution modes

There are currently two execution environments for pig.

  • Local execution in a single JVM
  • Distributed execution on a Hadoop cluster.

Local mode

In local mode, it uses a single JVM and local file system as execution environments. For running in local mode, we doent need any hadoop cluster. For entering into local execution mode, type the below command in the terminal. The execution type is set using the  -x or  -exectype option. When you type pig -x local,  You can see an output similar below and will enter into the grunt shell. On examining the below INFO logs, you can see that, it is using local file system.


pig –x local

 

2013-07-10 16:46:56,344 [main] INFO  org.apache.pig.Main - Apache Pig version 0.10.0-cdh4.1.2 (rexported) compiled Nov 01 2012, 18:38:58

2013-07-10 16:46:56,345 [main] INFO  org.apache.pig.Main - Logging error messages to: /home/amal_george/pig_1373455016342.log

2013-07-10 16:46:56,500 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at:file:///

grunt>

Distributed Mode

In a pig installed machine, when we type pig in the terminal, it will by default go into distribution execution mode. In distributed mode, the job will run as mapreduce and will use hdfs as file system. So we need a hadoop cluster for run pig in distributed mode.

When we type pig in the terminal. You can see an output similar below and will enter into the grunt shell. On examining the below INFO logs, you can see that, it is connecting to a cluster.

2013-07-10 16:47:52,510 [main] INFO  org.apache.pig.Main - Apache Pig version 0.10.0-cdh4.1.2 (rexported) compiled Nov 01 2012, 18:38:58

2013-07-10 16:47:52,511 [main] INFO  org.apache.pig.Main - Logging error messages to: /home/amal_george/pig_1373455072507.log
2013-07-10 16:47:52,797 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://master:9000

2013-07-10 16:47:53,487 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: master:9001

grunt>