How to check the Java architecture ?

To check the Java architecture whether it is 32 bit or 64 bit, the following commands will be helpful.

Execute the following commands in the command line and check the results

java -d32 -version

java -d64 -version

If any of the above command is giving an error message similar to “Error: This Java instance does not support a xx-bit JVM.” If it is not supporting 32 bit, then we can say it is a 64 bit Java. If it is not supporting 64 bit, we can say it is a 32-bit Java. The screenshot of the same is attached below.

java_screen.PNG

Hope this will help .. ūüôā

Advertisements

Sample program with Hadoop Counters and Distributed Cache

Counters are very useful feature in hadoop. This helps us in tracking global events in our job, ie across map and reduce phases.
When we execute a mapreduce job, we can see a lot of counters listed in the logs. Other than the default built-in counters, we can create our own custom counters. The custom counters will be listed along with the built-in counters.
This helps us in several ways. Here I am explaining a scenario where I am using a custom counter for counting the number of good words and stop words in the given text files. The stop words in this program are provided at the run time using distributed cache.
This is a mapper only job. The property job.setNumReduceTasks(0) makes the it a mapper only job.

Here I am introducing another feature in hadoop called Distributed Cache.
Distributed cache will distribute application specific read only files efficiently through out the application.
My requirement is to filter the stop words from input text files. The stop words list may vary. So if I hard code the list in my program, I have to update the code everytime to make changes in the stop word list. This is not a good practice.¬†I used distributed cache for this and the file containing the stop words is loaded to the distributed cache. This makes the file available to mapper as well as reducer.¬†In this program,¬†we don’t require any reducer.

The code is attached below. You can also get the code from the github.

Create a java project with the above java classes. Add the dependent java libraries.(Libraries will be present in your hadoop installation). Export the project as a runnable jar and execute. The file containing the stop words should be present in hdfs. The stop words should be added line by line in the stop word file. Sample format is given below.

is

the

am

are

with

was

were

Sample command to execute the program is given below.

hadoop jar <jar-name>  -skip  <stop-word-file-in hdfs>   <input-data-location>    <output-location>

Eg:  hadoop jar Skipper.jar  -skip /user/hadoop/skip/skip.txt     /user/hadoop/input     /user/hadoop/output

In the job logs, you can see the custom counters also. I am attaching a sample log below.

Counters

Program to compress a file in Snappy format

Hadoop supports various compression formats. Snappy is one among the compression formats supported by hadoop. I created a snappy compressed file using the google snappy library and used in hadoop.  But it gave me an error that the file is missing the Snappy identifier. I did a little research on this and found the workaround for that. The method I followed for finding the solution was as follows.
I compressed a file in snappy using the google snappy library and the snappy codecs present in hadoop. I verified the file size and checksum of both the files and found that It is having difference. The compressed file created using hadoop snappy is having some bytes more than that of the compressed file created using google snappy. It is some extra metadata that is consuming the extra bytes.
The code shown below will help you in creating snappy compressed file which will work perfectly in hadoop. This code requires the following dependent jars. This is available in your hadoop installation.
1)  hadoop-common.jar

2) guava-xx.jar

3) log4j.jar

4) commons-collections.jar

5) commons-logging.x.x.x.jar

You can download the code directly from github

“Missing artifact jdk.tools:jdk.tools:jar:1.6”

While using maven, we may face an error like
“Missing artifact jdk.tools:jdk.tools:jar:1.6”

This problem can be fixed by adding the below lines to your pom.xml file.
Replace ${JAVA_HOME} in the xml file with the absolute path of JAVA_HOME.

<dependency>
<groupId>jdk.tools</groupId>
<artifactId>jdk.tools</artifactId>
<scope>system</scope>
<version>1.6</version>
<systemPath>${JAVA_HOME}/lib/tools.jar</systemPath>
</dependency>

Simple Hive JDBC Client

Here I am explaining a sample hive jdbc client. With this we can fire hive queries from java programs. The only thing is that we need to start the hive server. By default, hive server listens at port 10000. The sample program is given below. The program is self explanatory and you can rewrite it to execute any type of hive queries. For this program you need the mysql-connector jar in the classpath.

import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;

  /*
   * @author
   * Amal G Jose
   * 
   */

public class HiveJdbc
  private static String driver = "org.apache.hadoop.hive.jdbc.HiveDriver";

  /**
   * @param args
   * @throws SQLException
   */
  public static void main(String[] args) throws SQLException {
      try {
      Class.forName(driver);
    } catch (ClassNotFoundException e) {
      e.printStackTrace();
      System.exit(1);
    }

    Connection connect = DriverManager.getConnection("jdbc:hive://:10000/default", "", "");
    Statement state = connect.createStatement();
    String tableName = "test";
    state.executeQuery("drop table " + tableName);
    ResultSet res = state.executeQuery("create table " + tableName + " (key int, value string)");
   
   // Query to show tables
    String show = "show tables";
    System.out.println("Running: " + show);
    res = state.executeQuery(show);
    if (res.next()) {
      System.out.println(res.getString(1));
    }

    // Query to describe table
    String describe = "describe " + tableName;
    System.out.println("Running: " + describe);
    res = state.executeQuery(describe);
    while (res.next()) {
      System.out.println(res.getString(1) + "\t" + res.getString(2));
    }

  }
}

Swapping Two numbers without using a third variable

This is a simple method for swapping the values of two numeric variables without using a third variable.
The sample java code is given below.

public void Swapping(int a, int b)
	{
                System.out.println(Values Before Swapping);
		System.out.println(a+" , "+b);
		a=a+b;
		b=a-b;
		a=a-b;
                System.out.println(Values After Swapping);
		System.out.println(a+" , "+b);
	}

Checking for Odd or Even without using any Conditional Statements

Last day my friend asked me a question to write a program which tells whether a given number is odd or even without using any conditional statements. It is very simple. There may be several solutions. Two of the solutions are given below. The code is given below

Using Array


public void OddEven(int num)
	{
		String []store = {"even","odd"};
		System.out.println("The number is "+store[(num%2)]);	
	}

Using try-catch

public void EvenOdd ( int num)
	{
		int temp = num%2;
		try {
			int ans = 10/temp;
			System.out.println("Number is odd");
		}
		catch (Exception e) {
			System.out.println("Number is Even");
		}
	}

A Simple Multithreaded Program in Java

Java provides built-in support for multithreaded programming. A multithreaded program contains two or more parts that can run concurrently. Each part of such a program is called a thread, and each thread defines a separate path of execution.

Here I am explaining a simple multi-threaded program.

The main thread writes 5000 to 1 in a file named MainThread.txt and the child thread writes 1 to 5000 in a file named childthread.txt.
Both will happen at the same time. That is it will run in parallel.

We are creating a child thread class by implementing a method Runnable.

This class will contain a method named run() where we do our functionality.

We will instantiate this thread class in the main method, so it will run along with the main thread.

The child thread class is

package com.amal.thread;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;

public class ThreadTest implements Runnable {
	Thread t;
	ThreadTest()
	{
		t=new Thread(this,"My Test");
		System.out.println("My test thread");
		t.start();

	}
	public void run() {
		File file=new File("childthread.txt");

		try {

			FileWriter fwt = new FileWriter(file.getAbsoluteFile());
			BufferedWriter bwt = new BufferedWriter(fwt);

			for(int i=0; i<5000; i++)
			{
				bwt.write("thread "+i);
				bwt.newLine();
			}				
			bwt.close();
		} catch (IOException e) {
			e.printStackTrace();
		}
	} 
}

The main class is

package com.amal.thread;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;

public class MainClass {
	public static void main(String[] args) throws IOException {
		new ThreadTest();
		File file1=new File("MainThread.txt");
		FileWriter fw = new FileWriter(file1.getAbsoluteFile());
		BufferedWriter bw = new BufferedWriter(fw);

		for (int i=5000;i>1;i--)
		{
			bw.write("main "+i);
			bw.newLine();
		}
		bw.close();
	}
}

Custom Text Input Format Record Delimiter for Hadoop

By default mapreduce program accepts text file and it reads line by line. Technically speaking the default input format is text input format and the default delimiter is ‚Äė/n‚Äô (new line).

In several cases, we need to override this property. For example if you have a large text file and you want to read the contents between ‚Äė.‚Äô ¬†in each read.

In this case, using the default delimiter will be difficult.

For example.

If you have a file like this


Ice cream (derived from earlier iced cream or cream ice) is a frozen  title="Dessert"  usually made from dairy products, such as milk and cream and often combined with fruits or other ingredients and flavours. Most varieties contain sugar, although some are made with other sweeteners. In some cases, artificial flavourings and colourings are used in addition to, or instead of, the natural ingredients. The mixture of chosen ingredients is stirred slowly while cooling, in order to incorporate air and to prevent large ice crystals from forming. The result is a smoothly textured semi-solid foam that is malleable and can be scooped.

And we want each record as


1) Ice cream (derived from earlier iced cream or cream ice) is a frozen dessert usually made from dairy products, such as milk and cream and often combined with fruits or other ingredients and flavours

2) Most varieties contain sugar, although some are made with other sweeteners

3) In some cases, artificial flavourings and colourings are used in addition to, or instead of, the natural ingredients

4) The mixture of chosen ingredients is stirred slowly while cooling, in order to incorporate air and to prevent large ice crystals from forming

5) The result is a smoothly textured semi-solid foam that is malleable and can be scooped

This we can do it by overriding one property textinputformat.record.delimiter

We can either set this property in the driver class or just changing the value of delimiter in the TextInputFormat class.

The first method is the easiest way.

Setting the textinputformat.record.delimiter in Driver class

The format for setting it in the program (Driver class)  is


conf.set(‚Äútextinputformat.record.delimiter‚ÄĚ, ‚Äúdelimiter‚ÄĚ)

The value you are setting by this method is ultimately going into the TextInputFormat class. This is explained below.

Editting the TextInputFormat class

.

Default TextInputFormat class

public class TextInputFormat extends FileInputFormat<LongWritable, Text> {

  @Override
  public RecordReader<LongWritable, Text>
    createRecordReader(InputSplit split,
                       TaskAttemptContext context) {
// By default,textinputformat.record.delimiter = ‚Äė/n‚Äô(Set in configuration file)
    String delimiter = context.getConfiguration().get(
        "textinputformat.record.delimiter");
    byte[] recordDelimiterBytes = null;
    if (null != delimiter)
      recordDelimiterBytes = delimiter.getBytes();
    return new LineRecordReader(recordDelimiterBytes);
  }

  @Override
  protected boolean isSplitable(JobContext context, Path file) {
    CompressionCodec codec =
      new CompressionCodecFactory(context.getConfiguration()).getCodec(file);
    return codec == null;
  }
}

Editted TextInputFormat class


public class TextInputFormat extends FileInputFormat<LongWritable, Text> {

  @Override
  public RecordReader<LongWritable, Text>
    createRecordReader(InputSplit split,
                       TaskAttemptContext context) {

// Hardcoding this value as ‚Äú.‚ÄĚ
// You can add any delimiter as your requirement

    String delimiter = ‚Äú.‚ÄĚ;
    byte[] recordDelimiterBytes = null;
    if (null != delimiter)
      recordDelimiterBytes = delimiter.getBytes();
    return new LineRecordReader(recordDelimiterBytes);
  }

  @Override
  protected boolean isSplitable(JobContext context, Path file) {
    CompressionCodec codec =
      new CompressionCodecFactory(context.getConfiguration()).getCodec(file);
    return codec == null;
  }
}

Simple Sentence Detector and Tokenizer Using OpenNLP

Machine learning is a branch of artificial intelligence. In this we  create and study about systems that can learn from data. We all learn from our experience or others experience. In machine learning, the system is also getting learned from some experience, which we feed as data.

So for getting an inference about something, first we train the system with some set of data. With that data, the system learns and will become capable to give inference for new data. This is the basic principal behind machine learning.

There are a lot of machine learning toolkits available. Here I am explaining a simple program by using Apache OpenNLP. OpenNLP library is a machine learning based toolkit which is made for text processing. A lot of components are available in this toolkit. Here I am  explaining a simple sentence detector and a tokenizer using OpenNLP.

Sentence Detector

Download the en-sent.bin from the Apache OpenNLP website and add this to the class path.


public void SentenceSplitter()
	{
	SentenceDetector sentenceDetector = null;
	InputStream modelIn = null;
	
	try {
       modelIn = getClass().getResourceAsStream("en-sent.bin");
       final SentenceModel sentenceModel = new SentenceModel(modelIn);
       modelIn.close();
       sentenceDetector = new SentenceDetectorME(sentenceModel);
	}
	catch (final IOException ioe) {
		   ioe.printStackTrace();
		}
	finally {
		   if (modelIn != null) {
		      try {
		         modelIn.close();
		      } catch (final IOException e) {}
		   }
		}
	String sentences[]=(sentenceDetector.sentDetect("I am Amal. I am engineer. I like travelling and driving"));
	for(int i=0; i<sentences.length;i++)
	{
		System.out.println(sentences[i]);
	}
	}

Instead of giving sentence inside the program, you can give it as an input file.

Tokenizer

Download the en-token.bin from the Apache OpenNLP website and add this to the class path.

public void Tokenizer() throws FileNotFoundException
     {
	//InputStream modelIn = new FileInputStream("en-token.bin");
	InputStream modelIn=getClass().getResourceAsStream("en-token.bin");
		try {
			  TokenizerModel model = new TokenizerModel(modelIn);
			  Tokenizer tokenizer = new TokenizerME(model);
			  String tokens[] = tokenizer.tokenize("Sample tokenizer program using java");
			  
			  for(int i=0; i<tokens.length;i++)
				{
					System.out.println(tokens[i]);
				}
			}
			catch (IOException e) {
			  e.printStackTrace();
			}
			finally {
			  if (modelIn != null) {
			    try {
			      modelIn.close();
			    }
			    catch (IOException e) {
			    }
			  } 
			}		
	}