Simple Sentence Detector and Tokenizer Using OpenNLP

Machine learning is a branch of artificial intelligence. In this we  create and study about systems that can learn from data. We all learn from our experience or others experience. In machine learning, the system is also getting learned from some experience, which we feed as data.

So for getting an inference about something, first we train the system with some set of data. With that data, the system learns and will become capable to give inference for new data. This is the basic principal behind machine learning.

There are a lot of machine learning toolkits available. Here I am explaining a simple program by using Apache OpenNLP. OpenNLP library is a machine learning based toolkit which is made for text processing. A lot of components are available in this toolkit. Here I am  explaining a simple sentence detector and a tokenizer using OpenNLP.

Sentence Detector

Download the en-sent.bin from the Apache OpenNLP website and add this to the class path.

public void SentenceSplitter()
	SentenceDetector sentenceDetector = null;
	InputStream modelIn = null;

	try {
       modelIn = getClass().getResourceAsStream("en-sent.bin");
       final SentenceModel sentenceModel = new SentenceModel(modelIn);
       sentenceDetector = new SentenceDetectorME(sentenceModel);
	catch (final IOException ioe) {
	finally {
		   if (modelIn != null) {
		      try {
		      } catch (final IOException e) {}
	String sentences[]=(sentenceDetector.sentDetect("I am Amal. I am engineer. I like travelling and driving"));
	for(int i=0; i<sentences.length;i++)

Instead of giving sentence inside the program, you can give it as an input file.


Download the en-token.bin from the Apache OpenNLP website and add this to the class path.

public void Tokenizer() throws FileNotFoundException
	//InputStream modelIn = new FileInputStream("en-token.bin");
	InputStream modelIn=getClass().getResourceAsStream("en-token.bin");
		try {
			  TokenizerModel model = new TokenizerModel(modelIn);
			  Tokenizer tokenizer = new TokenizerME(model);
			  String tokens[] = tokenizer.tokenize("Sample tokenizer program using java");

			  for(int i=0; i<tokens.length;i++)
			catch (IOException e) {
			finally {
			  if (modelIn != null) {
			    try {
			    catch (IOException e) {

Rhipe Installation

Rhipe was first developed by Saptarshi Guha.
Rhipe needs R and Hadoop. So first install R and hadooop. Installation of R and hadoop are well explained in my previous posts. The latest version of Rhipe as of now is Rhipe-0.73.1. and  latest available version of R is R-3.0.0. If you are using CDH4 (Cloudera distribution of hadoop) , use Rhipe-0.73 or later versions, because older versions may not work with CDH4.
Rhipe is an R and Hadoop integrated programming environment. Rhipe integrates R and Hadoop. Rhipe is very good for statistical and analytical calculations of very large data. Because here R is integrated with hadoop, so it will process in distributed mode, ie  mapreduce.
Futher explainations of Rhipe are available in


Hadoop, R, protocol buffers and rJava should be installed before installing Rhipe.
We are installing Rhipe in a hadoop cluster. So the job submitted may execute in any of the tasktracker nodes. So we have to install R and Rhipe in all the tasktracker nodes, otherwise you will face an exception “Cannot find R” or something similar to that.

Installing Protocol Buffer

Download the protocol buffer 2.4.1 from the below link

tar -xzvf protobuf-2.4.1.tar.gz

cd protobuf-2.4.1

chmod -R 755 protobuf-2.4.1



make install

Set the environment variable PKG_CONFIG_PATH

nano /etc/bashrc

export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig

save and exit

Then executed the following commands to check the installation

pkg-config --modversion protobuf

This will show the version number 2.4.1
Then execute

pkg-config --libs protobuf

This will display the following things

-pthread -L/usr/local/lib -lprotobuf -lz –lpthread

If these two are working fine, This means that the protobuf is properly installed.

Set the environment variables for hadoop

For example

nano /etc/bashrc

export HADOOP_HOME=/usr/lib/hadoop

export HADOOP_BIN=/usr/lib/hadoop/bin

export HADOOP_CONF_DIR=/etc/hadoop/conf

save and exit


cd /etc/

nano Protobuf-x86.conf

/usr/local/lib   # add this value as the content of Protobuf-x86.conf

Save and exit


Installing rJava

Download the rJava tarball from the below link.

The latest version of rJava available as of now is rJava_0.9-4.tar.gz

install rJava using the following command

R CMD INSTALL rJava_0.9-4.tar.gz

Installing Rhipe

Rhipe can be downloaded from the following link

R CMD INSTALL Rhipe_0.73.1.tar.gz

This will install Rhipe

After this type R in the terminal

You will enter into R terminal

Then type


#This will display


| Please call rhinit() else RHIPE will not run |



#This will display

Rhipe: Detected CDH4 jar files, using RhipeCDH4.jar
Initializing Rhipe v0.73
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/client-0.20/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/client/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See for an explanation.
Initializing mapfile caches

Now you can execute you Rhipe scripts.

R Installation in Linux Platforms

R is a free software that is for statistical and analytical computations. It is a very good tool for graphical computations also.
R is used in a wide range of areas. R allows us to carryout statistical analysis in an interactive model.
To use R, first we need to install R program in our computer. R can be installed windows, Linux, Mac OS etc.

In Linux platforms, we usually install by compiling the tarball.
The latest stable version of R as of now is R-3.0.0
Installation of R in Linux platforms is explained below.

Installation Using rpm

If you are using Redhat or CentOS distribution of linux, then you can either install using tarballs or rpm.
But latest versions may not be available as rpm.
Installation using rpm is simple.
Just download the rpm files with dependencies and install each using the command

rpm –ivh <rpm-name>

Installation Using yum

If you are having Internet connection,
Then installation is very simple.
Just do the following commands.

yum install R-core R-2*

Installing R using tarball

R is available as tarball which is compatible with all  linux platforms.
Latest versions of R are available as tarball.

The installation steps are given below.
Get the latest R tar file  for Linux from

Extract the tarball

tar   –xzvf   R-xxx.tar.gz

Change the permission of the extracted file

chmod –R 755 R-xxx

then go inside the extracted R directory and do the following steps

./configure  --enable-R-shlib

The above step may fail because of the lack of dependent libraries in your OS.
If it is failing, install the dependent libraries and do the above step again.
If this is done successfully, do the following steps.


make install


After this set the R_HOME and PATH in /etc/bashrc (Redhat or CentOS) or ~/.bashrc (if no root privilege) or /etc/bash.bashrc (Ubuntu)

export R_HOME= <path to R installation>
export PATH=$PATH:$R_HOME/bin


Then do the following command

source /etc/bashrc  (For Redhat or CentOS)


source /etc/bash.bashrc    (If no root privilege)


source ~/.bashrc   ( For Ubuntu)


Check R installation

Type R in your terminal

If R prompt is coming, your R installation is successful.

You can quit from R by using the command q()

Simple Tag Cloud Generation Using Java program

A Tag cloud is a visual representation of text data. In this tags are words, where the importance is highlighted using colour or font size. This is very popular now to analyse contents of websites. This helps in quickly perceiving the most important words. The importance is calculated by counting the number of occurance. Thus based on occurance, weightage is given to each word(tag). After analysing the whole text, it is displayed based on it weightage. Thus tag cloud will be generated. open cloud is a java library for generating tag clouds. Here I used Open cloud library for the generation of Tag cloud. Normally we need a webserver for getting a good UI of the TagCloud, here we are displaying the cloud using Swing. This is a sample program for the generation of a simple tag Cloud. For this download the Open Cloud Library.

package tagcloud;

import java.util.Random;

import javax.swing.JFrame;
import javax.swing.JLabel;
import javax.swing.JPanel;
import javax.swing.SwingUtilities;

import org.mcavallo.opencloud.Cloud;
import org.mcavallo.opencloud.Tag;

public class TestOpenCloud {

private static final String[] WORDS = { "amal", "india", "hello", "amal", "birthday", "amal", "hello", "california", "america", "software",
 "cat", "bike", "car", "christmas", "city", "zoo", "amal", "asia", "family", "festival", "flower", "flowers", "food",
 "little", "friends", "fun", "amal", "outing", "india", "weekend", "india", "software", "me", "music", "music", "music",
 "new", "love", "night", "nikon", "morning", "love", "park", "software", "people", "portrait", "flower", "sky", "travelling",
 "spain", "summer", "sunset", "india", "city", "india", "amal", "uk", "usa", "", "water", "wedding","cool","happy","friends","best","trust","good",

protected void initUI() {
 JFrame frame = new JFrame(TestOpenCloud.class.getSimpleName());
 JPanel panel = new JPanel();
 Cloud cloud = new Cloud();
 Random random = new Random();
 for (String s : WORDS) {
 for (int i = random.nextInt(50); i > 0; i--) {
 for (Tag tag : cloud.tags()) {
 final JLabel label = new JLabel(tag.getName());
 label.setFont(label.getFont().deriveFont((float) tag.getWeight() * 10));
 frame.setSize(800, 600);

public static void main(String[] args) {
 SwingUtilities.invokeLater(new Runnable() {
 public void run() {
 new TestOpenCloud().initUI();