Simple Sentence Detector and Tokenizer Using OpenNLP

Date: May 9, 2013Author: Amal G Jose 19 Comments

Machine learning is a branch of artificial intelligence. In this we create and study about systems that can learn from data. We all learn from our experience or others experience. In machine learning, the system is also getting learned from some experience, which we feed as data.

So for getting an inference about something, first we train the system with some set of data. With that data, the system learns and will become capable to give inference for new data. This is the basic principal behind machine learning.

There are a lot of machine learning toolkits available. Here I am explaining a simple program by using Apache OpenNLP. OpenNLP library is a machine learning based toolkit which is made for text processing. A lot of components are available in this toolkit. Here I am explaining a simple sentence detector and a tokenizer using OpenNLP.

Sentence Detector

Download the en-sent.bin from the Apache OpenNLP website and add this to the class path.


public void SentenceSplitter()
	{
	SentenceDetector sentenceDetector = null;
	InputStream modelIn = null;

	try {
       modelIn = getClass().getResourceAsStream("en-sent.bin");
       final SentenceModel sentenceModel = new SentenceModel(modelIn);
       modelIn.close();
       sentenceDetector = new SentenceDetectorME(sentenceModel);
	}
	catch (final IOException ioe) {
		   ioe.printStackTrace();
		}
	finally {
		   if (modelIn != null) {
		      try {
		         modelIn.close();
		      } catch (final IOException e) {}
		   }
		}
	String sentences[]=(sentenceDetector.sentDetect("I am Amal. I am engineer. I like travelling and driving"));
	for(int i=0; i<sentences.length;i++)
	{
		System.out.println(sentences[i]);
	}
	}

Instead of giving sentence inside the program, you can give it as an input file.

Tokenizer

Download the en-token.bin from the Apache OpenNLP website and add this to the class path.

public void Tokenizer() throws FileNotFoundException
     {
	//InputStream modelIn = new FileInputStream("en-token.bin");
	InputStream modelIn=getClass().getResourceAsStream("en-token.bin");
		try {
			  TokenizerModel model = new TokenizerModel(modelIn);
			  Tokenizer tokenizer = new TokenizerME(model);
			  String tokens[] = tokenizer.tokenize("Sample tokenizer program using java");

			  for(int i=0; i<tokens.length;i++)
				{
					System.out.println(tokens[i]);
				}
			}
			catch (IOException e) {
			  e.printStackTrace();
			}
			finally {
			  if (modelIn != null) {
			    try {
			      modelIn.close();
			    }
			    catch (IOException e) {
			    }
			  }
			}
	}

19 thoughts on “Simple Sentence Detector and Tokenizer Using OpenNLP”

Add Comment

xera says:

May 31, 2013 at 5:11 am

hai…thanks for the good tutorial…I want to ask you how can we read the sentence from the text file?

String sentences[]=(sentenceDetector.sentDetect(“I am Amal. I am engineer. I like travelling and driving”));

For example I don’t want to put the sentence manually like “I am Amal. I am engineer. I like travelling and driving” but I want my program read it from text file..I have tried but never work

Reply
1. amalgjose says:
  
  May 31, 2013 at 8:51 am
  
  This will help u.. 🙂
  
  public class SenTest { public static void main(String[] args) throws IOException {
  InputStream modelIn = new FileInputStream("en-sent.bin"); FileInputStream fin=new FileInputStream("input.txt"); DataInputStream in = new DataInputStream(fin); BufferedReader br = new BufferedReader(new InputStreamReader(in)); String strLine=br.readLine(); System.out.println(strLine); try { SentenceModel model = new SentenceModel(modelIn); SentenceDetectorME sentenceDetector = new SentenceDetectorME(model); String sentences[] = sentenceDetector.sentDetect(strLine); FileOutputStream fout=new FileOutputStream("output.txt"); System.out.println(sentences.length); for(int i=0;i<sentences.length;i++){ System.out.println(sentences[i]); fout.write((sentences[i]+"\n").getBytes());
  } fout.close(); } catch (IOException e) { e.printStackTrace(); } finally { if (modelIn != null) { try { modelIn.close(); } catch (IOException e) { } } fin.close(); } } }
  
  Reply
  1. Nivethidha says:
    
    September 26, 2014 at 6:42 am
    
    Hai boss .i got one error in name finder program .here i post that error code ..will u pls guide me what i need to do ?
    import opennlp.tools.cmdline.PerformanceMonitor;
    import opennlp.tools.cmdline.postag.POSModelLoader;
    import opennlp.tools.namefind.NameFinderME;
    import opennlp.tools.namefind.TokenNameFinderModel;
    import opennlp.tools.postag.POSModel;
    import opennlp.tools.postag.POSTaggerME;
    import opennlp.tools.sentdetect.SentenceDetector;
    import opennlp.tools.sentdetect.SentenceDetectorME;
    import opennlp.tools.sentdetect.SentenceModel;
    import opennlp.tools.tokenize.Tokenizer;
    import opennlp.tools.tokenize.TokenizerME;
    import opennlp.tools.tokenize.TokenizerModel;
    import opennlp.tools.tokenize.WhitespaceTokenizer;
    import opennlp.tools.util.ObjectStream;
    import opennlp.tools.util.PlainTextByLineStream;
    import opennlp.tools.util.Span;
    
    import java.io.BufferedReader;
    import java.io.BufferedWriter;
    import java.io.File;
    import java.io.FileOutputStream;
    import java.io.FileReader;
    import java.io.FileWriter;
    import java.io.IOException;
    import java.io.InputStream;
    import java.io.FileInputStream;
    import java.io.DataInputStream;
    import java.io.InputStreamReader;
    import java.io.PrintStream;
    import java.io.StringReader;
    import java.util.Arrays;
    import java.util.Collections;
    import java.util.Scanner;
    import java.util.logging.Logger;
    import java.io.StringReader;
    
    import opennlp.tools.namefind.NameFinderME;
    import opennlp.tools.namefind.TokenNameFinderModel;
    import opennlp.tools.util.InvalidFormatException;
    import opennlp.tools.util.Span;
    import opennlp.tools.postag.POSTaggerME;
    import opennlp.tools.sentdetect.SentenceDetector;
    import opennlp.tools.sentdetect.SentenceDetectorME;
    public class NameFinder {
    public static void main(String[] args){
    InputStream modelIn = new FileInputStream(“/home/serendio/en-ner-person.bin”);
    FileInputStream fin = new FileInputStream(“/home/serendio/myschool.txt”);
    DataInputStream in = new DataInputStream(fin);
    BufferedReader br = new BufferedReader(new InputStreamReader(in));
    String strLine = br.readLine();
    System.out.println(strLine);
    
    try {
    
    SentenceModel model = new SentenceModel(modelIn);
    SentenceDetectorME sentenceDetector = new SentenceDetectorME(model);
    String sentences[] = sentenceDetector.sentDetect(strLine);
    
    TokenNameFinderModel model1 = new TokenNameFinderModel(modelIn);
    NameFinderME nameFinder = new NameFinderME(model1);
    String names[] = nameFinder.namefinder(strLine);
    
    FileOutputStream fout = new FileOutputStream(“/home/serendio/myschool.txt”);
    System.out.println(names.length);
    for(int i=0;i<names.length;i++)
    {
    System.out.println(names[i]);
    fout.write((names[i]+"\n").getBytes());
    }
    fout.close();
    }
    catch(IOException e)
    {
    e.printStackTrace();
    }
    finally
    {
    if(modelIn != null)
    {
    try{
    modelIn.close();
    }
    catch(IOException e)
    {
    }
    }
    fin.close();
    }
    }
    }
  2. Nivethidha says:
    
    September 26, 2014 at 6:44 am
    
    i’m continuesly struggling with this code ..i’m very beginner of java ..pls send the correct one to me ..
  3. amalgjose says:
    
    September 26, 2014 at 8:41 pm
    
    Send me the error that you are getting.
  4. Vinod Kumar says:
    
    June 11, 2015 at 12:25 pm
    
    Hi Amal, I am trying to learn training the OpenNLP APIs. But i am having issues doing that. By default SentenceDetector API splits the sentences if the sentence ends with a full stop. Now i want to train the API so that it will split the sentences ending with semi-colon as well.
    
    I tried doing it by running a sample file and generating a train model, but i am getting the error as “The maxent model is not compatible with the sentence detector!”
    
    I am assuming there is some problem with the sample file itself but not sure what the problem is.
    
    Can you please help me in resolving this? My requiremen tis that i want to train the SentenceDetector API to split sentences ending with semi-colon also.
    
    Regards,
    Vinu
2. Nivethidha says:
  
  September 24, 2014 at 9:32 am
  
  hai boss ..i need to create a program for read a text file and split it into sentences ..I applied the above code ..but unfortunately i cause error ..help me pls boss ..
  
  Reply
  1. amalgjose says:
    
    September 24, 2014 at 10:24 am
    
    Please post the error.
  2. Nivethidha says:
    
    September 24, 2014 at 10:27 am
    
    Sorry boss ..i got the output ..thank u…………………..so much …………………
  3. amalgjose says:
    
    September 24, 2014 at 10:45 am
    
    🙂
xera says:

May 31, 2013 at 4:10 pm

thank you so much…I really appreciate it

Reply
nikita07 says:

August 2, 2013 at 7:24 pm

I get an error with sentenceDetector:
cannot find symbol
symbol: class SentenceDetector
location: class Sentence

Surround with …

Reply
Naveen Shukla says:

January 15, 2014 at 3:36 am

Hi I am creating one project which need various language support like german , spanish etc. so how can i use opennlp tokenizer ? I tried with changing the bin file but it is not giving right output.
Anyhelp would be appreciated ….

Reply
1. amalgjose says:
  
  January 26, 2014 at 1:42 am
  
  Try using GATE.
  
  Reply
Raja says:

May 10, 2014 at 12:07 pm

Hi Jose, I placed “en-sent.zip” in my class path. But my class file is not able to recognize SentenceModel and SentenceDetector

/*
InputStream is = new FileInputStream(“en-sent.bin”);
SentenceModel model = new SentenceModel(is);
SentenceDetectorME sdetector = new SentenceDetectorME(model);
*/

Reply
1. amalgjose says:
  
  July 11, 2014 at 1:03 pm
  
  Please extract the zip file, then add it to the classpath. Sorry for the delay in response
  
  Reply
Nimtha says:

January 5, 2015 at 9:03 am

InputStream is = new FileInputStream(“en-sent.bin”); shows file not found exception. “en-sent.bin” is there in classpath

Reply
Adhithya says:

September 2, 2015 at 4:59 am

What are all the set up do I need to make to use apache opennlp. Please can anyone help me how to use this program. Kindly tell me the a-z things I need to do to run the program in eclipse. So that people who are beginners like me will be easily understand. Pls help asap.

Reply
aicha42 says:

February 20, 2016 at 12:06 pm

Hie, thanks for your tuto ! is it possible to train a new model for Arabic sentence detector ? once again thank you !

Reply