Simple Sentence Detector and Tokenizer Using OpenNLP

Machine learning is a branch of artificial intelligence. In this we  create and study about systems that can learn from data. We all learn from our experience or others experience. In machine learning, the system is also getting learned from some experience, which we feed as data.

So for getting an inference about something, first we train the system with some set of data. With that data, the system learns and will become capable to give inference for new data. This is the basic principal behind machine learning.

There are a lot of machine learning toolkits available. Here I am explaining a simple program by using Apache OpenNLP. OpenNLP library is a machine learning based toolkit which is made for text processing. A lot of components are available in this toolkit. Here I am  explaining a simple sentence detector and a tokenizer using OpenNLP.

Sentence Detector

Download the en-sent.bin from the Apache OpenNLP website and add this to the class path.


public void SentenceSplitter()
	{
	SentenceDetector sentenceDetector = null;
	InputStream modelIn = null;

	try {
       modelIn = getClass().getResourceAsStream("en-sent.bin");
       final SentenceModel sentenceModel = new SentenceModel(modelIn);
       modelIn.close();
       sentenceDetector = new SentenceDetectorME(sentenceModel);
	}
	catch (final IOException ioe) {
		   ioe.printStackTrace();
		}
	finally {
		   if (modelIn != null) {
		      try {
		         modelIn.close();
		      } catch (final IOException e) {}
		   }
		}
	String sentences[]=(sentenceDetector.sentDetect("I am Amal. I am engineer. I like travelling and driving"));
	for(int i=0; i<sentences.length;i++)
	{
		System.out.println(sentences[i]);
	}
	}

Instead of giving sentence inside the program, you can give it as an input file.

Tokenizer

Download the en-token.bin from the Apache OpenNLP website and add this to the class path.

public void Tokenizer() throws FileNotFoundException
     {
	//InputStream modelIn = new FileInputStream("en-token.bin");
	InputStream modelIn=getClass().getResourceAsStream("en-token.bin");
		try {
			  TokenizerModel model = new TokenizerModel(modelIn);
			  Tokenizer tokenizer = new TokenizerME(model);
			  String tokens[] = tokenizer.tokenize("Sample tokenizer program using java");

			  for(int i=0; i<tokens.length;i++)
				{
					System.out.println(tokens[i]);
				}
			}
			catch (IOException e) {
			  e.printStackTrace();
			}
			finally {
			  if (modelIn != null) {
			    try {
			      modelIn.close();
			    }
			    catch (IOException e) {
			    }
			  }
			}
	}

About amalgjose
I am an Electrical Engineer by qualification, now I am working as a Software Architect. I am very much interested in Electrical, Electronics, Mechanical and now in Software fields. I like exploring things in these fields. I love travelling, long drives and music.

19 Responses to Simple Sentence Detector and Tokenizer Using OpenNLP

  1. xera says:

    hai…thanks for the good tutorial…I want to ask you how can we read the sentence from the text file?

    String sentences[]=(sentenceDetector.sentDetect(“I am Amal. I am engineer. I like travelling and driving”));

    For example I don’t want to put the sentence manually like “I am Amal. I am engineer. I like travelling and driving” but I want my program read it from text file..I have tried but never work

    • amalgjose says:

      This will help u.. 🙂


      public class SenTest {
      public static void main(String[] args) throws IOException {

      InputStream modelIn = new FileInputStream("en-sent.bin");
      FileInputStream fin=new FileInputStream("input.txt");
      DataInputStream in = new DataInputStream(fin);
      BufferedReader br = new BufferedReader(new InputStreamReader(in));
      String strLine=br.readLine();
      System.out.println(strLine);

      try {
      SentenceModel model = new SentenceModel(modelIn);
      SentenceDetectorME sentenceDetector = new SentenceDetectorME(model);
      String sentences[] = sentenceDetector.sentDetect(strLine);

      FileOutputStream fout=new FileOutputStream("output.txt");

      System.out.println(sentences.length);
      for(int i=0;i<sentences.length;i++){
      System.out.println(sentences[i]);
      fout.write((sentences[i]+"\n").getBytes());

      }
      fout.close();
      }
      catch (IOException e) {
      e.printStackTrace();
      }
      finally {
      if (modelIn != null) {
      try {
      modelIn.close();
      }
      catch (IOException e) {
      }
      }
      fin.close();
      }
      }
      }

      • Nivethidha says:

        Hai boss .i got one error in name finder program .here i post that error code ..will u pls guide me what i need to do ?
        import opennlp.tools.cmdline.PerformanceMonitor;
        import opennlp.tools.cmdline.postag.POSModelLoader;
        import opennlp.tools.namefind.NameFinderME;
        import opennlp.tools.namefind.TokenNameFinderModel;
        import opennlp.tools.postag.POSModel;
        import opennlp.tools.postag.POSTaggerME;
        import opennlp.tools.sentdetect.SentenceDetector;
        import opennlp.tools.sentdetect.SentenceDetectorME;
        import opennlp.tools.sentdetect.SentenceModel;
        import opennlp.tools.tokenize.Tokenizer;
        import opennlp.tools.tokenize.TokenizerME;
        import opennlp.tools.tokenize.TokenizerModel;
        import opennlp.tools.tokenize.WhitespaceTokenizer;
        import opennlp.tools.util.ObjectStream;
        import opennlp.tools.util.PlainTextByLineStream;
        import opennlp.tools.util.Span;

        import java.io.BufferedReader;
        import java.io.BufferedWriter;
        import java.io.File;
        import java.io.FileOutputStream;
        import java.io.FileReader;
        import java.io.FileWriter;
        import java.io.IOException;
        import java.io.InputStream;
        import java.io.FileInputStream;
        import java.io.DataInputStream;
        import java.io.InputStreamReader;
        import java.io.PrintStream;
        import java.io.StringReader;
        import java.util.Arrays;
        import java.util.Collections;
        import java.util.Scanner;
        import java.util.logging.Logger;
        import java.io.StringReader;

        import opennlp.tools.namefind.NameFinderME;
        import opennlp.tools.namefind.TokenNameFinderModel;
        import opennlp.tools.util.InvalidFormatException;
        import opennlp.tools.util.Span;
        import opennlp.tools.postag.POSTaggerME;
        import opennlp.tools.sentdetect.SentenceDetector;
        import opennlp.tools.sentdetect.SentenceDetectorME;
        public class NameFinder {
        public static void main(String[] args){
        InputStream modelIn = new FileInputStream(“/home/serendio/en-ner-person.bin”);
        FileInputStream fin = new FileInputStream(“/home/serendio/myschool.txt”);
        DataInputStream in = new DataInputStream(fin);
        BufferedReader br = new BufferedReader(new InputStreamReader(in));
        String strLine = br.readLine();
        System.out.println(strLine);

        try {

        SentenceModel model = new SentenceModel(modelIn);
        SentenceDetectorME sentenceDetector = new SentenceDetectorME(model);
        String sentences[] = sentenceDetector.sentDetect(strLine);

        TokenNameFinderModel model1 = new TokenNameFinderModel(modelIn);
        NameFinderME nameFinder = new NameFinderME(model1);
        String names[] = nameFinder.namefinder(strLine);

        FileOutputStream fout = new FileOutputStream(“/home/serendio/myschool.txt”);
        System.out.println(names.length);
        for(int i=0;i<names.length;i++)
        {
        System.out.println(names[i]);
        fout.write((names[i]+"\n").getBytes());
        }
        fout.close();
        }
        catch(IOException e)
        {
        e.printStackTrace();
        }
        finally
        {
        if(modelIn != null)
        {
        try{
        modelIn.close();
        }
        catch(IOException e)
        {
        }
        }
        fin.close();
        }
        }
        }

      • Nivethidha says:

        i’m continuesly struggling with this code ..i’m very beginner of java ..pls send the correct one to me ..

      • amalgjose says:

        Send me the error that you are getting.

      • Vinod Kumar says:

        Hi Amal, I am trying to learn training the OpenNLP APIs. But i am having issues doing that. By default SentenceDetector API splits the sentences if the sentence ends with a full stop. Now i want to train the API so that it will split the sentences ending with semi-colon as well.

        I tried doing it by running a sample file and generating a train model, but i am getting the error as “The maxent model is not compatible with the sentence detector!”

        I am assuming there is some problem with the sample file itself but not sure what the problem is.

        Can you please help me in resolving this? My requiremen tis that i want to train the SentenceDetector API to split sentences ending with semi-colon also.

        Regards,
        Vinu

    • Nivethidha says:

      hai boss ..i need to create a program for read a text file and split it into sentences ..I applied the above code ..but unfortunately i cause error ..help me pls boss ..

  2. xera says:

    thank you so much…I really appreciate it

  3. nikita07 says:

    I get an error with sentenceDetector:
    cannot find symbol
    symbol: class SentenceDetector
    location: class Sentence

    Surround with …

  4. Naveen Shukla says:

    Hi I am creating one project which need various language support like german , spanish etc. so how can i use opennlp tokenizer ? I tried with changing the bin file but it is not giving right output.
    Anyhelp would be appreciated ….

  5. Raja says:

    Hi Jose, I placed “en-sent.zip” in my class path. But my class file is not able to recognize SentenceModel and SentenceDetector

    /*
    InputStream is = new FileInputStream(“en-sent.bin”);
    SentenceModel model = new SentenceModel(is);
    SentenceDetectorME sdetector = new SentenceDetectorME(model);
    */

  6. Nimtha says:

    InputStream is = new FileInputStream(“en-sent.bin”); shows file not found exception. “en-sent.bin” is there in classpath

  7. Adhithya says:

    What are all the set up do I need to make to use apache opennlp. Please can anyone help me how to use this program. Kindly tell me the a-z things I need to do to run the program in eclipse. So that people who are beginners like me will be easily understand. Pls help asap.

  8. aicha42 says:

    Hie, thanks for your tuto ! is it possible to train a new model for Arabic sentence detector ? once again thank you !

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: