Simple Sentence Detector and Tokenizer Using OpenNLP

Machine learning is a branch of artificial intelligence. In this we  create and study about systems that can learn from data. We all learn from our experience or others experience. In machine learning, the system is also getting learned from some experience, which we feed as data.

So for getting an inference about something, first we train the system with some set of data. With that data, the system learns and will become capable to give inference for new data. This is the basic principal behind machine learning.

There are a lot of machine learning toolkits available. Here I am explaining a simple program by using Apache OpenNLP. OpenNLP library is a machine learning based toolkit which is made for text processing. A lot of components are available in this toolkit. Here I am  explaining a simple sentence detector and a tokenizer using OpenNLP.

Sentence Detector

Download the en-sent.bin from the Apache OpenNLP website and add this to the class path.


public void SentenceSplitter()
	{
	SentenceDetector sentenceDetector = null;
	InputStream modelIn = null;
	
	try {
       modelIn = getClass().getResourceAsStream("en-sent.bin");
       final SentenceModel sentenceModel = new SentenceModel(modelIn);
       modelIn.close();
       sentenceDetector = new SentenceDetectorME(sentenceModel);
	}
	catch (final IOException ioe) {
		   ioe.printStackTrace();
		}
	finally {
		   if (modelIn != null) {
		      try {
		         modelIn.close();
		      } catch (final IOException e) {}
		   }
		}
	String sentences[]=(sentenceDetector.sentDetect("I am Amal. I am engineer. I like travelling and driving"));
	for(int i=0; i<sentences.length;i++)
	{
		System.out.println(sentences[i]);
	}
	}

Instead of giving sentence inside the program, you can give it as an input file.

Tokenizer

Download the en-token.bin from the Apache OpenNLP website and add this to the class path.

public void Tokenizer() throws FileNotFoundException
     {
	//InputStream modelIn = new FileInputStream("en-token.bin");
	InputStream modelIn=getClass().getResourceAsStream("en-token.bin");
		try {
			  TokenizerModel model = new TokenizerModel(modelIn);
			  Tokenizer tokenizer = new TokenizerME(model);
			  String tokens[] = tokenizer.tokenize("Sample tokenizer program using java");
			  
			  for(int i=0; i<tokens.length;i++)
				{
					System.out.println(tokens[i]);
				}
			}
			catch (IOException e) {
			  e.printStackTrace();
			}
			finally {
			  if (modelIn != null) {
			    try {
			      modelIn.close();
			    }
			    catch (IOException e) {
			    }
			  } 
			}		
	}
Advertisements

About amalgjose
I am an Electrical Engineer by qualification, now I am working as a Software Engineer. I am very much interested in Electrical, Electronics, Mechanical and now in Software fields. I like exploring things in these fields. I like travelling, long drives and very much addicted to music.

15 Responses to Simple Sentence Detector and Tokenizer Using OpenNLP

  1. xera says:

    hai…thanks for the good tutorial…I want to ask you how can we read the sentence from the text file?

    String sentences[]=(sentenceDetector.sentDetect(“I am Amal. I am engineer. I like travelling and driving”));

    For example I don’t want to put the sentence manually like “I am Amal. I am engineer. I like travelling and driving” but I want my program read it from text file..I have tried but never work

    • amalgjose says:

      This will help u.. 🙂


      public class SenTest {
      public static void main(String[] args) throws IOException {

      InputStream modelIn = new FileInputStream("en-sent.bin");
      FileInputStream fin=new FileInputStream("input.txt");
      DataInputStream in = new DataInputStream(fin);
      BufferedReader br = new BufferedReader(new InputStreamReader(in));
      String strLine=br.readLine();
      System.out.println(strLine);

      try {
      SentenceModel model = new SentenceModel(modelIn);
      SentenceDetectorME sentenceDetector = new SentenceDetectorME(model);
      String sentences[] = sentenceDetector.sentDetect(strLine);

      FileOutputStream fout=new FileOutputStream("output.txt");

      System.out.println(sentences.length);
      for(int i=0;i<sentences.length;i++){
      System.out.println(sentences[i]);
      fout.write((sentences[i]+"\n").getBytes());

      }
      fout.close();
      }
      catch (IOException e) {
      e.printStackTrace();
      }
      finally {
      if (modelIn != null) {
      try {
      modelIn.close();
      }
      catch (IOException e) {
      }
      }
      fin.close();
      }
      }
      }

      • Nivethidha says:

        Hai boss .i got one error in name finder program .here i post that error code ..will u pls guide me what i need to do ?
        import opennlp.tools.cmdline.PerformanceMonitor;
        import opennlp.tools.cmdline.postag.POSModelLoader;
        import opennlp.tools.namefind.NameFinderME;
        import opennlp.tools.namefind.TokenNameFinderModel;
        import opennlp.tools.postag.POSModel;
        import opennlp.tools.postag.POSTaggerME;
        import opennlp.tools.sentdetect.SentenceDetector;
        import opennlp.tools.sentdetect.SentenceDetectorME;
        import opennlp.tools.sentdetect.SentenceModel;
        import opennlp.tools.tokenize.Tokenizer;
        import opennlp.tools.tokenize.TokenizerME;
        import opennlp.tools.tokenize.TokenizerModel;
        import opennlp.tools.tokenize.WhitespaceTokenizer;
        import opennlp.tools.util.ObjectStream;
        import opennlp.tools.util.PlainTextByLineStream;
        import opennlp.tools.util.Span;

        import java.io.BufferedReader;
        import java.io.BufferedWriter;
        import java.io.File;
        import java.io.FileOutputStream;
        import java.io.FileReader;
        import java.io.FileWriter;
        import java.io.IOException;
        import java.io.InputStream;
        import java.io.FileInputStream;
        import java.io.DataInputStream;
        import java.io.InputStreamReader;
        import java.io.PrintStream;
        import java.io.StringReader;
        import java.util.Arrays;
        import java.util.Collections;
        import java.util.Scanner;
        import java.util.logging.Logger;
        import java.io.StringReader;

        import opennlp.tools.namefind.NameFinderME;
        import opennlp.tools.namefind.TokenNameFinderModel;
        import opennlp.tools.util.InvalidFormatException;
        import opennlp.tools.util.Span;
        import opennlp.tools.postag.POSTaggerME;
        import opennlp.tools.sentdetect.SentenceDetector;
        import opennlp.tools.sentdetect.SentenceDetectorME;
        public class NameFinder {
        public static void main(String[] args){
        InputStream modelIn = new FileInputStream(“/home/serendio/en-ner-person.bin”);
        FileInputStream fin = new FileInputStream(“/home/serendio/myschool.txt”);
        DataInputStream in = new DataInputStream(fin);
        BufferedReader br = new BufferedReader(new InputStreamReader(in));
        String strLine = br.readLine();
        System.out.println(strLine);

        try {

        SentenceModel model = new SentenceModel(modelIn);
        SentenceDetectorME sentenceDetector = new SentenceDetectorME(model);
        String sentences[] = sentenceDetector.sentDetect(strLine);

        TokenNameFinderModel model1 = new TokenNameFinderModel(modelIn);
        NameFinderME nameFinder = new NameFinderME(model1);
        String names[] = nameFinder.namefinder(strLine);

        FileOutputStream fout = new FileOutputStream(“/home/serendio/myschool.txt”);
        System.out.println(names.length);
        for(int i=0;i<names.length;i++)
        {
        System.out.println(names[i]);
        fout.write((names[i]+"\n").getBytes());
        }
        fout.close();
        }
        catch(IOException e)
        {
        e.printStackTrace();
        }
        finally
        {
        if(modelIn != null)
        {
        try{
        modelIn.close();
        }
        catch(IOException e)
        {
        }
        }
        fin.close();
        }
        }
        }

      • Nivethidha says:

        i’m continuesly struggling with this code ..i’m very beginner of java ..pls send the correct one to me ..

      • amalgjose says:

        Send me the error that you are getting.

    • Nivethidha says:

      hai boss ..i need to create a program for read a text file and split it into sentences ..I applied the above code ..but unfortunately i cause error ..help me pls boss ..

  2. xera says:

    thank you so much…I really appreciate it

  3. Naveen Shukla says:

    Hi I am creating one project which need various language support like german , spanish etc. so how can i use opennlp tokenizer ? I tried with changing the bin file but it is not giving right output.
    Anyhelp would be appreciated ….

  4. Raja says:

    Hi Jose, I placed “en-sent.zip” in my class path. But my class file is not able to recognize SentenceModel and SentenceDetector

    /*
    InputStream is = new FileInputStream(“en-sent.bin”);
    SentenceModel model = new SentenceModel(is);
    SentenceDetectorME sdetector = new SentenceDetectorME(model);
    */

  5. aicha42 says:

    Hie, thanks for your tuto ! is it possible to train a new model for Arabic sentence detector ? once again thank you !

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: