Machine learning is a branch of artificial intelligence. In this we create and study about systems that can learn from data. We all learn from our experience or others experience. In machine learning, the system is also getting learned from some experience, which we feed as data.
So for getting an inference about something, first we train the system with some set of data. With that data, the system learns and will become capable to give inference for new data. This is the basic principal behind machine learning.
There are a lot of machine learning toolkits available. Here I am explaining a simple program by using Apache OpenNLP. OpenNLP library is a machine learning based toolkit which is made for text processing. A lot of components are available in this toolkit. Here I am explaining a simple sentence detector and a tokenizer using OpenNLP.
Sentence Detector
Download the en-sent.bin from the Apache OpenNLP website and add this to the class path.
public void SentenceSplitter() { SentenceDetector sentenceDetector = null; InputStream modelIn = null; try { modelIn = getClass().getResourceAsStream("en-sent.bin"); final SentenceModel sentenceModel = new SentenceModel(modelIn); modelIn.close(); sentenceDetector = new SentenceDetectorME(sentenceModel); } catch (final IOException ioe) { ioe.printStackTrace(); } finally { if (modelIn != null) { try { modelIn.close(); } catch (final IOException e) {} } } String sentences[]=(sentenceDetector.sentDetect("I am Amal. I am engineer. I like travelling and driving")); for(int i=0; i<sentences.length;i++) { System.out.println(sentences[i]); } }
Instead of giving sentence inside the program, you can give it as an input file.
Tokenizer
Download the en-token.bin from the Apache OpenNLP website and add this to the class path.
public void Tokenizer() throws FileNotFoundException { //InputStream modelIn = new FileInputStream("en-token.bin"); InputStream modelIn=getClass().getResourceAsStream("en-token.bin"); try { TokenizerModel model = new TokenizerModel(modelIn); Tokenizer tokenizer = new TokenizerME(model); String tokens[] = tokenizer.tokenize("Sample tokenizer program using java"); for(int i=0; i<tokens.length;i++) { System.out.println(tokens[i]); } } catch (IOException e) { e.printStackTrace(); } finally { if (modelIn != null) { try { modelIn.close(); } catch (IOException e) { } } } }
hai…thanks for the good tutorial…I want to ask you how can we read the sentence from the text file?
String sentences[]=(sentenceDetector.sentDetect(“I am Amal. I am engineer. I like travelling and driving”));
For example I don’t want to put the sentence manually like “I am Amal. I am engineer. I like travelling and driving” but I want my program read it from text file..I have tried but never work
This will help u.. 🙂
public class SenTest {
public static void main(String[] args) throws IOException {
InputStream modelIn = new FileInputStream("en-sent.bin");
FileInputStream fin=new FileInputStream("input.txt");
DataInputStream in = new DataInputStream(fin);
BufferedReader br = new BufferedReader(new InputStreamReader(in));
String strLine=br.readLine();
System.out.println(strLine);
try {
SentenceModel model = new SentenceModel(modelIn);
SentenceDetectorME sentenceDetector = new SentenceDetectorME(model);
String sentences[] = sentenceDetector.sentDetect(strLine);
FileOutputStream fout=new FileOutputStream("output.txt");
System.out.println(sentences.length);
for(int i=0;i<sentences.length;i++){
System.out.println(sentences[i]);
fout.write((sentences[i]+"\n").getBytes());
}
fout.close();
}
catch (IOException e) {
e.printStackTrace();
}
finally {
if (modelIn != null) {
try {
modelIn.close();
}
catch (IOException e) {
}
}
fin.close();
}
}
}
Hai boss .i got one error in name finder program .here i post that error code ..will u pls guide me what i need to do ?
import opennlp.tools.cmdline.PerformanceMonitor;
import opennlp.tools.cmdline.postag.POSModelLoader;
import opennlp.tools.namefind.NameFinderME;
import opennlp.tools.namefind.TokenNameFinderModel;
import opennlp.tools.postag.POSModel;
import opennlp.tools.postag.POSTaggerME;
import opennlp.tools.sentdetect.SentenceDetector;
import opennlp.tools.sentdetect.SentenceDetectorME;
import opennlp.tools.sentdetect.SentenceModel;
import opennlp.tools.tokenize.Tokenizer;
import opennlp.tools.tokenize.TokenizerME;
import opennlp.tools.tokenize.TokenizerModel;
import opennlp.tools.tokenize.WhitespaceTokenizer;
import opennlp.tools.util.ObjectStream;
import opennlp.tools.util.PlainTextByLineStream;
import opennlp.tools.util.Span;
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileOutputStream;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStream;
import java.io.FileInputStream;
import java.io.DataInputStream;
import java.io.InputStreamReader;
import java.io.PrintStream;
import java.io.StringReader;
import java.util.Arrays;
import java.util.Collections;
import java.util.Scanner;
import java.util.logging.Logger;
import java.io.StringReader;
import opennlp.tools.namefind.NameFinderME;
import opennlp.tools.namefind.TokenNameFinderModel;
import opennlp.tools.util.InvalidFormatException;
import opennlp.tools.util.Span;
import opennlp.tools.postag.POSTaggerME;
import opennlp.tools.sentdetect.SentenceDetector;
import opennlp.tools.sentdetect.SentenceDetectorME;
public class NameFinder {
public static void main(String[] args){
InputStream modelIn = new FileInputStream(“/home/serendio/en-ner-person.bin”);
FileInputStream fin = new FileInputStream(“/home/serendio/myschool.txt”);
DataInputStream in = new DataInputStream(fin);
BufferedReader br = new BufferedReader(new InputStreamReader(in));
String strLine = br.readLine();
System.out.println(strLine);
try {
SentenceModel model = new SentenceModel(modelIn);
SentenceDetectorME sentenceDetector = new SentenceDetectorME(model);
String sentences[] = sentenceDetector.sentDetect(strLine);
TokenNameFinderModel model1 = new TokenNameFinderModel(modelIn);
NameFinderME nameFinder = new NameFinderME(model1);
String names[] = nameFinder.namefinder(strLine);
FileOutputStream fout = new FileOutputStream(“/home/serendio/myschool.txt”);
System.out.println(names.length);
for(int i=0;i<names.length;i++)
{
System.out.println(names[i]);
fout.write((names[i]+"\n").getBytes());
}
fout.close();
}
catch(IOException e)
{
e.printStackTrace();
}
finally
{
if(modelIn != null)
{
try{
modelIn.close();
}
catch(IOException e)
{
}
}
fin.close();
}
}
}
i’m continuesly struggling with this code ..i’m very beginner of java ..pls send the correct one to me ..
Send me the error that you are getting.
Hi Amal, I am trying to learn training the OpenNLP APIs. But i am having issues doing that. By default SentenceDetector API splits the sentences if the sentence ends with a full stop. Now i want to train the API so that it will split the sentences ending with semi-colon as well.
I tried doing it by running a sample file and generating a train model, but i am getting the error as “The maxent model is not compatible with the sentence detector!”
I am assuming there is some problem with the sample file itself but not sure what the problem is.
Can you please help me in resolving this? My requiremen tis that i want to train the SentenceDetector API to split sentences ending with semi-colon also.
Regards,
Vinu
hai boss ..i need to create a program for read a text file and split it into sentences ..I applied the above code ..but unfortunately i cause error ..help me pls boss ..
Please post the error.
Sorry boss ..i got the output ..thank u…………………..so much …………………
🙂
thank you so much…I really appreciate it
I get an error with sentenceDetector:
cannot find symbol
symbol: class SentenceDetector
location: class Sentence
Surround with …
Hi I am creating one project which need various language support like german , spanish etc. so how can i use opennlp tokenizer ? I tried with changing the bin file but it is not giving right output.
Anyhelp would be appreciated ….
Try using GATE.
Hi Jose, I placed “en-sent.zip” in my class path. But my class file is not able to recognize SentenceModel and SentenceDetector
/*
InputStream is = new FileInputStream(“en-sent.bin”);
SentenceModel model = new SentenceModel(is);
SentenceDetectorME sdetector = new SentenceDetectorME(model);
*/
Please extract the zip file, then add it to the classpath. Sorry for the delay in response
InputStream is = new FileInputStream(“en-sent.bin”); shows file not found exception. “en-sent.bin” is there in classpath
What are all the set up do I need to make to use apache opennlp. Please can anyone help me how to use this program. Kindly tell me the a-z things I need to do to run the program in eclipse. So that people who are beginners like me will be easily understand. Pls help asap.
Hie, thanks for your tuto ! is it possible to train a new model for Arabic sentence detector ? once again thank you !