English morphological analysis like MeCab with OpenNLP


I want to do things like Japanese morphological analysis (MeCab) in English, so I use Apache OpenNLP


OS: Windows7 64bit Language: Java8 IDE: Eclipse4.6.1


When using MeCab on the command line

It's nice weather today.       ↓       ↓ Today "Nouns, adverbs possible, \ *, \ *, \ *, \ *, today, Kyo, Kyo" Is "particle, particle, \ *, \ *, \ *, \ *, ha, ha, wa" Good "adjective, independence, \ *, \ *, adjective / good, uninflected word, good, good, good" Weather "Noun, General, \ *, \ *, \ *, \ *, Weather, Tenki, Tenki" "Auxiliary verb, \ *, \ *, \ *, special death, uninflected word, is, death, death" Ne "Particles, final particles, \ *, \ *, \ *, \ *, ne, ne, ne" .. "Symbols, Kuten, \ *, \ *, \ *, \ * ,.,.,."

And morpheme information is displayed.

From this information, we obtain three "morphemes," "part of speech," and "basic forms" and use them for analysis.

I want to do the same thing in English, so I use OpenNLP to get "morphemes", "part of speech", and "uninflected words" from English sentences.

table of contents

  1. Functions provided by OpenNLP
  2. Java implementation
  3. Java preparation
  4. Word-separation
  5. Part of speech decomposition
  6. Word archetype

1. Functions provided by OpenNLP

Since OpenNLP itself supports multiple languages, it has the following functions.

I want to get "morpheme", "part of speech", and "basic form", so this time


2. Java implementation

1. Preparation

Create a maven project and add the following to pom.xml


Also, download the following file from the OpenNLP site and put it in the project so that the path will pass

2. Word-separation

//Tokenizer settings
InputStream modelIn = new FileInputStream("~/en-token.bin");
TokenizerModel model = new TokenizerModel(modelIn);
Tokenizer tokenizer = new TokenizerME(model);

message = "It is a fine day today.";
String[] morphemes = tokenizer.tokenize(message);

>> [It, is, a, fine, day, today, .]

3. Part of speech decomposition

// Part-of-speech Tagger settings
InputStream posModelIn = new FileInputStream("~/en-pos-maxent.bin");
POSModel posModel = new POSModel(posModelIn);
POSTaggerME posTagger = new POSTaggerME(posModel);

//Use the divided data
String [] tags = posTagger.tag(morphemes);
>> [PRP, VBZ, DT, JJ, NN, NN, .]

4. Word prototype

//Lemmatizer settings
InputStream dictLemmatizer = new FileInputStream("~/en-lemmatizer.txt");
DictionaryLemmatizer lemmatizer = new DictionaryLemmatizer(dictLemmatizer);

//Use divided data and part of speech data
String [] lemmas = lemmatizer.lemmatize(morphemes, tags);
>> [it, be, a, fine, day, today, O]

Since the result of word prototyping is often "O" more than I expected, it is necessary to make adjustments such as replacing it with morpheme data.

Reference link

Recommended Posts

English morphological analysis like MeCab with OpenNLP
Chinese morphological analysis like Mecab with FNLP
I tried morphological analysis with MeCab
Morphological analysis in Java with Kuromoji
NLP4J [006-030] 100 language processing knocks with NLP4J # 30 Reading morphological analysis results
Get detailed results of morphological analysis with Apache Solr 7.6 + SolrJ