Let's perform morphological analysis in Java. Considering that it will be a prerequisite in various other articles, I will summarize up to the operation check.
It refers to the process of dividing a document into the smallest meaningful units such as words. It is one of the most commonly used techniques for getting machines to process a language.
There are many other terms in this article, First of all, we will describe the operation check, and refer to each term in the appendix.
The policy is to add the Kuromoji library on top of Spring Boot & Gradle. If you are from environment construction, please refer to the following. ⇒ Introduction to Spring Boot ... It's good, so I'm sure!
environment | service/version |
---|---|
Execution environment | Windows10 |
Development environment | eclipse Oxygen.2 Release (4.7.2)Java version |
development language | Java 8 |
Framework | SpringBoot 2.1.3 |
Kuromoji's library seems to be in Maven Central, This time, I decided to fetch it from codelibs.
Added to repositories and dependencies as follows. Then perform a Gralde refresh to update the dependencies.
build.gralde
plugins {
id 'org.springframework.boot' version '2.1.3.RELEASE'
id 'java'
}
apply plugin: 'io.spring.dependency-management'
group = 'com.lab.app.ketman'
version = '0.0.1-SNAPSHOT'
sourceCompatibility = '1.8'
repositories {
mavenCentral()
//add to
maven {
url "http://maven.codelibs.org"
}
//So far
}
dependencies {
implementation 'org.springframework.boot:spring-boot-starter-thymeleaf'
implementation 'org.springframework.boot:spring-boot-starter-web'
implementation 'org.mybatis.spring.boot:mybatis-spring-boot-starter:2.0.0'
//add to
implementation 'org.codelibs:lucene-analyzers-kuromoji-ipadic-neologd:7.6.0-20190325'
//So far
runtimeOnly 'org.springframework.boot:spring-boot-devtools'
runtimeOnly 'org.postgresql:postgresql'
testImplementation 'org.springframework.boot:spring-boot-starter-test'
}
The analysis result is stored in the Attribute object. Declare the information you want as a variable and get it.
Attribute | Overview |
---|---|
CharTermAttribute | Representation of the analyzed sentence as it is |
ReadingAttribute | Morpheme reading |
OffsetAttribute | What character the morpheme appears in |
PartOfSpeechAttribute | Part of speech information |
BaseFormAttribute | prototype |
InflectionAttribute | Utilization |
KuromojiSample
public class KuromojiSample {
//Return a list of Kuromoji Entity as return
public List<KuromojiEntity> kuromojineologd(String src){
List<KuromojiEntity> keList = new ArrayList<KuromojiEntity>();
try(JapaneseTokenizer jt =
new JapaneseTokenizer(null, false, JapaneseTokenizer.Mode.NORMAL)){
jt.setReader(new StringReader(src));
jt.reset();
while(jt.incrementToken()){
CharTermAttribute ct = jt.addAttribute(CharTermAttribute.class);
ReadingAttribute ra = jt.addAttribute(ReadingAttribute.class);
OffsetAttribute oa = jt.addAttribute(OffsetAttribute.class);
PartOfSpeechAttribute posa = jt.addAttribute(PartOfSpeechAttribute.class);
BaseFormAttribute bfa = jt.addAttribute(BaseFormAttribute.class);
InflectionAttribute ifa = jt.addAttribute(InflectionAttribute.class);
System.out.println(
ct.toString()
+ " | " + ra.getReading()
+ " | " + oa.startOffset()
+ " | " + posa.getPartOfSpeech()
+ " | " + bfa.getBaseForm()
+ " | " + ifa.getInflectionForm()
+ " | " + ifa.getInflectionType());
}
} catch (IOException e) {
e.printStackTrace();
}
return keList;
}
}
KuromojiSample
@Controller
public class SampleKuromojiController {
KuromojiSample ks = new KuromojiSample();
@RequestMapping("/kuromoji")
public String index(Model model) {
String sentence = "neologd can interpret Yuru-chara as a proper noun.";
ks.kuromojineologd(sentence);
return "index";
}
}
neologd's dictionary seems to be divided like this. It is characteristic that the reading includes Jiccouiinkai.
neologd |Neologdy| 0 |noun-Proper noun-General| NEologd | null | null
You|Kun| 7 |noun-suffix-Personal name| null | null | null
Is|C| 8 |Particle-Particle| null | null | null
Yuru-chara|Yuru Chara Grand Prix Jiccoui Inkai| 9 |noun-Proper noun-Personal name-General| null | null | null
To|Wo| 14 |Particle-Case particles-General| null | null | null
Proper noun|Koyu Meishi| 15 |noun-General| null | null | null
As|Toshite| 19 |Particle-Case particles-Collocation| null | null | null
Interpretation|Kaishaku| 22 |noun-Change connection| null | null | null
Finished|Deki| 24 |verb-Independence|Can do|Continuous form|One step
Masu|trout| 26 |Auxiliary verb| null |Uninflected word|Special / mass
。 | 。 | 28 |symbol-Kuten| null | null | null
Excerpt from 1. Lucene Overview
Lucene is a 100% PureJava indexing type full-text search engine developed by Jakarta Project 1. (An index is an index attached for fast search.) Lucene itself is a library, not a complete program, By using the API provided by Lucene, you can easily create an easy-to-use full-text search program. Also, because it is written in Java, it can be easily adapted to web applications. Lucene itself cannot analyze Japanese, but it is possible to search for Japanese by using a morphological analysis program.
Maintenance of information (dictionary) given to machines in the evolving natural language day and night is one of the issues. The idea is to tackle this issue by crawling on the Web. Partial excerpt from neologd / mecab-ipadic-neologd
mecab-ipadic-NEologd is a system dictionary for MeCab customized by adding new words derived from many web language resources. When analyzing documents on the Web, it is recommended to use this dictionary together with the standard system dictionary (ipadic). (Omitted) Advantages Approximately 3.12 million pairs (including duplicate entries) of word surface (notation) and frigana pairs of words such as named entities that cannot be correctly divided by MeCab's standard system dictionary are recorded. This dictionary is updated automatically on the development server Will be updated at least twice a week Monday and Thursday Utilizing language resources on the Web, new named entities can be recorded at the time of update The resources currently in use are: ・ Dump data of Hatena keyword ・ Download zip code data … (Omitted) Disadvantages Insufficient classification of named entities For example, some personal names and product names are classified in the same named entity category. Words that are not named entities are also registered as named entities …
In the sample code, the argument (JapaneseTokenizer.Mode.NORMAL) was given, There are also Search and Extends modes, each with the following features.
Excerpt from About Kuromoji
Normal mode After initializing the normal mode, morphological analysis is performed in this format by default.
Search mode A word that combines multiple words such as "Nikkei" is "Japan"|Economy|It analyzes separately like a newspaper. When used in combination with a full-text search engine, the Nihon Keizai Shimbun can be searched by "economy" or "newspaper", which is convenient.
Extends mode In addition to Search mode, treat unknown words as uni-gram. For example, "Mobage" is "Mobage"|Ba|Ge|-"Is divided into each character. A function that seems to reduce the chance of failing to search for unknown words.
If you want to get it from Maven Central, you should do it like this. [Home » com.atilika.kuromoji » kuromoji-ipadic » 0.9.0] (https://mvnrepository.com/artifact/com.atilika.kuromoji/kuromoji-ipadic/0.9.0)
// https://mvnrepository.com/artifact/com.atilika.kuromoji/kuromoji-ipadic
compile group: 'com.atilika.kuromoji', name: 'kuromoji-ipadic', version: '0.9.0'
Recommended Posts