About this page

Let's perform morphological analysis in Java. Considering that it will be a prerequisite in various other articles, I will summarize up to the operation check.

What is morphological analysis?

It refers to the process of dividing a document into the smallest meaningful units such as words. It is one of the most commonly used techniques for getting machines to process a language.

There are many other terms in this article, First of all, we will describe the operation check, and refer to each term in the appendix.

Development policy

The policy is to add the Kuromoji library on top of Spring Boot & Gradle. If you are from environment construction, please refer to the following. ⇒ Introduction to Spring Boot ... It's good, so I'm sure!

environment	service/version
Execution environment	Windows10
Development environment	eclipse Oxygen.2 Release (4.7.2)Java version
development language	Java 8
Framework	SpringBoot 2.1.3

Welcome Kuromoji to the project

Kuromoji's library seems to be in Maven Central, This time, I decided to fetch it from codelibs.

Added to repositories and dependencies as follows. Then perform a Gralde refresh to update the dependencies.

`build.gralde`


plugins {
	id 'org.springframework.boot' version '2.1.3.RELEASE'
	id 'java'
}

apply plugin: 'io.spring.dependency-management'

group = 'com.lab.app.ketman'
version = '0.0.1-SNAPSHOT'
sourceCompatibility = '1.8'

repositories {
	mavenCentral()
	//add to
	maven {
        url "http://maven.codelibs.org"
    }
    //So far
}

dependencies {
	implementation 'org.springframework.boot:spring-boot-starter-thymeleaf'
	implementation 'org.springframework.boot:spring-boot-starter-web'
	implementation 'org.mybatis.spring.boot:mybatis-spring-boot-starter:2.0.0'

	//add to
	implementation 'org.codelibs:lucene-analyzers-kuromoji-ipadic-neologd:7.6.0-20190325'
	//So far

	runtimeOnly 'org.springframework.boot:spring-boot-devtools'
	runtimeOnly 'org.postgresql:postgresql'

	testImplementation 'org.springframework.boot:spring-boot-starter-test'
}

Image of Gradle refresh

Try to output to the console for the time being

The analysis result is stored in the Attribute object. Declare the information you want as a variable and get it.

Attribute	Overview
CharTermAttribute	Representation of the analyzed sentence as it is
ReadingAttribute	Morpheme reading
OffsetAttribute	What character the morpheme appears in
PartOfSpeechAttribute	Part of speech information
BaseFormAttribute	prototype
InflectionAttribute	Utilization

`KuromojiSample`


public class KuromojiSample {
	//Return a list of Kuromoji Entity as return
	public List<KuromojiEntity> kuromojineologd(String src){
		List<KuromojiEntity> keList = new ArrayList<KuromojiEntity>();
		try(JapaneseTokenizer jt =
				new JapaneseTokenizer(null, false, JapaneseTokenizer.Mode.NORMAL)){
			jt.setReader(new StringReader(src));
			jt.reset();
			while(jt.incrementToken()){

				CharTermAttribute ct = jt.addAttribute(CharTermAttribute.class);
				ReadingAttribute ra = jt.addAttribute(ReadingAttribute.class);
				OffsetAttribute oa = jt.addAttribute(OffsetAttribute.class);
				PartOfSpeechAttribute posa = jt.addAttribute(PartOfSpeechAttribute.class);
				BaseFormAttribute bfa = jt.addAttribute(BaseFormAttribute.class);
				InflectionAttribute ifa = jt.addAttribute(InflectionAttribute.class);

				System.out.println(
						ct.toString()
						+ " | " + ra.getReading()
						+ " | " + oa.startOffset()
						+ " | " + posa.getPartOfSpeech()
						+ " | " + bfa.getBaseForm()
						+ " | " + ifa.getInflectionForm()
						+ " | " + ifa.getInflectionType());
			}
		} catch (IOException e) {
			e.printStackTrace();
		}
		return keList;
	}
}

`KuromojiSample`


@Controller
public class SampleKuromojiController {
	KuromojiSample ks = new KuromojiSample();

	@RequestMapping("/kuromoji")
	public String index(Model model) {
		String sentence = "neologd can interpret Yuru-chara as a proper noun.";
		ks.kuromojineologd(sentence);
		return "index";
	}
}

result

neologd's dictionary seems to be divided like this. It is characteristic that the reading includes Jiccouiinkai.

neologd |Neologdy| 0 |noun-Proper noun-General| NEologd | null | null
You|Kun| 7 |noun-suffix-Personal name| null | null | null
Is|C| 8 |Particle-Particle| null | null | null
Yuru-chara|Yuru Chara Grand Prix Jiccoui Inkai| 9 |noun-Proper noun-Personal name-General| null | null | null
To|Wo| 14 |Particle-Case particles-General| null | null | null
Proper noun|Koyu Meishi| 15 |noun-General| null | null | null
As|Toshite| 19 |Particle-Case particles-Collocation| null | null | null
Interpretation|Kaishaku| 22 |noun-Change connection| null | null | null
Finished|Deki| 24 |verb-Independence|Can do|Continuous form|One step
Masu|trout| 26 |Auxiliary verb| null |Uninflected word|Special / mass
。 | 。 | 28 |symbol-Kuten| null | null | null

appendix

① What is Lucene Analyzer?

Excerpt from 1. Lucene Overview

Lucene is a 100% PureJava indexing type full-text search engine developed by Jakarta Project 1. (An index is an index attached for fast search.) Lucene itself is a library, not a complete program, By using the API provided by Lucene, you can easily create an easy-to-use full-text search program. Also, because it is written in Java, it can be easily adapted to web applications. Lucene itself cannot analyze Japanese, but it is possible to search for Japanese by using a morphological analysis program.

② What is ipadic-neologd?

Maintenance of information (dictionary) given to machines in the evolving natural language day and night is one of the issues. The idea is to tackle this issue by crawling on the Web. Partial excerpt from neologd / mecab-ipadic-neologd

mecab-ipadic-NEologd is a system dictionary for MeCab customized by adding new words derived from many web language resources. When analyzing documents on the Web, it is recommended to use this dictionary together with the standard system dictionary (ipadic). (Omitted) Advantages Approximately 3.12 million pairs (including duplicate entries) of word surface (notation) and frigana pairs of words such as named entities that cannot be correctly divided by MeCab's standard system dictionary are recorded. This dictionary is updated automatically on the development server Will be updated at least twice a week Monday and Thursday Utilizing language resources on the Web, new named entities can be recorded at the time of update The resources currently in use are: ・ Dump data of Hatena keyword ・ Download zip code data … (Omitted) Disadvantages Insufficient classification of named entities For example, some personal names and product names are classified in the same named entity category. Words that are not named entities are also registered as named entities …

③ About setting analysis policy

In the sample code, the argument (JapaneseTokenizer.Mode.NORMAL) was given, There are also Search and Extends modes, each with the following features.

Excerpt from About Kuromoji

Normal mode After initializing the normal mode, morphological analysis is performed in this format by default.

Search mode A word that combines multiple words such as "Nikkei" is "Japan"|Economy|It analyzes separately like a newspaper. When used in combination with a full-text search engine, the Nihon Keizai Shimbun can be searched by "economy" or "newspaper", which is convenient.

Extends mode In addition to Search mode, treat unknown words as uni-gram. For example, "Mobage" is "Mobage"|Ba|Ge|-"Is divided into each character. A function that seems to reduce the chance of failing to search for unknown words.

④ Additional grammar to Gradle

If you want to get it from Maven Central, you should do it like this. [Home » com.atilika.kuromoji » kuromoji-ipadic » 0.9.0] (https://mvnrepository.com/artifact/com.atilika.kuromoji/kuromoji-ipadic/0.9.0)

// https://mvnrepository.com/artifact/com.atilika.kuromoji/kuromoji-ipadic
compile group: 'com.atilika.kuromoji', name: 'kuromoji-ipadic', version: '0.9.0'

Morphological analysis in Java with Kuromoji

About this page

What is morphological analysis?

Development policy

Welcome Kuromoji to the project

build.gralde

Try to output to the console for the time being

KuromojiSample

KuromojiSample

result

appendix

① What is Lucene Analyzer?

② What is ipadic-neologd?

③ About setting analysis policy

④ Additional grammar to Gradle

`build.gralde`

`KuromojiSample`

`KuromojiSample`