This is a memo for myself as I read Introduction to Natural Language Processing Applications in 15 Steps. This time, in Chapter 2, Step 03, I will make a note of my own points.

Preparation

--Personal MacPC: MacOS Mojave version 10.14.6 --docker version: Version 19.03.2 for both Client and Server

Chapter overview

Understand how MeCab works and try tuning it. Also, check morphological analyzers other than MeCab.

03.1 MeCab

dictionary

Word-separation by MeCab is based on a dictionary. The information obtained from morphological analysis using MeCab in the dictionary depends on what kind of information is registered in the dictionary, and the information registered in the dictionary differs depending on the dictionary.

Dictionary name	Contents
IPAdic	・ Dictionary officially recommended by MeCab ・ Based on data called IPA Corpus
UniDic	・ Based on data called UniDic ・ The unit to be divided is small, which is close to strict "morphological analysis".
jumandic	-MeCab porting of the dictionary used in a morphological analyzer called JUMAN, which is different from MeCab ・ Based on data called Kyoto Corpus ・ Meta information such as representative notation is given
ipadic-NEologd	・ Significantly expanded the number of words based on the IPA dictionary ・ The vocabulary is frequently expanded by crawling words from the Internet, and the ability to respond to new words is very high. ・It is recommended to perform normalization as a pretreatment
unidic-NEologd	・ Ipadic-Similar to NEologd, a dictionary with word extensions based on UniDic

Install and run ipadic-NEologd

The difference between IPAdic and ipadic-NEologd is, for example, the analysis of the word "Deep Learning" (a relatively new word).

--IPAdic: Divided by "Deep" and "Learning" --ipadic-NEologd: Treated by one word of "Deep Learning"

Behavior of morphological analysis of MeCab

The dictionary holds not only information about morphemes such as execution results, but also the following various information.

--The cost of occurrence of each word --Left context ID of each word --Right context ID of each word --Concatenation cost for each combination of context IDs

The analysis result is the combination that minimizes the combination of the occurrence cost and the connection cost for the given sentence. (In the example below, the cost when dividing into "I love Higashi Osaka" is (minimum) the lowest, so this is the analysis result)

`Example) "I love Higashi Osaka"`


#When splitting with "I love Higashi Osaka"
Connection cost between the beginning of the sentence and "Higashi Osaka"
Occurrence cost of "Higashi Osaka"
Connection cost of "Higashi Osaka" and "I love you"
Occurrence cost of "love"
"I love you" and the connection cost at the end of the sentence

#When splitting with "I like the University of Tokyo and Osaka University"
Connection cost between the beginning of the sentence and "Higashi Osaka"
Occurrence cost of "The University of Tokyo"
Connection cost of "University of Tokyo" and "Osaka University"
Occurrence cost of "Osaka University"
Connection cost of "Osaka University" and "like"
Occurrence cost of "like"
"I love you" and the connection cost at the end of the sentence

03.2 MeCab Dictionary Modification

If you don't get the results you expect with an existing dictionary, tune the dictionary yourself.

--Addition of new words --Adjustment of morphological analysis

Build MeCab dictionary

#UTF source file encoding-Convert to 8
$ nkf --overwrite -Ew ./mecab-ipadic-2.7.0-20070801/*

#Build dictionary
$ mkdir build
$ $(mecab-config --libexecdir)/mecab-dict-index -d ./mecab-ipadic-2.7.0-20070801 -o build -f utf8 -t utf8
$ cp mecab-ipadic-2.7.0-20070801/dicrc ./build/. #Copy dicrc

nkf is an abbreviation for "Network Kanji Filter".

Add new word

`Create a csv file in the source file directory`


#Surface form, left context ID, right context ID, occurrence cost, part of speech, part of speech subclassification 1, part of speech subclassification 2, part of speech subclassification 3, inflected type, inflected form, prototype, reading, pronunciation

#If you want to add natural language processing
Natural language processing,1288,1288,0,noun,固有noun,General,*,*,*,Shizengengoshori,Shizengen Goshori,Shizengen Goshori

Adjustment of morphological analysis

Adjust the connection cost
Find the context ID of the target word in .csv
Modify the concatenation cost of the target context ID with matrix.def
Adjust the cost of occurrence
Correct the cost of occurrence of the target word in .csv

However, when modifying the above costs, ** be aware that it may affect results other than the intended part **.

There seems to be a method to automatically adjust the cost, but ** it seems that the range of influence can be kept small by manually adjusting the cost of the part you want to correct **.

03.3 Various morphological analyzers

Get an overview of morphological analyzers other than MeCab.

Morphological analyzer	Contents
MeCab	・ Based on the dictionary ・ The dictionary contains information on words, occurrence costs, and connection costs. ・ Execution speed is fast -Since the dictionary is made into an external file, it can be customized as needed.
JUMAN++	・ A relatively new morphological analyzer that uses a neural network ・ Consider not only grammatical correctness but also the meaning of words ・ Consider the information of all words before a word ・ Corresponds to notation fluctuation ・ There are many advantages over MeCab, but the execution speed is inferior.
KyTea(Cutie)	-SVM predicts whether a word is separated between one character and the next character based on the characters before and after it. -Python wrapper is provided by a third party
Janome	・ Written only in Python ・ IPA dictionary is built in, and API that can be handled from Python is provided. ・ Execution speed is slow ・ Dictionary options are limited
SudachiPy	-Morphological analysis for Java Sudachi's Python binding ・ As of May 2019, official release is(Has it been officially released yet?)
Esanpy(Kuromoji)	・ Kuromoji is a morphological analyzer implemented in Java. ・ When using from Python, pass through Esanpy ・ Esanpy is a text analysis library that uses Elasticsearch (full-text search engine) internally.

[PYTHON] Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 2 Step 03 Memo "Morphological Analysis and Word Separation"

Contents