This is a memo for myself as I read Introduction to Natural Language Processing Applications in 15 Steps. This time, in Chapter 2, Step 03, I will make a note of my own points.
--Personal MacPC: MacOS Mojave version 10.14.6 --docker version: Version 19.03.2 for both Client and Server
Understand how MeCab works and try tuning it. Also, check morphological analyzers other than MeCab.
03.1 MeCab
Word-separation by MeCab is based on a dictionary. The information obtained from morphological analysis using MeCab in the dictionary depends on what kind of information is registered in the dictionary, and the information registered in the dictionary differs depending on the dictionary.
Dictionary name | Contents |
---|---|
IPAdic | ・ Dictionary officially recommended by MeCab ・ Based on data called IPA Corpus |
UniDic | ・ Based on data called UniDic ・ The unit to be divided is small, which is close to strict "morphological analysis". |
jumandic | -MeCab porting of the dictionary used in a morphological analyzer called JUMAN, which is different from MeCab ・ Based on data called Kyoto Corpus ・ Meta information such as representative notation is given |
ipadic-NEologd | ・ Significantly expanded the number of words based on the IPA dictionary ・ The vocabulary is frequently expanded by crawling words from the Internet, and the ability to respond to new words is very high. ・It is recommended to perform normalization as a pretreatment |
unidic-NEologd | ・ Ipadic-Similar to NEologd, a dictionary with word extensions based on UniDic |
The difference between IPAdic and ipadic-NEologd is, for example, the analysis of the word "Deep Learning" (a relatively new word).
--IPAdic: Divided by "Deep" and "Learning" --ipadic-NEologd: Treated by one word of "Deep Learning"
The dictionary holds not only information about morphemes such as execution results, but also the following various information.
--The cost of occurrence of each word --Left context ID of each word --Right context ID of each word --Concatenation cost for each combination of context IDs
The analysis result is the combination that minimizes the combination of the occurrence cost and the connection cost for the given sentence. (In the example below, the cost when dividing into "I love Higashi Osaka" is (minimum) the lowest, so this is the analysis result)
Example) "I love Higashi Osaka"
#When splitting with "I love Higashi Osaka"
Connection cost between the beginning of the sentence and "Higashi Osaka"
Occurrence cost of "Higashi Osaka"
Connection cost of "Higashi Osaka" and "I love you"
Occurrence cost of "love"
"I love you" and the connection cost at the end of the sentence
#When splitting with "I like the University of Tokyo and Osaka University"
Connection cost between the beginning of the sentence and "Higashi Osaka"
Occurrence cost of "The University of Tokyo"
Connection cost of "University of Tokyo" and "Osaka University"
Occurrence cost of "Osaka University"
Connection cost of "Osaka University" and "like"
Occurrence cost of "like"
"I love you" and the connection cost at the end of the sentence
If you don't get the results you expect with an existing dictionary, tune the dictionary yourself.
--Addition of new words --Adjustment of morphological analysis
#UTF source file encoding-Convert to 8
$ nkf --overwrite -Ew ./mecab-ipadic-2.7.0-20070801/*
#Build dictionary
$ mkdir build
$ $(mecab-config --libexecdir)/mecab-dict-index -d ./mecab-ipadic-2.7.0-20070801 -o build -f utf8 -t utf8
$ cp mecab-ipadic-2.7.0-20070801/dicrc ./build/. #Copy dicrc
nkf is an abbreviation for "Network Kanji Filter".
Create a csv file in the source file directory
#Surface form, left context ID, right context ID, occurrence cost, part of speech, part of speech subclassification 1, part of speech subclassification 2, part of speech subclassification 3, inflected type, inflected form, prototype, reading, pronunciation
#If you want to add natural language processing
Natural language processing,1288,1288,0,noun,固有noun,General,*,*,*,Shizengengoshori,Shizengen Goshori,Shizengen Goshori
However, when modifying the above costs, ** be aware that it may affect results other than the intended part **.
There seems to be a method to automatically adjust the cost, but ** it seems that the range of influence can be kept small by manually adjusting the cost of the part you want to correct **.
Get an overview of morphological analyzers other than MeCab.
Morphological analyzer | Contents |
---|---|
MeCab | ・ Based on the dictionary ・ The dictionary contains information on words, occurrence costs, and connection costs. ・ Execution speed is fast -Since the dictionary is made into an external file, it can be customized as needed. |
JUMAN++ | ・ A relatively new morphological analyzer that uses a neural network ・ Consider not only grammatical correctness but also the meaning of words ・ Consider the information of all words before a word ・ Corresponds to notation fluctuation ・ There are many advantages over MeCab, but the execution speed is inferior. |
KyTea(Cutie) | -SVM predicts whether a word is separated between one character and the next character based on the characters before and after it. -Python wrapper is provided by a third party |
Janome | ・ Written only in Python ・ IPA dictionary is built in, and API that can be handled from Python is provided. ・ Execution speed is slow ・ Dictionary options are limited |
SudachiPy | -Morphological analysis for Java Sudachi's Python binding ・ As of May 2019, official release is(Has it been officially released yet?) |
Esanpy(Kuromoji) | ・ Kuromoji is a morphological analyzer implemented in Java. ・ When using from Python, pass through Esanpy ・ Esanpy is a text analysis library that uses Elasticsearch (full-text search engine) internally. |
Recommended Posts