[PYTHON] About cost calculation of MeCab

Introduction

I learned about MeCab cost calculation, so I summarized it. Please point out if something is wrong.

Overview of MeCab morphological analysis

MeCab performs morphological analysis using the registered dictionary. If a word (unknown word) that is not registered in the dictionary appears, it will be divided based on the cost of each word. Among them, ** the one with the lowest total cost ** is output as a result.

Actually try

This time, we will use the ipadic-neologd dictionary to check how the fictitious word "American German Village" is morphologically analyzed.


echo American German Village|mecab -d C:\neologd -N2

American noun,Proper noun,area,Country,*,*,America,America,America
German noun,Proper noun,area,Country,*,*,Germany,Germany,Germany
Village noun,suffix,area,*,*,*,village,village,village
EOS

American noun,Proper noun,area,Country,*,*,America,America,America
German Village Noun,Proper noun,General,*,*,*,German village,German village,German village
EOS

Specify the dictionary with -d and list the number of candidates specified by the NUM option. In this way, two types of divisions for unknown words were listed as candidates. Isn't this division quite convincing for us?

MeCab dictionary

Before we get into the cost calculation, we will explain how to register the MeCab dictionary. The dictionary is

Surface type,Left context ID,Right context ID,cost,Part of speech,Part of speech細分類1,Part of speech細分類2,Part of speech細分類3,Inflected form,Utilization type,Prototype,reading,pronunciation

Save it as a csv file and then build the dictionary. Looking at this, ・ ** Left context ID ** ・ ** Right context ID ** ・ ** Cost ** It contains an unfamiliar word. These are the information used for MeCab morphological analysis.

Occurrence cost

The ** occurrence cost ** is the ** difficulty of appearing ** of the word itself. The higher the value, the less likely the word will appear. The occurrence cost is the value of ** cost ** of the dictionary registered earlier. So what was the cost of the "American German Village"?


echo American German Village|mecab -F "%m,%c,\n" -d C:\neologd -N2

America, 4698,
Germany, 2543,
village, 8707,
EOS

America, 4698,
German village, 611,
EOS

Use% m to display the surface layer type, and% c to display the occurrence cost. One question arises here. The total cost should obviously be lower for the second, but the first candidate for output is the first result. The reason is the existence of a new cost, ** articulation cost **.

Connection cost

** Concatenation cost ** is the difficulty of concatenating the context IDs of two words. The smaller the value, the more likely it is to be continuous. The context ID corresponds to the ** left context ID and right context ID ** of the dictionary. Basically, this ID seems to be the same value at the time of registration. For example, consider the word "before and after". The "before" context ID is 1314 and the "after" context ID is 1313. The concatenation cost is determined by the combination of the left context ID and the right context ID. A list of combinations can be found in matrix.def (or matrix.bin) in MeCab \ dic \ ipadic. Looking at this,

1314 1313 -316
1313 1314 716

Since the connection cost is low (-316) from front to back, it is easy to continue, and from back to front, the connection cost is high (716) and it is difficult to continue. I think this is also quite convincing. Let's take a look at "American German Village".


echo American German Village|mecab  -F"%m,%phl,%phr,%c,%pc,%pn\n" -d C:\neologd -N2

America,1294,1294,4698,3746,3746
Germany,1294,1294,2543,-141,-3887
village,1303,1303,8707,881,1022
EOS

America,1294,1294,4698,3746,3746
German village,1288,1288,611,2614,-1132
EOS

The MeCab command can be summarized as follows.

command Description
%m Surface type
%phl Left context ID
%phr Right context ID
%c(Or%pw) Occurrence cost
%pc Connection cost+Word occurrence cost(Cumulative from the beginning of the sentence)
%pn Connection cost+Word occurrence cost(Its morpheme alone, %pw+%pC)

All commands are listed here [https://taku910.github.io/mecab/format.html). Since the output is difficult to understand, I will also tabulate this.

Surface type Left context ID Right context ID Occurrence cost Jacobs bogie + occurrence(Accumulation) Articulation+Occurrence(Alone)
America 1294 1294 4698 3746 3746
Germany 1294 1294 2543 -141 -3887
village 1303 1303 8707 881 1022
America 1294 1294 4698 3746 3746
German village 1288 1288 611 2614 -1132

Please note that ** BOS and EOS are also given context IDs **. So, the connection cost of the first "America" is from matrix.def

0 1294 -952

It will be. Therefore, the cumulative cost is 4698-952 = 3746. Next, let's look at "Germany". The left and right context IDs are 1294 and the connection cost is -6430, which is quite small. (It is rare that country names continue in a row ...) The cumulative cost was (3746 + 2543) -6430 = -141 and 2543-6430 = -3887 on its own, consistent with the calculation. Also, although it is not output, the final check is performed because EOS also has a context ID. The concatenation cost of context ID 1303 → 0 is 5, and the cumulative cost of context ID 1288 → 0 is -919. Comparing the cumulative costs, the first one is the lowest cost in 886 and 1695, so the mystery mentioned earlier was solved.

reference

Understand MeCab cost calculation. Cost calculation of MeCab learned at NTV Tokyo Peek behind the scenes of Japanese morphological analysis! How MeCab Parses Morphological Analysis

Recommended Posts

About cost calculation of MeCab
Calculation of similarity by MinHash
About cost calculation of MeCab
About the accuracy of Archimedean circle calculation method
About all of numpy
About MultiIndex of pandas
About variable of chainer
About max_iter of LogisticRegression () of scikit-learn
About Japanese path of pyminizip
About the ease of Python
About Japanese support of cometchat
About various encodings of Python 3
About all of numpy (2nd)
Calculation of similarity by MinHash
About approximate fractions of pi
About the components of Luigi
About HOG output of Scikit-Image
About the features of Python
About data management of anvil-app-server
Error-free calculation with big.Float of golang
Calculation of time series customer loyalty
Play with numerical calculation of magnetohydrodynamics
Calculation of normal vector using convolution
Deep learning from scratch (cost calculation)
Calculation of the number of Klamer correlations
Calculation of homebrew class and existing class
About the return value of the histogram.
About the basic type of Go
About circular crossover of genetic algorithms
[Python] Calculation of Kappa (k) coefficient
About the behavior of yield_per of SqlAlchemy
About import error of PyQt5.QtWidgets (Anaconda)
About the size of matplotlib points
About color halftone processing of images
About the basics list of Python basics
Calculation of Spearman's rank correlation coefficient
Project Euler 9 Retention of calculation results