Thorough comparison of three Python morphological analysis libraries

Introduction

There are many morphological analysis tools, but it is important to understand their characteristics before using them.

This time, I compared three morphological analysis tools available from Python.

MeCab

  • Parameter estimation using CRF (Conditional Random Fields)
  • Both the discrimination accuracy and the execution speed are high, and if you use it in a standard way, you should definitely use MeCab. However, the library is a little heavy.
In[1]: import MeCab
In[2]: mecab = MeCab.Tagger()
In[3]: %time print mecab.parse("Apples have proven to have a very positive effect on the human body")
Apple noun,General,*,*,*,*,Apple,Apple,Apple
Is a particle,Particle,*,*,*,*,Is,C,Wow
Human noun,General,*,*,*,*,Human,Ningen,Ningen
Particles,Attributive,*,*,*,*,of,No,No
Body noun,General,*,*,*,*,body,Shintai,Shintai
Particles for,Case particles,Collocation,*,*,*,for,Nitotte,Nitotte
Very noun,Adjectival noun stem,*,*,*,*,very,Taihen,Taihen
Good adjective,Independence,*,*,Adjective, Auoudan,Uninflected word,good,Yoi,Yoi
Effect noun,General,*,*,*,*,effect,Kouka,Coca
Is a particle,Case particles,General,*,*,*,But,Moth,Moth
A verb,Independence,*,*,Five steps, La line,Uninflected word,is there,Al,Al
That noun,Non-independent,General,*,*,*,thing,Things,Things
Is a particle,Case particles,General,*,*,*,But,Moth,Moth
Proof noun,Change connection,*,*,*,*,Proof,Risho,Richaud
Sa verb,Independence,*,*,Sahen Suru,Rel connection,To do,Service,Service
Re verb,suffix,*,*,One step,Continuous form,To be,Re,Re
Particles,Connection particle,*,*,*,*,hand,Te,Te
Verb,Non-independent,*,*,One step,Continuous form,Is,I,I
Auxiliary verb,*,*,*,Special / mass,Uninflected word,Masu,trout,trout
EOS

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 240 µs

Juman

  • Morpheme discrimination by heuristics
  • Since the discrimination accuracy is high and the __representative notation __ of each morpheme is displayed, it is excellent for analyzing things with a lot of notation fluctuation such as Twitter.
In[1]: import cJuman
In[2]: cJuman.init(['-B', '-e2'])
In[3]: %time print cJuman.parse_opt(["Apples have proven to have a very positive effect on the human body"], cJuman.SKIP_NO_RESULT)
Apple apple apple apple noun 6 appellative 1* 0 * 0 "Representative notation:Apple/Apple category:plant;Artificial object-Food domain:Cooking / meal"
Hahaha particle 9 particle 2* 0 * 0 NIL
Human human human noun 6 appellative 1* 0 * 0 "Representative notation:Human/Human category:Man"
Nono particle 9 Conjunctive particle 3* 0 * 0 NIL
Body Shintai Body Noun 6 Appellative 1* 0 * 0 "Representative notation:body/Shintai category:animal"
Ni ni ni ni particle 9 case particle 1* 0 * 0 NIL
To take verb 2*0 Consonant verb La line 10 Ta system continuous use Te form 14"Representative notation:take/Attached verb candidate to take (basic)"
@To take verb 2*0 Consonant verb La line 10 Ta system continuous use Te form 14"Representative notation:take/Domain to take:Political transitive verb:Self:Can be taken/Can be taken"
@To take verb 2*0 Consonant verb La line 10 Ta system continuous use Te form 14"Representative notation:take/Take self-transitive verb:Self:Be caught/Can be taken"
@To take verb 2*0 Consonant verb La line 10 Ta system continuous use Te form 14"Representative notation:take/Take self-transitive verb:Self:Can be harvested/Can be taken"
@To take verb 2*0 Consonant verb La line 10 Ta system continuous use Te form 14"Representative notation:Take/Domain to take:Cooking / meal self-transitive verb:Self:Can be taken/Can be taken"
@To take verb 2*0 Consonant verb La line 10 Ta system continuous use Te form 14"Representative notation:take/Domain to take:Culture / art Transitive verb:Self:Can be taken/Can be taken"
@To take verb 2*0 Consonant verb La line 10 Ta system continuous use Te form 14"Representative notation:Steal/Take"
Very very very very adverb 8* 0 * 0 * 0 "Representative notation:very/Very much"
@Very very hard adjective 3*0 na adjective 21 stem 1"Representative notation:It's hard/It's hard"
Good good good good adjective 3*0 adjective Auo Dan 18 Uninflected Word 2"Representative notation:good/Good rebellion:adjective:bad/Bad"
Effect Koka Effect Noun 6 Appellative 1* 0 * 0 "Representative notation:effect/Koka category:Abstract"
Gaga gaga particle 9 case particle 1* 0 * 0 NIL
There is there there is a verb 2*0 Consonant verb La line 10 Uninflected word 2"Representative notation:Yes/A supplementary sentence:adjective:No/Absent"
Koto Koto Koto Noun 6 Formal Noun 8* 0 * 0 NIL
Gaga gaga particle 9 case particle 1* 0 * 0 NIL
Prove Proof Noun 6 Sahen Noun 2* 0 * 0 "Representative notation:Proof/Risho category:Abstract domain:Politics"
Verb 2*0 s-irregular verb 16 imperfect form 3"Representative notation:To do/To do 付属動詞候補(基本) Self他動詞:Self:Become/Become"
Suffix 14 Verb Suffix 7 Vowel verb 1 Ta system continuous te form 14"Representative notation:To be/To be"
Suffix 14 Verb Suffix 7 Vowel Suffix 1 Basic continuous form 8"Representative notation:Is/Is"
More and more suffixes 14 verbs suffixes 7 verbs suffixes type 31 uninflected word 2"Representative notation:Masu/Masu"
EOS

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 976 µs

Cabocha

  • Perform dependency analysis using SVM (Support Vector Machine)
  • If you use it for learning data when automatically generating sentences using Markov chains, it seems that you can do more interesting things than using orthodox morphological analysis tools such as MeCab (appropriate).
In[1]: import CaboCha
In[2]: cabocha = CaboCha.Parser()
In[3]: %time print cabocha.parseToString("Apples have proven to have a very positive effect on the human body")
Apples---------------D
Human-D           |
For the body-------D   |
very-D   |   |
good-D |   |
The effect is-D   |
is there-D |
That-D
Proven
EOS

CPU times: user 882 µs, sys: 84 µs, total: 966 µs
Wall time: 917 µs

Since the following output is also possible, it is easy to process using dependency analysis in python code. However, it is slow.

In[4]: print cabocha.parse("Apples have proven to have a very positive effect on the human body").toString(CaboCha.FORMAT_LATTICE)
* 0 8D 0/1 -2.111879
Apple noun,General,*,*,*,*,Apple,Apple,Apple
Is a particle,Particle,*,*,*,*,Is,C,Wow
* 1 2D 0/1 1.635242
Human noun,General,*,*,*,*,Human,Ningen,Ningen
Particles,Attributive,*,*,*,*,of,No,No
* 2 6D 0/1 1.318492
Body noun,General,*,*,*,*,body,Shintai,Shintai
Particles for,Case particles,Collocation,*,*,*,for,Nitotte,Nitotte
* 3 4D 0/0 0.781377
Very noun,Adjectival noun stem,*,*,*,*,very,Taihen,Taihen
* 4 5D 0/0 1.810798
Good adjective,Independence,*,*,Adjective, Auoudan,Uninflected word,good,Yoi,Yoi
* 5 6D 0/1 2.448702
Effect noun,General,*,*,*,*,effect,Kouka,Coca
Is a particle,Case particles,General,*,*,*,But,Moth,Moth
* 6 7D 0/0 2.151727
A verb,Independence,*,*,Five steps, La line,Uninflected word,is there,Al,Al
* 7 8D 0/1 -2.111879
That noun,Non-independent,General,*,*,*,thing,Things,Things
Is a particle,Case particles,General,*,*,*,But,Moth,Moth
* 8 -1D 1/5 0.000000
Proof noun,Change connection,*,*,*,*,Proof,Risho,Richaud
Sa verb,Independence,*,*,Sahen Suru,Rel connection,To do,Service,Service
Re verb,suffix,*,*,One step,Continuous form,To be,Re,Re
Particles,Connection particle,*,*,*,*,hand,Te,Te
Verb,Non-independent,*,*,One step,Continuous form,Is,I,I
Auxiliary verb,*,*,*,Special / mass,Uninflected word,Masu,trout,trout
EOS

CPU times: user 1.29 ms, sys: 101 µs, total: 1.39 ms
Wall time: 1.91 ms

Finally

In addition to this, there are many morphological analysis tools in Python such as Kytea, Igo-python, ChaSen, and Kakasi, so I hope that you will be familiar with the characteristics of each and be able to use them properly in case by case.

Recommended Posts

Thorough comparison of three Python morphological analysis libraries
Simple comparison of Python libraries that operate Excel
A quick comparison of Python and node.js test libraries
[Python] Morphological analysis with MeCab
Python: Japanese text: Morphological analysis
Static analysis of Python programs
Comparison of 4 Python web frameworks
Japanese morphological analysis with Python
Speed comparison of Python XML parsing
I studied four libraries of Python 3 engineer certified data analysis exams
From the introduction of JUMAN ++ to morphological analysis of Japanese with Python
(Java, JavaScript, Python) Comparison of string processing
Comparison of Japanese conversion module in Python3
python string comparison / use'list'and'in' instead of'==' and'or'
Example of 3D skeleton analysis by Python
Tips: Comparison of the size of three values
Analysis of X-ray microtomography image by Python
Comparison of Python serverless frameworks-Zappa vs Chalice
Python: Simplified morphological analysis with regular expressions
Comparison of matrix transpose speeds with Python
[Python] Comparison of Principal Component Analysis Theory and Implementation by Python (PCA, Kernel PCA, 2DPCA)
Time variation analysis of black holes using python
Performance comparison of face detector with Python + OpenCV
[Python3] Coarse graining of numpy.ndarray Speed comparison etc.
Comparison of R and Python writing (Euclidean algorithm)
I tried morphological analysis and vectorization of words
Static analysis of Python code with GitLab CI
A well-prepared record of data analysis in Python
Comparison of Python and Ruby (Environment / Grammar / Literal)
Introduction of Python
First Python 3 ~ First comparison ~
Data analysis python
Basics of Python ①
Basics of python ①
Copy of python
Introduction of Python
Text mining with Python ① Morphological analysis (re: Linux version)
Explanation of the concept of regression analysis using python Part 2
[Python] [Word] [python-docx] Simple analysis of diff data using python
List of Python libraries for data scientists and data engineers
Collecting information from Twitter with Python (morphological analysis with MeCab)
[OpenCV / Python] I tried image analysis of cells with OpenCV
Python implementation comparison of multi-index moving averages (DEMA, TEMA)
Calculate the regression coefficient of simple regression analysis with python
Challenge principal component analysis of text data with Python
List of Python code used in big data analysis
Explanation of the concept of regression analysis using Python Part 1
Pure Python version online morphological analysis tool Rakuten MA
Planar skeleton analysis with Python (4) Handling of forced displacement
Comparison table of frequently used processes of Python and Clojure
Explanation of the concept of regression analysis using Python Extra 1
Python command line analysis library comparison (argparse, click, fire)
Comparison of CoffeeScript with JavaScript, Python and Ruby grammar