Morphological analysis using Igo + mecab-ipadic-neologd in Python (with Ruby bonus)

Introduction

I would like to do morphological analysis with Python. This time, we will use "Igo" as the morphological analyzer and "mecab-ipadic-neologd" as the dictionary.

Morphological analysis is to perform "word division" and "part of speech assignment".

First, I will explain Igo and mecab-ipadic-neologd, and then set it up so that it can actually be used in Python.

What is Igo?

Since "Mecab" is famous as a morphological analysis engine, many of you may have heard of it. This time I will use "Igo". "Mecab" is written in C language, and "Igo" seems to be made in java. "Igo" seems to be designed to return the same analysis results as Mecab. Compared with speed, it seems to move at about the same speed. It seems that it may be difficult to use because you have to build the "Mecab" binary, so I decided to use "Igo".

What is mecab-ipadic-neologd?

An important element of morphological analysis is the "dictionary". The link below explains the mechanism of morphological analysis (Mecab in this case) for your reference. Peek behind the scenes of Japanese morphological analysis! How MeCab Parses Morphological Analysis

When there is a sentence "living in Tokyo", you can feel that it cannot be analyzed well unless each word of "living in Tokyo, capital," is registered.

Also, for example, whether "Sazae-san" is recognized as "Sazae-san (title)" or "Sazae-san (personal name) + san (honorific title)" is determined by the presence of "Sazae-san" in the dictionary. Therefore, it is important that "Sazae-san" is registered in the dictionary in the sentence "Did you see Sazae-san yesterday? It was interesting."

For "Igo", I build and use Mecab's IPA dictionary for Igo, but I'm missing proper nouns. "Mecab-ipadic-neologd" supplements that. "Mecab-ipadic-neologd" records notations such as proper nouns that cannot be covered by the Mecab standard. In addition, the dictionary is updated twice a week and the latest words are updated, which is useful.

setup

$ pip install igo-python

You can now use Igo via Python.

neologd clone

git clone https://github.com/neologd/mecab-ipadic-neologd

Since neologd is a dictionary for mecab, it is necessary to build it for Igo so that it can be used with Igo.

For neologd, the following will be helpful. mecab-ipadic-neologd

Build dictionary for Igo

Download ʻigo-0.4.5.jar` from the Igo page. neologd is a dictionary for Mecab, so build it for Igo.

Go to the mecab-ipadic-neologd directory

$ bin/install-mecab-ipadic-neologd

To execute. This will create a build directory.

Inside the build directory, there should be a directory like mecab-ipadic-2.7.0-20070801-neologd-20160826. (However, the numbers may differ.) Copy the downloaded ʻigo-0.4.5.jarinto that directory. You can compile by entering themecab-ipadic-2.7.0-20070801-neologd-20160826` directory and executing the following command.

$ java -cp igo-0.4.5.jar net.reduls.igo.bin.BuildDic neologd . "utf-8"
java -cp igo-0.4.5.jar net.reduls.igo.bin.BuildDic (Dictionary output destination directory name) (Extraction directory name of Mecab dictionary) (Dictionary character code)

Change (dictionary output destination directory name) if necessary. I named it neologd here. (Expanded directory name of Mecab dictionary) will be the built directory. This time it is mecab-ipadic-2.7.0-20070801-neologd-20160826.

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space

If you get the error, you can add -Xmx1024m as shown below.

java -Xmx1024m -cp igo-0.4.5.jar net.reduls.igo.bin.BuildDic neologd . "utf-8"

If that doesn't work, increase the value of 1024 and it will work.

Execution confirmation

In the directory where the compiled neologd is located, try running:

$ python
>>> import igo
>>> tagger = igo.tagger.Tagger('neologd')
>>> for t in tagger.parse('Masahiro Nakai of SMAP revealed Shinichi Shinohara's past misunderstandings in "Masahiro Nakai's Mi Naru Library" (TV Asahi) broadcast on the 10th.')
...   print(t)
...
surface:The 10th, feature:noun,Proper noun,General,*,*,*,The 10th,Junichi,Junichi, start=0
surface:broadcast, feature:noun,Change connection,*,*,*,*,broadcast,Housou,Horso, start=3
surface:of, feature:Particle,Attributive,*,*,*,*,of,No,No, start=5
surface: 「, feature:symbol,Open parentheses,*,*,*,*,「,「,「, start=6
surface:Masahiro Nakai's Mi Naru Library, feature:noun,Proper noun,General,*,*,*,Masahiro Nakai's Mi Naru Library,Masahiro Nakai's Minimal Library,Masahiro Nakai's Minimal Library, start=7
surface: 」, feature:symbol,Parentheses closed,*,*,*,*,」,」,」, start=19
surface: (, feature:symbol,Open parentheses,*,*,*,*,(,(,(, start=20
surface:TV asahi, feature:noun,Proper noun,Organization,*,*,*,TV asahi,TV asahi,TV asahi, start=21
surface:system, feature:noun,suffix,General,*,*,*,system,Kay,Kay, start=26
surface: ), feature:symbol,Parentheses closed,*,*,*,*,),),), start=27
surface:so, feature:Particle,Case particles,General,*,*,*,so,De,De, start=28
surface: 、, feature:symbol,Comma,*,*,*,*,、,、,、, start=29
surface: SMAP, feature:noun,Proper noun,Personal name,General,*,*,SMAP,SMAP,SMAP, start=30
surface:of, feature:Particle,Attributive,*,*,*,*,of,No,No, start=34
surface:Masahiro Nakai, feature:noun,Proper noun,Personal name,General,*,*,Masahiro Nakai,Nakai Masahiro,Nakai Masahiro, start=35
surface:But, feature:Particle,Case particles,General,*,*,*,But,Moth,Moth, start=39
surface: 、, feature:symbol,Comma,*,*,*,*,、,、,、, start=40
surface:Shinichi Shinohara, feature:noun,Proper noun,Personal name,General,*,*,Shinichi Shinohara,Shinohara Shinichi,Shinohara Shinichi, start=41
surface:of, feature:Particle,Attributive,*,*,*,*,of,No,No, start=45
surface:past, feature:noun,Adverbs possible,*,*,*,*,past,Kako,Kako, start=46
surface:of, feature:Particle,Attributive,*,*,*,*,of,No,No, start=48
surface:Misunderstanding, feature:noun,Change connection,*,*,*,*,Misunderstanding,Canchigai,Canchigai, start=49
surface:To, feature:Particle,Case particles,General,*,*,*,To,Wo,Wo, start=52
surface:Reveal, feature:verb,Independence,*,*,Godan / Sa line,Uninflected word,Reveal,Akas,Akas, start=53
surface:One act, feature:noun,General,*,*,*,*,One act,Hitomak,Hitomak, start=56
surface:But, feature:Particle,Case particles,General,*,*,*,But,Moth,Moth, start=58
surface:Ah, feature:verb,Independence,*,*,Five steps, La line,Continuous connection,is there,Ah,Ah, start=59
surface:Ta, feature:Auxiliary verb,*,*,*,Special,Uninflected word,Ta,Ta,Ta, start=61
surface: 。, feature:symbol,Kuten,*,*,*,*,。,。,。, start=62

I did it well! The example sentence uses the example sentence of mecab-ipadic-NEologd: Neologism dictionary for MeCab.

Other languages (Ruby edition)

Igo is now easily available in other languages such as ʻigo-php and ʻigo-ruby. I will also try it in Ruby. If the dictionary so far is ready, you can do it easily.

$ gem install igo-ruby
require 'igo-ruby'

tagger = Igo::Tagger.new('neologd')
t = tagger.parse('Masahiro Nakai of SMAP revealed Shinichi Shinohara's past misunderstandings in "Masahiro Nakai's Mi Naru Library" (TV Asahi) broadcast on the 10th.')

t.each do |m|
  puts "#{m.surface} #{m.feature} #{m.start}"
end
10th noun,Proper noun,General,*,*,*,The 10th,Junichi,Junichi 0
Broadcast noun,Change connection,*,*,*,*,broadcast,Housou,Horso 3
Particles,Attributive,*,*,*,*,of,No,No 5
"Masahiro Nakai's Mi Naru Library" (noun),Proper noun,General,*,*,*,South Korea clinging to Chinese ass horse,Chugokunoshiriumanishigamitsukukankoku,Chugokunashiriumanishigamitsukukankoku 6
TV Asahi noun,Proper noun,Organization,*,*,*,TV asahi,TV asahi,TV asahi 21
System noun,suffix,General,*,*,*,system,Kay,Kay 26
) Symbol,Parentheses closed,*,*,*,*,),),) 27
Particles,Case particles,General,*,*,*,so,De,De 28
, Symbol,Comma,*,*,*,*,、,、,、 29
SMAP noun,Proper noun,Personal name,General,*,*,SMAP,SMAP,SMAP 30
Particles,Attributive,*,*,*,*,of,No,No 34
Masahiro Nakai noun,Proper noun,Personal name,General,*,*,Masahiro Nakai,Nakai Masahiro,Nakai Masahiro 35
Is a particle,Case particles,General,*,*,*,But,Moth,Moth 39
, Symbol,Comma,*,*,*,*,、,、,、 40
Shinichi Shinohara noun,Proper noun,Personal name,General,*,*,Shinichi Shinohara,Shinohara Shinichi,Shinohara Shinichi 41
Particles,Attributive,*,*,*,*,of,No,No 45
Past nouns,Adverbs possible,*,*,*,*,past,Kako,Kako 46
Particles,Attributive,*,*,*,*,of,No,No 48
Misunderstanding noun,Change connection,*,*,*,*,Misunderstanding,Canchigai,Canchigai 49
Particles,Case particles,General,*,*,*,To,Wo,Wo 52
Verb to reveal,Independence,*,*,Godan / Sa line,Uninflected word,Reveal,Akas,Akas 53
One act noun,General,*,*,*,*,One act,Hitomak,Hitomak 56
Is a particle,Case particles,General,*,*,*,But,Moth,Moth 58
A verb,Independence,*,*,Five steps, La line,Continuous connection,is there,Ah,Ah 59
Auxiliary verb,*,*,*,Special,Uninflected word,Ta,Ta,Ta 61
.. symbol,Kuten,*,*,*,*,。,。,。 62

Reference: kyow / igo-ruby

reference

Use mecab-ipadic-neologd with igo-python mecab-ipadic-NEologd : Neologism dictionary for MeCab I tried using igo-python

Recommended Posts

Morphological analysis using Igo + mecab-ipadic-neologd in Python (with Ruby bonus)
[Python] Morphological analysis with MeCab
Japanese morphological analysis with Python
Text mining with Python ① Morphological analysis
Python: Simplified morphological analysis with regular expressions
From preparation for morphological analysis with python using polyglot to part-of-speech tagging
I tried using mecab with python2.7, ruby2.3, php7
Perform entity analysis using spaCy / GiNZA in Python
[Environment construction] Dependency analysis using CaboCha in Python 2.7
Data analysis with python 2
Data analysis using Python 0
Voice analysis with python
Voice analysis with python
Data analysis with Python
Regression analysis in Python
Text mining with Python ① Morphological analysis (re: Linux version)
Principal component analysis using python from nim with nimpy
Scraping with selenium in Python
I tried using the Python library from Ruby with PyCall
[S3] CRUD with S3 using Python [Python]
[Co-occurrence analysis] Easy co-occurrence analysis with Python! [Python]
Working with LibreOffice in Python
Japanese morphological analysis using Janome
Python: Japanese text: Morphological analysis
Overlapping combinations with limits in Python / Ruby / PHP / Golang (Go)
Using Quaternion with Python ~ numpy-quaternion ~
Debugging with pdb in Python
[Python] Using OpenCV with Python (Basic)
Sentiment analysis with Python (word2vec)
Working with sounds in Python
Axisymmetric stress analysis in Python
Zundokokiyoshi with python / ruby / Lua
Scraping with Tor in Python
Tweet with image in Python
Planar skeleton analysis with Python
Combined with permutations in Python
Simple regression analysis in Python
Things to keep in mind when using Python with AtCoder
Things to keep in mind when using cgi with python.
Muscle jerk analysis with Python
Translate using googletrans in Python
Using Python mode in Processing
[PowerShell] Morphological analysis with SudachiPy
Using OpenCV with Python @Mac
Send using Python with Gmail
Make the morphological analysis engine MeCab available in Python 3 (March 2016 version)
Object extraction in images by pattern matching using OpenCV with Python
Implement ranking processing with ties in Python using Redis Sorted Set
Complement python with emacs using company-jedi
Number recognition in images with Python
Harmonic mean with Python Harmonic mean (using SciPy)
GUI programming in Python using Appjar
Testing with random numbers in Python
[Python] Using OpenCV with Python (Image Filtering)
Precautions when using pit in Python
First simple regression analysis in Python
Scraping with Node, Ruby and Python
3D skeleton structure analysis with Python
GOTO in Python with Sublime Text 3
Using Rstan from Python with PypeR
Working with LibreOffice in Python: import