I would like to do morphological analysis with Python. This time, we will use "Igo" as the morphological analyzer and "mecab-ipadic-neologd" as the dictionary.
Morphological analysis is to perform "word division" and "part of speech assignment".
First, I will explain Igo and mecab-ipadic-neologd, and then set it up so that it can actually be used in Python.
Since "Mecab" is famous as a morphological analysis engine, many of you may have heard of it. This time I will use "Igo". "Mecab" is written in C language, and "Igo" seems to be made in java. "Igo" seems to be designed to return the same analysis results as Mecab. Compared with speed, it seems to move at about the same speed. It seems that it may be difficult to use because you have to build the "Mecab" binary, so I decided to use "Igo".
An important element of morphological analysis is the "dictionary". The link below explains the mechanism of morphological analysis (Mecab in this case) for your reference. Peek behind the scenes of Japanese morphological analysis! How MeCab Parses Morphological Analysis
When there is a sentence "living in Tokyo", you can feel that it cannot be analyzed well unless each word of "living in Tokyo, capital," is registered.
Also, for example, whether "Sazae-san" is recognized as "Sazae-san (title)" or "Sazae-san (personal name) + san (honorific title)" is determined by the presence of "Sazae-san" in the dictionary. Therefore, it is important that "Sazae-san" is registered in the dictionary in the sentence "Did you see Sazae-san yesterday? It was interesting."
For "Igo", I build and use Mecab's IPA dictionary for Igo, but I'm missing proper nouns. "Mecab-ipadic-neologd" supplements that. "Mecab-ipadic-neologd" records notations such as proper nouns that cannot be covered by the Mecab standard. In addition, the dictionary is updated twice a week and the latest words are updated, which is useful.
$ pip install igo-python
You can now use Igo via Python.
git clone https://github.com/neologd/mecab-ipadic-neologd
Since neologd is a dictionary for mecab, it is necessary to build it for Igo so that it can be used with Igo.
For neologd, the following will be helpful. mecab-ipadic-neologd
Download ʻigo-0.4.5.jar` from the Igo page. neologd is a dictionary for Mecab, so build it for Igo.
Go to the mecab-ipadic-neologd
directory
$ bin/install-mecab-ipadic-neologd
To execute. This will create a build
directory.
Inside the build
directory, there should be a directory like mecab-ipadic-2.7.0-20070801-neologd-20160826
. (However, the numbers may differ.)
Copy the downloaded ʻigo-0.4.5.jarinto that directory. You can compile by entering the
mecab-ipadic-2.7.0-20070801-neologd-20160826` directory and executing the following command.
$ java -cp igo-0.4.5.jar net.reduls.igo.bin.BuildDic neologd . "utf-8"
java -cp igo-0.4.5.jar net.reduls.igo.bin.BuildDic (Dictionary output destination directory name) (Extraction directory name of Mecab dictionary) (Dictionary character code)
Change (dictionary output destination directory name) if necessary. I named it neologd
here.
(Expanded directory name of Mecab dictionary) will be the built directory. This time it is mecab-ipadic-2.7.0-20070801-neologd-20160826
.
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
If you get the error, you can add -Xmx1024m
as shown below.
java -Xmx1024m -cp igo-0.4.5.jar net.reduls.igo.bin.BuildDic neologd . "utf-8"
If that doesn't work, increase the value of 1024 and it will work.
In the directory where the compiled neologd
is located, try running:
$ python
>>> import igo
>>> tagger = igo.tagger.Tagger('neologd')
>>> for t in tagger.parse('Masahiro Nakai of SMAP revealed Shinichi Shinohara's past misunderstandings in "Masahiro Nakai's Mi Naru Library" (TV Asahi) broadcast on the 10th.')
... print(t)
...
surface:The 10th, feature:noun,Proper noun,General,*,*,*,The 10th,Junichi,Junichi, start=0
surface:broadcast, feature:noun,Change connection,*,*,*,*,broadcast,Housou,Horso, start=3
surface:of, feature:Particle,Attributive,*,*,*,*,of,No,No, start=5
surface: 「, feature:symbol,Open parentheses,*,*,*,*,「,「,「, start=6
surface:Masahiro Nakai's Mi Naru Library, feature:noun,Proper noun,General,*,*,*,Masahiro Nakai's Mi Naru Library,Masahiro Nakai's Minimal Library,Masahiro Nakai's Minimal Library, start=7
surface: 」, feature:symbol,Parentheses closed,*,*,*,*,」,」,」, start=19
surface: (, feature:symbol,Open parentheses,*,*,*,*,(,(,(, start=20
surface:TV asahi, feature:noun,Proper noun,Organization,*,*,*,TV asahi,TV asahi,TV asahi, start=21
surface:system, feature:noun,suffix,General,*,*,*,system,Kay,Kay, start=26
surface: ), feature:symbol,Parentheses closed,*,*,*,*,),),), start=27
surface:so, feature:Particle,Case particles,General,*,*,*,so,De,De, start=28
surface: 、, feature:symbol,Comma,*,*,*,*,、,、,、, start=29
surface: SMAP, feature:noun,Proper noun,Personal name,General,*,*,SMAP,SMAP,SMAP, start=30
surface:of, feature:Particle,Attributive,*,*,*,*,of,No,No, start=34
surface:Masahiro Nakai, feature:noun,Proper noun,Personal name,General,*,*,Masahiro Nakai,Nakai Masahiro,Nakai Masahiro, start=35
surface:But, feature:Particle,Case particles,General,*,*,*,But,Moth,Moth, start=39
surface: 、, feature:symbol,Comma,*,*,*,*,、,、,、, start=40
surface:Shinichi Shinohara, feature:noun,Proper noun,Personal name,General,*,*,Shinichi Shinohara,Shinohara Shinichi,Shinohara Shinichi, start=41
surface:of, feature:Particle,Attributive,*,*,*,*,of,No,No, start=45
surface:past, feature:noun,Adverbs possible,*,*,*,*,past,Kako,Kako, start=46
surface:of, feature:Particle,Attributive,*,*,*,*,of,No,No, start=48
surface:Misunderstanding, feature:noun,Change connection,*,*,*,*,Misunderstanding,Canchigai,Canchigai, start=49
surface:To, feature:Particle,Case particles,General,*,*,*,To,Wo,Wo, start=52
surface:Reveal, feature:verb,Independence,*,*,Godan / Sa line,Uninflected word,Reveal,Akas,Akas, start=53
surface:One act, feature:noun,General,*,*,*,*,One act,Hitomak,Hitomak, start=56
surface:But, feature:Particle,Case particles,General,*,*,*,But,Moth,Moth, start=58
surface:Ah, feature:verb,Independence,*,*,Five steps, La line,Continuous connection,is there,Ah,Ah, start=59
surface:Ta, feature:Auxiliary verb,*,*,*,Special,Uninflected word,Ta,Ta,Ta, start=61
surface: 。, feature:symbol,Kuten,*,*,*,*,。,。,。, start=62
I did it well! The example sentence uses the example sentence of mecab-ipadic-NEologd: Neologism dictionary for MeCab.
Igo is now easily available in other languages such as ʻigo-php and ʻigo-ruby
.
I will also try it in Ruby. If the dictionary so far is ready, you can do it easily.
$ gem install igo-ruby
require 'igo-ruby'
tagger = Igo::Tagger.new('neologd')
t = tagger.parse('Masahiro Nakai of SMAP revealed Shinichi Shinohara's past misunderstandings in "Masahiro Nakai's Mi Naru Library" (TV Asahi) broadcast on the 10th.')
t.each do |m|
puts "#{m.surface} #{m.feature} #{m.start}"
end
10th noun,Proper noun,General,*,*,*,The 10th,Junichi,Junichi 0
Broadcast noun,Change connection,*,*,*,*,broadcast,Housou,Horso 3
Particles,Attributive,*,*,*,*,of,No,No 5
"Masahiro Nakai's Mi Naru Library" (noun),Proper noun,General,*,*,*,South Korea clinging to Chinese ass horse,Chugokunoshiriumanishigamitsukukankoku,Chugokunashiriumanishigamitsukukankoku 6
TV Asahi noun,Proper noun,Organization,*,*,*,TV asahi,TV asahi,TV asahi 21
System noun,suffix,General,*,*,*,system,Kay,Kay 26
) Symbol,Parentheses closed,*,*,*,*,),),) 27
Particles,Case particles,General,*,*,*,so,De,De 28
, Symbol,Comma,*,*,*,*,、,、,、 29
SMAP noun,Proper noun,Personal name,General,*,*,SMAP,SMAP,SMAP 30
Particles,Attributive,*,*,*,*,of,No,No 34
Masahiro Nakai noun,Proper noun,Personal name,General,*,*,Masahiro Nakai,Nakai Masahiro,Nakai Masahiro 35
Is a particle,Case particles,General,*,*,*,But,Moth,Moth 39
, Symbol,Comma,*,*,*,*,、,、,、 40
Shinichi Shinohara noun,Proper noun,Personal name,General,*,*,Shinichi Shinohara,Shinohara Shinichi,Shinohara Shinichi 41
Particles,Attributive,*,*,*,*,of,No,No 45
Past nouns,Adverbs possible,*,*,*,*,past,Kako,Kako 46
Particles,Attributive,*,*,*,*,of,No,No 48
Misunderstanding noun,Change connection,*,*,*,*,Misunderstanding,Canchigai,Canchigai 49
Particles,Case particles,General,*,*,*,To,Wo,Wo 52
Verb to reveal,Independence,*,*,Godan / Sa line,Uninflected word,Reveal,Akas,Akas 53
One act noun,General,*,*,*,*,One act,Hitomak,Hitomak 56
Is a particle,Case particles,General,*,*,*,But,Moth,Moth 58
A verb,Independence,*,*,Five steps, La line,Continuous connection,is there,Ah,Ah 59
Auxiliary verb,*,*,*,Special,Uninflected word,Ta,Ta,Ta 61
.. symbol,Kuten,*,*,*,*,。,。,。 62
Reference: kyow / igo-ruby
Use mecab-ipadic-neologd with igo-python mecab-ipadic-NEologd : Neologism dictionary for MeCab I tried using igo-python
Recommended Posts