[PYTHON] Use the company name recognition dictionary "JCLdic" for MeCab

About this article

This article introduces how to use the company name dictionary (JCLdic).

Dictionary introduction

-"JCLdic" public page

JCLdic contains over 8 million company names and their aliases. This dictionary was created to solve the problem that the coverage of company names is low in conventional dictionaries and it is difficult to recognize due to notational fluctuations.

Get a dictionary

Download MeCab Dic using JCL_slim as an example.

Preparing the environment

Please install MeCab and mecab-python3 first.

Move the downloaded jcl_slim_mecab.dic to the specified folder.

$ mkdir /usr/local/lib/mecab/dic/user_dict
$ mv jcl_slim_mecab.dic /usr/local/lib/mecab/dic/user_dict

Update the MeCab configuration file mecabrc and write the dictionary path.

$ vim /usr/local/etc/mecabrc

In mecabrc, the dicdir system dictionary path, ʻuserdic is the user dictionary path. Write the JCLdic path in ʻuserdic.

dicdir =  /usr/local/lib/mecab/dic/ipadic
;dicdir =  /usr/local/lib/mecab/dic/mecab-ipadic-neologd
;dicdir = /usr/local/lib/mecab/dic/jumandic
;dicdir = /usr/local/lib/mecab/dic/unidic

userdic = /usr/local/lib/mecab/dic/user_dict/jcl_slim_mecab.dic
; output-format-type = wakati
; input-buffer-size = 8192

; node-format = %m\n
; bos-format = %S\n
; eos-format = EOS\n

You can also specify paths for multiple user dictionaries.

userdic = /usr/local/lib/mecab/dic/user_dict/jcl_full_mecab_1.dic,/usr/local/lib/mecab/dic/user_dict/jcl_full_mecab_2.dic

Now you're ready to go.

Use JCLdic on the command line

Result of not using jcl_slim_mecab.dic:

echo "TIS Co., Ltd. of the TIS INTEC Group has released JCLdic (Japanese company name dictionary), a dictionary for recognizing company names in natural language processing, free of charge." | mecab

TIS noun,General,*,*,*,*,*
Intec noun,Proper noun,Organization,*,*,*,INTEC,INTEC,INTEC
Group noun,General,*,*,*,*,group,group,group
Particles,Attributive,*,*,*,*,of,No,No
TIS noun,General,*,*,*,*,*
Noun Co., Ltd.,General,*,*,*,*,Co., Ltd.,Kabushiki Gaisha,Kabushiki Gaisha
Is a particle,Particle,*,*,*,*,Is,C,Wow
, Symbol,Comma,*,*,*,*,、,、,、
......
EOS

Result of using jcl_slim_mecab.dic:

echo "TIS Co., Ltd. of the TIS INTEC Group has released JCLdic (Japanese company name dictionary), a dictionary for recognizing company names in natural language processing, free of charge." | mecab

TIS noun,Proper noun,Organization,*,*,*,TIS Co., Ltd.,*,*
Intec noun,Proper noun,Organization,*,*,*,INTEC Inc.,*,*
Group noun,General,*,*,*,*,group,group,group
Particles,Attributive,*,*,*,*,of,No,No
TIS Co., Ltd. Noun,Proper noun,Organization,*,*,*,TIS Co., Ltd.,*,*
Is a particle,Particle,*,*,*,*,Is,C,Wow
, Symbol,Comma,*,*,*,*,、,、,、
......
EOS

You can also specify a user dictionary.

echo "TIS Co., Ltd. of the TIS INTEC Group has released JCLdic (Japanese company name dictionary), a dictionary for recognizing company names in natural language processing, free of charge." | mecab -u /usr/local/lib/mecab/dic/user_dict/jcl_medium_mecab.dic

Use JCLdic with Python

Recognize the company name.

Method 1: parse method

import unicodedata
import MeCab

# 1 specify dictionary by option
# tagger = MeCab.Tagger('-u /usr/local/lib/mecab/dic/user_dict/jcl_slim_mecab.dic')

# 2 import multiple dictionaries by mecabrc
tagger = MeCab.Tagger('-r /usr/local/etc/mecabrc')

text = 'TIS Co., Ltd. of the TIS INTEC Group has released JCLdic (Japanese company name dictionary), a dictionary for recognizing company names in natural language processing, free of charge.'

# convert zenkaku to hankaku
text = unicodedata.normalize('NFKC', text) 

# parse
print(tagger.parse(text))

result:

TIS noun,Proper noun,Organization,*,*,*,TIS Co., Ltd.,*,*
Intec noun,Proper noun,Organization,*,*,*,INTEC Inc.,*,*
Group noun,General,*,*,*,*,group,group,group
Particles,Attributive,*,*,*,*,of,No,No
TIS Co., Ltd. Noun,Proper noun,Organization,*,*,*,TIS Co., Ltd.,*,*
Is a particle,Particle,*,*,*,*,Is,C,Wow
, Symbol,Comma,*,*,*,*,、,、,、
...
EOS

Method 2: parseToNode method

Recognize the company name entity with the organization keyword.

import unicodedata
import MeCab

# 1 specify dictionary by option
# tagger = MeCab.Tagger('-u /usr/local/lib/mecab/dic/user_dict/jcl_slim_mecab.dic')

# 2 import multiple dictionaries by mecabrc
tagger = MeCab.Tagger('-r /usr/local/etc/mecabrc')

text = 'TIS Co., Ltd. of the TIS INTEC Group has released JCLdic (Japanese company name dictionary), a dictionary for recognizing company names in natural language processing, free of charge.'

# convert zenkaku to hankaku
text = unicodedata.normalize('NFKC', text) 

# parse
node = tagger.parseToNode(text)
result = []

while node:
    # node feature map:Part of speech,Part of speech subclassification 1,Part of speech subclassification 2,Part of speech subclassification 3,Inflected form,Utilization type,Prototype,reading,pronunciation
    # example:   TIS: ['noun', '固有noun', 'Organization', '*', '*', '*', 'TIS Co., Ltd.', '*', '*']
    if node.feature.split(",")[2] == 'Organization':
        result.append(node.surface)
    node = node.next

print(result)
# ['TIS', 'INTEC', 'TIS Co., Ltd.']

reference

Recommended Posts

Use the company name recognition dictionary "JCLdic" for MeCab
Use BMFont as the font for pyglet
[Python] I tried substituting the function name for the function name
How to use MkDocs for the first time
Use logger with Python for the time being