[PYTHON] [Morphological analysis] How to add a new dictionary to Mecab

environment

Mac Mecab installed

procedure

1 Download the keyword file and create a CSV file

1-1 Keyword file download

#Hatena Keyword
curl -L http://d.hatena.ne.jp/images/keyword/keywordlist_furigana.csv | iconv -f euc-jp -t utf-8 > keywordlist_furigana.csv
# Wikipedia
curl -L http://dumps.wikimedia.org/jawiki/latest/jawiki-latest-all-titles-in-ns0.gz | gunzip > jawiki-latest-all-titles-in-ns0

1-2 Extract nouns into CSV files

sample.rb


require 'csv'

original_data = {
  wikipedia: 'jawiki-latest-all-titles-in-ns0',
  hatena: 'keywordlist_furigana.csv'
}

CSV.open("custom.csv", 'w') do |csv|
  original_data.each do |type, filename|
    next unless File.file? filename
    open(filename).each do |title|
      title.strip!

      next if title =~ %r(^[+-.$()?*/&%!"'_,]+)
      next if title =~ /^[-.0-9]+$/
      next if title =~ /Ambiguity avoidance/
      next if title =~ /_\(/
      next if title =~ /^PJ:/
      next if title =~ /Characters/
      next if title =~ /List/

      title_length = title.length

      if title_length > 3
        score = [-36000.0, -400 * (title_length ** 1.5)].max.to_i
        csv << [title, nil, nil, score, 'noun', 'General', '*', '*', '*', '*', title, '*', '*', type]
      end
    end
  end
end

After that, run sample.rb

ruby sample.rb

2 Create and add a user dictionary

Create a user dictionary custom.dic with the mecab-dict-index command based on the CSV file created in this way.

/usr/local/libexec/mecab/mecab-dict-index -d /usr/local/lib/mecab/dic/ipadic -u custom.dic -f utf-8 -t utf-8 custom.csv

Make sure you have custom.dic here.

After that, in the terminal, go to / usr / local / lib / mecab / dic / ipadic and

$ sudo vi dicrc

And

Finally, create a custom.dic directory.

userdic ="Location of the created dictionary directory"

Put in.

result

Let's implement the following code.

sample01.py


#coding:utf-8
import MeCab
tagger = MeCab.Tagger("-Ochasen")
result = tagger.parse("Cloud")
print result

At first, when you do not add a dictionary, "cloud" is

Kura Kura Kura Noun-Proper noun-General
Udo Udo noun-General

Whereas it was

Cloud cloud noun-General

became.

If you can do this, you're done. Thank you for your hard work.

Recommended Posts

[Morphological analysis] How to add a new dictionary to Mecab
Add a dictionary to MeCab
MeCab: Add new words to user-defined dictionary (Windows)
Add user dictionary to MeCab
Difference in morphological analysis results by mecab dictionary
How to use dictionary {}
How to quickly create a morphological analysis environment using Elasticsearch on macOS Sierra
How to convert a class object to a dictionary with SQLAlchemy
How to write a list / dictionary type of Python3
[NNabla] How to add a new layer between the middle layers of a pre-built network
[Python] Morphological analysis with MeCab
How to call a function
How to hack a terminal
How to build a new python virtual environment on Ubuntu
How to convert an array to a dictionary with Python [Application]
How to make a Japanese-English translation
How to put a symbolic link
To add a C module to MicroPython ...
[Python] How to add rows and columns to a table (pandas DataFrame)
How to make a slack bot
How to create a Conda package
How to make a crawler --Advanced
How to make a recursive function
How to add sudo when debugging
How to check the memory size of a dictionary in Python
■ [Google Colaboratory] Use morphological analysis (MeCab)
How to make a deadman's switch
How to create a Dockerfile (basic)
[Blender] How to make a Blender plugin
How to delete a Docker container
Metaclass (wip) to generate a dictionary
How to add AWS EBS volume
I played with Mecab (morphological analysis)!
How to make a crawler --Basic
How to create a config file
[Python] How to create a dictionary type list, add / change / delete elements, and extract with a for statement
How to generate a new loggroup in CloudWatch using python within Lambda
[Django 2.2] Add a New badge to new posts with a date using a template filter
[NNabla] How to add a quantization layer to the middle layer of a trained model
Add a new issue to GitHub by email (Amazon SES utilization version)
[Discord.py] How to add or remove job titles after a reaction [python]
Add a GPIO board to your computer. (1)
How to create a clone from Github
How to build a sphinx translation environment
How to create a git clone folder
Qiita (1) How to write a code name
How to draw a graph using Matplotlib
[Python] How to convert a 2D list to a 1D list
How to use mecab, neologd-ipadic on colab
[Colab] How to copy a huge dataset
How to install a package using a repository
[Ubuntu] How to execute a shell script
How to get a stacktrace in python
Various ways to create a dictionary (memories)
How to create a repository from media
Script to create a Mac dictionary file
How to make a Backtrader custom indicator
How to add python module to anaconda environment
How to choose a Seaborn color palette
How to test on a Django-authenticated page
How to make a Pelican site map