[PYTHON] A relatively easy way to insert a NEologd dictionary on Windows-System Dictionary

Introduction

I used WSL (Windows Subsystem for Linux) + Ubuntu to install the NEologd dictionary, which is a dictionary for morphological analysis of MeCab, but it was relatively easy to install with git for Windows and 7-zip.

Since I described the user dictionary in the previous article ↓, this time it is the system dictionary edition. https://qiita.com/zincjp/items/c61c441426b9482b5a48

environment

Windows10 64bit Language: Japanese MeCab 0.996-32bit

What to introduce

git for Windows 2.20.1 64-bit 7-Zip 18.06 64-bit

Installation procedure

PATH to MeCab

Set the environment variable in the following folder containing the MeCab executable file and put it in the PATH. C:\Program Files (x86)\MeCab\bin

Install 7-zip and set environment variables

7-zip installation

The downloaded NEologd dictionary is compressed in xz format, so use 7-zip to extract it. Download and install 7-zip 64bit from the following site. https://sevenzip.osdn.jp/

PATH to 7-zip

Set the following as environment variables C:\Program Files\7-Zip

Install git for Windows

Install git for Windows 64bit referring to the following site https://qiita.com/taiponrock/items/632c117220e57d555099

Download NEologd dictionary

Download dictionary from git (create clone)

Start command prompt as an administrator Move to the working folder with the following command

cd %homepath%

Then download the NEologd dictionary with the following command

git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git

Check NEologd dictionary file

From the command prompt, use the following command to move to C: \ Users \ (user name) \ mecab-ipadic-neologd \ seed and check the file.

cd mecab-ipadic-neologd\seed
dir

image.png

Extract NEologd dictionary file

Extract these .csv.xz files with 7-zip with the following command.

7z X *.xz

Copy .csv file

Copy the csv file to Mecab's dic \ ipadic folder with the following command. However, the mecab-user-dict-seed. (Date) .csv file is large & frequently updated, so I want to use it as a user dictionary. Deleted in this work.

copy *.csv "c:\Program Files (x86)\MeCab\dic\ipadic"
del "c:\Program Files (x86)\MeCab\dic\ipadic\mecab-user-dict-seed.*"

Converts SHIFT-JIS format files to UTF-8 format.

All .csv files in c: \ Program Files (x86) \ MeCab \ dic \ ipadic, character code of unk.def file Convert from SHIFT-JIS (line feed CR + LF) to UTF-8 (line feed LF) with an editor.

I converted the character code with EmEditor (https://jp.emeditor.com/).

This is because the csv file of the neologd dictionary is in UTF-8 format, and the csv file of the ipa dictionary that is included by default is in SHIFT-JIS format. If you compile a dictionary under different code mixed conditions, you will see morphemes whose part of speech information is displayed as "??" when morphological analysis is performed.

Compiling dictionary files

Creating a SHIFT-JIS dictionary

Create a SHIFT-JIS system dictionary in the ipadic folder with the following command.

cd "c:\Program Files (x86)\MeCab\dic\ipadic"
mecab-dict-index -f utf-8 -t shift-jis

Creating a UTF-8 dictionary

Since UTF-8 dictionary is still required for UTF-8 system such as Python, create UTF-8 system dictionary with the following command. Use the following command to copy all the files in the ipadic folder to the newly created ipadic-UTF8 folder.

mkdir "c:\Program Files (x86)\MeCab\dic\ipadic-UTF8"
copy * "c:\Program Files (x86)\MeCab\dic\ipadic-UTF8"

Create a UTF-8 system dictionary from the files in c: \ Program Files (x86) \ MeCab \ dic \ ipadic-UTF8 with the following command.

cd "c:\Program Files (x86)\MeCab\dic\ipadic-UTF8"
mecab-dict-index -f utf-8 -t utf-8

Analytical testing

It wasn't in the user dictionary You will be able to analyze "Ooi Ooi" etc.

Before introducing NEologd system dictionary


Hey hey hey
Hey hey adverb,General,*,*,*,*,Hey hey,Oioi,Oioi
Hey verb,*,*,*,*,*,Hey,Oy,Oy
EOS

After introducing NEologd system dictionary


Hey hey hey
Hey hey hey verb,*,*,*,*,*,Hey hey,Oioioi,Oioioi
EOS

reference

URL of NEologd dictionary

https://github.com/neologd/mecab-ipadic-neologd/blob/master/ChangeLog

Recommended Posts

A relatively easy way to insert a NEologd dictionary on Windows-System Dictionary
Easy way to use Python 2.7 on Cent OS 6
Easy way to load CPU / memory on Linux
Even beginners can do it! An easy way to write a Sankey Diagram on Plotly
Add a dictionary to MeCab
Easy way to rename files
A very convenient way to give a presentation on Jupyter Notebook
How easy is it to synthesize a drug on the market?
Easy copy to clipboard on Linux
Easy way to customize Python import
Metaclass (wip) to generate a dictionary
What is the fastest way to create a reverse dictionary in python?
Easy way to use Wikipedia in Python
Various ways to create a dictionary (memories)
Script to create a Mac dictionary file
How to test on a Django-authenticated page
A way to understand Python duck typing