I used WSL (Windows Subsystem for Linux) + Ubuntu to install the NEologd dictionary, which is a dictionary for morphological analysis of MeCab, but it was relatively easy to install with git for Windows and 7-zip.
Since I described the user dictionary in the previous article ↓, this time it is the system dictionary edition. https://qiita.com/zincjp/items/c61c441426b9482b5a48
Windows10 64bit Language: Japanese MeCab 0.996-32bit
git for Windows 2.20.1 64-bit 7-Zip 18.06 64-bit
Set the environment variable in the following folder containing the MeCab executable file and put it in the PATH. C:\Program Files (x86)\MeCab\bin
The downloaded NEologd dictionary is compressed in xz format, so use 7-zip to extract it. Download and install 7-zip 64bit from the following site. https://sevenzip.osdn.jp/
Set the following as environment variables C:\Program Files\7-Zip
Install git for Windows 64bit referring to the following site https://qiita.com/taiponrock/items/632c117220e57d555099
Start command prompt as an administrator Move to the working folder with the following command
cd %homepath%
Then download the NEologd dictionary with the following command
git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git
From the command prompt, use the following command to move to C: \ Users \ (user name) \ mecab-ipadic-neologd \ seed and check the file.
cd mecab-ipadic-neologd\seed
dir
Extract these .csv.xz files with 7-zip with the following command.
7z X *.xz
Copy the csv file to Mecab's dic \ ipadic folder with the following command. However, the mecab-user-dict-seed. (Date) .csv file is large & frequently updated, so I want to use it as a user dictionary. Deleted in this work.
copy *.csv "c:\Program Files (x86)\MeCab\dic\ipadic"
del "c:\Program Files (x86)\MeCab\dic\ipadic\mecab-user-dict-seed.*"
All .csv files in c: \ Program Files (x86) \ MeCab \ dic \ ipadic, character code of unk.def file Convert from SHIFT-JIS (line feed CR + LF) to UTF-8 (line feed LF) with an editor.
I converted the character code with EmEditor (https://jp.emeditor.com/).
This is because the csv file of the neologd dictionary is in UTF-8 format, and the csv file of the ipa dictionary that is included by default is in SHIFT-JIS format. If you compile a dictionary under different code mixed conditions, you will see morphemes whose part of speech information is displayed as "??" when morphological analysis is performed.
Create a SHIFT-JIS system dictionary in the ipadic folder with the following command.
cd "c:\Program Files (x86)\MeCab\dic\ipadic"
mecab-dict-index -f utf-8 -t shift-jis
Since UTF-8 dictionary is still required for UTF-8 system such as Python, create UTF-8 system dictionary with the following command. Use the following command to copy all the files in the ipadic folder to the newly created ipadic-UTF8 folder.
mkdir "c:\Program Files (x86)\MeCab\dic\ipadic-UTF8"
copy * "c:\Program Files (x86)\MeCab\dic\ipadic-UTF8"
Create a UTF-8 system dictionary from the files in c: \ Program Files (x86) \ MeCab \ dic \ ipadic-UTF8 with the following command.
cd "c:\Program Files (x86)\MeCab\dic\ipadic-UTF8"
mecab-dict-index -f utf-8 -t utf-8
It wasn't in the user dictionary You will be able to analyze "Ooi Ooi" etc.
Before introducing NEologd system dictionary
Hey hey hey
Hey hey adverb,General,*,*,*,*,Hey hey,Oioi,Oioi
Hey verb,*,*,*,*,*,Hey,Oy,Oy
EOS
After introducing NEologd system dictionary
Hey hey hey
Hey hey hey verb,*,*,*,*,*,Hey hey,Oioioi,Oioioi
EOS
URL of NEologd dictionary
https://github.com/neologd/mecab-ipadic-neologd/blob/master/ChangeLog
Recommended Posts