[PYTHON] Morphological analysis of sentences containing recent words in Windows10 64bit environment

problem

When trying to use MeCab with Python in Windows10 64bit environment I mainly stumbled on the following 5 points and was filled with the desire to split the screen, so I summarized it systematically.

Problem 1: MeCab does not come in with pip install alone Problem 2: I was able to install it, but morphological analysis does not work Problem 3: It seems that the extraction of named entities works well by using the NEologd dictionary. Difficult to install in Windows environment Problem 4: When I try to install it, it goes through PATH, but I don't understand the concept of PATH. Problem 5: DOS commands do not pass

table of contents

① Install MeCab from the unofficial version of .exe for 64bit (2) Install the library for handling MeCab in Python ③ To perform morphological analysis more precisely Clone NEologd from git and compile from command prompt

① Install MeCab from the unofficial version of .exe for 64bit

Reference: https://qiita.com/wanko5296/items/eeb7865ee71a7b9f1a3a

Officially, only the 32-bit version is supported, so It is better to install the 64-bit version built by volunteers.

The executable file is published by the following git. https://github.com/ikegami-yukino/mecab/releases/tag/v0.996

I select the character code when installing the executable file, Select according to the character code of the target text file for which you want to perform morphological analysis. If you're not particular about it, choose UTF-8. (* Default is SHIFT-JIS)

(2) Install the library for handling MeCab in Python

Reference: https://qiita.com/yukinoi/items/990b6933d9f21ba0fb43

With cmd or Anaconda prompt

pip install sys
pip install MeCab

Execute. If you have installed the above 64-bit version of MeCab, you can use the above pip.

With jupyter notebook etc.


import MeCab

Verify if it can be installed with.

If no error occurs, morphological analysis is possible at this stage. If you want to give it a try,

import sys
import MeCab
m = MeCab.Tagger ("-Ochasen")
print(m.parse ("Of the thighs and thighs"))

You can see that the morphological analysis is done.

However, words that include recent words (i.g. My Number, Keyakizaka46, etc.) are It becomes like My / Number, Keyaki / Saka / 46.

To prevent this, install a NEologd dictionary that contains a recent KW list.

③ Clone NEologd from git and compile it from the command prompt to perform morphological analysis more precisely.

・ Preparation

Reference: https://qiita.com/zincjp/items/c61c441426b9482b5a48 (Basically, the above article is written for those who do not understand PATH and DOS commands.)

Install 64-bit git and 7-zip as required. The installation method is omitted here. ** ・ git ** Reference: https://eng-entrance.com/git-install ** ・ 7-zip ** Official site: https://sevenzip.osdn.jp/

You need to set environment variables in 7-zip.

C:\Program Files\7-Zip

Now, let me briefly introduce this environment variable. It is a setting to easily execute an application with cmd, and it is also said to pass through PATH.

As a setting method, if you search for "environment variable" on the control panel screen etc., the setting screen will appear. image.png

If you select Edit environment variables in the above image, you will see a screen like this. image.png Select the part called Path in blue and select ** Edit> New ** Add the following, which is the installation destination of 7-zip, and select OK. It will be posted again, but the installation destination differs depending on the person, and the default is as follows.

C:\Program Files\7-Zip

This puts the so-called PATH in place.

Install the NEologd dictionary from here.

-Install and compile NEologd dictionary

Launch a command prompt with ** administrator privileges **

git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git

Download the necessary dictionary files. Then go to the directory of the downloaded file and check if it is downloaded with dir. There is no problem if you can see the neologd ~ system files when executing dir. If you can't find the seed folder and get an error, ** C: \ Users (user name) \ mecab-ipadic-neologd \ seed ** will move to the directory.

cd mecab-ipadic-neologd\seed
dir

By the way, it means to go to read the directory called mecab-ipadic-neologd \ seed.

Because it needs to be decompressed by 7-zip Execute the following command. I mean to answer .xz files with 7-zip.

7z X *.xz

Then compile the dictionary with the following command. (Change to a dictionary format that can be read by MeCab) However, there are some caveats.

** ① NEologd is updated daily, so all subsequent 20191024 will actually be Select the date attached to the DL file name when you cloned ** ** ② C: \ Program Files \ MeCab \ bin \ mecab-dict-index matches the installation destination of your MeCab ** ** ③ UTF-8 was selected for the installation method of mecab in this article, If you are installing in SHIFT-JIS environment, change "-t utf-8" to "-t shift-jis" **

"C:\Program Files\MeCab\bin\mecab-dict-index" -d "C:\Program Files\MeCab\dic\ipadic" -u NEologd.20191024.dic -f utf-8 -t utf-8 mecab-user-dict-seed.20191024.csv

mkdir "C:\Program Files\MeCab\dic\NEologd"

move NEologd.20191024.dic "C:\Program Files\MeCab\dic\NEologd"

By the way, the meaning is Run mecab-dict-index.exe located in C: \ Program Files \ MeCab \ bin and Exists in the current directory to which cd is moved mecab-user-dict-seed.20191024.csv in UTF-8 format Compile with the name NEologd.20191024.dic. After that, create NEologd in C: \ Program Files \ MeCab \ dic and move the compiled one in it.

At this point, the rest is almost over ** Open mecabrc ** located in C: \ Program Files \ MeCab \ etc with Notepad Replace userdic = with C: \ Program Files \ MeCab \ dic \ NEologd \ Neologd.20191024.dic Change to and save by overwriting. Depending on the authority, it may not be possible to overwrite and save, so Save mecabrc once in another folder and save it in the original place. Don't forget to delete the .txt at that time.

To check if NEologd is actually applied, when you actually perform morphological analysis with jupyter etc. Keyakizaka46 should be recognized as a proper noun.

import sys
import MeCab
m = MeCab.Tagger ("-Ochasen")
print(m.parse ("Keyakizaka46 is eating a red fox."))

end

In order to improve the accuracy of morphological analysis You can read the publicly available Japanese stop word list, Words specific to the target to be read can be set as a user dictionary. The accuracy of unnecessary items should be improved by steadily NG.

Recommended Posts

Morphological analysis of sentences containing recent words in Windows10 64bit environment
Browser specification of Jupyter Notebook in Windows environment
Create an environment of 64bit Windows + python 2.7 + MeCab 0.996
I tried morphological analysis and vectorization of words
virtualenvwrapper in windows environment
Perform morphological analysis in the machine learning environment launched by GCE
Error in ordinal number when importing Numpy in Anaconda environment of Windows
Put MeCab in "Windows 10; Python3.5 (64bit)"
Install python2.7 on windows 32bit environment
UnicodeDecodeError occurs in pip (Windows environment)
[Python] Reason for dtype "int32" in Numpy (Windows environment) (as of September 2020)