[PYTHON] Somehow manage the Mecab symbol / service connection

I couldn't find the same phenomenon when I thought that it would come out if I searched. So memo memo. The PC environment is windows10.

This problem that can be understood in 3 seconds

#Okay,*,*,*,*,*
Plum Fusesa-Fu,*,*,*,*,*

What is okke! What is Fusesa!

Supplement: Mecab installation procedure

I introduced it in such an environment. I'm just using mecab via github.

  1. Get Mecab built to 64bit from the following page https://github.com/ikegami-yukino/mecab/releases

  2. Enter the SHIFT-JIS and UTF-8 dictionaries as a reference on the following page. Using MeCab with Python and R-Windows10-64bit

  3. Compile NEologd into SHIFT-JIS and UTF-8 by referring to the following page. To use NEologd for RMeCab dictionary on Windows 10 (Linux is not included)

What I wanted to do

After morphological analysis, I narrowed down to "nouns, verbs, adjectives" to remove unnecessary words ... image.png ** I don't think Igeta is a noun! ** ** (Words are excerpted for verification)

After all, I wanted you to recognize the symbol that is a noun as a symbol.

After investigating, it seems that the symbol is recognized as a strange connection in the setting of Mecab in the first place. It is said that the dictionary should be converted, so let's convert the dictionary by referring to the following page. Reference: Add entry to MeCab dictionary on Windows

dic\ipadic\unk.def
dic\ipadic-UTF8\unk.def

Change the 9th line of the above two dictionaries as below. (If you don't use both R and Python, I think it's just `` ipadic```) Depending on the save location, overwriting is prohibited, so when editing, copy it to the desktop as well.

SYMBOL,1283,1283,17585,symbol,General,*,*,*,*,*   
↓
SYMBOL,1283,1283,17585,symbol,General,*,*,*,*,*

After that, start the command prompt as an administrator. (Note that a permission error will occur at a normal command prompt) Move to the folder with the above changes and execute the following commands respectively.

# dic\Run on ipadic
..\..\bin\mecab-dict-index -f shift-jis

# dic\ipadic-Run in UTF8
..\..\bin\mecab-dict-index -f utf-8 

Then start mecab from the command prompt and enter "# plum" ... image.png

** What is okke! What is Fusesa! ** **

Research of cause

It seems that the characters are garbled, but it is a type that I have not seen much. If UTF characters are garbled, it's a diamond question mark.

I searched on a site called garbled tester that intentionally made garbled characters ... image.png ** This guy! !! ** **

In other words, it seems that the encoding is not working. Then do this!

# dic\Run on ipadic
..\..\bin\mecab-dict-index -f euc-jp -f shift-jis

Come on "#Plum" image.png ** Doushite …… **

It didn't work even if I re-entered the dictionary and tried again. In this way, enter the labyrinth. If you think positively, it's not "Fusesa", so the dictionary works.

I got lost on the road and did something like this

# dic\Run on ipadic
..\..\bin\mecab-dict-index -f shift-jis -f euc-jp

So, "# plum"

image.png ** That! !! !! ** ** It went well.

It seems that you misunderstood because the arguments when creating a dictionary with NEOlogd were in the order of "-f Original dictionary character code -f Character code of the dictionary to be created"? …… In the first place, I didn't understand the details of the arguments even when I looked at the official page.

In other words, I thought that a command like "mecab-dict-index -f character code of the dictionary to be created -f character code of the original dictionary" is the correct answer. Meibi.

UTF-8 dictionary conversion (unfinished)

It looks like mecab is working fine on the command prompt, but UTF-8 is already displayed. Characters other than SHIFT-JIS should be garbled at the command prompt, so the dictionary intended to be converted by specifying UTF-8 is SHIFT-JIS.

Since the dictionary must also be UTF-8 in order to use it with python, rebuild the UTF version referring to the following. Reference: How to insert NEologd dictionary on Windows relatively easily-System dictionary

With software called EmEditor Specify all encoding and save → Character code: utf-8 (with bom) → Line feed code: lf only Convert CSV at once like. Then execute the following command

# dic\ipadic-Run in UTF8
mecab-dict-index -f utf-8 -t utf-8

You should now have a UTF-8 dictionary. Temporarily rewrite mecabrc below ...

;6th line
dicdir =  $(rcpath)\..\dic\ipadic-UTF8
;8th line
userdic = C:\Program Files (x86)\MeCab\dic\NEologd\NEologd.20200521-u.dic

mecab -dCheck the character code with. image.png It's ok.

From python ... image.png I can't do that ...? A little verification required.

Serpentine

It's just moody. At first, I wanted to use mecab with R, so I performed the same countermeasures against changes in the dictionary downloaded from mecab official website. At that time, the characters were not garbled. ...... I feel like why. My memory is ambiguous. I don't know because I haven't verified whether 32bit and 64bit are related.

If you try to use mecab dropped from the above official website, you will get an error like "32bit!" In python and it will be scattered, so it is safer to bring the one built to 64bit.

Recommended Posts

Somehow manage the Mecab symbol / service connection
About the service command