Text mining with Python ① Morphological analysis

Challenge text mining with Python. (For Python3 series) Follow the steps below.

① Morphological analysis (this article) ② Visualization with Word Cloud (next time)



Morphological analysis library

Morphological analysis required to divide a Japanese sentence into words. As a well-known and easy-to-understand example "Sumomomo Momomo" To "Plums, peaches, peaches, peaches" What divides into.

Unlike English, Japanese has not clear word breaks and it is very difficult to divide sentences into words, so it is not realistic to process with your own code.

Therefore, we use a library called "MeCab" that is open source. (Probably the most major in Japanese morphological analysis. It seems to read "Mekabu")

Install MeCab

To be able to use MeCab in Python ・ Installation of MeCab main unit ・ Installation of dictionary -Install Python bindings Is necessary.

However, since the binary package for Windows includes a dictionary, you do not need to install the dictionary. Here, the procedure is described assuming that it will be installed on Windows.

First, from the download site listed on Official Site ・ Mecab-0.996.exe ・ Mecab-python-0.996.tar.gz download.

Next, start mecab-0.996.exe and install the main body. Select the character code of the dictionary on the way, but select the default Shift-JIS. (I'm a little worried if I don't have to use UTF-8 ...)

You should be able to use the mecab command at this point, but it doesn't seem to be in your PATH. Manually add the bin of the installation directory to your PATH.

Try using mecab on the command line. As usual, "Sumomomo Momomo".

>mecab↓
Of the thighs and thighs ↓
Plum noun,General,*,*,*,*,Plum,Plum,Plum
Also particles,Particle,*,*,*,*,Also,Mo,Mo
Peach noun,General,*,*,*,*,Peaches,peach,peach
Also particles,Particle,*,*,*,*,Also,Mo,Mo
Peach noun,General,*,*,*,*,Peaches,peach,peach
Particles,Attributive,*,*,*,*,of,No,No
Noun,Non-independent,Adverbs possible,*,*,*,home,Uchi,Uchi
EOS

Install MeCab Python bindings

Next, unzip mecab-python-0.996.tar.gz to a suitable directory. Go to the unzipped directory and run build and install according to the README. Below is the result of execution.

>python setup.py build
'mecab-config'Is an internal or external command,
It is not recognized as an operable program or batch file.
Traceback (most recent call last):
  File "setup.py", line 13, in <module>
    version = cmd1("mecab-config --version"),
  File "setup.py", line 7, in cmd1
    return os.popen(str).readlines()[0][:-1]
IndexError: list index out of range

Suddenly stumble on build. It seems that there is no command called mecab-config called in setup.py. I have a PATH, but I can't find an executable file that looks like that under bin.

Googling, it seems like putting Python bindings on Windows is pretty annoying. You can do your best, but interrupted because the purpose is to do text mining and not to run MeCab on Windows. I decided to put it in another Linux environment.


Reference site
Building an environment using MeCab with R and Python (Windows, Mac)

Recommended Posts

Text mining with Python ① Morphological analysis
Text mining with Python ① Morphological analysis (re: Linux version)
[Python] Morphological analysis with MeCab
Python: Japanese text: Morphological analysis
Japanese morphological analysis with Python
Text mining with Python ② Visualization with Word Cloud
Python: Simplified morphological analysis with regular expressions
Data analysis with python 2
Voice analysis with python
Text mining with Python-Scraping-
Voice analysis with python
Data analysis with Python
[Co-occurrence analysis] Easy co-occurrence analysis with Python! [Python]
Planar skeleton analysis with Python
Muscle jerk analysis with Python
[PowerShell] Morphological analysis with SudachiPy
Text sentiment analysis with ML-Ask
Collecting information from Twitter with Python (morphological analysis with MeCab)
Challenge principal component analysis of text data with Python
GOTO in Python with Sublime Text 3
Impedance analysis (EIS) with python [impedance.py]
Text extraction with AWS Textract (Python3.6)
Enable Python raw_input with Sublime Text 3
Python: Negative / Positive Analysis: Text Analysis Application
Speak Japanese text with OpenJTalk + python
I played with Mecab (morphological analysis)!
Data analysis starting with python (data visualization 1)
Logistic regression analysis Self-made with python
Data analysis starting with python (data visualization 2)
Morphological analysis using Igo + mecab-ipadic-neologd in Python (with Ruby bonus)
From the introduction of JUMAN ++ to morphological analysis of Japanese with Python
English speech recognition with python [speech to text]
[In-Database Python Analysis Tutorial with SQL Server 2017]
Marketing analysis with Python ① Customer analysis (decyl analysis, RFM analysis)
Two-dimensional saturated-unsaturated osmotic flow analysis with Python
Machine learning with python (2) Simple regression analysis
Tweet analysis with Python, Mecab and CaboCha
Principal component analysis with Power BI + Python
Data analysis starting with python (data preprocessing-machine learning)
Two-dimensional unsteady heat conduction analysis with Python
Try text mining your diary in Python
Read text in images with python OCR
FizzBuzz with Python3
Scraping with Python
From preparation for morphological analysis with python using polyglot to part-of-speech tagging
Statistics with python
Scraping with Python
Data analysis python
Twilio with Python
Integrate with Python
Play with 2016-Python
AES256 with python
Tested with Python
python starts with ()
with syntax (Python)
[Let's play with Python] Aiming for automatic sentence generation ~ Perform morphological analysis ~
Bingo with python
Zundokokiyoshi with python
Excel with Python
Microcomputer with Python
Cast with python