[PYTHON] Text analysis that can be done in 5 minutes [Word Cloud]

Let's do text mining very easily using Python 3.x series.

** This time, in addition to processing on the LINUX terminal as much as possible so that even people who have never used Python can understand it, please be assured that the commands to be input are also described! ** (I don't know anything about Python ...)

What is text mining?

Text mining (English: text mining) is data mining for character strings. This is a text data analysis method that extracts useful information by dividing data consisting of ordinary sentences into words and phrases and analyzing the frequency of their appearance, the correlation of co-appearance, the tendency of appearance, and the time series. Source [Wikipedia](https://ja.m.wikipedia.org/wiki/%E3%83%86%E3%82%AD%E3%82%B9%E3%83%88%E3%83%9E% E3% 82% A4% E3% 83% 8B% E3% 83% B3% E3% 82% B0)

This time, let's create a * word cloud * with text mining technology! This is what a word cloud is. ↓ wc1-1.png

First prepare the data

First, prepare the data to be analyzed. However, it is difficult to prepare immediately, so this time I will use the tweet data ** of the online event ** "Idolmaster Shiny Colors MUSIC DAWN DAY 1" held on October 31st.

click here to download [Text data # Shanimas MUSICDAWNday2](https://www.github.com/ysok2135/py/tree/main/%E5%BD%A2%E6%85%8B%E7%B4%A0%E8%A7%A3 % E6% 9E% 90% E5% 85% 83% E3% 83% 86% E3% 82% 99% E3% 83% BC% E3% 82% BF_SC_DOWN_20201031_utf8.csv)

Installation of Python 3.x series

sudo apt install python3.7

Perform morphological analysis of data

Unlike English, Japanese does not separate segments with spaces, so you cannot do text mining from the beginning. Therefore, this time, we will use the ** open source morphological analysis engine MeCab **, which is familiar in the streets.

MeCab related installation

Type in the following command order.

udo apt install mecab
sudo apt install libmecab-dev
sudo apt install mecab-ipadic
sudo apt install mecab-ipadic-utf8
pip install mecab-python3

If you want to improve the search accuracy, you should also install additional dictionaries such as NEologd, but this time it is not annoying.

Actually perform morphological analysis

Many sites run on python, but I think this is much easier. First, set the analysis source file to "test.txt". Then enter the following in the terminal:

mecab -Owakati test.txt -o sample.txt

**that's all! ** ** When I check the file, it is analyzed properly. スクリーンショット 2020-11-01 15.42.09.png スクリーンショット 2020-11-01 15.42.23.png

Finally work on WordCloud!

Wordcloud installation

pip install wordcloud

That's all.

Try to create a word cloud

Copy and paste the sample code below.

sample.py


from wordcloud import WordCloud
with open('sample.txt') as f:
        text = f.read()
stop_words = [ u'https', u'co', u'Thank you', u'RT', u'Ah', u'']
wc = WordCloud(background_color="white",width=1600, height=1200, font_path='GenEiLateGoP_v2.ttf', stopwords=set(stop_words))
wc.generate(text)
wc.to_file('wc1.png')

Code description

** ① Read wordcloud and import files **

from wordcloud import WordCloud
with open('sample.txt') as f:
        text = f.read()

** ② Various settings ** stop_words ・ ・ ・ Set keywords to exclude ** It is recommended to try several times and set keywords. ** ** background_color ・ ・ ・ Background color width, height ・ ・ ・ Set the size of the image (unit is pixel) fonf_path ・ ・ ・ Specify font path (This time, I am using English source Latemin) ↑ ** [Super important! If you don't load the Japanese font, you will get tofu! !! !! ] **

stop_words = [ u'https', u'co', u'Thank you', u'RT', u'Ah', u'']
wc = WordCloud(background_color="white",width=1600, height=1200, font_path='GenEiLateGoP_v2.ttf', stopwords=set(stop_words))

** ③ Execution processing **

wc.generate(text)
wc.to_file('wc1.png')

I actually went

python3 sample.py

Execution result wc1.png

Great! !! !! Mr. Tsuda's presence is dangerous! (Lol)

You may want to watch it with the theme of Aozora Bunko. I hope that you will be interested in emotion analysis and so on. Thank you until the end.

I'm doing Twitter

* Verification environment
Ubuntu 18.04 LTS
Python 3.7

Recommended Posts

Text analysis that can be done in 5 minutes [Word Cloud]
A story that heroku that can be done in 5 minutes actually took 3 days
ANTs image registration that can be used in 5 minutes
Morphological analysis and tfidf (with test code) that can be done in about 1 minute
Serverless LINE Bot that can be done in 2 hours (source identifier acquisition)
[Can be done in 10 minutes] Create a local website quickly with Django
Building Sphinx that can be written in Markdown
Basic algorithms that can be used in competition pros
Summary of statistical data analysis methods using Python that can be used in business
A mechanism to call a Ruby method from Python that can be done in 200 lines
It seems that Skeleton Tracking can be done with RealSense
Goroutine (parallel control) that can be used in the field
Goroutine that can be used in the field (errgroup.Group edition)
Scripts that can be used when using bottle in Python
I investigated the pretreatment that can be done with PyCaret
Evaluation index that can be specified in GridSearchCV of sklearn
[For beginners] Baseball statistics and PyData that can be remembered in 33 minutes and 4 seconds ~ With Dai-Kang Yang
A record that GAMEBOY could not be done in Python. (PYBOY)
Make a Spinbox that can be displayed in Binary with Tkinter
A timer (ticker) that can be used in the field (can be used anywhere)
About character string handling that can be placed in JSON communication
Python standard input summary that can be used in competition pro
Make a Spinbox that can be displayed in HEX with Tkinter
Confirmation that rkhunter can be installed
Easy padding of data that can be used in natural language processing
AtCoder C problem summary that can be solved in high school mathematics
Maximum number of function parameters that can be defined in each language
Analyze PubMed paper abstracts in word cloud
Text mining with Python ② Visualization with Word Cloud
Get Cloud Logging available in Python in 10 minutes
Can FIFO queues be realized by "Specifying message order" in Cloud Pub / Sub?
I want to create a priority queue that can be updated in Python (2.7)
A personal memo of Pandas related operations that can be used in practice
Easy program installer and automatic program updater that can be used in any language
Summary of scikit-learn data sources that can be used when writing analysis articles
I made a familiar function that can be used in statistics with Python
It can be achieved in 1 minute! Decorator that caches function execution results in memcached
List of tools that can be used to easily try sentiment analysis of Japanese sentences in Python (try with google colab)