[PYTHON] I played with wordcloud!

Introduction

Since I decided to use wordcloud, I posted it as a memorandum

Since mecab is used, if you are asking "What is mecab?", Please click [here] 1!

I tried to summarize from the installation of wordcloud to image output

The item description is as follows

What story is this?

Since it's a big deal, I will issue the problem output by wordcloud (laugh)

wordcloud.png

I will write the answer in ** Conclusion **!

What is wordcloud

A method of selecting multiple words that appear frequently in a sentence and displaying them in a size according to the frequency.

The official is [here] 2

Installation can be used immediately by installing with pip etc.

pip install wordcloud

I actually moved it

I think it is faster to explain using images, so I tried moving it immediately The story used here is "Little Red Riding Hood"

program

import MeCab

from wordcloud import WordCloud

FILE_NAME = "sample.txt"

with open(FILE_NAME, "r", encoding="utf-8") as f:
    CONTENT = f.read()

tagger = MeCab.Tagger("-Owakati")
parse = tagger.parse(CONTENT)

wordcloud = WordCloud()
wordcloud.generate(CONTENT)
wordcloud.to_file("wordcloud.png ")

wordcloud = WordCloud()

Word cloud object for generation and drawing

wordcloud.generate ("string")

Create wordcloud from text (string)

wordcloud.to_file ("photo name")

Export to image file

The above steps will create a wordcloud image.

image

wordcloud.png

Wordcloud displays frequently used words in large size

However, note that ** one-letter words ** such as A and me are not displayed!

It can be seen that grandmother, Little Red, and Red Riding are often used in "Little Red Riding Hood".

Various settings

You can add settings within WordCloud, such as backgrounds and character limits

Here are some of the settings you will use most often.

parameter Default Description
width 400 Width
height 200 Vertical width
background_color "black" Background color
colormap None Letter color
collocations True Collocation
stopwords None Words to exclude (list)
max_words 200 Maximum number of words to display
regexp r"\w[\w']+" Regular expression of the displayed characters

I want to change the size of the image

The previous image is a little small (because it is for Qiita)

If you try to set it to 1080 vertical and 1920 horizontal, which is also the size of Desktop, it will be as follows

wordcloud = WordCloud(width=1920, height=1080)

I want to change the color

The background and text colors are hard to see ...

Declare the background color you want to specify Since there are several image colors of characters, declare them.

This time, the background color is white and the image color of the characters is summer.

wordcloud = WordCloud(background_color="white", colormap="summer")

wordcloud.png

I want to break down collocations like Red Riding

Often "Red" appears on the screen, like Red Riding and Little Red.

So, try setting as follows Very convenient because you can judge collocations as separate words

wordcloud = WordCloud(background_color="white", colormap="summer", collocations=False)

wordcloud.png

I don't want to display a certain character

It doesn't make much sense to put words like "the, and, to" on wordcloud

If you do not want to display those words, you can declare it using an array as follows. (This time, for the sake of clarity, try not to display ["Little", "grandmother"])

wordcloud = WordCloud(background_color="white", colormap="summer", collocations=False, stopwords=["Little", "grandmother"])

wordcloud.png

I want to limit the number of characters that can be displayed

wordcloud is set to output 200 characters by default You can set how many characters to output by setting as follows.

wordcloud = WordCloud(background_color="white", colormap="summer", collocations=False, stopwords=["Little", "grandmother"], max_words=10])

wordcloud.png

Looking at this, it seems that you can get data that seems to be good if you erase around [the, and, to]? ??

I want to display even one-letter words

As mentioned above, wordcloud cannot output single-letter words. By limiting with regexp, even words with one or more letters can be supported.

wordcloud = WordCloud(background_color="white", colormap="summmer", collocations=False, stopwords=["the", "and", "to"], max_words=20, regexp=r"[\w']+")

wordcloud.png

It's understandable that ** a ** is the most common ...

Tell me more! From [Official] 2

Common mistakes in Japanese

If you play a Japanese sentence with the above program, you will see the following image ...

wordcloud.png

This is because the font used in wordcloud does not support Japanese.

So you can set the font

The font settings are as follows.

FONT_FILE = "C:\Windows\Fonts\MSGOTHIC.TTC" wordcloud = WordCloud(font_path=FONT_FILE, background_color="white", colormap="summer", collocations=False, regexp=r"[\w']+")

e? Why is it MS Gothic? ** Former COBOL ** That's why! (Those who understand ... I think)

That's why the output was like this

wordcloud.png

in conclusion

I roughly summarized wordcloud

By the way, the answer to the previous question is ...

wordcloud.png

** The Three Little Pigs **!

wordcloud is a word that often has large letters Looking at the image

little pig house

The above three are the words that often appear!

By making it wordcloud like this, It can also be used as an index such as what the character string represents (˘ω˘)

Recommended Posts

I played with wordcloud!
I made wordcloud with Python.
I played with PyQt5 and Python3
I played with Mecab (morphological analysis)!
I played with DragonRuby GTK (Game Toolkit)
[Scikit-learn] I played with the ROC curve
[Introduction to Pytorch] I played with sinGAN ♬
[Python] I introduced Word2Vec and played with it.
[Python] I played with natural language processing ~ transformers ~
Visualize 2019 nem with WordCloud
I played with Floydhub for the time being
I played with Diamond, a metrics collection tool
I tried scraping with Python
I wrote GP with numpy
I made blackjack with python!
I tried clustering with PyCaret
I implemented VQE with Blueqat
I can't search with # google-map. ..
I measured BMI with tkinter
I tried gRPC with Python
I made COVID19_simulator with JupyterLab
I tried scraping with python
I made Word2Vec with Pytorch
I made blackjack with Python.
[Python] I tried to visualize tweets about Corona with WordCloud
I tried trimming efficiently with OpenCV
I can't install python3 with pyenv-vertualenv
I tried machine learning with liblinear
I tried web scraping with python.
I tried moving food with SinGAN
[Introduction to StyleGAN] I played with "The Life of a Man" ♬
I can't download images with Google_images_download
I can't install mysql-connector-python with anaconda
I made a fortune with Python.
I implemented Attention Seq2Seq with PyTorch
I sent an SMS with Python
I tried implementing DeepPose with PyTorch
[Introduction to sinGAN-Tensorflow] I played with the super-resolution "Challenge Big Imayuyu" ♬
I liked the tweet with python. ..
I tried face detection with MTCNN
[Introduction to Matplotlib] Axes 3D animation: I played with 3D Lissajous figures ♬
I can't use Japanese with pyperclip
I tried to summarize everyone's remarks on slack with wordcloud (Python)
I want to do ○○ with Pandas
I couldn't daemonize gunicorn with Fabric
[Introduction to RasPi4] I played with "Hiroko / Hiromi's poisonous tongue conversation" ♪
[Introduction to StyleGAN] I played with "A woman transforms into Mayuyu" ♬
I want to debug with Python
I tried running prolog with python 3.8.2.
I made a daemon with Python
I tried SMTP communication with Python
I tried sentence generation with GPT-2
I tried learning LightGBM with Yellowbrick
I tried face recognition with OpenCV
[Python] I tried to visualize the night on the Galactic Railroad with WordCloud!
[Introduction to AWS] I played with male and female voices with Polly and Transcribe ♪
[Introduction to StyleGAN] I played with style_mixing "Woman who takes off glasses" ♬
I tried to visualize the text of the novel "Weathering with You" with WordCloud
I get an error with import pandas.
[Introduction to WordCloud] Let's play with scraping ♬
I want to detect objects with OpenCV