I made wordcloud with Python.

Introduction

I'm a beginner, but after practicing Python, I want to write an impressive picture in Word Cloud! I enjoyed it. Write down the work contents as a memorandum.

Work environment

Working environment is Ubuntu18.04.4 LTS Python 3.6.9 mecab-python3 0.996.5

Please read the file arguments etc. in the source code of this article as appropriate for your own environment.

What is Word Cloud

WordCloud is a method of selecting multiple words that appear frequently in a sentence and displaying them in a size according to the frequency. It refers to automatically arranging words that frequently appear on web pages and blogs. By changing not only the size of the characters but also the color, font, and orientation, you can impress the content of the text at a glance. (From the commentary on Digital Daijisen)

This is the final completed word cloud diagram. The text was created separately from a speech by Apple founder Steve Jobs and passed as an input file.

Also, using a mask image, the character string is displayed inside the outline of the Jobs and Apple logos.

Characters such as myself, life, liking, and university stand out. I'm happy because I personally think it was cool: clap:

wc_image_ja.png

Create WordCloud

Here is the final source code to create the image above.

sample4wordcloud.py


#coding: utf-8
from PIL import Image
import numpy as np
from matplotlib import pyplot as plt
from wordcloud import WordCloud
import requests
import MeCab

#Word cloud creation function(English text version)
def create_wordcloud_en(text, image):
    fontpath = 'NotoSansCJK-Regular.ttc'
    stop_words_en = [u'am', u'is', u'of', u'and', u'the', u'to', u'it', \
                  u'for', u'in', u'as', u'or', u'are', u'be', u'this', u'that', u'will', u'there', u'was']

    wordcloud = WordCloud(background_color="white",
                          font_path=fontpath,
                          width=900,
                          height=500,
                          mask = msk,
                          contour_width=1,
                          contour_color="black",
                          stopwords=set(stop_words_en)).generate(text)

    #drawing
    plt.figure(figsize=(15,20))
    plt.imshow(wordcloud)
    plt.axis("off")
    plt.show()
    #png output
    wordcloud.to_file("wc_image_en.png ")

#Word cloud creation function(Japanese text version)
def create_wordcloud_ja(text, image):
    fontpath = 'NotoSansCJK-Regular.ttc'
    stop_words_ja = ['thing', 'thing', 'When', 'so', 'Etc.', 'this', 'Yo', 'thisら', 'It', 'all']
    #Morphological analysis
    tagger = MeCab.Tagger() 
    tagger.parse('') 
    node = tagger.parseToNode(text)

    word_list = []
    while node:
        word_type = node.feature.split(',')[0]
        word_surf = node.surface.split(',')[0]
        if word_type == 'noun' and word_surf not in stop_words_ja:
            word_list.append(node.surface)
        node = node.next

    word_chain = ' '.join(word_list)
    wordcloud = WordCloud(background_color="white",
                          font_path=fontpath,
                          width=900,
                          height=500,
                          mask = msk,
                          contour_width=1,
                          contour_color="black",
                          stopwords=set(stop_words_ja)).generate(word_chain)

    #drawing
    plt.figure(figsize=(15,20))
    plt.imshow(wordcloud)
    plt.axis("off")
    plt.show()
    wordcloud.to_file("wc_image_ja.png ")


#Calling required files
#Reading text
with open('jobs.txt', 'r', encoding='utf-8') as fi:
    text = fi.read()
#Loading the mask image to use
msk = np.array(Image.open("apple.png "))

create_wordcloud_ja(text, msk)

Two functions for creating a word cloud are defined, one for Japanese text and one for English text. I've written similar code, so I'm sure it's smarter here ... Should I write using the class?

About source code

The processing method until drawing a word cloud is different between English text and Japanese text.

In English, like "I like Apple.", Each word is separated by a space, so when you divide it into part of speech, you will not lose track of the division. However, in the case of Japanese, the division is not clear like "I like Apple."

Therefore, in the case of Japanese, it is necessary to perform morphological analysis in order to separate the character strings. This time, morphological analysis was performed using MeCab.

A rough explanation of the morphological analysis code

sample.py


    tagger = MeCab.Tagger()
    tagger.parse('') 
    node = tagger.parseToNode(text) 

    word_list = []
    while node:
        word_type = node.feature.split(',')[0]
        word_surf = node.surface.split(',')[0]
        if word_type == 'noun' and word_surf not in stop_words_ja:
            word_list.append(node.surface)
        node = node.next

The above is the part where morphological analysis is performed.

tagger = MeCab.Tagger()

Output mode setting. The output mode changes when the argument settings are changed.

-"-Ochasen": (ChaSen compatible format) -"-Owakati": (output only word-separation) -"Oyomi": (output only reading)

All the arguments start with O and it's cute (laughs)

tagger.parse('')

I don't really understand this part, It seems that you can avoid UnicodeDecodeError by writing this before passing the data to the parser ...

node = tagger.parseToNode(text)

Substitute the analysis result with surface (word) and feature (part of speech information) for node. You can access each by writing node.surface or node.feature.

    word_list = []
    while node:
        word_type = node.feature.split(',')[0]
        word_surf = node.surface.split(',')[0]
        if word_type == 'noun' and word_surf not in stop_words_ja:
            word_list.append(node.surface)
        node = node.next

    word_chain = ' '.join(word_list)

Read each node in order, and add the part of speech that is a noun and that is not in stop_words_ja to word_list.

Then leave the delimiter blank and convert the list to a string to get word_chain.

Figure before source modification

Actually, the first picture I drew was output from all nouns without setting hidden characters. Then it looks like this ...

image.png

In this figure, character strings such as "koto", "it", and "yo" that are not interesting even if they are displayed are conspicuous.

This is something ...: frowning2:

So, I made a word list that I do not want to display, and I tried not to display character strings that I do not want to be displayed.

Finally

I tried to touch Python after a long time. After all it's fun ~: relaxed: Next, I'm thinking of scraping SNS and playing with it. If you have any mistakes or advice in the content of this article, please let me know.

Referenced articles / related articles

https://sleepless-se.net/2018/08/24/python-mecab-wakatigaki/

https://qiita.com/furipon308/items/be97abf25cf4caa0574e

https://qiita.com/yonedaco/items/27e1ad19132c9f1c9180

https://analysis-navi.com/?p=2295

Recommended Posts

I made wordcloud with Python.
I made blackjack with python!
I made blackjack with Python.
I made a fortune with Python.
I played with wordcloud!
I made a character counter with Python
I made a Hex map with Python
I made a roguelike game with Python
I made a simple blackjack with Python
I made a configuration file with Python
I made a neuron simulator with Python
I made a weather forecast bot-like with Python.
I made a GUI application with Python + PyQt5
I made a Twitter fujoshi blocker with Python ①
[Python] I made a Youtube Downloader with Tkinter.
I made a bin picking game with Python
I made a Mattermost bot with Python (+ Flask)
I tried fp-growth with python
I tried scraping with Python
I made a python text
I tried gRPC with Python
I made COVID19_simulator with JupyterLab
I made Word2Vec with Pytorch
Othello made with python (GUI-like)
I made a Twitter BOT with GAE (python) (with a reference)
I made a Christmas tree lighting game with Python
I made a net news notification app with Python
I made a Python3 environment on Ubuntu with direnv.
I made a LINE BOT with Python and Heroku
SNS Python basics made with Flask
I made a Line-bot using Python!
I can't install python3 with pyenv-vertualenv
I tried web scraping with python.
I made my own Python library
Numer0n with items made in Python
I sent an SMS with Python
Othello game development made with Python
I liked the tweet with python. ..
I played with PyQt5 and Python3
I tried running prolog with python 3.8.2.
I tried SMTP communication with Python
When I made CaboCha usable with python3, I got stuck (Windows 10)
I made a simple typing game with tkinter in Python
I made a package to filter time series with python
I made LINE-bot with Python + Flask + ngrok + LINE Messaging API
I made a simple book application with python + Flask ~ Introduction ~
[I made it with Python] XML data batch output tool
Life game with Python [I made it] (on the terminal & Tkinter)
FizzBuzz with Python3
Scraping with Python
I made a library to easily read config files with Python
I made a package that can compare morphological analyzers with Python
Create wordcloud from your tweet with python3
Statistics with python
I made Othello to teach Python3 to children (4)
I made a payroll program in Python!
Simple Slack API client made with Python
Scraping with Python
Python with Go
I drew a heatmap with seaborn [Python]
Twilio with Python