[PYTHON] Visualize 2019 nem with WordCloud

This article is from nem # 2 Advent Calendar 2019.

Content of this article

① Extract only the characters from the nem related documents and (2) Break it down into part of speech with Mecab and visualize it with WordCloud. ③ Furthermore, just WordCloud is not interesting, so I will add a little effort.

environment

Mac 10.15.1 Python 3.7.4

1. Extract only characters from nem related documents

By the way, what is the nem-related document that seems to be a summary of 2019? I agree. It's an Advent calendar.

~~ This time, I extracted the character strings from all the articles of this year's nem Advent calendar ~~ If you do that, it doesn't seem to make much sense unless you save the last day, so for the time being, the first article @ 44uk_i3's Summary of specifications that change between NEM1 and NEM2. (Properly, I got permission)

Style to paste the source code for the time being

scrapy.py


import urllib.request
from bs4 import BeautifulSoup

text = []

#URL of the target site
url = 'https://qiita.com/44uk_i3/items/53ad306d2c82df41803f'
html = urllib.request.urlopen(url)
soup = BeautifulSoup(html,'html.parser')

#Qiita's article is<div class="p-items_article">Because the inside is the text
article =  soup.findAll('div' , class_= 'p-items_article')

#Extract only the text from the text
for i in article:
    text.append(i.text)

#Extracted text nem.Save to txt
file = open('nem.txt','w',encoding='utf-8')
file.writelines(text)
file.close()

I think there is something special to mention. Please comment if anything happens.

2. Break down into part of speech with Mecab and visualize with WordCloud

Yes, it will be disassembled. Writing this way reminds me of the day I was impressed with Mecab for the first time.

Mecab is such a guy.

$ mecab
Of the thighs and thighs
Plum noun,General,*,*,*,*,Plum,Plum,Plum
Also particles,Particle,*,*,*,*,Also,Mo,Mo
Peach noun,General,*,*,*,*,Peaches,peach,peach
Also particles,Particle,*,*,*,*,Also,Mo,Mo
Peach noun,General,*,*,*,*,Peaches,peach,peach
Particles,Attributive,*,*,*,*,of,No,No
Noun,Non-independent,Adverbs possible,*,*,*,home,Uchi,Uchi
EOS

Talk in source code

nem_wordcloud.py


import MeCab
from wordcloud import WordCloud

#Open the saved text
data = open("./nem.txt","rb").read()
text = data.decode('utf-8')

mecab = MeCab.Tagger('-Ochasen')
mecab.parse('')

#Morphological analysis
node = mecab.parseToNode(text)

#Word list to use for WordCloud
output = []

#Separate words using part of speech
while node:
  word = node.surface
  hinnsi = node.feature.split(",")[0]
  #Specify the part of speech to add to the array
  if hinnsi in ["verb","adverb","adjective","noun"]:
    output.append(word)
  node = node.next

text = ' '.join(output)

#Japanese font path(mac)
fpath = "/System/Library/Fonts/Hiragino Mincho ProN.ttc"

#WordCloud generation. Specify background color
wc = WordCloud(
    background_color="white",
    font_path=fpath,
    width=800,
    height=600).generate(text)

#Save as png
wc.to_file("./wc.png ")

Yes, change the font to your liking. The default fonts for Mac are in / System / Library / Fonts /. Please google Windows.

Things that have been completed so far

nem.png

I made something like that.

3. Add another effort

With the contents so far, there are many articles on Qiita, so I will add another effort.

Spoil first with source code

nem_wordcloud_2.py


import MeCab
from wordcloud import WordCloud
import numpy as np
from wordcloud import WordCloud ,ImageColorGenerator
from PIL import Image

#Open the saved text
data = open("./nem.txt","rb").read()
text = data.decode('utf-8')

mecab = MeCab.Tagger('-Ochasen')
mecab.parse('')

#Morphological analysis
node = mecab.parseToNode(text)

#List of words to use for WordCloud
output = []

#Separate words using part of speech
while node:
  word = node.surface
  hinnsi = node.feature.split(",")[0]
  #Specify the part of speech to add to the array
  if hinnsi in ["verb","adverb","adjective","noun"]:
    output.append(word)
  node = node.next

#Japanese font path(mac)
fpath = "/System/Library/Fonts/Hiragino Mincho ProN.ttc"

text = ' '.join(output)

imagepaht = "./nem_icon_black.png "
img_color = np.array(Image.open(imagepaht))

wc = WordCloud(
    width=800,
    height=800,
    font_path=fpath,
    mask=img_color,
    background_color="white",
    collocations=False,).generate(text)

wc.to_file("./wc_nem.png ")

Yes, I prepared this image first. nem_icon_black.png

As you all know, I made that icon black. This image is the ./nem_icon_black.png specified in ʻimagepaht`.

So, here is the image that is created when you execute this code.

The finished product

wc_nem.png

Better than I expected.

Summary

If you increase the amount of data a little more, it seems that you can analyze what was important for nem in 2019.

bonus

You can also make such a version by changing the original image or changing the settings. wc_symbol.png

Recommended Posts

Visualize 2019 nem with WordCloud
Visualize 2ch threads with WordCloud-Morphological analysis / WordCloud-
Quickly visualize with Pandas
Visualize data with Streamlit
I played with wordcloud!
Visualize claims with AI
Visualize 2ch threads with WordCloud-Scraping-
Visualize app reviews using wordcloud
Visualize location information with Basemap
Visualize Wikidata knowledge with Neo4j
I made wordcloud with Python.
[Python] I tried to visualize tweets about Corona with WordCloud
Visualize decision trees with jupyter notebook
Visualize python package dependencies with graphviz
[Python] I tried to visualize the night on the Galactic Railroad with WordCloud!
I tried to visualize the text of the novel "Weathering with You" with WordCloud
[Introduction to WordCloud] Let's play with scraping ♬
Create wordcloud from your tweet with python3
I tried to visualize AutoEncoder with TensorFlow
Quickly try to visualize datasets with pandas
Visualize latitude / longitude coordinate information with kepler.gl
Visualize point P that works with Python
Visualize scikit-learn decision trees with Plotly's Treemap