I tried to summarize everyone's remarks on slack with wordcloud (Python)

This is a 12/8 article from jsys19AdventCalender (https://adventar.org/calendars/4301).

Introduction

This is the first time I have sent my code along with the text, and although it will be a poor text and code, I would appreciate it if you could keep an eye on it and tell me if there is something that you think "this is the way to go!".

I analyzed and summarized the remarks of everyone in slack

Suddenly, do you all know what a word cloud is?

A method of selecting multiple words that appear frequently in a sentence and displaying them in a size according to the frequency. It refers to automatically arranging words that frequently appear on web pages and blogs. By changing not only the size of the characters but also the color, font, and orientation, you can impress the content of the text at a glance. https://kotobank.jp/word/%E3%83%AF%E3%83%BC%E3%83%89%E3%82%AF%E3%83%A9%E3%82%A6%E3%83%89-674221

It looks like this, the actual one is like the image below wc1.png This is an image of typescript-eslint's github page on the word cloud.

I've seen this way of expressing words a little interestingly on the net before, and I thought "Isn't it interesting to do this in the slack log?" And wrote an article.

Make a sentence to pass to wordcloud

wordcloud can only receive words separated by space. Everyone's remarks are not so, so I will use MeCab to write a word. Before that, I put in the work of putting all the remarks together.

First, we will get an archive of everyone's remarks from slack from the director of the workspace owner and try to extract the sentences. When you open the file, there is a folder for each channel, in which information such as the sender and reaction of the remark is stored in json format. (At this point, it's easier to delete the folder of the channel where many bots say

ex-2020-6-31.json


[
    {
        "client_msg_id": "hoge",
        "type": "message",
        "text": "I became a hatachi",
        "user": "hogee",
        "ts": "hooge",
        "team": "foo",
        "user_team": "foo",
        "source_team": "foo",
        "user_profile": {
            "avatar_hash": "bar",
            "image_72": "https:\/\/avatars.slack-edge.com\/ore.png ",
            "first_name": "Murakami",
            "real_name": "Murakami ore",
            "display_name": "Murakami",
            "team": "piyo",
            "name": "s31051315",
            "is_restricted": false,
            "is_ultra_restricted": false
        },
    }
]

Below is the code to scan all the json files in the archive folder and put the contents of the text property that indicates the statement in one variable.


from pathlib import Path
import glob
import json
import re

main_text = ""

json_path=Path("src/jsys_archive")
dirs=list(json_path.glob("**/*.json"))
for i in dirs:
    json_open = open(i)
    json_text = json.load(json_open)
    json_dicts = len(json_text)
    for j in range(json_dicts):
        json_text_fixed = re.sub("<.*?>|:.*?:","",json_text[j]["text"])
        main_text += json_text_fixed

I put the path of the folder I want to check in Path () and make it a path object, and pass "*** / **. Json" to glob () to search for an arbitrary json file.

pa_th=Path("src/jsys_archive")
dirs=list(pa_th.glob("**/*.json"))

And everyone's remarks are mixed with non-pure text noise such as data and mention information that will be handled on various slack enclosed in <>, reaction information enclosed in ::. If these are also included, the output word cloud will be only system messages, so character string operations are performed using regular expressions.

json_text_fixed = re.sub("<.*?>|:.*?:","",json_text[j]["text"])
#<>, Or::And erase the text inside it

Now everyone's remarks are gathered in the variable main_text (huge). The rest is going to MeCab.

wordcloud can only receive space-separated ones. Everyone's remarks are not so, so I will use MeCab to write a word.

Do this.

import MeCab
words = MeCab.Tagger("-Owakati")
nodes = words.parseToNode(main_text)
s = []
while nodes:
    if nodes.feature[:2] == "noun":
        s.append(nodes.surface)
    nodes = nodes.next

To do this, give " -Owakati " to ``` MeCab.Tagger ()` `` and share it. The Tagger object can mainly take the following four arguments.

1, "mecabrc" (no arguments) 2, "-Ochasen" (ChaSen compatible format) 3, "-Owakati" (output word-separation) ← 4, "-Oyomi" (output reading) This time, we will use 3 "Share" ~~ (MeCab's argument Japanese-like is interesting, but I don't call it a share) ~~

Next, the Node object parsed and returned by (Tagger instance) .parseToNode (" string ")` `` has two properties, `` `.surface and `` `.feature```. there is. The surface contains the character string data of the Node object, and the feature contains [part of speech, part of speech classification 1, part of speech classification 2, part of speech classification 3, conjugation, conjugation, prototype, reading, pronunciation]. Below is an example program.

feature_example



import MeCab
mecab = MeCab.Tagger()
nodes = mecab.parseToNode("Information Media System Bureau")
while nodes:
    print(nodes.feature)
    nodes = nodes.next

↓ Execution result

noun,General,*,*,*,*,information,Jouhou,Joe Ho
noun,General,*,*,*,*,media,media,media
noun,General,*,*,*,*,system,system,system
noun,suffix,General,*,*,*,Station,Kyoku,Kyoku

Since only nouns need to be displayed in the figure, pass only the nouns with if and add the character string data to the prepared empty list. Then, the completed list is converted into a character string separated by half-width spaces, and the preparation is finally completed.

s = []
while nodes:
    if nodes.feature[:2] == "noun":
        s.append(nodes.surface)
    nodes = nodes.next
parsed_main_text = " ".join(s)

Image output with wordcloud

Finally you can make an image. wc = wordcloud()Create a wordcloud object by setting various images in. I think that the height, width, background_color, etc. that set the height and width of the image are stylized and easy to understand. There are various other things such as collocation to avoid the appearance of the same word, stopwords to set words that you do not want to appear, but this time we will use only those that are here. The mask that determines the shape of the output image will be described later.

import numpy
from PIL import Image
from wordcloud import WordCloud

mask_jsys = numpy.array(Image.open("jsys.jpeg "))
wc = WordCloud(width=1200, height=800,
                background_color="black",
                collocations = False,
                mask=mask_jsys,
                stopwords={"thing","this","For","It","By the way",
                          "Yo","From","Mr.","but","thing","so"},
                font_path="/System/Library/Fonts/Hiragino Horn Gothic W6.ttc")

The first line determines the shape of the image. This time I used the image below. I like the font, but I use Impact. jsys.jpeg

This will place the word cloud text only in the jsys text part of this image.

Pass the parsed_main_text created earlier to wc.generate () to generate the image and save it as wc.to_file ("filename").

wc.generate(parsed_main_text)
wc.to_file('jsys_wordcloud.png')

This is finally complete. It was long,,

Complete!

これ.png

Is it good? (Self-praise) Did you say this? I'm sure there are some remarks that I think, but I think there are remarks like this. Personally, it's interesting that "request" and "okay" become bigger. I'm glad that the group name jsys also came out.

Referenced web page

https://oku.edu.mie-u.ac.jp/~okumura/python/wordcloud.html https://qiita.com/sea_ship/items/7c8811b5cf37d700adc4 https://www.pynote.info/entry/python-wordcloud#%E3%83%9E%E3%82%B9%E3%82%AF%E3%82%92%E4%BD%BF%E7%94%A8%E3%81%99%E3%82%8B https://takaxtech.com/2018/11/03/article271/ https://qiita.com/amowwee/items/e63b3610ea750f7dba1b

Recommended Posts

I tried to summarize everyone's remarks on slack with wordcloud (Python)
I tried to implement Minesweeper on terminal with python
[Python] I tried to visualize tweets about Corona with WordCloud
I tried to summarize Python exception handling
Python3 standard input I tried to summarize
I tried to get CloudWatch data with Python
I tried to output LLVM IR with Python
I tried to automate sushi making with python
I tried with the top 100 PyPI packages> I tried to graph the packages installed on Python
I tried fp-growth with python
I tried scraping with Python
I tried to summarize how to use matplotlib of python
I tried to get started with blender python script_Part 01
I tried to touch the CSV file with Python
I tried to draw a route map with Python
Post to slack with Python 3
I tried to solve the soma cube with python
I tried to get started with blender python script_Part 02
I tried to implement an artificial perceptron with python
I tried to automatically generate a password with Python3
I want to AWS Lambda with Python on Mac!
I tried to summarize how to use pandas in python
I tried to solve the problem with Python Vol.1
I tried to summarize SparseMatrix
I tried to analyze J League data with Python
I tried gRPC with Python
I tried scraping with python
I made wordcloud with Python.
I tried to summarize the string operations of Python
I tried to solve AOJ's number theory with Python
I tried to find the entropy of the image with python
I tried to simulate how the infection spreads with Python
I tried various methods to send Japanese mail with Python
[Ipdb] Web development beginners tried to summarize debugging with Python
Mayungo's Python Learning Episode 3: I tried to print numbers with print
I tried to make GUI tic-tac-toe with Python and Tkinter
I tried changing the python script from 2.7.11 to 3.6.0 on windows10
I tried to divide the file into folders with Python
I tried to touch Python (installation)
I tried web scraping with python.
I want to debug with Python
I tried running prolog with python 3.8.2.
I tried SMTP communication with Python
[5th] I tried to make a certain authenticator-like tool with python
I tried to solve the ant book beginner's edition with python
[2nd] I tried to make a certain authenticator-like tool with python
I tried to visualize bookmarks flying to Slack with Doc2Vec and PCA
[3rd] I tried to make a certain authenticator-like tool with python
[Python] A memo that I tried to get started with asyncio
I tried to create a list of prime numbers with python
[Pandas] I tried to analyze sales data with Python [For beginners]
I tried to make a periodical process with Selenium and Python
I tried to summarize what was output with Qiita with Word cloud
I tried to find out if ReDoS is possible with Python
I tried to display GUI on Mac with X Window System
I tried to make a todo application using bottle with python
[4th] I tried to make a certain authenticator-like tool with python
I tried to easily detect facial landmarks with python and dlib
[1st] I tried to make a certain authenticator-like tool with python
I tried to improve the efficiency of daily work with Python
I tried ChatOps with Slack x API Gateway x Lambda (Python) x RDS