[PYTHON] I tried to visualize bookmarks flying to Slack with Doc2Vec and PCA

This is the article on the 11th day of the Advent calendar.

this is

A distributed representation of the article titles bookmarked by everyone (4 people) acquired and visualized

If you bookmark it, IFTTT will pick it up and spit it out to slack, so process it from there

Prerequisites

--Environment -Link to Dockerfile

References / used

  1. [Job change meeting] Try to classify companies by processing word-of-mouth in natural language with word2vec
  2. [word2vec] Let's visualize the result of natural language processing of company reviews
  3. [Japanese preprocessing memorandum by python](https://datumstudio.jp/blog/python%E3%81%AB%E3%82%88%E3%82%8B%E6%97%A5%E6% 9C% AC% E8% AA% 9E% E5% 89% 8D% E5% 87% A6% E7% 90% 86% E5% 82% 99% E5% BF% 98% E9% 8C% B2)
  4. Easy backup of slack chat log
  5. models.doc2vec(gensim)
  6. slack-dump
  7. slack api

What you can do

Forecast before doing

--R-kun --Many gadgets and security systems --Number of bookmarks 79 --Mr. Y ――The widest range of the four ――Actually, only this user is posting the selected ones for the purpose of sharing with everyone. --Number of bookmarks 864 --M-kun --Web and machine learning, etc. --240 bookmarks --S (self) ――In addition to the Web, machine learning, and gadgets, I throw things like "Saury is not caught this year" --Number of bookmarks 896

result

Range and overlap are intuitively close to expectations

スクリーンショット 2019-12-11 19.16.44.png

Preparation

Mechanism for letting IFTTT post Hatena bookmarks to Slack

procedure

Details are omitted, but the mechanism itself is completed according to the flow shown in the figure below, but it is necessary to enter the URL to receive RSS Feed between the 4th and 5th frames, and this time it is a Hatena bookmark, so http://b.hatena It becomes .ne.jp/<username>/rss

Like this

Reason

Should I like users in Hatena?

That's not bad (rather, you can do both), but you can feel free to comment like this in the community.

IMG_52CF406E5EED-1.jpeg
Should I use the Slack command / feed?

Since you can customize the post, you can use it for play like this time, and using the Slack command takes up a lot of space, which is a problem

Get posted messages from Slack

These two types seem to be easy to do

  1. Slack API
    • https://api.slack.com/methods/channels.history
  2. Go tool (this time)
    • https://github.com/joefitzgerald/slack-dump --Binary is this
      • https://github.com/PyYoshi/slack-dump/releases

Either way, you need a token, so get it from here

$ wget https://github.com/PyYoshi/slack-dump/releases/download/v1.1.3/slack-dump-v1.1.3-linux-386.tar.gz
$ tar -zxvf slack-dump-v1.1.3-linux-386.tar.gz
$ linux-386/slack-dump -t=<token> <channel>

Take it with DM and move it to another place because it is an obstacle

python


import zipfile, os

os.mkdir('dumps')
with zipfile.ZipFile('./zipfile_name') as z:
    for n in z.namelist():
        if 'channel_name' in n:
            z.extract(n, './dumps')

Open the file and get the contents, because it is by date, make all one

python


import json, glob

posts = []
files = glob.glob('./dumps/channel/<channel_name>/*.json'.format(dirname))
for file in files:
    with open(file) as f:
        posts += json.loads(f.read())

Extract the Message and associate the article title with the user name (this area depends on the settings in IFTTT)

python


user_post_dic = {
    'Y': [],
    'S': [],
    'M': [],
    'R': [],
}

for p in posts:
    if "username" not in p or p["username"] != "IFTTT":
        continue
    for a in p["attachments"]:
        #Miscellaneous avoidance
        try:
            user_post_dic[a["text"]].append(a["title"])
        except:
            pass
        
users = user_post_dic.keys()
print([[u, len(user_post_dic[u])] for u in users])

output


[['Y', 864], ['S', 896], ['M', 240], ['R', 79]]

Main story

Preprocessing

Cleansing and word-separation

The posted message looks like this and the site title and URL are unnecessary, so delete it

Use Neovim in your browser's text area<http://Developers.IO|Developers.IO>

Security measures for front-end engineers/ #frontkansai 2019 - Speaker Deck

Japanese with matplotlib

Reintroduction to Modern JavaScript/ Re-introduction to Modern JavaScript - Speaker Deck

I didn't know how to use re, so I pushed it. In addition, it also uses MeCab for word-separation, and although the environment includes sudapipy etc., it is fast to use something that is familiar to the hand.

python


import MeCab, re
m = MeCab.Tagger("-Owakati")

_tag = re.compile(r'<.*?>')
_url = re.compile(r'(http|https)://([-\w]+\.)+[-\w]+(/[-\w./?%&=]*)?')
_title = re.compile(r'( - ).*$')
_par = re.compile(r'\(.*?\)')
_sla = re.compile(r'/.*$')
_qt = re.compile(r'"')
_sep = re.compile(r'\|.*$')
_twi = re.compile(r'(.*)on Twitter: ')
_lab = re.compile(r'(.*) ⇒ \(')
_last_par = re.compile(r'\)$')

def clean_text(text):
    text = text.translate(str.maketrans({chr(0xFF01 + i): chr(0x21 + i) for i in range(94)}))
    text = re.sub(_lab, '', text)
    text = re.sub(_tag, '', text)
    text = re.sub(_url, '', text)
    text = re.sub(_title, '', text)
    text = re.sub(_sla,  '', text)
    text = re.sub(_qt,  '', text)
    text = re.sub(_sep, '', text)
    text = re.sub(_twi, '', text)
    text = re.sub(_par, '', text)
    text = re.sub(_last_par, '', text)
    return text

p_all = []
m_all = []
for u in users:
    user_post_dic[u] = list(map(clean_text, p_dic[u]))
    m_all += [m.parse(p).split('\n')[0] for p in p_dic[u]]
    p_all += [u + '**' + p for p in user_post_dic[u]]

The reason why the user name is added to the beginning of each element in p_all is that the text disappears due to preprocessing and the index of list shifts, so it is tied in a painful way. (By the way, if you bookmark the URL as the article title)

For the time being it became beautiful

Use Neovim in your browser's text area
 
Security measures for front-end engineers
 
Japanese with matplotlib

Reintroduction to Modern JavaScript

Doc2Vec The text body that is the material when m_all acquires the distributed expression p_all is just a name

Parameters are not enthusiastically considered

python


from gensim import models

#Reference article: http://qiita.com/okappy/items/32a7ba7eddf8203c9fa1
class LabeledListSentence(object):
    def __init__(self, words_list, labels):
        self.words_list = words_list
        self.labels = labels

    def __iter__(self):
        for i, words in enumerate(self.words_list):
            yield models.doc2vec.TaggedDocument(words, ['%s' % self.labels[i]])

sentences = LabeledListSentence(m_all, p_all)
model = models.Doc2Vec(
    alpha=0.025,
    min_count=5,
    vector_size=100,
    epoch=20,
    workers=4
)
#Build vocabulary from the sentences you have
model.build_vocab(sentences)
model.train(
    sentences,
    total_examples=len(m_all),
    epochs=model.epochs
)

#Recall as the order may change
tags = model.docvecs.offset2doctag

PCA and drawing

It's my first time to use the PCA library, and even though I learned so much, it's amazing to use it in two lines

python


from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import japanize_matplotlib

vecs = [model.docvecs[p] for p in tags]
draw_scatter_plot(vecs, ls)

#Untie
tag_users = [p.split('**')[0] for p in tags]
tag_docs = [p.split('**')[1] for p in tags]

#It was difficult to find the same degree of color in 4 colors
cols = ["#0072c2", "#Fc6993", "#ffaa1c", "#8bd276" ]

#I forcibly wrote it in one line
clusters = [cols[0] if u == tag_users[0] else cols[1] if u == tag_users[1] else cols[2] if u == tag_users[2] else cols[3] for u in lab_users]

#2D because it is a plane
pca = PCA(n_components=2)
coords = pca.fit_transform(vecs)

fig, ax = plt.subplots(figsize=(16, 12))
x = [v[0] for v in coords]
y = [v[1] for v in coords]

#Do this loop to give a legend
for i, u in enumerate(set(tag_users)):
    x_of_u = [v for i, v in enumerate(x) if tag_users[i] == u]
    y_of_u = [v for i, v in enumerate(y) if tag_users[i] == u]
    ax.scatter(
        x_of_u,
        y_of_u,
        label=u,
        c=cols[i],
        s=30,
        alpha=1,
        linewidth=0.2,
        edgecolors='#777777'
    )

plt.legend(
    loc='upper right',
    fontsize=20,
    prop={'size':18,}
)
plt.show()

Made (repost)

Forecast before doing

--R-kun --Many gadgets and security systems --Number of bookmarks 79 --Mr. Y ――The widest range of the four ――Actually, only this user is posting the selected ones for the purpose of sharing with everyone. --Number of bookmarks 864 --M-kun --Web and machine learning, etc. --240 bookmarks --S (self) ――In addition to the Web, machine learning, and gadgets, I throw things like "Saury is not caught this year" --Number of bookmarks 896

result

Range and overlap are intuitively close to expectations

スクリーンショット 2019-12-11 19.16.44.png

end

In the first place, there are many duplicate bookmarks, so I'm sorry I couldn't break up cleanly. If the data increases a little more, I would like to make recommendations by turning user inferences.

Sorry for being late (12/11/21: 00)

Recommended Posts

I tried to visualize bookmarks flying to Slack with Doc2Vec and PCA
I tried to visualize AutoEncoder with TensorFlow
I tried to read and save automatically with VOICEROID2 2
I tried to implement and learn DCGAN with PyTorch
I tried to automatically read and save with VOICEROID2
I tried to implement Grad-CAM with keras and tensorflow
I tried to predict and submit Titanic survivors with Kaggle
[Python] I tried to visualize tweets about Corona with WordCloud
I tried to make GUI tic-tac-toe with Python and Tkinter
I tried to make a periodical process with Selenium and Python
I tried to create Bulls and Cows with a shell program
I tried to summarize everyone's remarks on slack with wordcloud (Python)
I tried to easily detect facial landmarks with python and dlib
I tried to implement Autoencoder with TensorFlow
I tried to get started with Hy
I tried to implement CVAE with PyTorch
I tried to solve TSP with QAOA
[Python] I tried to visualize the night on the Galactic Railroad with WordCloud!
I tried to visualize the age group and rate distribution of Atcoder
I tried to express sadness and joy with the stable marriage problem.
I tried to convert datetime <-> string with tzinfo using strftime () and strptime ()
I tried to visualize Google's general object recognition NN, Inception-v3 with Tensorboard
I tried to learn the angle from sin and cos with chainer
I tried to visualize the text of the novel "Weathering with You" with WordCloud
I tried to visualize the model with the low-code machine learning library "PyCaret"
I tried to visualize all decision trees of random forest with SVG
I tried to control the network bandwidth and delay with the tc command
I tried to predict next year with AI
I tried to detect Mario with pytorch + yolov3
I tried to implement reading Dataset with PyTorch
I tried to use lightGBM, xgboost with Boruta
I tried to learn logical operations with TF Learn
I tried to move GAN (mnist) with keras
I implemented DCGAN and tried to generate apples
I tried to notify slack of Redmine update
I tried to save the data with discord
I tried to detect motion quickly with OpenCV
I tried to integrate with Keras in TFv1.1
I tried playing with PartiQL and MongoDB connected
I tried Jacobian and partial differential with python
I tried to get CloudWatch data with Python
I tried function synthesis and curry with python
I tried to output LLVM IR with Python
I tried to detect an object with M2Det!
I tried to automate sushi making with python
I tried to predict Titanic survival with PyCaret
I tried to operate Linux with Discord Bot
I tried to study DP with Fibonacci sequence
I tried to start Jupyter with Amazon lightsail
I tried to judge Tsundere with Naive Bayes
[Introduction to PID] I tried to control and play ♬
I tried to make a periodical process with CentOS7, Selenium, Python and Chrome
I tried to automate internal operations with Docker, Python and Twitter API + bonus
[ES Lab] I tried to develop a WEB application with Python and Flask ②
I tried to easily visualize the tweets of JAWS DAYS 2017 with Python + ELK
I tried to create a button for Slack with Raspberry Pi + Tact Switch
[Introduction to AWS] I tried porting the conversation app and playing with text2speech @ AWS ♪
I tried to make a simple image recognition API with Fast API and Tensorflow
I tried to learn the sin function with chainer
I tried to create a table only with Django
I tried to classify Mr. Habu and Mr. Habu with natural language processing × naive Bayes classifier