[PYTHON] [Natural language processing] I tried to visualize the remarks of each member in the Slack community

About this article

In this article, I'll show you how to visualize what each member says in the Slack community with Wordcloud.

The source code can be found here [https://github.com/sota0121/slack-msg-analysis): octocat:

I also want to read: [Natural language processing] I tried to visualize the hot topics this week in the Slack community

table of contents

  1. Usage and output example
  2. Get messages from Slack
  3. Preprocessing: Table creation / cleaning / morphological analysis / normalization / stopword removal
  4. Preprocessing: Important phrase extraction (tf-idf)
  5. Visualization process with Wordcloud
  6. Bonus

* I would like to summarize the preprocessing in another article in the future </ font>

1. Usage and output example

1.1. How to use

For more information, see Getting started in README. The flow is like this.

  1. Build a virtual environment with docker-compose up -d
  2. Enter the shell with docker exec -it ds-py3 bash
  3. Run run_wordcloud_by_user.sh

1.2. Output example

This is an example of actual output. Wordcloud is the remarks of different members.

anim_.gif

2. Get messages from Slack

See this article.

[[Natural language processing] I tried to visualize the hot topics this week in the Slack community --2. Get the message from Slack](https://qiita.com/masso/items/41630aa02f1fd6cfa0a4#2-slack%E3%81% 8B% E3% 82% 89% E3% 83% A1% E3% 83% 83% E3% 82% BB% E3% 83% BC% E3% 82% B8% E3% 82% 92% E5% 8F% 96% E5% BE% 97)

3. Preprocessing: Table creation / cleaning / morphological analysis / normalization / stopword removal

Since it is the same content as this article, it is omitted. Please refer to the link for details.

-[Pre-processing: Create message mart table](https://qiita.com/masso/items/41630aa02f1fd6cfa0a4#3-%E5%89%8D%E5%87%A6%E7%90%86%E3%83% A1% E3% 83% 83% E3% 82% BB% E3% 83% BC% E3% 82% B8% E3% 83% 9E% E3% 83% BC% E3% 83% 88% E3% 83% 86% E3% 83% BC% E3% 83% 96% E3% 83% AB% E4% BD% 9C% E6% 88% 90) -[Pretreatment: Cleaning](https://qiita.com/masso/items/41630aa02f1fd6cfa0a4#4-%E5%89%8D%E5%87%A6%E7%90%86%E3%82%AF%E3 % 83% AA% E3% 83% BC% E3% 83% 8B% E3% 83% B3% E3% 82% B0) -[Pretreatment: Morphological analysis (Janome)](https://qiita.com/masso/items/41630aa02f1fd6cfa0a4#5-%E5%89%8D%E5%87%A6%E7%90%86%E5%BD % A2% E6% 85% 8B% E7% B4% A0% E8% A7% A3% E6% 9E% 90janome) -[Preprocessing: Normalization](https://qiita.com/masso/items/41630aa02f1fd6cfa0a4#6-%E5%89%8D%E5%87%A6%E7%90%86%E6%AD%A3% E8% A6% 8F% E5% 8C% 96) -[Pretreatment: Stopword removal](https://qiita.com/masso/items/41630aa02f1fd6cfa0a4#7-%E5%89%8D%E5%87%A6%E7%90%86%E3%82%B9 % E3% 83% 88% E3% 83% 83% E3% 83% 97% E3% 83% AF% E3% 83% BC% E3% 83% 89% E9% 99% A4% E5% 8E% BB)

4. Preprocessing: Important phrase extraction (tf-idf)

4.1. What is tf-idf?

tf-idf is for each word in a document It can be said that it is an index for scoring from the viewpoint of "Is it important for understanding the context of the document?"

For details, please refer to this article.

4.2. Implementation of word scoring processing by tf-idf

4.2.1. What should be a document / all documents?

The purpose of this time is to see the characteristics of a member's remark. Therefore, I thought it should be possible to know ** what characteristics a member has for every post in the Slack community **.

Therefore,

-** All documents : All posts of all channels and all users so far - 1 document **: All posts of a member

I calculated tf-idf as.

4.2.2. Implementation

Write the process flow easily.

  1. Grouping by member who said the message
  2. Calculate tf-idf with one group of messages as one document
  3. Extract words whose tf-idf score is above the threshold (output as a dictionary)

important_word_extraction.py


import pandas as pd
import json
from datetime import datetime, date, timedelta, timezone
from pathlib import Path
from sklearn.feature_extraction.text import TfidfVectorizer
JST = timezone(timedelta(hours=+9), 'JST')

#Group messages by user
def group_msgs_by_user(df_msgs: pd.DataFrame) -> dict:
    ser_uid = df_msgs.uid
    ser_wktmsg = df_msgs.wakati_msg
    #Get a unique uid list
    ser_uid_unique = df_msgs.drop_duplicates(subset='uid').uid
    #Grouping by uid without duplication
    dict_msgs_by_user = {}
    for uid in ser_uid_unique:
        #Get all wktmsg corresponding to the uid
        extracted = df_msgs.query('uid == @uid')
        # key,Add value to the output dictionary
        dict_msgs_by_user[uid] = ' '.join(extracted.wakati_msg.dropna().values.tolist())        
    return dict_msgs_by_user

# tf-Extract important words and return them as a dictionary while referring to the idf score
def extract_important_word_by_key(feature_names: list, bow_df: pd.DataFrame, uids: list) -> dict:
    # >Look at each line and extract important words(top X words in tfidf)
    dict_important_words_by_user = {}
    for uid, (i, scores) in zip(uids, bow_df.iterrows()):
        #Create a table of the user's words and tfidf scores
        words_score_tbl = pd.DataFrame()
        words_score_tbl['scores'] = scores
        words_score_tbl['words'] = feature_names
        #Sort in descending order by tfidf score
        words_score_tbl = words_score_tbl.sort_values('scores', ascending=False)
        words_score_tbl = words_score_tbl.reset_index()
        # extract : tf-idf score > 0.001
        important_words = words_score_tbl.query('scores > 0.001')
        #Creating a dictionary for the user'uid0': {'w0': 0.9, 'w1': 0.87}
        d = {}
        for i, row in important_words.iterrows():
            d[row.words] = row.scores
        #Add to table only if the user's dictionary has at least one word
        if len(d.keys()) > 0:
            dict_important_words_by_user[uid] = d
    return dict_important_words_by_user

#Extract important words for each user
def extraction_by_user(input_root: str, output_root: str) -> dict:
    # ---------------------------------------------
    # 1. load messages (processed)
    # ---------------------------------------------
    msg_fpath = input_root + '/' + 'messages_cleaned_wakati_norm_rmsw.csv'
    print('load: {0}'.format(msg_fpath))
    df_msgs = pd.read_csv(msg_fpath)
    # ---------------------------------------------
    # 2. group messages by user
    # ---------------------------------------------
    print('group messages by user and save it.')
    msgs_grouped_by_user = group_msgs_by_user(df_msgs)
    msg_grouped_fpath = input_root + '/' + 'messages_grouped_by_user.json'
    with open(msg_grouped_fpath, 'w', encoding='utf-8') as f:
        json.dump(msgs_grouped_by_user, f, ensure_ascii=False, indent=4)
    # ---------------------------------------------
    # 4.Tf for all documents-idf calculation
    # ---------------------------------------------
    print('tfidf vectorizing ...')
    # >Words in all documents are columns, the number of documents (=A matrix with user) as the row is created. Tf for each element-There is an idf value
    tfidf_vectorizer = TfidfVectorizer(token_pattern=u'(?u)\\b\\w+\\b')

    bow_vec = tfidf_vectorizer.fit_transform(msgs_grouped_by_user.values())
    bow_array = bow_vec.toarray()
    bow_df = pd.DataFrame(bow_array,
                        index=msgs_grouped_by_user.keys(),
                        columns=tfidf_vectorizer.get_feature_names())
    # ---------------------------------------------
    # 5. tf-Extract important words based on idf
    # ---------------------------------------------
    print('extract important words ...')
    d_word_score_by_uid = extract_important_word_by_key(tfidf_vectorizer.get_feature_names(), bow_df, msgs_grouped_by_user.keys())
    # ---------------------------------------------
    # 6. uid =>uname conversion
    # ---------------------------------------------
    print('Convert key of important word group for each user from uid to uname...')
    user_tbl = pd.read_csv('../../data/020_intermediate/users.csv')
    d_word_score_by_uname = {}
    for uid, val in d_word_score_by_uid.items():
        #Search for uname by speaker's uid (may not exist if not active user)
        target = user_tbl.query('uid == @uid')
        if target.shape[0] != 0:
            uname = target.iloc[0]['uname']
        else:
            continue
        print('uname: ', uname, 'type of uname: ', type(uname))
        d_word_score_by_uname[uname] = val
    return d_word_score_by_uname

4.2.3. Output dictionary

In Wordcloud explained in the next chapter, you can generate Wordcloud that changes the display size of words according to the score by entering the dictionary {" word ": score}.

[Natural language processing] I tried to visualize the hot topics this week in the Slack community

In the article, I output Wordcloud when grouped by "period".

In this article, we have grouped by "members", but ** the output dictionaries are in the same format **.

By doing so, everything can be achieved with the same processing ** except for ** "tf-idf scoring processing". You want to go to DRY.

Here is the dictionary that was actually output this time. (User name is hidden)

important_word_tfidf_by_user.json


{
    "USER_001": {
        "Participation": 0.1608918987478819,
        "environment": 0.15024077008089046,
        "Good product": 0.1347222699467748,
        "node": 0.1347222699467748,
        "Description": 0.13378417526975775,
        "Cyber security": 0.12422689899152742,
        "r": 0.12354794954617476,
        "Choice": 0.11973696610170319,
        "Replacement": 0.11678031479185731,
        "Last": 0.11632792524420342,
        "Course": 0.11467215023122095,
        "Release": 0.11324407267324783,
        "analysis": 0.11324407267324783,
        "Deadline": 0.11100429535028021,
        "How to write": 0.10628494383660991,
        "Deep learning": 0.10229478898619786,
        :
    },
    "USER_002": {
        "data": 0.170245452132736,
        "Participation": 0.15825283334154341,
        "Course": 0.13785592895847276,
        "Please": 0.1265412327351908,
        "Recruitment": 0.12204781908784276,
        "article": 0.1197561921672133,
        "environment": 0.11083230914864184,
        "Food": 0.1091835225326696,
        "share": 0.10371152197590257,
        "corona": 0.10081254351124691,
        "Reading in a circle": 0.10025885742434383,
        "Planning": 0.09899869065055528,
        "development of": 0.09571338092513401,
        "Target": 0.09253887576557392,
        "jobs": 0.09094257214685446,
        "project": 0.08910924912513929,
        "information": 0.08772258523428605,
        "language": 0.08636683271048684,
        "channel": 0.08295159680178281,
        "release": 0.0818876418995022,
        "youtube": 0.07956948308804826,
        "team": 0.07956948308804826,
        "Basic": 0.07444492553072463,
        :
    },
    :
}

5. Visualization process with Wordcloud

Please refer to this article.

[[Natural language processing] I tried to visualize the topic that was excited this week in the Slack community --9 Visualization processing with Wordcloud](https://qiita.com/masso/items/41630aa02f1fd6cfa0a4#9-wordcloud%E3%81%A7 % E5% 8F% AF% E8% A6% 96% E5% 8C% 96% E5% 87% A6% E7% 90% 86)

6. Bonus

Articles that I especially referred to

-Types of preprocessing in natural language processing and their power | Qiita

Other reference materials (large amount) are summarized in here: octocat:.

Promotion

This time, we are using data from the Slack community called Data Learning Guild. The Data Learning Guild is an online community of data analytics talent. If you are interested, please check here.

Data Learning Guild Official Homepage

Data Learning Guild 2019 Advent Calendar

Recommended Posts

[Natural language processing] I tried to visualize the remarks of each member in the Slack community
[Natural language processing] I tried to visualize the hot topics this week in the Slack community
I tried to display the analysis result of the natural language processing library GiNZA in an easy-to-understand manner
I tried to solve the 2020 version of 100 language processing [Chapter 3: Regular expressions 25-29]
I tried to solve the 2020 version of 100 language processing knocks [Chapter 3: Regular expressions 20 to 24]
I tried to extract named entities with the natural language processing library GiNZA
I tried to visualize the spacha information of VTuber
I tried to solve the 2020 version of 100 language processing knocks [Chapter 1: Preparatory movement 00-04]
I tried to solve the 2020 version of 100 language processing knocks [Chapter 1: Preparatory movement 05-09]
[First data science ⑥] I tried to visualize the market price of restaurants in Tokyo
I tried to illustrate the time and time in C language
[Python] I tried to visualize the follow relationship of Twitter
Sentiment analysis with natural language processing! I tried to predict the evaluation from the review text
I want to know the population of each country in the world.
[Word2vec] Let's visualize the result of natural language processing of company reviews
I tried to visualize the common condition of VTuber channel viewers
I tried natural language processing with transformers.
I tried to summarize the contents of each package saved by Python pip in one line
I tried to get the batting results of Hachinai using image processing
I tried to visualize the age group and rate distribution of Atcoder
I tried to visualize the text of the novel "Weathering with You" with WordCloud
I tried to display the altitude value of DTM in a graph
I tried to notify slack of Redmine update
I tried to touch the API of ebay
I tried to correct the keystone of the image
Unbearable shortness of Attention in natural language processing
I tried to predict the price of ETF
I tried to vectorize the lyrics of Hinatazaka46!
[Python] I tried to judge the member image of the idol group using Keras
[Python] [Natural language processing] I tried Deep Learning ❷ made from scratch in Japanese ①
I tried to easily visualize the tweets of JAWS DAYS 2017 with Python + ELK
I tried to graph the packages installed in Python
Performance verification of data preprocessing in natural language processing
I tried to erase the negative part of Meros
I tried to identify the language using CNN + Melspectogram
I tried to classify the voices of voice actors
I tried to summarize the string operations of Python
[Linux] I learned LPIC lv1 in 10 days and tried to understand the mechanism of Linux.
I tried to classify Mr. Habu and Mr. Habu with natural language processing × naive Bayes classifier
Python practice 100 knocks I tried to visualize the decision tree of Chapter 5 using graphviz
I tried to verify whether the Natural Language API (sentiment analysis) supports net slang.
I tried to extract the text in the image file using Tesseract of the OCR engine
I tried to visualize the characteristics of new coronavirus infected person information with wordcloud
I tried to put HULFT IoT (Agent) in the gateway Rooster of Sun Electronics
I tried to visualize the running data of the racing game (Assetto Corsa) with Plotly
I tried to compare the processing speed with dplyr of R and pandas of Python
I tried to visualize the lyrics of GReeeen, which I used to listen to crazy in my youth but no longer listen to it.
I tried to find the entropy of the image with python
[Horse Racing] I tried to quantify the strength of racehorses
I tried the accuracy of three Stirling's approximations in python
I tried to find the average of the sequence with TensorFlow
I tried to visualize the power consumption of my house with Nature Remo E lite
I tried to summarize the code often used in Pandas
I tried to create a Python script to get the value of a cell in Microsoft Excel
I tried to summarize the commands often used in business
I tried to implement the mail sending function in Python
[TF] I tried to visualize the learning result using Tensorboard
[Machine learning] I tried to summarize the theory of Adaboost
I tried 100 language processing knock 2020
I tried to fight the Local Minimum of Goldstein-Price Function
I wrote a doctest in "I tried to simulate the probability of a bingo game with Python"