[PYTHON] [Natural language processing] I tried to visualize the remarks of each member in the Slack community

About this article

In this article, I'll show you how to visualize what each member says in the Slack community with Wordcloud.

The source code can be found here [https://github.com/sota0121/slack-msg-analysis): octocat:

I also want to read: [Natural language processing] I tried to visualize the hot topics this week in the Slack community

Usage and output example
Get messages from Slack
Preprocessing: Table creation / cleaning / morphological analysis / normalization / stopword removal
Preprocessing: Important phrase extraction (tf-idf)
Visualization process with Wordcloud
Bonus

* I would like to summarize the preprocessing in another article in the future </ font>

1. Usage and output example

1.1. How to use

For more information, see Getting started in README. The flow is like this.

Build a virtual environment with docker-compose up -d
Enter the shell with docker exec -it ds-py3 bash
Run run_wordcloud_by_user.sh

1.2. Output example

This is an example of actual output. Wordcloud is the remarks of different members.

2. Get messages from Slack

See this article.

[[Natural language processing] I tried to visualize the hot topics this week in the Slack community --2. Get the message from Slack](https://qiita.com/masso/items/41630aa02f1fd6cfa0a4#2-slack%E3%81% 8B% E3% 82% 89% E3% 83% A1% E3% 83% 83% E3% 82% BB% E3% 83% BC% E3% 82% B8% E3% 82% 92% E5% 8F% 96% E5% BE% 97)

3. Preprocessing: Table creation / cleaning / morphological analysis / normalization / stopword removal

Since it is the same content as this article, it is omitted. Please refer to the link for details.

-[Pre-processing: Create message mart table](https://qiita.com/masso/items/41630aa02f1fd6cfa0a4#3-%E5%89%8D%E5%87%A6%E7%90%86%E3%83% A1% E3% 83% 83% E3% 82% BB% E3% 83% BC% E3% 82% B8% E3% 83% 9E% E3% 83% BC% E3% 83% 88% E3% 83% 86% E3% 83% BC% E3% 83% 96% E3% 83% AB% E4% BD% 9C% E6% 88% 90) -[Pretreatment: Cleaning](https://qiita.com/masso/items/41630aa02f1fd6cfa0a4#4-%E5%89%8D%E5%87%A6%E7%90%86%E3%82%AF%E3 % 83% AA% E3% 83% BC% E3% 83% 8B% E3% 83% B3% E3% 82% B0) -[Pretreatment: Morphological analysis (Janome)](https://qiita.com/masso/items/41630aa02f1fd6cfa0a4#5-%E5%89%8D%E5%87%A6%E7%90%86%E5%BD % A2% E6% 85% 8B% E7% B4% A0% E8% A7% A3% E6% 9E% 90janome) -[Preprocessing: Normalization](https://qiita.com/masso/items/41630aa02f1fd6cfa0a4#6-%E5%89%8D%E5%87%A6%E7%90%86%E6%AD%A3% E8% A6% 8F% E5% 8C% 96) -[Pretreatment: Stopword removal](https://qiita.com/masso/items/41630aa02f1fd6cfa0a4#7-%E5%89%8D%E5%87%A6%E7%90%86%E3%82%B9 % E3% 83% 88% E3% 83% 83% E3% 83% 97% E3% 83% AF% E3% 83% BC% E3% 83% 89% E9% 99% A4% E5% 8E% BB)

4. Preprocessing: Important phrase extraction (tf-idf)

4.1. What is tf-idf?

tf-idf is for each word in a document It can be said that it is an index for scoring from the viewpoint of "Is it important for understanding the context of the document?"

For details, please refer to this article.

4.2. Implementation of word scoring processing by tf-idf

4.2.1. What should be a document / all documents?

The purpose of this time is to see the characteristics of a member's remark. Therefore, I thought it should be possible to know ** what characteristics a member has for every post in the Slack community **.

Therefore,

-** All documents : All posts of all channels and all users so far - 1 document **: All posts of a member

I calculated tf-idf as.

4.2.2. Implementation

Write the process flow easily.

Grouping by member who said the message
Calculate tf-idf with one group of messages as one document
Extract words whose tf-idf score is above the threshold (output as a dictionary)

`important_word_extraction.py`


import pandas as pd
import json
from datetime import datetime, date, timedelta, timezone
from pathlib import Path
from sklearn.feature_extraction.text import TfidfVectorizer
JST = timezone(timedelta(hours=+9), 'JST')

#Group messages by user
def group_msgs_by_user(df_msgs: pd.DataFrame) -> dict:
    ser_uid = df_msgs.uid
    ser_wktmsg = df_msgs.wakati_msg
    #Get a unique uid list
    ser_uid_unique = df_msgs.drop_duplicates(subset='uid').uid
    #Grouping by uid without duplication
    dict_msgs_by_user = {}
    for uid in ser_uid_unique:
        #Get all wktmsg corresponding to the uid
        extracted = df_msgs.query('uid == @uid')
        # key,Add value to the output dictionary
        dict_msgs_by_user[uid] = ' '.join(extracted.wakati_msg.dropna().values.tolist())        
    return dict_msgs_by_user

# tf-Extract important words and return them as a dictionary while referring to the idf score
def extract_important_word_by_key(feature_names: list, bow_df: pd.DataFrame, uids: list) -> dict:
    # >Look at each line and extract important words(top X words in tfidf)
    dict_important_words_by_user = {}
    for uid, (i, scores) in zip(uids, bow_df.iterrows()):
        #Create a table of the user's words and tfidf scores
        words_score_tbl = pd.DataFrame()
        words_score_tbl['scores'] = scores
        words_score_tbl['words'] = feature_names
        #Sort in descending order by tfidf score
        words_score_tbl = words_score_tbl.sort_values('scores', ascending=False)
        words_score_tbl = words_score_tbl.reset_index()
        # extract : tf-idf score > 0.001
        important_words = words_score_tbl.query('scores > 0.001')
        #Creating a dictionary for the user'uid0': {'w0': 0.9, 'w1': 0.87}
        d = {}
        for i, row in important_words.iterrows():
            d[row.words] = row.scores
        #Add to table only if the user's dictionary has at least one word
        if len(d.keys()) > 0:
            dict_important_words_by_user[uid] = d
    return dict_important_words_by_user

#Extract important words for each user
def extraction_by_user(input_root: str, output_root: str) -> dict:
    # ---------------------------------------------
    # 1. load messages (processed)
    # ---------------------------------------------
    msg_fpath = input_root + '/' + 'messages_cleaned_wakati_norm_rmsw.csv'
    print('load: {0}'.format(msg_fpath))
    df_msgs = pd.read_csv(msg_fpath)
    # ---------------------------------------------
    # 2. group messages by user
    # ---------------------------------------------
    print('group messages by user and save it.')
    msgs_grouped_by_user = group_msgs_by_user(df_msgs)
    msg_grouped_fpath = input_root + '/' + 'messages_grouped_by_user.json'
    with open(msg_grouped_fpath, 'w', encoding='utf-8') as f:
        json.dump(msgs_grouped_by_user, f, ensure_ascii=False, indent=4)
    # ---------------------------------------------
    # 4.Tf for all documents-idf calculation
    # ---------------------------------------------
    print('tfidf vectorizing ...')
    # >Words in all documents are columns, the number of documents (=A matrix with user) as the row is created. Tf for each element-There is an idf value
    tfidf_vectorizer = TfidfVectorizer(token_pattern=u'(?u)\\b\\w+\\b')

    bow_vec = tfidf_vectorizer.fit_transform(msgs_grouped_by_user.values())
    bow_array = bow_vec.toarray()
    bow_df = pd.DataFrame(bow_array,
                        index=msgs_grouped_by_user.keys(),
                        columns=tfidf_vectorizer.get_feature_names())
    # ---------------------------------------------
    # 5. tf-Extract important words based on idf
    # ---------------------------------------------
    print('extract important words ...')
    d_word_score_by_uid = extract_important_word_by_key(tfidf_vectorizer.get_feature_names(), bow_df, msgs_grouped_by_user.keys())
    # ---------------------------------------------
    # 6. uid =>uname conversion
    # ---------------------------------------------
    print('Convert key of important word group for each user from uid to uname...')
    user_tbl = pd.read_csv('../../data/020_intermediate/users.csv')
    d_word_score_by_uname = {}
    for uid, val in d_word_score_by_uid.items():
        #Search for uname by speaker's uid (may not exist if not active user)
        target = user_tbl.query('uid == @uid')
        if target.shape[0] != 0:
            uname = target.iloc[0]['uname']
        else:
            continue
        print('uname: ', uname, 'type of uname: ', type(uname))
        d_word_score_by_uname[uname] = val
    return d_word_score_by_uname

4.2.3. Output dictionary

In Wordcloud explained in the next chapter, you can generate Wordcloud that changes the display size of words according to the score by entering the dictionary {" word ": score}.

[Natural language processing] I tried to visualize the hot topics this week in the Slack community

In the article, I output Wordcloud when grouped by "period".

In this article, we have grouped by "members", but ** the output dictionaries are in the same format **.

By doing so, everything can be achieved with the same processing ** except for ** "tf-idf scoring processing". You want to go to DRY.

Here is the dictionary that was actually output this time. (User name is hidden)

`important_word_tfidf_by_user.json`


{
    "USER_001": {
        "Participation": 0.1608918987478819,
        "environment": 0.15024077008089046,
        "Good product": 0.1347222699467748,
        "node": 0.1347222699467748,
        "Description": 0.13378417526975775,
        "Cyber security": 0.12422689899152742,
        "r": 0.12354794954617476,
        "Choice": 0.11973696610170319,
        "Replacement": 0.11678031479185731,
        "Last": 0.11632792524420342,
        "Course": 0.11467215023122095,
        "Release": 0.11324407267324783,
        "analysis": 0.11324407267324783,
        "Deadline": 0.11100429535028021,
        "How to write": 0.10628494383660991,
        "Deep learning": 0.10229478898619786,
        :
    },
    "USER_002": {
        "data": 0.170245452132736,
        "Participation": 0.15825283334154341,
        "Course": 0.13785592895847276,
        "Please": 0.1265412327351908,
        "Recruitment": 0.12204781908784276,
        "article": 0.1197561921672133,
        "environment": 0.11083230914864184,
        "Food": 0.1091835225326696,
        "share": 0.10371152197590257,
        "corona": 0.10081254351124691,
        "Reading in a circle": 0.10025885742434383,
        "Planning": 0.09899869065055528,
        "development of": 0.09571338092513401,
        "Target": 0.09253887576557392,
        "jobs": 0.09094257214685446,
        "project": 0.08910924912513929,
        "information": 0.08772258523428605,
        "language": 0.08636683271048684,
        "channel": 0.08295159680178281,
        "release": 0.0818876418995022,
        "youtube": 0.07956948308804826,
        "team": 0.07956948308804826,
        "Basic": 0.07444492553072463,
        :
    },
    :
}

5. Visualization process with Wordcloud

Please refer to this article.

[[Natural language processing] I tried to visualize the topic that was excited this week in the Slack community --9 Visualization processing with Wordcloud](https://qiita.com/masso/items/41630aa02f1fd6cfa0a4#9-wordcloud%E3%81%A7 % E5% 8F% AF% E8% A6% 96% E5% 8C% 96% E5% 87% A6% E7% 90% 86)

6. Bonus

Articles that I especially referred to

-Types of preprocessing in natural language processing and their power | Qiita

NLP Pipeline 101 With Basic Code Example — Feature Extraction

Other reference materials (large amount) are summarized in here: octocat:.

Promotion

This time, we are using data from the Slack community called Data Learning Guild. The Data Learning Guild is an online community of data analytics talent. If you are interested, please check here.

Data Learning Guild Official Homepage

Data Learning Guild 2019 Advent Calendar