In this article, I'll show you how to visualize what each member says in the Slack community with Wordcloud.
The source code can be found here [https://github.com/sota0121/slack-msg-analysis): octocat:
I also want to read: [Natural language processing] I tried to visualize the hot topics this week in the Slack community
* I would like to summarize the preprocessing in another article in the future </ font>
For more information, see Getting started in README. The flow is like this.
docker-compose up -d
docker exec -it ds-py3 bash
run_wordcloud_by_user.sh
This is an example of actual output. Wordcloud is the remarks of different members.
See this article.
[[Natural language processing] I tried to visualize the hot topics this week in the Slack community --2. Get the message from Slack](https://qiita.com/masso/items/41630aa02f1fd6cfa0a4#2-slack%E3%81% 8B% E3% 82% 89% E3% 83% A1% E3% 83% 83% E3% 82% BB% E3% 83% BC% E3% 82% B8% E3% 82% 92% E5% 8F% 96% E5% BE% 97)
Since it is the same content as this article, it is omitted. Please refer to the link for details.
-[Pre-processing: Create message mart table](https://qiita.com/masso/items/41630aa02f1fd6cfa0a4#3-%E5%89%8D%E5%87%A6%E7%90%86%E3%83% A1% E3% 83% 83% E3% 82% BB% E3% 83% BC% E3% 82% B8% E3% 83% 9E% E3% 83% BC% E3% 83% 88% E3% 83% 86% E3% 83% BC% E3% 83% 96% E3% 83% AB% E4% BD% 9C% E6% 88% 90) -[Pretreatment: Cleaning](https://qiita.com/masso/items/41630aa02f1fd6cfa0a4#4-%E5%89%8D%E5%87%A6%E7%90%86%E3%82%AF%E3 % 83% AA% E3% 83% BC% E3% 83% 8B% E3% 83% B3% E3% 82% B0) -[Pretreatment: Morphological analysis (Janome)](https://qiita.com/masso/items/41630aa02f1fd6cfa0a4#5-%E5%89%8D%E5%87%A6%E7%90%86%E5%BD % A2% E6% 85% 8B% E7% B4% A0% E8% A7% A3% E6% 9E% 90janome) -[Preprocessing: Normalization](https://qiita.com/masso/items/41630aa02f1fd6cfa0a4#6-%E5%89%8D%E5%87%A6%E7%90%86%E6%AD%A3% E8% A6% 8F% E5% 8C% 96) -[Pretreatment: Stopword removal](https://qiita.com/masso/items/41630aa02f1fd6cfa0a4#7-%E5%89%8D%E5%87%A6%E7%90%86%E3%82%B9 % E3% 83% 88% E3% 83% 83% E3% 83% 97% E3% 83% AF% E3% 83% BC% E3% 83% 89% E9% 99% A4% E5% 8E% BB)
tf-idf is for each word in a document It can be said that it is an index for scoring from the viewpoint of "Is it important for understanding the context of the document?"
For details, please refer to this article.
The purpose of this time is to see the characteristics of a member's remark. Therefore, I thought it should be possible to know ** what characteristics a member has for every post in the Slack community **.
Therefore,
-** All documents : All posts of all channels and all users so far - 1 document **: All posts of a member
I calculated tf-idf as.
Write the process flow easily.
important_word_extraction.py
import pandas as pd
import json
from datetime import datetime, date, timedelta, timezone
from pathlib import Path
from sklearn.feature_extraction.text import TfidfVectorizer
JST = timezone(timedelta(hours=+9), 'JST')
#Group messages by user
def group_msgs_by_user(df_msgs: pd.DataFrame) -> dict:
ser_uid = df_msgs.uid
ser_wktmsg = df_msgs.wakati_msg
#Get a unique uid list
ser_uid_unique = df_msgs.drop_duplicates(subset='uid').uid
#Grouping by uid without duplication
dict_msgs_by_user = {}
for uid in ser_uid_unique:
#Get all wktmsg corresponding to the uid
extracted = df_msgs.query('uid == @uid')
# key,Add value to the output dictionary
dict_msgs_by_user[uid] = ' '.join(extracted.wakati_msg.dropna().values.tolist())
return dict_msgs_by_user
# tf-Extract important words and return them as a dictionary while referring to the idf score
def extract_important_word_by_key(feature_names: list, bow_df: pd.DataFrame, uids: list) -> dict:
# >Look at each line and extract important words(top X words in tfidf)
dict_important_words_by_user = {}
for uid, (i, scores) in zip(uids, bow_df.iterrows()):
#Create a table of the user's words and tfidf scores
words_score_tbl = pd.DataFrame()
words_score_tbl['scores'] = scores
words_score_tbl['words'] = feature_names
#Sort in descending order by tfidf score
words_score_tbl = words_score_tbl.sort_values('scores', ascending=False)
words_score_tbl = words_score_tbl.reset_index()
# extract : tf-idf score > 0.001
important_words = words_score_tbl.query('scores > 0.001')
#Creating a dictionary for the user'uid0': {'w0': 0.9, 'w1': 0.87}
d = {}
for i, row in important_words.iterrows():
d[row.words] = row.scores
#Add to table only if the user's dictionary has at least one word
if len(d.keys()) > 0:
dict_important_words_by_user[uid] = d
return dict_important_words_by_user
#Extract important words for each user
def extraction_by_user(input_root: str, output_root: str) -> dict:
# ---------------------------------------------
# 1. load messages (processed)
# ---------------------------------------------
msg_fpath = input_root + '/' + 'messages_cleaned_wakati_norm_rmsw.csv'
print('load: {0}'.format(msg_fpath))
df_msgs = pd.read_csv(msg_fpath)
# ---------------------------------------------
# 2. group messages by user
# ---------------------------------------------
print('group messages by user and save it.')
msgs_grouped_by_user = group_msgs_by_user(df_msgs)
msg_grouped_fpath = input_root + '/' + 'messages_grouped_by_user.json'
with open(msg_grouped_fpath, 'w', encoding='utf-8') as f:
json.dump(msgs_grouped_by_user, f, ensure_ascii=False, indent=4)
# ---------------------------------------------
# 4.Tf for all documents-idf calculation
# ---------------------------------------------
print('tfidf vectorizing ...')
# >Words in all documents are columns, the number of documents (=A matrix with user) as the row is created. Tf for each element-There is an idf value
tfidf_vectorizer = TfidfVectorizer(token_pattern=u'(?u)\\b\\w+\\b')
bow_vec = tfidf_vectorizer.fit_transform(msgs_grouped_by_user.values())
bow_array = bow_vec.toarray()
bow_df = pd.DataFrame(bow_array,
index=msgs_grouped_by_user.keys(),
columns=tfidf_vectorizer.get_feature_names())
# ---------------------------------------------
# 5. tf-Extract important words based on idf
# ---------------------------------------------
print('extract important words ...')
d_word_score_by_uid = extract_important_word_by_key(tfidf_vectorizer.get_feature_names(), bow_df, msgs_grouped_by_user.keys())
# ---------------------------------------------
# 6. uid =>uname conversion
# ---------------------------------------------
print('Convert key of important word group for each user from uid to uname...')
user_tbl = pd.read_csv('../../data/020_intermediate/users.csv')
d_word_score_by_uname = {}
for uid, val in d_word_score_by_uid.items():
#Search for uname by speaker's uid (may not exist if not active user)
target = user_tbl.query('uid == @uid')
if target.shape[0] != 0:
uname = target.iloc[0]['uname']
else:
continue
print('uname: ', uname, 'type of uname: ', type(uname))
d_word_score_by_uname[uname] = val
return d_word_score_by_uname
In Wordcloud explained in the next chapter, you can generate Wordcloud that changes the display size of words according to the score by entering the dictionary {" word ": score}
.
[Natural language processing] I tried to visualize the hot topics this week in the Slack community
In the article, I output Wordcloud when grouped by "period".
In this article, we have grouped by "members", but ** the output dictionaries are in the same format **.
By doing so, everything can be achieved with the same processing ** except for ** "tf-idf scoring processing". You want to go to DRY.
Here is the dictionary that was actually output this time. (User name is hidden)
important_word_tfidf_by_user.json
{
"USER_001": {
"Participation": 0.1608918987478819,
"environment": 0.15024077008089046,
"Good product": 0.1347222699467748,
"node": 0.1347222699467748,
"Description": 0.13378417526975775,
"Cyber security": 0.12422689899152742,
"r": 0.12354794954617476,
"Choice": 0.11973696610170319,
"Replacement": 0.11678031479185731,
"Last": 0.11632792524420342,
"Course": 0.11467215023122095,
"Release": 0.11324407267324783,
"analysis": 0.11324407267324783,
"Deadline": 0.11100429535028021,
"How to write": 0.10628494383660991,
"Deep learning": 0.10229478898619786,
:
},
"USER_002": {
"data": 0.170245452132736,
"Participation": 0.15825283334154341,
"Course": 0.13785592895847276,
"Please": 0.1265412327351908,
"Recruitment": 0.12204781908784276,
"article": 0.1197561921672133,
"environment": 0.11083230914864184,
"Food": 0.1091835225326696,
"share": 0.10371152197590257,
"corona": 0.10081254351124691,
"Reading in a circle": 0.10025885742434383,
"Planning": 0.09899869065055528,
"development of": 0.09571338092513401,
"Target": 0.09253887576557392,
"jobs": 0.09094257214685446,
"project": 0.08910924912513929,
"information": 0.08772258523428605,
"language": 0.08636683271048684,
"channel": 0.08295159680178281,
"release": 0.0818876418995022,
"youtube": 0.07956948308804826,
"team": 0.07956948308804826,
"Basic": 0.07444492553072463,
:
},
:
}
Please refer to this article.
[[Natural language processing] I tried to visualize the topic that was excited this week in the Slack community --9 Visualization processing with Wordcloud](https://qiita.com/masso/items/41630aa02f1fd6cfa0a4#9-wordcloud%E3%81%A7 % E5% 8F% AF% E8% A6% 96% E5% 8C% 96% E5% 87% A6% E7% 90% 86)
-Types of preprocessing in natural language processing and their power | Qiita
Other reference materials (large amount) are summarized in here: octocat:.
This time, we are using data from the Slack community called Data Learning Guild. The Data Learning Guild is an online community of data analytics talent. If you are interested, please check here.
Data Learning Guild Official Homepage