In this article, I'll show you how to use Wordcloud to visualize what topics have been raised in the Slack community over a period of time (1 week here).
The source code can be found here [https://github.com/sota0121/slack-msg-analysis): octocat:
I also want to read: [Natural language processing] I tried to visualize the remarks of each member in the Slack community
* I would like to summarize the preprocessing in another article in the future </ font>
For more information, see Getting started in the README. The flow is like this.
docker-compose up -d
docker exec -it ds-py3 bash
run_wordcloud_by_term.sh
This is an example of actual output. Wordcloud is the remark of each different period.
Get a Slack API token from Slack API Official
How to get started with the Slack API is not shown here.
Please obtain the following tokens to perform the subsequent processing.
Here, we will create a ** SlackApp ** class that gets Slack information via API. The acquired information is saved in JSON format without processing.
slack_app.py
#A class that uses slackapi to get the desired information(Do not process)
import requests
import json
from tqdm import tqdm
import pandas as pd
class SlackApp:
ch_list_url = 'https://slack.com/api/channels.list'
ch_history_url = 'https://slack.com/api/channels.history'
usr_list_url = 'https://slack.com/api/users.list'
def __init__(self, ch_api_key, usr_api_key):
# NEW members
self.channels_info = []
self.users_info = []
self.messages_info = []
# OLD members
self.channelInfo = {} # k: ch_name, v: ch_id
self.messages_in_chs = {}
self.userInfo = {}
self.ch_api_token = str(ch_api_key)
self.usr_api_token = str(usr_api_key)
def load_save_channel_info(self, outdir: str):
#Get channel information via slack API and save to file
payload = {'token': self.ch_api_token}
response = requests.get(SlackApp.ch_list_url, params=payload)
if response.status_code == 200:
json_data = response.json()
if 'channels' in json_data.keys():
self.channels_info = json_data['channels']
with open(outdir + '/' + 'channel_info.json', 'w', encoding='utf-8') as f:
json.dump(self.channels_info, f, indent=4, ensure_ascii=False)
def load_save_user_info(self, outdir: str):
#Get user information via slack API and save to file
payload = {'token': self.usr_api_token}
response = requests.get(SlackApp.usr_list_url, params=payload)
if response.status_code == 200:
json_data = response.json()
if 'members' in json_data.keys():
self.users_info = json_data['members']
with open(outdir + '/' + 'user_info.json', 'w', encoding='utf-8') as f:
json.dump(self.users_info, f, indent=4, ensure_ascii=False)
def load_save_messages_info(self, outdir: str):
#Create channel id list
channel_id_list = []
for ch in self.channels_info:
channel_id_list.append(ch['id'])
#Get user information via slack API and save to file
for ch_id in tqdm(channel_id_list, desc='[loading...]'):
payload = {'token': self.ch_api_token, 'channel': ch_id}
response = requests.get(SlackApp.ch_history_url, params=payload)
if response.status_code == 200:
json_data = response.json()
msg_in_ch = {}
msg_in_ch['channel_id'] = ch_id
if 'messages' in json_data.keys():
msg_in_ch['messages'] = json_data['messages']
else:
msg_in_ch['messages'] = ''
self.messages_info.append(msg_in_ch)
with open(outdir + '/' + 'messages_info.json', 'w', encoding='utf-8') as f:
json.dump(self.messages_info, f, indent=4, ensure_ascii=False)
Use the ** SlackApp ** class earlier to get the information.
The information to be acquired is the following three
slack_msg_extraction.py
#A class that uses slackapi to get the desired information
import sys
import json
sys.path.append('../../src/d000_utils') #Added storage path for SlackApp scripts
import slack_app as sa
def main():
# -------------------------------------
# load api token
# -------------------------------------
credentials_root = '../../conf/local/'
credential_fpath = credentials_root + 'credentials.json'
print('load credential.json ...')
with open(credential_fpath, 'r') as f:
credentials = json.load(f)
# -------------------------------------
# start slack app
# -------------------------------------
print('start slack app ...')
app = sa.SlackApp(
credentials['channel_api_key'],
credentials['user_api_key']
)
outdir = '../../data/010_raw'
# -------------------------------------
# get channels info
# -------------------------------------
app.load_save_channel_info(outdir)
# -------------------------------------
# get user info
# -------------------------------------
app.load_save_user_info(outdir)
# -------------------------------------
# get msg info
# -------------------------------------
app.load_save_messages_info(outdir)
if __name__ == "__main__":
main()
The information acquired by SlackAPI was saved in JSON format according to the SlackAPI specifications. Make the table data formatted so that it is easy to analyze. When designing a table, be aware of Tidy Data.
This time, I designed it as follows. I made the following table with the image of the minimum required information + α.
message mart table
No | Column Name | Type | Content |
---|---|---|---|
0 | index | int | AUTO INCREMENT |
1 | ch_id | str | Channel ID |
2 | msg | str | Message string |
3 | uid | str | Speaker's user ID |
4 | timestamp | datetime | Time of speech |
No | Column Name | Type | Content |
---|---|---|---|
0 | index | int | AUTO INCREMENT |
1 | ch_id | str | Channel ID |
2 | ch_name | str | Channel name (name displayed on Slack's UI) |
3 | ch_namenorm | str | Normalized channel name |
4 | ch_membernum | int | Number of participants in the channel |
No | Column Name | Type | Content |
---|---|---|---|
0 | index | int | AUTO INCREMENT |
1 | uid | str | User ID |
2 | uname | str | username |
The actual code is below.
../../ data / 010_raw
: Information obtained from Slack JSON format storage location
--ʻUser_info.json: User information (JSON) file name --
messages_info.json: Message information (JSON) file name for all channels --
channel_info.json`: Channel information (JSON) file namemake_msg_mart_table.py
import json
import pandas as pd
def make_user_table(usr_dict: dict) -> pd.DataFrame:
uid_list = []
uname_list = []
for usr_ditem in usr_dict:
if usr_ditem['deleted'] == True:
continue
uid_list.append(usr_ditem['id'])
uname_list.append(usr_ditem['profile']['real_name_normalized'])
user_table = pd.DataFrame({'uid': uid_list, 'uname': uname_list})
return user_table
def make_msg_table(msg_dict: dict) -> pd.DataFrame:
ch_id_list = []
msg_list = []
uid_list = []
ts_list = []
for msg_ditem in msg_dict:
if 'channel_id' in msg_ditem.keys():
ch_id = msg_ditem['channel_id']
else:
continue
if 'messages' in msg_ditem.keys():
msgs_in_ch = msg_ditem['messages']
else:
continue
# get message in channel
for i, msg in enumerate(msgs_in_ch):
# if msg by bot, continue
if 'user' not in msg:
continue
ch_id_list.append(ch_id)
msg_list.append(msg['text'])
uid_list.append(msg['user']) #I don't have this key for bots
ts_list.append(msg['ts'])
df_msgs = pd.DataFrame({
'ch_id': ch_id_list,
'msg': msg_list,
'uid': uid_list,
'timestamp': ts_list
})
return df_msgs
def make_ch_table(ch_dict: dict) -> pd.DataFrame:
chid_list = []
chname_list = []
chnormname_list = []
chmembernum_list = []
for ch_ditem in ch_dict:
chid_list.append(ch_ditem['id'])
chname_list.append(ch_ditem['name'])
chnormname_list.append(ch_ditem['name_normalized'])
chmembernum_list.append(ch_ditem['num_members'])
ch_table = pd.DataFrame({
'ch_id': chid_list,
'ch_name': chname_list,
'ch_namenorm': chnormname_list,
'ch_membernum': chmembernum_list
})
return ch_table
def main():
# 1. load user/message/channels
input_root = '../../data/010_raw'
user_info_fpath = input_root + '/' + 'user_info.json'
with open(user_info_fpath, 'r', encoding='utf-8') as f:
user_info_rawdict = json.load(f)
print('load ... ', user_info_fpath)
msg_info_fpath = input_root + '/' + 'messages_info.json'
with open(msg_info_fpath, 'r', encoding='utf-8') as f:
msgs_info_rawdict = json.load(f)
print('load ... ', msg_info_fpath)
ch_info_fpath = input_root + '/' + 'channel_info.json'
with open(ch_info_fpath, 'r', encoding='utf-8') as f:
ch_info_rawdict = json.load(f)
print('load ... ', ch_info_fpath)
# 2. make and save tables
# user
output_root = '../../data/020_intermediate'
df_user_info = make_user_table(user_info_rawdict)
user_tbl_fpath = output_root + '/' + 'users.csv'
df_user_info.to_csv(user_tbl_fpath, index=False)
print('save ... ', user_tbl_fpath)
# msg
df_msg_info = make_msg_table(msgs_info_rawdict)
msg_tbl_fpath = output_root + '/' + 'messages.csv'
df_msg_info.to_csv(msg_tbl_fpath, index=False)
print('save ... ', msg_tbl_fpath)
# channel
df_ch_info = make_ch_table(ch_info_rawdict)
ch_tbl_fpath = output_root + '/' + 'channels.csv'
df_ch_info.to_csv(ch_tbl_fpath, index=False)
print('save ... ', ch_tbl_fpath)
if __name__ == "__main__":
main()
Generally, it refers to the act of removing noise. It is necessary to perform various processing depending on the target data and purpose. Here, the following processing was executed.
cleaning.py
import re
import pandas as pd
import argparse
from pathlib import Path
def clean_msg(msg: str) -> str:
# sub 'Return and Space'
result = re.sub(r'\s', '', msg)
# sub 'url link'
result = re.sub(r'(<)http.+(>)', '', result)
# sub 'mention'
result = re.sub(r'(<)@.+\w(>)', '', result)
# sub 'reaction'
result = re.sub(r'(:).+\w(:)', '', result)
# sub 'html key words'
result = re.sub(r'(&).+?\w(;)', '', result)
# sub 'multi lines code block'
result = re.sub(r'(```).+(```)', '', result)
# sub 'inline code block'
result = re.sub(r'(`).+(`)', '', result)
return result
def clean_msg_ser(msg_ser: pd.Series) -> pd.Series:
cleaned_msg_list = []
for i, msg in enumerate(msg_ser):
cleaned_msg = clean_msg(str(msg))
if 'Joined the channel' in cleaned_msg:
continue
cleaned_msg_list.append(cleaned_msg)
cleaned_msg_ser = pd.Series(cleaned_msg_list)
return cleaned_msg_ser
def get_ch_id_from_table(ch_name_parts: list, input_fpath: str) -> list:
df_ch = pd.read_csv(input_fpath)
ch_id = []
for ch_name_part in ch_name_parts:
for i, row in df_ch.iterrows():
if ch_name_part in row.ch_name:
ch_id.append(row.ch_id)
break
return ch_id
def main(input_fname: str):
input_root = '../../data/020_intermediate'
output_root = input_root
# 1. load messages.csv (including noise)
msgs_fpath = input_root + '/' + input_fname
df_msgs = pd.read_csv(msgs_fpath)
print('load :{0}'.format(msgs_fpath))
# 2. Drop Not Target Records
print('drop records (drop non-target channel\'s messages)')
non_target_ch_name = ['general', 'Announcement from management']
non_target_ch_ids = get_ch_id_from_table(non_target_ch_name, input_root + '/' + 'channels.csv')
print('=== non target channels bellow ====')
print(non_target_ch_ids)
for non_target_ch_id in non_target_ch_ids:
df_msgs = df_msgs.query('ch_id != @non_target_ch_id')
# 3. clean message string list
ser_msg = df_msgs.msg
df_msgs.msg = clean_msg_ser(ser_msg)
# 4. save it
pin = Path(msgs_fpath)
msgs_cleaned_fpath = output_root + '/' + pin.stem + '_cleaned.csv'
df_msgs.to_csv(msgs_cleaned_fpath, index=False)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("input_fname", help="set input file name", type=str)
args = parser.parse_args()
input_fname = args.input_fname
main(input_fname)
Actually, I wanted to "remove only when the source code is in the code block", but I couldn't. : sweat:
Generally, it is the process of finding "morphemes" in sentences. Details of morphological analysis will be left to other articles.
Here, the real purpose is to perform "separate writing".
Roughly speaking, it is the process of converting a sentence into information called "word word word". For example
** Example sentence: I like baseball. ** ** ** Example sentence word-separation: "I like baseball" **
It will be in the form of.
Practically speaking, if you want to handle words that represent the context of a sentence like this time, extract "nouns" </ font> is preferable. Therefore,
** Example sentence word-separation (noun only): "Private baseball" **
It would be even better.
When implementing morphological analysis,
--What to use for the morphological analysis library --What to use as dictionary data
We have to decide that.
This time, I did the following.
--Morphological analysis library: Janome --Dictionary data: Janome default (NEologd for new words is even better)
In addition, the part of speech to be extracted is "morphological analysis tool part of speech system", and what is necessary to achieve the purpose is something? I thought from the viewpoint.
The official mascot, Janome, is cute. (I don't know if Janome is the name)
morphological_analysis.py
from janome.tokenizer import Tokenizer
from tqdm import tqdm
import pandas as pd
import argparse
from pathlib import Path
import sys
exc_part_of_speech = {
"noun": ["Non-independent", "代noun", "number"]
}
inc_part_of_speech = {
"noun": ["Change connection", "General", "固有noun"],
}
class MorphologicalAnalysis:
def __init__(self):
self.janome_tokenizer = Tokenizer()
def tokenize_janome(self, line: str) -> list:
# list of janome.tokenizer.Token
tokens = self.janome_tokenizer.tokenize(line)
return tokens
def exists_pos_in_dict(self, pos0: str, pos1: str, pos_dict: dict) -> bool:
# Retrurn where pos0, pos1 are in pos_dict or not.
# ** pos = part of speech
for type0 in pos_dict.keys():
if pos0 == type0:
for type1 in pos_dict[type0]:
if pos1 == type1:
return True
return False
def get_wakati_str(self, line: str, exclude_pos: dict,
include_pos: dict) -> str:
'''
exclude/include_pos is like this
{"noun": ["Non-independent", "代noun", "number"], "adjective": ["xxx", "yyy"]}
'''
tokens = self.janome_tokenizer.tokenize(line, stream=True) #Generator to save memory
extracted_words = []
for token in tokens:
part_of_speech0 = token.part_of_speech.split(',')[0]
part_of_speech1 = token.part_of_speech.split(',')[1]
# check for excluding words
exists = self.exists_pos_in_dict(part_of_speech0, part_of_speech1,
exclude_pos)
if exists:
continue
# check for including words
exists = self.exists_pos_in_dict(part_of_speech0, part_of_speech1,
include_pos)
if not exists:
continue
# append(Acquired surface layer shape to absorb notation fluctuation)
extracted_words.append(token.surface)
# wakati string with extracted words
wakati_str = ' '.join(extracted_words)
return wakati_str
def make_wakati_for_lines(msg_ser: pd.Series) -> pd.Series:
manalyzer = MorphologicalAnalysis()
wakati_msg_list = []
for msg in tqdm(msg_ser, desc='[mk wakati msgs]'):
wakati_msg = manalyzer.get_wakati_str(str(msg), exc_part_of_speech,
inc_part_of_speech)
wakati_msg_list.append(wakati_msg)
wakati_msg_ser = pd.Series(wakati_msg_list)
return wakati_msg_ser
def main(input_fname: str):
input_root = '../../data/020_intermediate'
output_root = '../../data/030_processed'
# 1. load messages_cleaned.csv
msgs_cleaned_fpath = input_root + '/' + input_fname
df_msgs = pd.read_csv(msgs_cleaned_fpath)
# 2. make wakati string by record
ser_msg = df_msgs.msg
df_msgs['wakati_msg'] = make_wakati_for_lines(ser_msg)
# 3. save it
pin = Path(msgs_cleaned_fpath)
msgs_wakati_fpath = output_root + '/' + pin.stem + '_wakati.csv'
df_msgs.to_csv(msgs_wakati_fpath, index=False)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("input_fname", help="set input file name", type=str)
args = parser.parse_args()
input_fname = args.input_fname
# input file must been cleaned
if 'cleaned' not in input_fname:
print('input file name is invalid.: {0}'.format(input_fname))
print('input file name must include \'cleaned\'')
sys.exit(1)
main(input_fname)
Normalization in the preprocessing of natural language processing refers to the following processing. It is also called "name identification".
The world of normalization is deep, so I'll leave it here.
This time, for the sake of simplicity, only the following processing was implemented. It's insanely easy.
normalization.py
import re
import pandas as pd
from tqdm import tqdm
import argparse
from pathlib import Path
import sys
def normarize_text(text: str):
normalized_text = normalize_number(text)
normalized_text = lower_text(normalized_text)
return normalized_text
def normalize_number(text: str) -> str:
"""
pattern = r'\d+'
replacer = re.compile(pattern)
result = replacer.sub('0', text)
"""
#Replace consecutive numbers with 0
replaced_text = re.sub(r'\d+', '0', text)
return replaced_text
def lower_text(text: str) -> str:
return text.lower()
def normalize_msgs(wktmsg_ser: pd.Series) -> pd.Series:
normalized_msg_list = []
for wktstr in tqdm(wktmsg_ser, desc='normalize words...'):
normalized = normarize_text(str(wktstr))
normalized_msg_list.append(normalized)
normalized_msg_ser = pd.Series(normalized_msg_list)
return normalized_msg_ser
def main(input_fname: str):
input_root = '../../data/030_processed'
output_root = input_root
# 1. load wakati messages
msgs_fpath = input_root + '/' + input_fname
df_msgs = pd.read_csv(msgs_fpath)
# 2. normalize wakati_msg (update)
ser_msg = df_msgs.wakati_msg
df_msgs.wakati_msg = normalize_msgs(ser_msg)
# 3. save it
pin = Path(msgs_fpath)
msgs_ofpath = output_root + '/' + pin.stem + '_norm.csv'
df_msgs.to_csv(msgs_ofpath, index=False)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("input_fname", help="set input file name", type=str)
args = parser.parse_args()
input_fname = args.input_fname
# input file must been cleaned
if 'wakati' not in input_fname:
print('input file name is invalid.: {0}'.format(input_fname))
print('input file name must include \'wakati\'')
sys.exit(1)
main(input_fname)
"Stop word removal" is the process of removing stop words. (Too much as it is ...)
So what is a "** stop word **"?
According to the goo Japanese dictionary
Words that are excluded from the search by themselves because they are so common in full-text search. Refers to "ha", "no", "desu" in Japanese, and a, the, of in English. Stopwords.
Natural language processing tasks often have the purpose of "understanding the context of a sentence." Stopwords are unnecessary for this purpose and should be removed.
There are two main ones. This time we will use 1.
For dictionary data, use here.
There are several reasons for choosing method 1.
--When using a dictionary Existing dictionary makes it easy to install. --When using the frequency of occurrence, it is necessary to extract only "words that occur frequently in ** general conversation ** and therefore should be removed". If you use only the Slack data you have at hand to determine the frequency of appearance, ** there is a risk that "words related to exciting topics" will be unintentionally removed **, so I thought that it was necessary to prepare separate data and aggregate it. .. ...... It's a little difficult
Removes the words registered in the dictionary data introduced in the previous section. In addition, the following characters have also been removed. It's a word that I thought was annoying while tuning various things.
stopword_removal.py
import pandas as pd
import urllib.request
from pathlib import Path
from tqdm import tqdm
import argparse
from pathlib import Path
import sys
def maybe_download(path: str):
stopword_def_page_url = 'http://svn.sourceforge.jp/svnroot/slothlib/CSharp/Version1/SlothLib/NLP/Filter/StopWord/word/Japanese.txt'
p = Path(path)
if p.exists():
print('File already exists.')
else:
print('downloading stop words definition ...')
# Download the file from `url` and save it locally under `file_name`:
urllib.request.urlretrieve(stopword_def_page_url, path)
#stop word added
sw_added_list = [
'-',
'-',
'w',
'W',
'm',
'Lol'
]
sw_added_str = '\n'.join(sw_added_list)
with open(path, 'a') as f:
print(sw_added_str, file=f)
def load_sw_definition(path: str) -> list:
with open(path, 'r', encoding='utf-8') as f:
lines = f.read()
line_list = lines.split('\n')
line_list = [x for x in line_list if x != '']
return line_list
def remove_sw_from_text(wktstr: str, stopwords: list) -> str:
words_list = wktstr.split(' ')
words_list_swrm = [x for x in words_list if x not in stopwords]
swremoved_str = ' '.join(words_list_swrm)
return swremoved_str
def remove_sw_from_msgs(wktmsg_ser: pd.Series, stopwords: list) -> pd.Series:
swremved_msg_list = []
for wktstr in tqdm(wktmsg_ser, desc='remove stopwords...'):
removed_str = remove_sw_from_text(str(wktstr), stopwords)
swremved_msg_list.append(removed_str)
swremved_msg_ser = pd.Series(swremved_msg_list)
return swremved_msg_ser
def main(input_fname: str):
input_root = '../../data/030_processed'
output_root = input_root
# 1. load stop words
sw_def_fpath = 'stopwords.txt'
maybe_download(sw_def_fpath)
stopwords = load_sw_definition(sw_def_fpath)
# 2. load messages
msgs_fpath = input_root + '/' + input_fname
df_msgs = pd.read_csv(msgs_fpath)
# 3. remove stop words
ser_msg = df_msgs.wakati_msg
df_msgs.wakati_msg = remove_sw_from_msgs(ser_msg, stopwords)
# 4. save it
pin = Path(msgs_fpath)
msgs_ofpath = output_root + '/' + pin.stem + '_rmsw.csv'
df_msgs.to_csv(msgs_ofpath, index=False)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("input_fname", help="set input file name", type=str)
args = parser.parse_args()
input_fname = args.input_fname
# input file must been cleaned
if 'norm' not in input_fname:
print('input file name is invalid.: {0}'.format(input_fname))
print('input file name must include \'norm\'')
sys.exit(1)
main(input_fname)
There are many great commentary articles. I referred to this article. TF-IDF | Qiita
Here's a rough explanation.
tf-idf is the product of tf and idf. In other words ** Large tf-idf ** = ** Frequently appearing in one document ** & ** Not often appearing in other documents ** = Important for understanding the ** context of the document **
The purpose of this time is to see the characteristics of remarks for a certain period (1 week). Therefore, I thought that it should be possible to understand ** what characteristics the remarks of a certain period (1 week) have for all the posts so far **.
Therefore,
-** All documents : All posts so far - 1 document **: A mass of posts for a certain period (1 week)
I calculated tf-idf as.
Write the process flow easily
important_word_extraction.py
import pandas as pd
import json
from datetime import datetime, date, timedelta, timezone
from pathlib import Path
from sklearn.feature_extraction.text import TfidfVectorizer
JST = timezone(timedelta(hours=+9), 'JST')
#Grouping messages by a specified period
def group_msgs_by_term(df_msgs: pd.DataFrame, term: str) -> dict:
# set term
term_days = 8
if term == 'lm':
term_days = 31
print('group messages every {0} days'.format(term_days))
# analyze timestamp
now_in_sec = (datetime.now(JST) - datetime.fromtimestamp(0, JST)).total_seconds()
interval_days = timedelta(days=term_days)
interval_seconds = interval_days.total_seconds()
oldest_timestamp = df_msgs.min().timestamp
oldest_ts_in_sec = (datetime.fromtimestamp(oldest_timestamp, JST) - datetime.fromtimestamp(0, JST)).total_seconds()
loop_num = (abs(now_in_sec - oldest_ts_in_sec) / interval_seconds) + 1
# extract by term
dict_msgs_by_term = {}
df_tmp = df_msgs
now_tmp = now_in_sec
for i in range(int(loop_num)):
# make current term string
cur_term_s = 'term_ago_{0}'.format(str(i).zfill(3))
print(cur_term_s)
# current messages
df_msgs_cur = df_tmp.query('@now_tmp - timestamp < @interval_seconds')
df_msgs_other = df_tmp.query('@now_tmp - timestamp >= @interval_seconds')
# messages does not exist. break.
if df_msgs_cur.shape[0] == 0:
break
# add current messages to dict
dict_msgs_by_term[cur_term_s] = ' '.join(df_msgs_cur.wakati_msg.dropna().values.tolist())
# update temp value for next loop
now_tmp = now_tmp - interval_seconds
df_tmp = df_msgs_other
return dict_msgs_by_term
# tf-Extract important words and return them as a dictionary while referring to the idf score
def extract_important_word_by_key(feature_names: list, bow_df: pd.DataFrame, uids: list) -> dict:
# >Look at each line and extract important words(top X words in tfidf)
dict_important_words_by_user = {}
for uid, (i, scores) in zip(uids, bow_df.iterrows()):
#Create a table of the user's words and tfidf scores
words_score_tbl = pd.DataFrame()
words_score_tbl['scores'] = scores
words_score_tbl['words'] = feature_names
#Sort in descending order by tfidf score
words_score_tbl = words_score_tbl.sort_values('scores', ascending=False)
words_score_tbl = words_score_tbl.reset_index()
# extract : tf-idf score > 0.001
important_words = words_score_tbl.query('scores > 0.001')
#Creating a dictionary for the user'uid0': {'w0': 0.9, 'w1': 0.87}
d = {}
for i, row in important_words.iterrows():
d[row.words] = row.scores
#Add to table only if the user's dictionary has at least one word
if len(d.keys()) > 0:
dict_important_words_by_user[uid] = d
return dict_important_words_by_user
#Extract important words in specified period units
def extraction_by_term(input_root: str, output_root: str, term: str) -> dict:
# ---------------------------------------------
# 1. load messages (processed)
# ---------------------------------------------
print('load msgs (all of history and last term) ...')
msg_fpath = input_root + '/' + 'messages_cleaned_wakati_norm_rmsw.csv'
df_msgs_all = pd.read_csv(msg_fpath)
# ---------------------------------------------
# 2. group messages by term
# ---------------------------------------------
print('group messages by term and save it.')
msgs_grouped_by_term = group_msgs_by_term(df_msgs_all, term)
msg_grouped_fpath = input_root + '/' + 'messages_grouped_by_term.json'
with open(msg_grouped_fpath, 'w', encoding='utf-8') as f:
json.dump(msgs_grouped_by_term, f, ensure_ascii=False, indent=4)
# ---------------------------------------------
# 3.Tf for all documents-idf calculation
# ---------------------------------------------
print('tfidf vectorizing ...')
# >Words in all documents are columns, the number of documents (=A matrix with user) as the row is created. Tf for each element-There is an idf value
tfidf_vectorizer = TfidfVectorizer(token_pattern=u'(?u)\\b\\w+\\b')
bow_vec = tfidf_vectorizer.fit_transform(msgs_grouped_by_term.values())
bow_array = bow_vec.toarray()
bow_df = pd.DataFrame(bow_array,
index=msgs_grouped_by_term.keys(),
columns=tfidf_vectorizer.get_feature_names())
# ---------------------------------------------
# 5. tf-Extract important words based on idf
# ---------------------------------------------
print('extract important words ...')
dict_word_score_by_term = extract_important_word_by_key(
tfidf_vectorizer.get_feature_names(),
bow_df, msgs_grouped_by_term.keys())
return dict_word_score_by_term
Words with high scores are large images, and words with low scores are small images. Various values such as "appearance frequency" and "importance" can be freely set for the score.
Official repository: amueller / word_cloud: octocat:
This time, I will use this. What is Homemade Rounded M +
In the previous chapter "8. Preprocessing: Extracting important words (tf-idf)", the following JSON file was output.
important_word_tfidf_by_term.json
{
"term_ago_000": {
"data": 0.890021,
"game": 0.780122,
"article": 0.720025,
:
},
"term_ago_001": {
"translation": 0.680021,
"data": 0.620122,
"deepl": 0.580025,
:
},
:
}
Load this and create a Wordcloud image.
Use the method WordCloud.generate_from_frequencies ()
.
wordcloud_from_score.py
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import json
from pathlib import Path
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
from tqdm import tqdm
import sys
import argparse
def main(input_fname: str):
input_root = '../../data/031_features'
output_root = './wordcloud_by_user' if 'by_user' in input_fname else './wordcloud_by_term'
p = Path(output_root)
if p.exists() is False:
p.mkdir()
# -------------------------------------
# 1. load tf-idf score dictionary
# -------------------------------------
d_word_score_by_user = {}
tfidf_fpath = input_root + '/' + input_fname
with open(tfidf_fpath, 'r', encoding='utf-8') as f:
d_word_score_by_user = json.load(f)
# -------------------------------------
# 2. gen word cloud from score
# -------------------------------------
fontpath = './rounded-l-mplus-1c-regular.ttf'
for uname, d_word_score in tqdm(d_word_score_by_user.items(), desc='word cloud ...'):
# img file name is user.png
uname = str(uname).replace('/', '-')
out_img_fpath = output_root + '/' + uname + '.png'
# gen
wc = WordCloud(
background_color='white',
font_path=fontpath,
width=900, height=600,
collocations=False
)
wc.generate_from_frequencies(d_word_score)
wc.to_file(out_img_fpath)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("input_fname", help="set input file name", type=str)
args = parser.parse_args()
input_fname = args.input_fname
main(input_fname)
-Types of preprocessing in natural language processing and their power | Qiita
Other reference materials (large amount) are summarized in here: octocat:.
This time, we are using data from the Slack community called Data Learning Guild. The Data Learning Guild is an online community of data analytics talent. If you are interested, please check here.
Data Learning Guild Official Homepage