[PYTHON] [Natural language processing] I tried to visualize the hot topics this week in the Slack community

About this article

In this article, I'll show you how to use Wordcloud to visualize what topics have been raised in the Slack community over a period of time (1 week here).

The source code can be found here [https://github.com/sota0121/slack-msg-analysis): octocat:

I also want to read: [Natural language processing] I tried to visualize the remarks of each member in the Slack community

table of contents

  1. Usage and output example
  2. Get messages from Slack
  3. Pre-processing: Message mart table creation
  4. Pretreatment: Cleaning
  5. Pretreatment: Morphological analysis (Janome)
  6. Preprocessing: Normalization
  7. Pretreatment: Stop word removal
  8. Preprocessing: Important word extraction (tf-idf)
  9. Visualization process with Wordcloud
  10. Bonus

* I would like to summarize the preprocessing in another article in the future </ font>

1. Usage and output example

1.1. How to use

For more information, see Getting started in the README. The flow is like this.

  1. Build a virtual environment with docker-compose up -d
  2. Enter the shell with docker exec -it ds-py3 bash
  3. Run run_wordcloud_by_term.sh

1.2. Output example

This is an example of actual output. Wordcloud is the remark of each different period.

anim__.gif

2. Get messages from Slack

2.1. Use Slack API

Get a Slack API token from Slack API Official

How to get started with the Slack API is not shown here.

Please obtain the following tokens to perform the subsequent processing.

  • Channel API
  • Users API

2.2. Create a class to get Slack information via API

Here, we will create a ** SlackApp ** class that gets Slack information via API. The acquired information is saved in JSON format without processing.

slack_app.py


#A class that uses slackapi to get the desired information(Do not process)
import requests
import json
from tqdm import tqdm
import pandas as pd


class SlackApp:
    ch_list_url = 'https://slack.com/api/channels.list'
    ch_history_url = 'https://slack.com/api/channels.history'
    usr_list_url = 'https://slack.com/api/users.list'

    def __init__(self, ch_api_key, usr_api_key):
        # NEW members
        self.channels_info = []
        self.users_info = []
        self.messages_info = []
        # OLD members
        self.channelInfo = {}  # k: ch_name, v: ch_id
        self.messages_in_chs = {}
        self.userInfo = {}
        self.ch_api_token = str(ch_api_key)
        self.usr_api_token = str(usr_api_key)

    def load_save_channel_info(self, outdir: str):
        #Get channel information via slack API and save to file
        payload = {'token': self.ch_api_token}
        response = requests.get(SlackApp.ch_list_url, params=payload)
        if response.status_code == 200:
            json_data = response.json()
            if 'channels' in json_data.keys():
                self.channels_info = json_data['channels']
            with open(outdir + '/' + 'channel_info.json', 'w', encoding='utf-8') as f:
                json.dump(self.channels_info, f, indent=4, ensure_ascii=False)

    def load_save_user_info(self, outdir: str):
        #Get user information via slack API and save to file
        payload = {'token': self.usr_api_token}
        response = requests.get(SlackApp.usr_list_url, params=payload)
        if response.status_code == 200:
            json_data = response.json()
            if 'members' in json_data.keys():
                self.users_info = json_data['members']
            with open(outdir + '/' + 'user_info.json', 'w', encoding='utf-8') as f:
                json.dump(self.users_info, f, indent=4, ensure_ascii=False)

    def load_save_messages_info(self, outdir: str):
        #Create channel id list
        channel_id_list = []
        for ch in self.channels_info:
            channel_id_list.append(ch['id'])
        #Get user information via slack API and save to file
        for ch_id in tqdm(channel_id_list, desc='[loading...]'):
            payload = {'token': self.ch_api_token, 'channel': ch_id}
            response = requests.get(SlackApp.ch_history_url, params=payload)
            if response.status_code == 200:
                json_data = response.json()
                msg_in_ch = {}
                msg_in_ch['channel_id'] = ch_id
                if 'messages' in json_data.keys():
                    msg_in_ch['messages'] = json_data['messages']
                else:
                    msg_in_ch['messages'] = ''
                self.messages_info.append(msg_in_ch)
        with open(outdir + '/' + 'messages_info.json', 'w', encoding='utf-8') as f:
            json.dump(self.messages_info, f, indent=4, ensure_ascii=False)        

2.3. Get Slack information

Use the ** SlackApp ** class earlier to get the information.

The information to be acquired is the following three

  1. Channel list
  2. Message list (obtained for each channel)
  3. User list

slack_msg_extraction.py


#A class that uses slackapi to get the desired information
import sys
import json
sys.path.append('../../src/d000_utils') #Added storage path for SlackApp scripts
import slack_app as sa


def main():
    # -------------------------------------
    # load api token
    # -------------------------------------
    credentials_root = '../../conf/local/'
    credential_fpath = credentials_root + 'credentials.json'
    print('load credential.json ...')
    with open(credential_fpath, 'r') as f:
        credentials = json.load(f)
    # -------------------------------------
    # start slack app
    # -------------------------------------    
    print('start slack app ...')
    app = sa.SlackApp(
        credentials['channel_api_key'],
        credentials['user_api_key']
        )
    outdir = '../../data/010_raw'
    # -------------------------------------
    # get channels info
    # -------------------------------------
    app.load_save_channel_info(outdir)
    # -------------------------------------
    # get user info
    # -------------------------------------
    app.load_save_user_info(outdir)
    # -------------------------------------
    # get msg info
    # -------------------------------------
    app.load_save_messages_info(outdir)


if __name__ == "__main__":
    main()

3. Pre-processing: Message mart table creation

3.1. Message Mart Table Design

The information acquired by SlackAPI was saved in JSON format according to the SlackAPI specifications. Make the table data formatted so that it is easy to analyze. When designing a table, be aware of Tidy Data.

This time, I designed it as follows. I made the following table with the image of the minimum required information + α.

message mart table

No Column Name Type Content
0 index int AUTO INCREMENT
1 ch_id str Channel ID
2 msg str Message string
3 uid str Speaker's user ID
4 timestamp datetime Time of speech

channels table (bonus)

No Column Name Type Content
0 index int AUTO INCREMENT
1 ch_id str Channel ID
2 ch_name str Channel name (name displayed on Slack's UI)
3 ch_namenorm str Normalized channel name
4 ch_membernum int Number of participants in the channel

users table (bonus)

No Column Name Type Content
0 index int AUTO INCREMENT
1 uid str User ID
2 uname str username

3.2. RAW data → Format to message mart table

The actual code is below.

  • 【note】 --../../ data / 010_raw: Information obtained from Slack JSON format storage location --ʻUser_info.json: User information (JSON) file name --messages_info.json: Message information (JSON) file name for all channels --channel_info.json`: Channel information (JSON) file name

make_msg_mart_table.py


import json
import pandas as pd


def make_user_table(usr_dict: dict) -> pd.DataFrame:
    uid_list = []
    uname_list = []
    for usr_ditem in usr_dict:
        if usr_ditem['deleted'] == True:
            continue
        uid_list.append(usr_ditem['id'])
        uname_list.append(usr_ditem['profile']['real_name_normalized'])
    user_table = pd.DataFrame({'uid': uid_list, 'uname': uname_list})
    return user_table


def make_msg_table(msg_dict: dict) -> pd.DataFrame:
    ch_id_list = []
    msg_list = []
    uid_list = []
    ts_list = []
    for msg_ditem in msg_dict:
        if 'channel_id' in msg_ditem.keys():
            ch_id = msg_ditem['channel_id']
        else:
            continue
        if 'messages' in msg_ditem.keys():
            msgs_in_ch = msg_ditem['messages']
        else:
            continue
        # get message in channel
        for i, msg in enumerate(msgs_in_ch):
            # if msg by bot, continue
            if 'user' not in msg:
                continue
            ch_id_list.append(ch_id)
            msg_list.append(msg['text'])
            uid_list.append(msg['user'])  #I don't have this key for bots
            ts_list.append(msg['ts'])
    df_msgs = pd.DataFrame({
        'ch_id': ch_id_list,
        'msg': msg_list,
        'uid': uid_list,
        'timestamp': ts_list
    })
    return df_msgs


def make_ch_table(ch_dict: dict) -> pd.DataFrame:
    chid_list = []
    chname_list = []
    chnormname_list = []
    chmembernum_list = []
    for ch_ditem in ch_dict:
        chid_list.append(ch_ditem['id'])
        chname_list.append(ch_ditem['name'])
        chnormname_list.append(ch_ditem['name_normalized'])
        chmembernum_list.append(ch_ditem['num_members'])
    ch_table = pd.DataFrame({
        'ch_id': chid_list,
        'ch_name': chname_list,
        'ch_namenorm': chnormname_list,
        'ch_membernum': chmembernum_list
    })
    return ch_table


def main():
    # 1. load user/message/channels
    input_root = '../../data/010_raw'
    user_info_fpath = input_root + '/' + 'user_info.json'
    with open(user_info_fpath, 'r', encoding='utf-8') as f:
        user_info_rawdict = json.load(f)
        print('load ... ', user_info_fpath)
    msg_info_fpath = input_root + '/' + 'messages_info.json'
    with open(msg_info_fpath, 'r', encoding='utf-8') as f:
        msgs_info_rawdict = json.load(f)
        print('load ... ', msg_info_fpath)
    ch_info_fpath = input_root + '/' + 'channel_info.json'
    with open(ch_info_fpath, 'r', encoding='utf-8') as f:
        ch_info_rawdict = json.load(f)
        print('load ... ', ch_info_fpath)
    # 2. make and save tables
    # user
    output_root = '../../data/020_intermediate'
    df_user_info = make_user_table(user_info_rawdict)
    user_tbl_fpath = output_root + '/' + 'users.csv'
    df_user_info.to_csv(user_tbl_fpath, index=False)
    print('save ... ', user_tbl_fpath)
    # msg
    df_msg_info = make_msg_table(msgs_info_rawdict)
    msg_tbl_fpath = output_root + '/' + 'messages.csv'
    df_msg_info.to_csv(msg_tbl_fpath, index=False)
    print('save ... ', msg_tbl_fpath)
    # channel
    df_ch_info = make_ch_table(ch_info_rawdict)
    ch_tbl_fpath = output_root + '/' + 'channels.csv'
    df_ch_info.to_csv(ch_tbl_fpath, index=False)
    print('save ... ', ch_tbl_fpath)


if __name__ == "__main__":
    main()

4. Pretreatment: Cleaning

4.1. Contents of cleaning process

Generally, it refers to the act of removing noise. It is necessary to perform various processing depending on the target data and purpose. Here, the following processing was executed.

  1. Delete URL string
  2. Delete the mention string
  3. Delete Unicode emoji
  4. Delete special characters in html (such as & gt)
  5. Delete code block
  6. Delete inline code block
  7. Removed the message "○○ joined the channel"
  8. Other noise removal peculiar to this community

4.2. Implementation of cleaning process

cleaning.py


import re
import pandas as pd
import argparse
from pathlib import Path


def clean_msg(msg: str) -> str:
    # sub 'Return and Space'
    result = re.sub(r'\s', '', msg)
    # sub 'url link'
    result = re.sub(r'(<)http.+(>)', '', result)
    # sub 'mention'
    result = re.sub(r'(<)@.+\w(>)', '', result)
    # sub 'reaction'
    result = re.sub(r'(:).+\w(:)', '', result)
    # sub 'html key words'
    result = re.sub(r'(&).+?\w(;)', '', result)
    # sub 'multi lines code block'
    result = re.sub(r'(```).+(```)', '', result)
    # sub 'inline code block'
    result = re.sub(r'(`).+(`)', '', result)
    return result


def clean_msg_ser(msg_ser: pd.Series) -> pd.Series:
    cleaned_msg_list = []
    for i, msg in enumerate(msg_ser):
        cleaned_msg = clean_msg(str(msg))
        if 'Joined the channel' in cleaned_msg:
            continue
        cleaned_msg_list.append(cleaned_msg)
    cleaned_msg_ser = pd.Series(cleaned_msg_list)
    return cleaned_msg_ser


def get_ch_id_from_table(ch_name_parts: list, input_fpath: str) -> list:
    df_ch = pd.read_csv(input_fpath)
    ch_id = []
    for ch_name_part in ch_name_parts:
        for i, row in df_ch.iterrows():
            if ch_name_part in row.ch_name:
                ch_id.append(row.ch_id)
                break
    return ch_id


def main(input_fname: str):
    input_root = '../../data/020_intermediate'
    output_root = input_root
    # 1. load messages.csv (including noise)
    msgs_fpath = input_root + '/' + input_fname
    df_msgs = pd.read_csv(msgs_fpath)
    print('load :{0}'.format(msgs_fpath))
    # 2. Drop Not Target Records
    print('drop records (drop non-target channel\'s messages)')
    non_target_ch_name = ['general', 'Announcement from management']
    non_target_ch_ids = get_ch_id_from_table(non_target_ch_name, input_root + '/' + 'channels.csv')
    print('=== non target channels bellow ====')
    print(non_target_ch_ids)
    for non_target_ch_id in non_target_ch_ids:
        df_msgs = df_msgs.query('ch_id != @non_target_ch_id')
    # 3. clean message string list
    ser_msg = df_msgs.msg
    df_msgs.msg = clean_msg_ser(ser_msg)
    # 4. save it
    pin = Path(msgs_fpath)
    msgs_cleaned_fpath = output_root + '/' + pin.stem + '_cleaned.csv'
    df_msgs.to_csv(msgs_cleaned_fpath, index=False)


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("input_fname", help="set input file name", type=str)
    args = parser.parse_args()
    input_fname = args.input_fname
    main(input_fname)

Actually, I wanted to "remove only when the source code is in the code block", but I couldn't. : sweat:

5. Pretreatment: Morphological analysis (Janome)

5.1. What is morphological analysis?

Generally, it is the process of finding "morphemes" in sentences. Details of morphological analysis will be left to other articles.

Here, the real purpose is to perform "separate writing".

5.2. What is word-separation?

Roughly speaking, it is the process of converting a sentence into information called "word word word". For example

** Example sentence: I like baseball. ** ** ** Example sentence word-separation: "I like baseball" **

It will be in the form of.

Practically speaking, if you want to handle words that represent the context of a sentence like this time, extract "nouns" </ font> is preferable. Therefore,

** Example sentence word-separation (noun only): "Private baseball" **

It would be even better.

  • Isn't it better to erase "I" as well? I thought you there. See the "Stopword Removal" chapter

5.3. Implementation of morphological analysis and word-separation

When implementing morphological analysis,

--What to use for the morphological analysis library --What to use as dictionary data

We have to decide that.

This time, I did the following.

--Morphological analysis library: Janome --Dictionary data: Janome default (NEologd for new words is even better)

In addition, the part of speech to be extracted is "morphological analysis tool part of speech system", and what is necessary to achieve the purpose is something? I thought from the viewpoint.

The official mascot, Janome, is cute. (I don't know if Janome is the name)

morphological_analysis.py


from janome.tokenizer import Tokenizer
from tqdm import tqdm
import pandas as pd
import argparse
from pathlib import Path
import sys


exc_part_of_speech = {
    "noun": ["Non-independent", "代noun", "number"]
}
inc_part_of_speech = {
    "noun": ["Change connection", "General", "固有noun"],
}


class MorphologicalAnalysis:

    def __init__(self):
        self.janome_tokenizer = Tokenizer()

    def tokenize_janome(self, line: str) -> list:
        # list of janome.tokenizer.Token
        tokens = self.janome_tokenizer.tokenize(line)
        return tokens

    def exists_pos_in_dict(self, pos0: str, pos1: str, pos_dict: dict) -> bool:
        # Retrurn where pos0, pos1 are in pos_dict or not.
        # ** pos = part of speech
        for type0 in pos_dict.keys():
            if pos0 == type0:
                for type1 in pos_dict[type0]:
                    if pos1 == type1:
                        return True
        return False

    def get_wakati_str(self, line: str, exclude_pos: dict,
                       include_pos: dict) -> str:
        '''
        exclude/include_pos is like this
        {"noun": ["Non-independent", "代noun", "number"], "adjective": ["xxx", "yyy"]}
        '''
        tokens = self.janome_tokenizer.tokenize(line, stream=True)  #Generator to save memory
        extracted_words = []
        for token in tokens:
            part_of_speech0 = token.part_of_speech.split(',')[0]
            part_of_speech1 = token.part_of_speech.split(',')[1]
            # check for excluding words
            exists = self.exists_pos_in_dict(part_of_speech0, part_of_speech1,
                                             exclude_pos)
            if exists:
                continue
            # check for including words
            exists = self.exists_pos_in_dict(part_of_speech0, part_of_speech1,
                                             include_pos)
            if not exists:
                continue
            # append(Acquired surface layer shape to absorb notation fluctuation)
            extracted_words.append(token.surface)
        # wakati string with extracted words
        wakati_str = ' '.join(extracted_words)
        return wakati_str


def make_wakati_for_lines(msg_ser: pd.Series) -> pd.Series:
    manalyzer = MorphologicalAnalysis()
    wakati_msg_list = []
    for msg in tqdm(msg_ser, desc='[mk wakati msgs]'):
        wakati_msg = manalyzer.get_wakati_str(str(msg), exc_part_of_speech,
                                              inc_part_of_speech)
        wakati_msg_list.append(wakati_msg)
    wakati_msg_ser = pd.Series(wakati_msg_list)
    return wakati_msg_ser


def main(input_fname: str):
    input_root = '../../data/020_intermediate'
    output_root = '../../data/030_processed'
    # 1. load messages_cleaned.csv
    msgs_cleaned_fpath = input_root + '/' + input_fname
    df_msgs = pd.read_csv(msgs_cleaned_fpath)
    # 2. make wakati string by record
    ser_msg = df_msgs.msg
    df_msgs['wakati_msg'] = make_wakati_for_lines(ser_msg)
    # 3. save it
    pin = Path(msgs_cleaned_fpath)
    msgs_wakati_fpath = output_root + '/' + pin.stem + '_wakati.csv'
    df_msgs.to_csv(msgs_wakati_fpath, index=False)


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("input_fname", help="set input file name", type=str)
    args = parser.parse_args()
    input_fname = args.input_fname
    # input file must been cleaned
    if 'cleaned' not in input_fname:
        print('input file name is invalid.: {0}'.format(input_fname))
        print('input file name must include \'cleaned\'')
        sys.exit(1)
    main(input_fname)

6. Preprocessing: Normalization

6.1. What is normalization?

Normalization in the preprocessing of natural language processing refers to the following processing. It is also called "name identification".

  1. Unification of character types
  2. Unify Kana to full-width
  3. Alphabet is unified to lowercase ... etc.
  4. Number replacement
  5. Replace all numbers with 0, etc.
    • There are few situations where numbers are important in natural language processing.
  6. Unification of words using a dictionary
  7. Judge "Sony" and "Sony" as the same, and unify the notation as "Sony"

The world of normalization is deep, so I'll leave it here.

6.2. Implementation of normalization

This time, for the sake of simplicity, only the following processing was implemented. It's insanely easy.

  1. Unify the alphabet to lowercase
  2. Replace all numbers with 0

normalization.py


import re
import pandas as pd
from tqdm import tqdm
import argparse
from pathlib import Path
import sys


def normarize_text(text: str):
    normalized_text = normalize_number(text)
    normalized_text = lower_text(normalized_text)
    return normalized_text


def normalize_number(text: str) -> str:
    """
    pattern = r'\d+'
    replacer = re.compile(pattern)
    result = replacer.sub('0', text)
    """
    #Replace consecutive numbers with 0
    replaced_text = re.sub(r'\d+', '0', text)
    return replaced_text


def lower_text(text: str) -> str:
    return text.lower()


def normalize_msgs(wktmsg_ser: pd.Series) -> pd.Series:
    normalized_msg_list = []
    for wktstr in tqdm(wktmsg_ser, desc='normalize words...'):
        normalized = normarize_text(str(wktstr))
        normalized_msg_list.append(normalized)
    normalized_msg_ser = pd.Series(normalized_msg_list)
    return normalized_msg_ser


def main(input_fname: str):
    input_root = '../../data/030_processed'
    output_root = input_root
    # 1. load wakati messages
    msgs_fpath = input_root + '/' + input_fname
    df_msgs = pd.read_csv(msgs_fpath)
    # 2. normalize wakati_msg (update)
    ser_msg = df_msgs.wakati_msg
    df_msgs.wakati_msg = normalize_msgs(ser_msg)
    # 3. save it
    pin = Path(msgs_fpath)
    msgs_ofpath = output_root + '/' + pin.stem + '_norm.csv'
    df_msgs.to_csv(msgs_ofpath, index=False)


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("input_fname", help="set input file name", type=str)
    args = parser.parse_args()
    input_fname = args.input_fname
    # input file must been cleaned
    if 'wakati' not in input_fname:
        print('input file name is invalid.: {0}'.format(input_fname))
        print('input file name must include \'wakati\'')
        sys.exit(1)
    main(input_fname)

7. Pretreatment: Stop word removal

7.1. What is Stopword Removal?

"Stop word removal" is the process of removing stop words. (Too much as it is ...)

So what is a "** stop word **"?

According to the goo Japanese dictionary

Words that are excluded from the search by themselves because they are so common in full-text search. Refers to "ha", "no", "desu" in Japanese, and a, the, of in English. Stopwords.

Natural language processing tasks often have the purpose of "understanding the context of a sentence." Stopwords are unnecessary for this purpose and should be removed.

7.2. How to decide the stop word?

There are two main ones. This time we will use 1.

  1. Method using a dictionary (** ← This time is adopted **)
  2. Method using frequency of appearance

For dictionary data, use here.

There are several reasons for choosing method 1.

--When using a dictionary Existing dictionary makes it easy to install. --When using the frequency of occurrence, it is necessary to extract only "words that occur frequently in ** general conversation ** and therefore should be removed". If you use only the Slack data you have at hand to determine the frequency of appearance, ** there is a risk that "words related to exciting topics" will be unintentionally removed **, so I thought that it was necessary to prepare separate data and aggregate it. .. ...... It's a little difficult

7.3. Implementation of stopword removal

Removes the words registered in the dictionary data introduced in the previous section. In addition, the following characters have also been removed. It's a word that I thought was annoying while tuning various things.

  • 「-」 ―― “ー”
  • 「w」
  • 「m」
  • "Lol"

stopword_removal.py


import pandas as pd
import urllib.request
from pathlib import Path
from tqdm import tqdm
import argparse
from pathlib import Path
import sys


def maybe_download(path: str):
    stopword_def_page_url = 'http://svn.sourceforge.jp/svnroot/slothlib/CSharp/Version1/SlothLib/NLP/Filter/StopWord/word/Japanese.txt'
    p = Path(path)
    if p.exists():
        print('File already exists.')
    else:
        print('downloading stop words definition ...')
        # Download the file from `url` and save it locally under `file_name`:
        urllib.request.urlretrieve(stopword_def_page_url, path)
    #stop word added
    sw_added_list = [
        '-',
        '-',
        'w',
        'W',
        'm',
        'Lol'
    ]
    sw_added_str = '\n'.join(sw_added_list)
    with open(path, 'a') as f:
        print(sw_added_str, file=f)


def load_sw_definition(path: str) -> list:
    with open(path, 'r', encoding='utf-8') as f:
        lines = f.read()
        line_list = lines.split('\n')
        line_list = [x for x in line_list if x != '']
    return line_list


def remove_sw_from_text(wktstr: str, stopwords: list) -> str:
    words_list = wktstr.split(' ')
    words_list_swrm = [x for x in words_list if x not in stopwords]
    swremoved_str = ' '.join(words_list_swrm)
    return swremoved_str


def remove_sw_from_msgs(wktmsg_ser: pd.Series, stopwords: list) -> pd.Series:
    swremved_msg_list = []
    for wktstr in tqdm(wktmsg_ser, desc='remove stopwords...'):
        removed_str = remove_sw_from_text(str(wktstr), stopwords)
        swremved_msg_list.append(removed_str)
    swremved_msg_ser = pd.Series(swremved_msg_list)
    return swremved_msg_ser


def main(input_fname: str):
    input_root = '../../data/030_processed'
    output_root = input_root
    # 1. load stop words
    sw_def_fpath = 'stopwords.txt'
    maybe_download(sw_def_fpath)
    stopwords = load_sw_definition(sw_def_fpath)
    # 2. load messages
    msgs_fpath = input_root + '/' + input_fname
    df_msgs = pd.read_csv(msgs_fpath)
    # 3. remove stop words
    ser_msg = df_msgs.wakati_msg
    df_msgs.wakati_msg = remove_sw_from_msgs(ser_msg, stopwords)
    # 4. save it
    pin = Path(msgs_fpath)
    msgs_ofpath = output_root + '/' + pin.stem + '_rmsw.csv'
    df_msgs.to_csv(msgs_ofpath, index=False)


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("input_fname", help="set input file name", type=str)
    args = parser.parse_args()
    input_fname = args.input_fname
    # input file must been cleaned
    if 'norm' not in input_fname:
        print('input file name is invalid.: {0}'.format(input_fname))
        print('input file name must include \'norm\'')
        sys.exit(1)
    main(input_fname)

8. Preprocessing: Important word extraction (tf-idf)

8.1. What is tf-idf?

There are many great commentary articles. I referred to this article. TF-IDF | Qiita

Here's a rough explanation.

  • tf (Term Frequency) --Frequency of word occurrence in a document ――If it is big, "the word often appears in the document"
  • idf (Inverse Document Frequency) --Reciprocal of ** (to all documents) of documents in which a word appears ** ――If it's big, it doesn't appear much in other documents.

tf-idf is the product of tf and idf. In other words ** Large tf-idf ** = ** Frequently appearing in one document ** & ** Not often appearing in other documents ** = Important for understanding the ** context of the document **

8.2. Implementation of word scoring processing by tf-idf

8.2.1. What should be a document / all documents?

The purpose of this time is to see the characteristics of remarks for a certain period (1 week). Therefore, I thought that it should be possible to understand ** what characteristics the remarks of a certain period (1 week) have for all the posts so far **.

Therefore,

-** All documents : All posts so far - 1 document **: A mass of posts for a certain period (1 week)

I calculated tf-idf as.

8.2.2. Implementation

Write the process flow easily

  1. Read the message mart table (preprocessed)
  2. Grouping by message period
  3. Divide and group past data in 7-day units starting from the time of processing execution
  4. Calculate tf-idf with one group of messages as one document
  5. Extract words whose tf-idf score is above the threshold (output as a dictionary)

important_word_extraction.py


import pandas as pd
import json
from datetime import datetime, date, timedelta, timezone
from pathlib import Path
from sklearn.feature_extraction.text import TfidfVectorizer
JST = timezone(timedelta(hours=+9), 'JST')

#Grouping messages by a specified period
def group_msgs_by_term(df_msgs: pd.DataFrame, term: str) -> dict:
    # set term
    term_days = 8
    if term == 'lm':
        term_days = 31
    print('group messages every {0} days'.format(term_days))
    # analyze timestamp
    now_in_sec = (datetime.now(JST) - datetime.fromtimestamp(0, JST)).total_seconds()
    interval_days = timedelta(days=term_days)
    interval_seconds = interval_days.total_seconds()
    oldest_timestamp = df_msgs.min().timestamp
    oldest_ts_in_sec = (datetime.fromtimestamp(oldest_timestamp, JST) - datetime.fromtimestamp(0, JST)).total_seconds()
    loop_num = (abs(now_in_sec - oldest_ts_in_sec) / interval_seconds) + 1
    # extract by term
    dict_msgs_by_term = {}
    df_tmp = df_msgs
    now_tmp = now_in_sec
    for i in range(int(loop_num)):
        # make current term string
        cur_term_s = 'term_ago_{0}'.format(str(i).zfill(3))
        print(cur_term_s)
        # current messages
        df_msgs_cur = df_tmp.query('@now_tmp - timestamp < @interval_seconds')
        df_msgs_other = df_tmp.query('@now_tmp - timestamp >= @interval_seconds')
        # messages does not exist. break.
        if df_msgs_cur.shape[0] == 0:
            break
        # add current messages to dict
        dict_msgs_by_term[cur_term_s] = ' '.join(df_msgs_cur.wakati_msg.dropna().values.tolist())
        # update temp value for next loop
        now_tmp = now_tmp - interval_seconds
        df_tmp = df_msgs_other
    return dict_msgs_by_term

# tf-Extract important words and return them as a dictionary while referring to the idf score
def extract_important_word_by_key(feature_names: list, bow_df: pd.DataFrame, uids: list) -> dict:
    # >Look at each line and extract important words(top X words in tfidf)
    dict_important_words_by_user = {}
    for uid, (i, scores) in zip(uids, bow_df.iterrows()):
        #Create a table of the user's words and tfidf scores
        words_score_tbl = pd.DataFrame()
        words_score_tbl['scores'] = scores
        words_score_tbl['words'] = feature_names
        #Sort in descending order by tfidf score
        words_score_tbl = words_score_tbl.sort_values('scores', ascending=False)
        words_score_tbl = words_score_tbl.reset_index()
        # extract : tf-idf score > 0.001
        important_words = words_score_tbl.query('scores > 0.001')
        #Creating a dictionary for the user'uid0': {'w0': 0.9, 'w1': 0.87}
        d = {}
        for i, row in important_words.iterrows():
            d[row.words] = row.scores
        #Add to table only if the user's dictionary has at least one word
        if len(d.keys()) > 0:
            dict_important_words_by_user[uid] = d
    return dict_important_words_by_user

#Extract important words in specified period units
def extraction_by_term(input_root: str, output_root: str, term: str) -> dict:
    # ---------------------------------------------
    # 1. load messages (processed)
    # ---------------------------------------------
    print('load msgs (all of history and last term) ...')
    msg_fpath = input_root + '/' + 'messages_cleaned_wakati_norm_rmsw.csv'
    df_msgs_all = pd.read_csv(msg_fpath)
    # ---------------------------------------------
    # 2. group messages by term
    # ---------------------------------------------
    print('group messages by term and save it.')
    msgs_grouped_by_term = group_msgs_by_term(df_msgs_all, term)
    msg_grouped_fpath = input_root + '/' + 'messages_grouped_by_term.json'
    with open(msg_grouped_fpath, 'w', encoding='utf-8') as f:
        json.dump(msgs_grouped_by_term, f, ensure_ascii=False, indent=4)
    # ---------------------------------------------
    # 3.Tf for all documents-idf calculation
    # ---------------------------------------------
    print('tfidf vectorizing ...')
    # >Words in all documents are columns, the number of documents (=A matrix with user) as the row is created. Tf for each element-There is an idf value
    tfidf_vectorizer = TfidfVectorizer(token_pattern=u'(?u)\\b\\w+\\b')

    bow_vec = tfidf_vectorizer.fit_transform(msgs_grouped_by_term.values())
    bow_array = bow_vec.toarray()
    bow_df = pd.DataFrame(bow_array,
                        index=msgs_grouped_by_term.keys(),
                        columns=tfidf_vectorizer.get_feature_names())
    # ---------------------------------------------
    # 5. tf-Extract important words based on idf
    # ---------------------------------------------
    print('extract important words ...')
    dict_word_score_by_term = extract_important_word_by_key(
        tfidf_vectorizer.get_feature_names(),
        bow_df, msgs_grouped_by_term.keys())
    return dict_word_score_by_term

9. Visualization process with Wordcloud

9.1. What is Wordcloud?

Words with high scores are large images, and words with low scores are small images. Various values such as "appearance frequency" and "importance" can be freely set for the score.

Official repository: amueller / word_cloud: octocat:

9.2. Prepare Wordcloud fonts

This time, I will use this. What is Homemade Rounded M +

9.3. Implementation of Wordcloud

In the previous chapter "8. Preprocessing: Extracting important words (tf-idf)", the following JSON file was output.

important_word_tfidf_by_term.json


{
  "term_ago_000": {
    "data": 0.890021,
    "game": 0.780122,
    "article": 0.720025,
    :
  },
  "term_ago_001": {
    "translation": 0.680021,
    "data": 0.620122,
    "deepl": 0.580025,
    :
  },
  :
}

Load this and create a Wordcloud image. Use the method WordCloud.generate_from_frequencies ().

wordcloud_from_score.py


from wordcloud import WordCloud
import matplotlib.pyplot as plt
import json
from pathlib import Path
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
from tqdm import tqdm
import sys
import argparse


def main(input_fname: str):
    input_root = '../../data/031_features'
    output_root = './wordcloud_by_user' if 'by_user' in input_fname else './wordcloud_by_term'
    p = Path(output_root)
    if p.exists() is False:
        p.mkdir()
    # -------------------------------------
    # 1. load tf-idf score dictionary
    # -------------------------------------
    d_word_score_by_user = {}
    tfidf_fpath = input_root + '/' + input_fname
    with open(tfidf_fpath, 'r', encoding='utf-8') as f:
        d_word_score_by_user = json.load(f)
    # -------------------------------------
    # 2. gen word cloud from score
    # -------------------------------------
    fontpath = './rounded-l-mplus-1c-regular.ttf'
    for uname, d_word_score in tqdm(d_word_score_by_user.items(), desc='word cloud ...'):
        # img file name is user.png
        uname = str(uname).replace('/', '-')
        out_img_fpath = output_root + '/' + uname + '.png'
        # gen
        wc = WordCloud(
            background_color='white',
            font_path=fontpath,
            width=900, height=600,
            collocations=False
            )
        wc.generate_from_frequencies(d_word_score)
        wc.to_file(out_img_fpath)


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("input_fname", help="set input file name", type=str)
    args = parser.parse_args()
    input_fname = args.input_fname
    main(input_fname)

10. Bonus

Articles that I especially referred to

-Types of preprocessing in natural language processing and their power | Qiita

Other reference materials (large amount) are summarized in here: octocat:.

Promotion

This time, we are using data from the Slack community called Data Learning Guild. The Data Learning Guild is an online community of data analytics talent. If you are interested, please check here.

Data Learning Guild Official Homepage

Data Learning Guild 2019 Advent Calendar

Recommended Posts