I tried to put out the frequent word ranking of LINE talk with Python

This is Qiita's first post. I'm glad I managed to post without throwing it out ...

Introduction

I make LINE stamps as a hobby, but when I tried to make a niche stamp that I used only with a specific individual, I said, "Then, if you can put out a ranking of words that you often use in talks with this person, you can use it for stamp creation. I thought, "Isn't it?"

By the way, I chose Python because it's a simple reason to want to get in touch with the popular Python.

environment

Analytical data creation

1. Extract talk content from LINE talk

You can output the talk content as a text file by selecting [Others] ⇒ [Send Talk History] from the "≡" mark on the upper right of LINE Talk.

2. Make the talk history available as data.

If you just drop the talk content into a text file, it will be in the following format.

sample.txt


[LINE]Talk history with 〇〇
Save date: 2020/10/19 22:31

2015/10/10(soil)
1:04 〇 〇 Good night!
6:03 △△ Good morning!
6:33 〇 〇 Good morning(*´-`)
・ ・ ・
・ ・
・

Since there are many unnecessary data such as dates, times, LINE names, pictograms, and spaces, delete them. By the way, you can delete the LINE name by specifying it with the replace function as it is, but since there are various patterns for the date and time, delete it using a regular expression.

sample.txt


Good night! good morning! good morning·········

This completes the data to be read by the program.

code

The whole is as follows

sample.py


import MeCab as mc
from collections import Counter
import sys
from sys import argv
import matplotlib.pyplot as plt
import seaborn as sb

#Get arguments
input_file_name= sys.argv[1]

#Words using mecab(Phrase)Divide into
def mecab_analysis(text):
    m = mc.Tagger('')
    m_result = m.parse(text).splitlines()
    m_result = m_result[:-1]
    break_pos = ['noun','verb','Prefix','adverb','感verb','adjective','形容verb','Adnominal adjective']
    wakachi = ['']
    afterPrepos = False
    afterSahenNoun = False
    for v in m_result:
        if '\t' not in v: continue
        surface = v.split('\t')[0]
        pos = v.split('\t')[1].split(',')
        pos_detail = ','.join(pos[1:4])
        noBreak = pos[0] not in break_pos
        noBreak = noBreak or 'suffix' in pos_detail
        noBreak = noBreak or (pos[0]=='verb' and 'Change connection' in pos_detail)
        noBreak = noBreak or 'Non-independent' in pos_detail
        noBreak = noBreak or afterPrepos
        noBreak = noBreak or (afterSahenNoun and pos[0]=='verb' and pos[4]=='Sahen Suru')
        if noBreak == False:
            wakachi.append("")
        wakachi[-1] += surface
        afterPrepos = pos[0]=='Prefix'
        afterSahenNoun = 'Change connection' in pos_detail
    if wakachi[0] == '': wakachi = wakachi[1:]
    return wakachi

#Display the acquired words in the figure
def show_data():

    sb.set(context="talk")
    sb.set(font='Yu Gothic')
    fig = plt.subplots(figsize=(8, 8))
    text = str(open(input_file_name,"r",encoding="utf-8").read())
    words = mecab_analysis(text)
    counter = Counter(words)
    #For the time being, get the top 10
    sb.countplot(y=words,order=[i[0] for i in counter.most_common(10)])
    plt.show()

def main():
   show_data()

if __name__ == '__main__':
   main()

For the process of dividing into words in the def mecab_analysis (text) part, I referred to the following article.
Separate Japanese into phrase units [Python] [MeCab]

Now, let's output the ranking, but prepare the data you want to output the ranking as follows.

rank_sample.txt


Good morning. Hello. goodbye. Good morning. Hello. goodbye. Good morning. Hello. goodbye. Good morning. Hello. goodbye. Good morning. Hello. goodbye. Good morning. Hello. goodbye. Good morning. Hello. goodbye. Good morning. Hello. goodbye. It's nice weather today, is not it. I slept well today. It's bad weather today.

After that, specify python [program path] [data file path] in the command, and execute it, the output will be as follows. Figure_1.png In this way, we were able to output in order from the top of the ranking.

Finally

This time, I wanted to see the ranking of frequently-used words for the time being, so I thought I could do it quickly, but I had a hard time with Python for the first time due to lack of knowledge. However, I was finally able to touch Python, which I had been interested in for a long time. And since I hadn't touched the program at work for a while, I learned a lot. However, I could hardly understand the "natural language processing" used in the "mecab" used in the frequent word extraction this time, and for the time being, I just said "I used it." Since it's a big deal, I think I'll take this opportunity to study "natural language processing".

Recommended Posts

I tried to put out the frequent word ranking of LINE talk with Python
I tried to automatically send the literature of the new coronavirus to LINE with Python
I tried to find the entropy of the image with python
[Uncorrelated test] I tried to put out the boundary line with or without rejection
I tried to improve the efficiency of daily work with Python
Mayungo's Python Learning Episode 2: I tried to put out characters with variables
I tried to get the authentication code of Qiita API with Python.
I tried to streamline the standard role of new employees with Python
I tried to get the movie information of TMDb API with Python
I tried to touch the CSV file with Python
I tried to solve the soma cube with python
I tried to solve the problem with Python Vol.1
I tried to summarize the string operations of Python
I tried to easily visualize the tweets of JAWS DAYS 2017 with Python + ELK
I tried to find out how to streamline the work flow with Excel x Python ②
I tried "gamma correction" of the image with Python + OpenCV
I tried to simulate how the infection spreads with Python
I tried to find the average of the sequence with TensorFlow
I tried to find out how to streamline the work flow with Excel x Python ④
I tried to notify the train delay information with LINE Notify
I tried to find out how to streamline the work flow with Excel x Python ⑤
I tried to automate the article update of Livedoor blog with Python and selenium.
I tried to find out how to streamline the work flow with Excel x Python ①
[Python] I tried to visualize the follow relationship of Twitter
I tried to divide the file into folders with Python
I tried to find out how to streamline the work flow with Excel x Python ③
I tried to compare the processing speed with dplyr of R and pandas of Python
The 15th offline real-time I tried to solve the problem of how to write with python
I tried to make the weather forecast on the official line by referring to the weather forecast bot of "Dialogue system made with python".
How to write offline real time I tried to solve the problem of F02 with Python
I tried to get the number of days of the month holidays (Saturdays, Sundays, and holidays) with python
I wrote a doctest in "I tried to simulate the probability of a bingo game with Python"
I tried to solve the ant book beginner's edition with python
I want to know the weather with LINE bot feat.Heroku + Python
I want to output the beginning of the next month with Python
I tried to create a list of prime numbers with python
I tried to fix "I tried stochastic simulation of bingo game with Python"
I tried to find out if ReDoS is possible with Python
I tried to expand the size of the logical volume with LVM
I tried to automatically collect images of Kanna Hashimoto with Python! !!
PhytoMine-I tried to get the genetic information of plants with Python
I tried to recognize the wake word
I tried to verify the speaker identification by the Speaker Recognition API of Azure Cognitive Services with Python. # 1
I tried to verify the speaker identification by the Speaker Recognition API of Azure Cognitive Services with Python. # 2
[Python] I tried to visualize the prize money of "ONE PIECE" over 100 million characters with matplotlib.
I tried to summarize the contents of each package saved by Python pip in one line
I tried to solve the first question of the University of Tokyo 2019 math entrance exam with python sympy
[Python] I tried to visualize the night on the Galactic Railroad with WordCloud!
I tried to refer to the fun rock-paper-scissors poi for beginners with Python
I tried to automatically extract the movements of PES players with software
(Python) I tried to analyze 1 million hands ~ I tried to estimate the number of AA ~
I tried with the top 100 PyPI packages> I tried to graph the packages installed on Python
I tried to verify and analyze the acceleration of Python by Cython
I tried to analyze the negativeness of Nono Morikubo. [Compare with Posipa]
I tried to visualize the text of the novel "Weathering with You" with WordCloud
I tried to predict the behavior of the new coronavirus with the SEIR model.
I tried "smoothing" the image with Python + OpenCV
I tried hundreds of millions of SQLite with python
I tried "differentiating" the image with Python + OpenCV
I tried to save the data with discord
I tried to touch the API of ebay