[PYTHON] Line talk analysis with janome (OSS released)

Introduction

I have created a Python script that analyzes Line talk, so I will share it. There are many articles that analyze Twitter by natural language processing with janome, but I thought that there was no one that targeted Line.

code: https://github.com/nashimo/LineAnalyzer

Example of use

The following is the result of analysis based on the talk history between me (pythonian) and Rinna. The following files and images will be generated. Since Rinna speaks well, there is a lot of noise (disturbing words, etc.), but this is a sample image. Analyzing group conversations is very interesting. Since the first person and the words used differ from person to person, that person's personality appears.

Statistics

statistics.txt


===Statistics===
member:Rinna pythonian
period: 2016/05/23~2020/09/20
Conversation statistics
Rinna:244 lines 9445 characters
 pythonian:226 lines 1310 characters
Phone time: 00:00:00
stamp:5 times
Image transmission:66 times

Conversation volume The increase in the number of characters over time is displayed daily and cumulatively (incl_chars.png). incl_chars.png

Reply frequency Rinna answers immediately, so I'm stuck at 0 (interval.png). intarval.png

Emoji used This is a frequently used emoji (emoji_freq.png). emoji_freq.png

Word cloud (Rinna only) wc_りんな.png

How to use

Execution environment

If it does not work, please install the package as appropriate. I am creating an environment with anaconda @ Windows10.

Execution method

--Export Line talk (or group) to a file from your PC or smartphone The export method is as follows, for example. https://www.appbank.net/2020/06/15/iphone-application/1911418.php --Give the exported file to line_analysis.py Specifically, pass the file name to fname in the main process of line_analysis.py.

line_analysis.py


if __name__ == "__main__":
    fname = "[LINE]Talk with Rinna"
    lta, nlp = file2process(fname, media="Phone")

--Run the program I wrote it so that Jupyter can be executed with VScode, but it works as it is even if it is executed directly. > Python line_analysis.py

Caution / option

--PC or smartphone The file format will change slightly depending on which medium you saved the talk on. Please specify the argument * media * of * file2process () * of the above main process. It is "PC" or "Phone". --Excluded characters (unwanted_word.txt) ? And! Are also included in the analysis. If you write the word you want to remove in unwanted_word.txt, it will be ignored when it is displayed in word cloud. --Character combination (l.522 * _sanitize_noun () *) There is an option corresponding to the decrypted word in the janome option, and there is a way to use it, but the accuracy is not good, so I try not to combine it. Instead, I'm writing a process to combine what I want to combine manually. Since personal names are often separated, it is better to use them by manually combining them. --Add font (l.388) There are emojis that are garbled in the Win10 standard Segoe. In such a case, symbolola is good, and if you give a path to FONT2 as needed, it will also be written in that font by emoji analysis. --Minimum frequency of pictogram analysis (l.562) Currently, we are analyzing pictograms that have been used more than once (`nlp.show_emoji_freq (min_freq = 1)`). Please change the count accordingly. --Maximum number of characters in Word Cloud (l.383) Now it's 130 characters. (`` `wc_max_words = 130```). If you increase it, it will be messy, and if you decrease it, it will be refreshing.

Technical points

It's basically just a combination of the information in the references. However, there was nothing that could be used as it is for Line analysis, so I think it might be helpful for someone in this part. That said, it's just a tedious task.

--Using Japanese and emoji with matplotlib The color emoji could not be displayed. --Decompose Line talk into elements (separate dates, speakers, messages, etc. with regular expressions) --Analyze the talk decomposed into elements with janome

References

[1] I created a stacked bar graph with matplotlib in Python and added a data label. https://qiita.com/s_fukuzawa/items/6f9c1a3d4c4f98ae6eb1 [2] Matplotlib can now display Japanese on a PC without installing additional fonts https://qiita.com/yniji/items/2f0fbe0a52e3e067c23c [3] Python janome analyzer is useful https://ohke.hateblo.jp/entry/2017/11/02/230000 [4] I tried to visualize tweets in Word Cloud (python) https://qiita.com/turmericN/items/04cd0b40f91076f0ef42 [5] I analyzed the lyrics of B'z with Python and machine learning ~ Data acquisition ~ https://pira-nino.hatenablog.com/entry/2018/07/27/B%27z%E3%81%AE%E6%AD%8C%E8%A9%9E%E3%82%92Python%E3%81%A8%E6%A9%9F%E6%A2%B0%E5%AD%A6%E7%BF%92%E3%81%A7%E5%88%86%E6%9E%90%E3%81%97%E3%81%A6%E3%81%BF%E3%81%9F_%E3%80%9C%E3%83%87%E3%83%BC%E3%82%BF

Recommended Posts

Line talk analysis with janome (OSS released)
[OSS] api_gen released
Make a morphological analysis bot loosely with LINE + Flask
Data analysis with python 2
Basket analysis with Spark (1)
Dependency analysis with CaboCha
Voice analysis with python
Voice analysis with python
Dynamic analysis with Valgrind
Regression analysis with NumPy
Data analysis with Python