Try text mining your diary in Python

Background

Execute the janome studied in the following article in the local environment. I will try text mining the diary I wrote. https://mocobeta.github.io/janome/

environment

-PYthon 3.7.4

Module used

-Janome 0.30.10 -wordcloud 1.7.0

From the module installation

pip install Janome
pip install wordcloud

Don't forget to cd into the module folder and do the following (I forgot)

Python setup.py install

Processing order

  1. Prepare a text file (.txt)
  2. Word division of text
  3. Create a word cloud

from janome.tokenizer import Tokenizer
from janome.analyzer import Analyzer
from janome.charfilter import *
from janome.tokenfilter import *
from wordcloud import WordCloud

#A function that specifies the part of speech to replace or filter unrecognized characters
def create_analyzer(): 
  tokenizer=Tokenizer()
  char_filters=[RegexReplaceCharFilter('《.*?》', '')]  #Filter that replaces strings
  token_filters=[POSKeepFilter(['noun','adjective','Adjectival noun','Interjection']),POSStopFilter(['noun,Non-independent','noun,代noun']),ExtractAttributeFilter('base_form')]
  #Keep targets the target words, top excludes them, and Extract targets only the uninflected words.
  #This time, we focused on nouns, adjectives, adjective verbs, and interjections.

  return Analyzer(char_filters,tokenizer,token_filters=token_filters)

#A function that divides a sentence into words and returns it as a text file
def split_text(src, out): #Apply user dictionary information to divide sentences into words and preprocess
  #Reads the file passed in src, splits words and writes to out.
  a=create_analyzer()
  with open(src,encoding='utf-8') as f1:
    with open(out, mode='w', encoding='utf-8') as f2:
      for line in f1:
        tokens=list(a.analyze(line))
        f2.write('%s\n' % ' '.join(tokens))


split_text('data/diary.txt', 'words.txt')
with open("words.txt",encoding='utf-8')as f:
    text=f.read()

wc = WordCloud(width=1920, height=1080,
               font_path="fonts/ipagp.ttf", #Font download
               max_words=100,#Number of words in the word cloud
               background_color="white",#Background color
               stopwords={"myself","Absent","Good","Good"}) #Set prohibited words

wc.generate(text)
wc.to_file('data/test_wordcloud.png')

You can add a csv file of a dictionary that describes technical terms with the very first function create_analyzer, but this time I omitted it. Again, you can study on the page below https://mocobeta.github.io/janome/

The following png file is created. In the future, I would like to read from JSON files in combination with the information and APIs picked up by web scraping.

test_wordcloud.png

Recommended Posts

Try text mining your diary in Python
Clustering text in Python
Text processing in Python
Try gRPC in Python
Try 9 slices in Python
UTF8 text processing in python
Try to improve your own intro quiz in Python
Try LINE Notify in Python
Speech to speech in python [text to speech]
Try implementing Yubaba in Python 3
Try sorting your own objects with priority queue in Python
GOTO in Python with Sublime Text 3
Try implementing extension method in python
Try using LevelDB in Python (plyvel)
Let's try Fizz Buzz in Python
Text mining with Python ① Morphological analysis
Try to calculate Trace in Python
Try PLC register access in Python
Extract text from images in Python
Sort large text files in Python
Reading and writing text in Python
Try using Leap Motion in Python
Try python
Try to log in to Netflix automatically using python on your PC
Try logging in to qiita with Python
Try using the Wunderlist API in Python
Try using the Kraken API in Python
[LLDB] Create your own command in Python
Try working with binary data in Python
Try sending a SYN packet in Python
Try drawing a simple animation in Python
Easily use your own functions in Python
Text mining with Python ② Visualization with Word Cloud
Try hitting the YouTube API in Python
Try a functional programming pipe in Python
Try something like Python for-else in Ruby
Read text in images with python OCR
Get your own IP address in Python
Quadtree in Python --2
Python in optimization
CURL in python
Metaprogramming in Python
First steps to try Google CloudVision in Python
Python 3.3 in Anaconda
Try to implement Oni Maitsuji Miserable in python
Geocoding in python
SendKeys in Python
Try to calculate a statistical problem in Python
3.14 π day, so try to output in Python
Try auto to automatically price Enums in Python 3.6
Meta-analysis in Python
Try implementing two stacks in one array in Python
Email attachments using your gmail account in python.
Unittest in python
tse --Introduction to Text Stream Editor in Python
Put text scraped in Python into Google Sheets
When I try matplotlib in Python, it says'cairo.Context'
Try using the BitFlyer Ligntning API in Python
Epoch in Python
Discord in Python
[CpawCTF] Q14. [PPC] Try writing Sort! In Python