[PYTHON] I tried to automatically create a report with Markov chain

What is a Markov chain?

A simple explanation of Markov chains is that the state of the previous point in time determines the state of the next point in time. Looking at a concrete example in the text, when you see the word "tummy", it seems like "empty" will come next. However, this is not the only correct answer for "vacant", but other "full" may come. So, let's think about expressing this with a probability. Let's assume that the words that follow "tummy" have a 60% chance of being "vacant" and a 40% chance of being "full". This probability is the probability of each of the following states called the transition probability. So far, it's easy, but I've talked about Markov chains. If you want to know more about the story around here, please read Basics of Markov Chain and Kolmogorov Equation (Beautiful Story of High School Mathematics)

However, it is not the case if it is said that all sentences can be explained by Markov chains. For example, when "I'm hungry", the probability of becoming "hungry" is high, but when "I'm hungry", the probability of becoming "full" is high. This means that the sentence depends not only on the previous word but also on the previous word. For that matter, it depends on the context. However, since this article deals with Markov chains, I would like to introduce this area in another article.

program

The purpose of the program created this time is to automatically generate a new report using the data of the report created by myself. So, first read the file.


import random
from janome.tokenizer import Tokenizer

with open("data.csv", "rt", encoding="utf-8_sig") as f:
    text_raws = f.read()
text_raws = text_raws.replace("\n", "@\n").split("\n")

I loaded data.csv. Here is the data of the author's report, but I feel that it is a little bad to publish this to the outside, so I will put it in an appropriate sentence when posting it on github. I replaced it after reading it because I wanted to insert @ as a mark at the end of the sentence.


text_lists = []
t = Tokenizer()
for text_raw in text_raws:
    text_list = []
    tokens = t.tokenize(text_raw, wakati=True)
    for token in tokens:
        text_list.append(token)
    text_lists.append(text_list)

We will perform morphological analysis using Tokenizer. Morphological analysis is to divide a sentence into words, for example, as follows.

["I will post an article on qiita."]] ↓ ['I',' is',' qiita','to','article',' to','post','to','. ']

Also, by default, extra information such as part of speech is added, so by setting the parameter to wakati = True, only the words are extracted.


dic = {}
for text_list in text_lists:
    for i in range(len(text_list) - 1):
        if text_list[i] in dic:
            lists = dic[text_list[i]]
        else:
            lists = []
        lists.append(text_list[i + 1])
        dic[text_list[i]] = lists

Here, the correspondence between the previous word and the next word is generated in a dictionary format such as {"Tummy": ["Suita", "Full"]}.


word = input("Please enter the first word")
generate = word
word = list(t.tokenize(word, wakati=True))[-1]
limit = 10000
cnt = 0

while cnt < limit:
    try:
        word = random.choice(dic[word])
        if word == "@":
            break
    except:
        break
    cnt += 1
    generate += word
print(generate)

The first word is in the form of being input. Then, the entered word is morphologically analyzed, and the Markov chain is started from the last word. The transition probability is randomly retrieved from the dictionary and proportional to the number of occurrences. Finally, I introduced it as a sign at the end of the sentence, set an upper limit so as not to reach @ or infinite loop, and finish.

This completes the program. Let's try it out.

Input "Today" Generation "Today there was a delay in aggregating less satisfying procedures and taking advantage of unnatural techniques."

Input "people" Generation "I thought it was necessary to see the flow of technical assistance for specific heavy rain data procedures in the future, which is more likely to be worn by people."

I don't know what you're talking about. As I mentioned at the beginning, the language is not determined only by the immediately preceding words, so I ended up with an unnatural connection such as "many → → →". Next time, I would like to improve it so that I can judge the optimum word from more past states using LSTM etc. Source code here

References

I want to generate tweets in Python! -Markov chain- Basics of Markov Chain and Kolmogorov Equation (Beautiful Story of High School Mathematics)

Recommended Posts

I tried to automatically create a report with Markov chain
[Outlook] I tried to automatically create a daily report email with Python
[Python] I tried to automatically create a daily report of YWT with Outlook mail
I tried to create a table only with Django
I tried to automatically generate a password with Python3
[Markov chain] I tried to read a quote into Python.
I tried to create a linebot (implementation)
I tried to create a linebot (preparation)
A memorandum when I tried to get it automatically with selenium
I tried to create a list of prime numbers with python
I tried to create Bulls and Cows with a shell program
I tried to create a program to convert hexadecimal numbers to decimal numbers with python
I tried to create a plug-in with HULFT IoT Edge Streaming [Development] (2/3)
I tried to create a plug-in with HULFT IoT Edge Streaming [Execution] (3/3)
I tried to create a plug-in with HULFT IoT Edge Streaming [Setup] (1/3)
I tried to read and save automatically with VOICEROID2 2
I tried to draw a route map with Python
I tried to automatically read and save with VOICEROID2
I want to manually create a legend with matplotlib
When I tried to create a virtual environment with Python, it didn't work
I tried to easily create a fully automatic attendance system with Selenium + Python
I tried to create a button for Slack with Raspberry Pi + Tact Switch
I tried to create a model with the sample of Amazon SageMaker Autopilot
I tried to implement a volume moving average with Quantx
I tried to create Quip API
Create a Mastodon bot with a function to automatically reply with Python
[Markov chain] I tried to read negative emotions into Python.
I tried to automatically generate a character string to be input to Mr. Adjustment with Python
I tried to solve a combination optimization problem with Qiskit
I tried to get started with Hy ・ Define a class
I tried to create a reinforcement learning environment for Othello with Open AI gym
I tried to sort a random FizzBuzz column with bubble sort.
I tried to create a bot for PES event notification
I tried to create a class to search files with Python's Glob method in VBA
I tried to divide with a deep learning language model
I tried scraping food recall information with Python to create a pandas data frame
I tried to create an article in Wiki.js with SQLAlchemy
[5th] I tried to make a certain authenticator-like tool with python
I tried to create a server environment that runs on Windows 10
I tried to create a simple credit score by logistic regression.
[2nd] I tried to make a certain authenticator-like tool with python
I wanted to create a smart presentation with Jupyter Notebook + nbpresent
I tried to implement anomaly detection using a hidden Markov model
[3rd] I tried to make a certain authenticator-like tool with python
[Python] A memo that I tried to get started with asyncio
I tried to make a periodical process with Selenium and Python
I tried to make a 2channel post notification application with Python
I tried to make a todo application using bottle with python
[4th] I tried to make a certain authenticator-like tool with python
[1st] I tried to make a certain authenticator-like tool with python
I tried to make a strange quote for Jojo with LSTM
I tried to automatically collect images of Kanna Hashimoto with Python! !!
I tried to log in to twitter automatically with selenium (RPA, scraping)
I tried to make a mechanism of exclusive control with Go
I tried to implement Autoencoder with TensorFlow
I tried to visualize AutoEncoder with TensorFlow
I tried to get started with Hy
I tried a functional language with Python
I tried to implement CVAE with PyTorch
I tried to make a Web API
I tried to solve TSP with QAOA