[PYTHON] I tried to make an automatic character dialogue generator by N floor Markov chain

Introduction

I usually do research on financial texts as research on natural language processing. Since the content handled in the research is the content of joint research and cannot be published to the outside, and the content that can be published is summarized by presenting at an academic conference or submitting a dissertation, there was no need to rewrite it as an article, so a new article I decided to start doing what I could do. In this article, the anime "[Kemono Friends](https://ja.wikipedia.org/wiki/%E3%81%91%E3%82%82%E3%81%AE%E3%83%95%E3" % 83% AC% E3% 83% B3% E3% 82% BA_ (% E3% 82% A2% E3% 83% 8B% E3% 83% A1))) How to automatically generate serval lines in a Markov chain I'd like to introduce. image.png (Stable illustration shop because I don't know the copyright)

As for the flow of the article

  1. Data acquisition
  2. About the Nth floor Markov chain
  3. Implementation
  4. Consideration
  5. Summary It becomes.

Acquisition of text data of Kemono Friends

As for the dialogue data, I will use the one on GitHub of akahuku. I did. I'm really grateful for this kind of data. Since this text data is given a label indicating which character is speaking, it was possible to extract only the serval lines. Originally, I was looking for this data when I wanted to study the implementation of sentence classification by pytorch instead of automatic serif generation, and I was looking for data that could be used for classification (the article on serif classification is another one). I will write it on the occasion). By the way, the reason for using the automatic generation of serval lines is simply because the lines were the most (812 long lines and short lines combined). Actually, I wanted to make it with Tsuchinoko and Dr., but unfortunately the number of data was overwhelmingly insufficient. Actually, I think that I should write fine data processing, but I will omit it because the article will be unnecessarily long, but I have not performed any special processing other than aligning all the characters in full-width. The following data was extracted from the collected data.

Doctor, you guys will find this.
Dr. It is necessary when running on land. Use your head.
Dr. Similar buses have been witnessed on several islands. Look for it first.
Doctor I can't help it. There is something like a bus in the amusement park.
Dr. That's right for you guys, this is good.
:
Serval After all, I thought I should follow him a little more.
Serval Let's be friends!
Serval Wow! boss! ??
Serval talked! ??

About Nth floor Markov chain

Regarding Markov chain and implementation, @ k-jimon's article [Python] Sentence generation with Nth floor Markov chain was very helpful. It was. To be honest, I don't feel the need to write from scratch in this article. An intuitive and easy-to-understand explanation about sentence generation by Markov chain is ["Sentence generation by Markov chain"](https://omedstu.jimdofree.com/2018/05/06/%E3%83%9E%E3%83 % AB% E3% 82% B3% E3% 83% 95% E9% 80% A3% E9% 8E% 96% E3% 81% AB% E3% 82% 88% E3% 82% 8B% E6% 96% 87 It is written in% E6% 9B% B8% E7% 94% 9F% E6% 88% 90 /). For the time being, I will briefly introduce it in this article so that the linked article may disappear.

Markov chain

For the time being [Wikipedia](https://ja.wikipedia.org/wiki/%E3%83%9E%E3%83%AB%E3%82%B3%E3%83%95%E9%80%A3%E9% From 8E% 96).

A Markov chain (Markov chain) is a Markov process, which is a type of stochastic process, in which the possible states are discrete (finite or countable) (discrete state Markov process). In particular, it often refers to discrete times (time is represented by subscripts) (there is also a continuous time Markov process, which is continuous in time). In the Markov chain, future behavior is determined only by the current value and is irrelevant to past behavior (Markov property). Regarding the state change (transition or transition) that occurs at each time, the Markov chain is a series in which the transition probability does not depend on the past state but only on the current state. It is applied to various fields as a particularly important stochastic process.

To put it simply, it is written that the probability of becoming the next state is determined by taking into account the current state, and the past results do not affect the probability. In terms of weather, if it is sunny today, it is likely that it will be sunny tomorrow, and if it is raining today, it will be cloudy or rainy tomorrow. According to the Markov property assumption, if today's weather is sunny, whether yesterday's weather is sunny or rainy, it is the Markov chain that has no effect.

In sentence generation, consider the following three sentences as an example. "Servals like hunting games" "Bags are good at thinking" "Dr. is not good at cooking" Since we will not deal with it in detail this time, we will omit a detailed explanation, but since the sentence can be divided into word units, the following figure can be created. image.png This is a diagram showing the state transition, and you can make a sentence by transitioning from left to right. For example, by transitioning as follows "** Bag ** is ** not good at hunting **" You can create sentences such as. image.png In this way, by accumulating how the words that appear in the learning data appear, it becomes possible to generate various sentences.

Nth floor Markov chain

The N-order Markov chain predicts the next state by adding the N states from the state before N-1 to the current state. When N = 1, it becomes the same as a normal Markov chain. In the case of the weather above, if possible, you can intuitively understand that tomorrow's prediction will be better if the weather of the previous day is also taken into consideration. If it continues to be sunny and sunny, there is a high possibility that it will be fine tomorrow, and if it is rainy and sunny, the probability that it will be fine tomorrow will be lower than that of sunny and sunny. Sentence generation will look more like the generated text if you take into account what appeared a while ago. For example, in addition to the previous sentence, "Curry is delicious on the second day" If there is a sentence like this and N = 1, "Curry is good at hunting" Will be generated. However, the larger N is, the more difficult it is to generate new sentences. I wish I had a lot more data ...

Implementation

This is the long-awaited implementation. However, in this N-th order Markov chain, the model is created not in word units but in character units. Since the data this time is dialogue data, it is a spoken language, there are many hiragana and katakana due to the characteristics of the work, and there are many named entities (Japari Park etc.), so it is difficult to analyze morphological elements with MeCab etc. Considering that the number of data is small, we generated sentences in character units. For example "Dr. is not good at cooking" Is "Haku", "Shi", "Ha", "Fee", "Ri", "Ga", "Bitter", "Hand" Create the state transition model separately as follows. In consideration, we will also see how it changes depending on the size of N. (~~ It's not that it was troublesome to introduce MeCab or write it ~~)

environment

・ Python3.7

Prerequisite knowledge

For model creation ["Now", "Day"] → ["is", "also", ",", "is", "is", "is", "is", ",", "is"] It is necessary to create a data structure like this. This is an example when N = 2, and when the characters "now" and "day" appear, it is necessary to accumulate what character will come next (when generating sentences). , Because "characters" are randomly selected from these, the more duplicated ones are more likely to be selected). There is also a way to create a data structure to reduce the number of data, but this time we will simply create a data structure like this.

Since this data has a list ([" is ", ...]) corresponding to the key ([" now "," day "] above)), use python's dict and list. I will. Also, use deque to create the key. It can be used in the collections module, which is the standard library of python, and it is also written in [Python] Sentence generation with N-th floor Markov chain, but this implementation It comes in handy in. To briefly explain what is different from list, if you specify the length first, when you add it with append, if it exceeds the specified length, the value at the beginning can be pushed out. I can do it. In other words, if the deque contains ["now", "day"] and "ha" is added to the end with append, ["day", "" becomes "](really convenient). You can implement it like this.

python


from collections import deque
n_size = 2
queue = deque([], n_size)

However, deque cannot be used as it is for key, so convert it to tuple type.

python


key = tuple(queue)

This should be all you need for the prerequisite knowledge.

Data reading

The extracted data is Name line Since it is saved in "kemono_friends.txt" with a space delimiter, it is read line by line, line breaks are removed, and it is divided by a space. And the one with the name "Serval" is added to text_list.

data_load.py


text_list = []
with open("./data/kemono_friends.txt") as data:
    for line in data:
        char, text = line.rstrip('\n').split(" ")
        if char == "Serval":
            text_list.append(text)

Modeling

Create a model. "BOS" is a tag that indicates the beginning of a sentence and is an abbreviation for Beginning of sentence, and "EOS" is a tag that indicates the end of a sentence and is an abbreviation for End of sentence. n_size is N of the Nth order Markov chain.

mk_model.py


from collections import deque
import pickle
n_size = 4
def mk_model(text_list):
    model = {}
    for text in text_list:
        queue = deque([], n_size)
        queue.append("[BOS]")
        for i in range(0, len(text)):
            key = tuple(queue)
            if key not in model:
                model[key] = []
            model[key].append(text[i])
            queue.append(text[i])
        key = tuple(queue)
        if key not in model:
            model[key] = []
        model[key].append("[EOS]")
    return model

"""
Data here_load.Copy py
"""

Automatic dialogue generation

The program that automatically generates sentences using the created model adds the following code to the previous program.

mk_serihu.py


import random
def mk_serihu():
    value_list = []
    queue = deque([], n_size)
    queue.append("[BOS]")
    key = tuple(queue)
    while(True):
        key = tuple(queue)
        value = random.choice(model[key])
        if value == "[EOS]":
            break
        value_list.append(value)
        queue.append(value)
    return value_list
#For the time being, 10 lines are automatically generated
for i in range(0, 10):
    serihu = ''.join(mk_serihu())
    print(serihu)

As a program, "BOS" becomes the key first, and characters are randomly generated from the model. After that, the key changes gradually, characters are generated one after another, and when "EOS" appears, the generation ends. The output looks like this.

But do you continue that much?
It's my first time. You did it.
Ah. It's cold.
Bye bye.
Yup. Hey, brown bear. I'll go.
Well, do you say idol and start dancing and singing?
Wait, wait.
No way.
I gave you the name earlier. For what.
I wonder what.

It feels pretty good! By the way, it is difficult to generate a model every time when the number of data is large, so it is convenient to save it as a binary file with pickle and call it immediately. The program that saves it as model.binaryfile and calls it is as follows. (If you want to skip data processing and run automatic generation for the time being, if you get DM from Twitter etc., pass model.binaryfile)

mk_serihu.py


from collections import deque
import random
import pickle
n_size = 4
f = open("data/model.binaryfile",'rb')
model = pickle.load(f)

def mk_serihu():
    value_list = []
    queue = deque([], n_size)
    queue.append("[BOS]")
    key = tuple(queue)
    while(True):
        key = tuple(queue)
        value = random.choice(model[key])
        if value == "[EOS]":
            break
        value_list.append(value)
        queue.append(value)
    return value_list
#For the time being, 10 lines are automatically generated
for i in range(0, 10):
    serihu = ''.join(mk_serihu())
    print(serihu)

Consideration

When I changed the value of N in various ways, I was able to make a line like that with N = 4. If N is increased, the data will be the same as the learning data, and if N is small, the sentence will not be valid and it will be difficult (sweat). Pick up a good guy that is not in the training data when N = 4.

Well, why don't you try it?
Tell me what the boss is!
Is there anything there?
Yeah. Your bag is dexterous.
You're kidding, bag-chan.
See, let's go!
Isn't the bag like this?
What's that? I wonder how to have it.
You forget about us too.

You can generate lines that are not in the training data by about 63%! Next is the one that is not in the training data when N = 3.

What's wrong? What about your bag?
all right. What should I do?
It was friends ...
It's messed up. I'm looking forward to it.
Ah! No good! Can you take it?
I think everyone is worried.
Well, that voice was eaten by someone.

80% of the lines are not in the training data, but they are still unstable. The next one is not in the training data when N = 2.

Okay, is it okay to run around without being deceived anymore?
somewhere. Bag ... I wonder if anyone is white. Hey, lava?
Or should I just call it?
I'm fine. Is it in the shape of friends?

96% of the lines are not in the training data, but ... The last is N = 5

I fell asleep on the way yesterday.
There's a lot of sand.
Yeah, yeah. What do you mean? here?

35% of the lines are not in the training data. There is a sense of stability, but I want you to make more lines that aren't originally there, so N = 4 is just right. Since the lines like "bag-chan" are long, it is easy to make meaningless lines when a character string of such a long word comes, so after all Markov chains alone are tough (sweat)

Summary

For the time being, it is a model that can be easily created, but it is difficult to generate sentences. There are still many things that can be easily done, such as registering character strings of characters and places with named entity recognition, and replacing them with similar words with word2vec. Also, there are many recent sentence generations that use deep learning, so I'd like to try it. Personally, I would like to create a model that can generate a large amount of learning data using an automatic dialogue generator, input appropriate dialogue, convert it to the specified character-like dialogue, and output it. After all, I would like to build a model that allows shorter dialogue than one-sided dialogue. Kemono Friends had good data, but in reality it will be used in @RemChabot Using the text data of the novel acquired by the API, it will be possible to interact with Rem I want to do it. The data of the novel has no label for the dialogue, so it stops at the point of making the data (sweat) ~~ I want someone to create data. ~~ Anyway, it is difficult to create learning data in natural language processing, so if you would like to cooperate, please contact us.

I'm not used to writing articles, and I think there were many points that were difficult to read, but thank you for reading to the end. If you think "it was helpful" or "I want to read more", I would be grateful if you could like it. (Because it becomes motivated).

Reference site list

I tried transcribing the lines of the TV anime Kemono Friends Wikipedia [Python] Sentence generation with N-th floor Markov chain ["Sentence generation by Markov chain"](https://omedstu.jimdofree.com/2018/05/06/%E3%83%9E%E3%83%AB%E3%82%B3%E3%83%95 % E9% 80% A3% E9% 8E% 96% E3% 81% AB% E3% 82% 88% E3% 82% 8B% E6% 96% 87% E6% 9B% B8% E7% 94% 9F% E6 % 88% 90 /)

Recommended Posts

I tried to make an automatic character dialogue generator by N floor Markov chain
I tried to get an image by scraping
I tried to make an OCR application with PySimpleGUI
I tried to make an air lip detection & automatic response BOT for remote work
I tried to automatically create a report with Markov chain
[Markov chain] I tried to read negative emotions into Python.
[Markov chain] I tried to read a quote into Python.
I tried to make an activity that collectively sets location information
I tried to make it possible to automatically send an email just by double-clicking the [Python] icon
I tried to make an image classification BOT by combining TensorFlow Lite and LINE Messaging API
I tried to make an analysis base of 5 patterns in 3 years
[Python] Simple Japanese ⇒ I tried to make an English translation tool
I tried to make an image similarity function with Python + OpenCV
I tried to make it possible to automatically send an email just by double-clicking the [GAS / Python] icon
Python: I tried to make a flat / flat_map just right with a generator
I tried to make an open / close sensor (Twitter cooperation) with TWE-Lite-2525A
[Markov chain] I tried to read quotes and negative emotions into Python.
I want to make an automation program!
I tried to make a Web API
I tried my best to make an optimization function, but it didn't work.
I tried to program bubble sort by language
I tried to make AI for Smash Bros.
I tried to detect an object with M2Det!
I tried to generate a random character string
I tried to classify dragon ball by adaline
I tried to make a ○ ✕ game using TensorFlow
Continuation ・ I tried to make Slackbot after studying Python3
I tried to implement an artificial perceptron with python
I tried to get an AMI using AWS Lambda
I tried to become an Ann Man using OpenCV
I tried to find an alternating series with tensorflow
I tried to implement automatic proof of sequence calculation
[Python] I tried to make an application that calculates salary according to working hours with tkinter
I tried it with SymPyLive by referring to "[Ruby] Getting unique elements from an array".
I tried to make a generator that generates a C # container class from CSV with Python
I tried to make Othello AI that I learned 7.2 million hands by deep learning with Chainer