Markov Chain Chatbot with Python + Janome (3) Ozaki Hosai Free-form Haiku Generator

Preface

Two times before Introduction to Markov Chain Chatbot (1) Janome with Python + Janome Last time [Python + Janome with Markov Chain Chatbot (2) Introduction to Markov Chain] (https://qiita.com/GlobeFish/items/17dc25f7920bb580d298)

Finally, we will implement sentence generation in earnest. Any material can be used for sentence generation, but if the sentence is too coherent The impression is that the incoherence stands out. A fluffy, high-context, moderate length is desirable. So, this time, I prepared a free-form haiku by Ozaki Hosai. "I'm alone even if I cough".

Data entry

First, prepare a text file. This time I prepared this. Ozaki Hosai Collection (Aozora Bunko)

DL and put the text file in the same directory. The contents are like this.

Ozaki Hosai Collection of haiku
Ozaki Hosai

Aozora Bunko Edition Preface

In this text file, along with Taneda Santoka, the works of Ozaki Hosai (1885-1926), a haiku representative of so-called free-form haiku, are arranged in chronological order. Hosai's essay began early in junior high school, and the steps leading up to his death at the age of 41 are divided into ten periods.
Of course, not all of Hosai's phrases are listed here, but only a few. The choices were made as plain as possible, hoping that they would be read by young people. Also, at the beginning of each chapter, a brief comment about Hosai at that time is attached.
Many of Hosai's phrases are published in different notations. For digitization, based on "Ozaki Hosai Collection" (Yao Shobo) and "Ozaki Hosai Zenkushu" (Shunjusha Publishing), both are posted if the notation is different. The notation of the phrase with () is based on "Ozaki Hosai Complete Phrase Collection". <Edit-Aozora Bunko / Hamano>

[Junior high school]
Hosai Ozaki was born on January 20, 1858, in Yoshikata-cho, Omi-gun, Tottori Prefecture (currently Tottori City) as the second son of Shinzo Ozaki and his mother. Real name Hideo. Entered Prefectural Daiichi Junior High School in 1897 (Meiji 30). Phrase writing began around this time.

Kite kite thread hanging plum branch

A quiet house with water and summer yanagi

I wonder if the fishing floor is seen from between the trees

I wonder if it's daytime depending on the desk of a good person

Hagi's huts and towns with lots of dew

The corner of the field that calls for cold chrysanthemums and chickens

Is it the second floor where young leaves are on the parapet?

Spring as a sick depression

I wrote a function that takes out only free-form haiku and stores it in an array. In short, scoop a sentence that is neither the end of a kuten nor written in []. ~~ (It was troublesome to absorb part of the header and footer, so I manually deleted this from the original file) ~~

import re

def haiku_reader(textfile):
    f = open(textfile, 'r')
    pre_haiku_set = [line.rstrip('\n') for line in f.readlines() if line != '\n']

    pre_haiku_set2 = [line for line in pre_haiku_set if re.findall(r'[.*|(.*',line) == []]
    haiku_set = [line + '.' for line in pre_haiku_set2 if re.findall(r'.*?。',line) == []]
    return haiku_set

pre_haiku_set is an array of sentences with line breaks at the end of the sentence removed. pre_haiku_set2 stores sentences that do not include [] and () writing. Finally, the sentences that do not contain punctuation marks in haiku_set are put together and completed. It's convenient to have a signal to end later, so I've added a'.' To the end of the haiku here. The contents of haiku_set look like this.

['Kite kite thread hanging plum branch.', 'A quiet house with water and summer yanagi.', 'I wonder if the fishing floor is seen from between the trees.', 'I wonder if it's daytime depending on the desk of a good person.', 'Hagi's huts and towns with lots of dew.', 'The corner of the field that calls for cold chrysanthemums and chickens.', 'Is it the second floor where young leaves are on the parapet?.', 'Spring as a sick depression.', 'Tsukushi Koto, the beloved of Yukiharu and her mother.',
(Omitted)
'Nagisa white footing.', 'Poor and lined up in flowerpots.', 'Frost and shining birds.', 'Hot tea
Shop.', 'A forest with snow approaching the forest.', 'It ’s a thick bone that makes the meat thin..', 'Put one hot water drink and let it go.', 'Put the thin body on the window and whistle the ship.', 'Become a sick person and the willow thread is blown.', 'Spring mountain
Smoke came out from the white.']

Next, let's define a function that creates an array that stores these separately.

def barashi(text):
    t = Tokenizer()
    parted_text = ''
    for haiku in haiku_reader(text):
        for token in t.tokenize(haiku):
            parted_text += str(token.surface)
            parted_text += '|'
    word_list = parted_text.split('|') 
    word_list.pop()
    return word_list

It is the same as Previous Chapter except that it is functionalized. There is nothing special about this.

Introduction to deque type

Last time we implemented a simple Markov chain with N = 1, but this time let's generalize it. To implement the Nth order, use the ** deque type ** of the collections module, which is a standard Python library. This is a data structure that is more suitable for queues and stacks than lists, and is a great way to easily insert and delete from both ends.

from collections import deque

queue = deque([9,9],3)

for i in range(10):
    print(queue)
    queue.append(i)

You can create a deque object with deque (). You can specify the initial value in the first argument and the maximum number of elements in the second argument. If you specify the maximum number of elements, the elements will be discarded from the side opposite to the added side. Let's run the above code.

deque([9, 9], maxlen=3)
deque([9, 9, 0], maxlen=3)
deque([9, 0, 1], maxlen=3)
deque([0, 1, 2], maxlen=3)
deque([1, 2, 3], maxlen=3)
deque([2, 3, 4], maxlen=3)
deque([3, 4, 5], maxlen=3)
deque([4, 5, 6], maxlen=3)
deque([5, 6, 7], maxlen=3)
deque([6, 7, 8], maxlen=3)

N-floor Markov chain implementation (dictionary)

Let's define a function that creates a dictionary that associates a group of N words with the words that follow them.

def dictionary_generator(text,order):
    dictionary = {}
    queue = deque([],order)
    word_list = barashi(text)
    for word in word_list:
        if len(queue) == order and '.' not in queue:
            key = tuple(queue)
            if key not in dictionary:
                dictionary[key] = []
                dictionary[key].append(word)
            else:
                dictionary[key].append(word)
        queue.append(word)
    
    return dictionary

The basic structure is the same as the previous chapter, only the dictionary key is deque. However, there is a restriction that the dictionary key must be hashable. Converting to a hashable tuple when adding an item. The contents of the dictionary when N = 2 looks like this.

{('Cut', 'kite'): ['of'], ('kite', 'of'): ['yarn'], ('of', 'yarn'): ['Take', 'But'], ('yarn', 'Take'): ['Keri'], ('Take', 'Keri'): ['plum'], ('Keri', 'plum'): ['of'], ('plum', 'of'): ['branch'], ('of', 'branch'): ['.'], ('water', 'strike'): ['hand'], ('strike', 'hand'): ['quiet'], ('hand', 'quiet'): ['Nana'], ('quiet', 'Nana'): ['House'], 
(Omitted)

Nth floor Markov chain implementation (sentence generation)

It is finally a sentence generation function.

def text_generator(dictionary,order):
    start = random.choice(list(dictionary.keys()))
    t = Tokenizer()
    token = list(t.tokenize(start[0]))
    part_of_speech = str(token[0].part_of_speech)
    while re.match(r'noun|adjective|Interjection|Adnominal adjective',part_of_speech) == None:
        start = random.choice(list(dictionary.keys()))
        token = list(t.tokenize(start[0]))
        part_of_speech = str(token[0].part_of_speech)
    now_word = deque(start,order)
    sentence = ''.join(now_word)
    for i in range (1000):
        if now_word[-1] == '.':
            break
        elif tuple(now_word) not in dictionary:
            break
        else:
            next_word = random.choice(dictionary[tuple(now_word)])
            now_word.append(next_word)
            sentence += next_word
    return sentence

=========== Note below ===========

def text_generator(dictionary,order):
    start = random.choice(list(dictionary.keys()))
    t = Tokenizer()
    token = list(t.tokenize(start[0]))
    part_of_speech = str(token[0].part_of_speech)
    while re.match(r'noun|adjective|Interjection|Adnominal adjective',part_of_speech) == None:
        start = random.choice(list(dictionary.keys()))
        token = list(t.tokenize(start[0]))
        part_of_speech = str(token[0].part_of_speech)
    now_word = deque(start,order)
    sentence = ''.join(now_word)

I will randomly bring one from the key of the dictionary I made earlier. However, if it starts with particles, it doesn't look like a sentence, so it doesn't look good. Nouns, adjectives, interjections, or adnominal adjectives will be redrawn with a while statement until the beginning is written out. (By the way, most of the original songbooks were haiku with nouns beginning.) The token contains the first word that was made into a janome.tokenizer.Token object, We are getting the part of speech of the first word with token [0] .part_of_speech. If you bring a nice export array, put it in the now_word queue Store the combined string in sentence.

    for i in range (1000):
        if now_word[-1] == '.':
            break
        elif tuple(now_word) not in dictionary:
            break
        else:
            next_word = random.choice(dictionary[tuple(now_word)])
            now_word.append(next_word)
            sentence += next_word
    return sentence

・ If you come to'.', You're done ・ If you cannot find the following word, end it. We put constraints first. The second half is the same as the previous chapter, except that the following words are randomly selected and added to the sentence.

Summary

from collections import deque
from janome.tokenizer import Tokenizer
import re
import random


def haiku_reader(textfile):
    f = open(textfile, 'r')
    pre_haiku_set = [line.rstrip('\n') for line in f.readlines() if line != '\n']

    pre_haiku_set2 = [line for line in pre_haiku_set if re.findall(r'[.*|(.*',line) == []]
    haiku_set = [line + '.' for line in pre_haiku_set2 if re.findall(r'.*?。',line) == []]
    return haiku_set


def barashi(text):
    t = Tokenizer()
    parted_text = ''
    for haiku in haiku_reader(text):
        for token in t.tokenize(haiku):
            parted_text += str(token.surface)
            parted_text += '|'
    word_list = parted_text.split('|') 
    word_list.pop()
    return word_list

def dictionary_generator(text,order):
    dictionary = {}
    queue = deque([],order)
    word_list = barashi(text)
    for word in word_list:
        if len(queue) == order and '.' not in queue:
            key = tuple(queue)
            if key not in dictionary:
                dictionary[key] = []
                dictionary[key].append(word)
            else:
                dictionary[key].append(word)
        queue.append(word)
    
    return dictionary


def text_generator(dictionary,order):
    start = random.choice(list(dictionary.keys()))
    t = Tokenizer()
    token = list(t.tokenize(start[0]))
    part_of_speech = str(token[0].part_of_speech)
    while re.match(r'noun|adjective|Interjection|Adnominal adjective',part_of_speech) == None:
        start = random.choice(list(dictionary.keys()))
        token = list(t.tokenize(start[0]))
        part_of_speech = str(token[0].part_of_speech)
    now_word = deque(start,order)
    sentence = ''.join(now_word)
    for i in range (1000):
        if now_word[-1] == ".":
            break
        elif tuple(now_word) not in dictionary:
            break
        else:
            next_word = random.choice(dictionary[tuple(now_word)])
            now_word.append(next_word)
            sentence += next_word
    return sentence


text = "ozaki_hosai_senkushu.txt"
order = 2
dictionary = dictionary_generator(text,order)

print(text_generator(dictionary,order))
print(text_generator(dictionary,order))
print(text_generator(dictionary,order))

Execution result of N = 3

Raise the face of fire.
Lanterns fly to fire Fog on the banks of the river.
Selling to the morning woman who opens the window.

It looks like that! When I was delighted, I was surprised because the second one and the third ** mystery were real. The following is N = 2. I wonder if this is the limit of originality.

Before.
Put the face of the beggar on the brazier.
Dead under a big stone tower.

This was also the third part of the real thing. I don't know ~

Addictive point

from janome.tokenizer import Tokenizer

t = Tokenizer()
s = "Sentence"
print(type(t.tokenize(s)))
#<class 'generator'>

t.tokenize () is the generator class by default. When I looked at the documents about Janome, -List by default -Become a generator by setting the argument stream to True It was said that it seems that it became a generator from the beginning when stream disappeared recently. To make a list, just put it in list ().

Digression

It's a good idea, so I'll list some of them.

Throw out such a good moon alone.
A dark, big ant hangs on a tatami mat.
The body of sake is full of kudzu water..
Pomegranate is dead and lonely in the garden.
I have two ears that I want to squeeze.
The roof is heavy.

It has become rather real ... I tried to generate various things, but personally

Release to the sea without chickens.

Was a hit.

Recommended Posts

Markov Chain Chatbot with Python + Janome (3) Ozaki Hosai Free-form Haiku Generator
Markov Chain Chatbot with Python + Janome (1) Introduction to Janome
Markov Chain Chatbot with Python + Janome (2) Introduction to Markov Chain
random French number generator with python
Python --Markov chain state transition simulation