Introduction

Using the Japanese morphological analysis engine Janome written in Pure Python and the Markov chain library markovify, Japanese sentences are learned and automatically generated.

Basically Learn Japanese sentences with markovify and generate sentences by Markov chain It is based on.

It's been a while since I touched Python, so it's pretty loose. Please note.

background purpose

Actually, there is no precedent for learning and automatic generation of Japanese sentences using markovify, but most of them use MeCab, and due to its introduction, it is trivial in Windows environment and some virtual environments. It takes time and effort (Heroku).

In that respect, Janome is easy to install even on Windows, and there is also an article that actually introduced how to use it for automatic generation [after all](https://omedstu.jimdofree.com/2018/05/06/%E3%83% 9E% E3% 83% AB% E3% 82% B3% E3% 83% 95% E9% 80% A3% E9% 8E% 96% E3% 81% AB% E3% 82% 88% E3% 82% 8B% E6% 96% 87% E6% 9B% B8% E7% 94% 9F% E6% 88% 90 /), but I couldn't find a combination of markovify and Janome (too niche). It is easier to use markovify to enhance the naturalness of the generated sentence, so I would like to use it if possible.

So, this time, I tried to make it possible to generate sentences by using both, so I will put it instead of a memo. Well, if you want to use it together, I feel that you can rewrite it by yourself, so it's really just a memo ...

Preparation

Python (3.8.1)
Janome (0.3.10)
markovify (0.8.0)

Both janome and markovify can be installed with pip install. ( Pip3 depending on the environment)

code

First of all. For the textGen part, Reference 1 is diverted more than halfway.

`janomeGen.py`


#!/usr/bin/env python3
# -*- coding:utf-8 -*-
from janome.tokenizer import Tokenizer
import markovify

def split(text):
    #Line breaks, spaces, replacement of problematic characters
    table = str.maketrans({
        '\n': '',
        '\r': '',
        '(': '（',
        ')': '）',
        '[': '［',
        ']': '］',
        '"':'”',
        "'":"’",
    })
    text = text.translate(table)
    t = Tokenizer()
    result = t.tokenize(text, wakati=True)
    #Look at each morpheme, insert a half-width space between them, and insert a line break at the end of the sentence.
    splitted_text = ""
    for i in range(len(result)):
        splitted_text += result[i]
        if result[i] != '。' and result[i] != '！' and result[i] != '？':
            splitted_text += ' '
        if result[i] == '。' or result[i] == '！' or result[i] == '？':
            splitted_text += '\n'
    return splitted_text

def textGen(file):
    f = open(file, 'r', encoding="utf-8")
    text = f.read()
    sentence = None
    while sentence == None: #Empty sentences may be generated depending on the material, so countermeasures
        #Split into a form that can process text
        splitted_text = split(text)

        #Model generation
        text_model = markovify.NewlineText(splitted_text, state_size=3)

        #Generate sentences based on the model
        sentence = text_model.make_sentence()   
    
    #Saving training data
    with open('learned_data.json', 'w') as f:
        f.write(text_model.to_json())
    
    #When reusing data
    """
    with open('learned_data.json') as f:
        text_model = markovify.NewlineText.from_json(f.read())
    """

    #Returns as a combined set of strings
    return ''.join(sentence.split())

Below, we will look at them in order.

Text preparation

    table = str.maketrans({
        '\n': '',
        '\r': '',
        '(': '（',
        ')': '）',
        '[': '［',
        ']': '］',
        '"':'”',
        "'":"’",
    })
    text = text.translate(table)

Replace some characters so that markovify can be read. Line breaks and spaces are used to indicate sentence breaks and word breaks, respectively, so delete them once (Japanese sentences mixed with English sentences cannot be processed well, but this time they are ignored).

Also, replace'bad characters', which adversely affect the operation of markovify, with harmless double-byte characters. (Since markovify v0.7.2, you can specify whether to ignore sentences containing bad characters with the well_formed parameter of markovify.Text, but it is a waste to ignore the whole sentence, so it is replaced in advance.)

Split text

t = Tokenizer()
    result = t.tokenize(text, wakati=True)
    splitted_text = ""
    for i in range(len(result)):
        splitted_text += result[i]
        if result[i] != '。' and result[i] != '！' and result[i] != '？':
            splitted_text += ' '
        if result[i] == '。' or result[i] == '！' or result[i] == '？':
            splitted_text += '\n'

What you are doing is almost the same as Reference Article 1, so it is more accurate to refer to it.

If you tokenize like this with Janome's Tokenizer, it will be returned as a list separated by morpheme. If "I eat an apple", then ``` ['I','is','apple',' is','one','eat','. It feels like'] `` `. For those who want only the morpheme body, it is easier and more convenient than MeCab.

This time, read the morphemes one by one so that markovify can read them, divide the space with a half-width space, and when it comes to the end of the sentence, separate it with a line break (like the English sentence). In the reference article, I cut it only with Kuten, but this time! When? But I tried to cut it. How to divide the reading point depends on your preference, but here it is divided as one word. If you want to make it similar to English, replace the if statement

        if i+1 < len(result):
            if result[i] != '。' and result[i] != '！' and result[i] != '？' and result[i+1] != '、':
                splitted_text += ' '
            if result[i] == '。' or result[i] == '！' or result[i] == '？':
                splitted_text += '\n'
        else:
            if result[i] != '。' and result[i] != '！' and result[i] != '？':
                splitted_text += ' '
            if result[i] == '。' or result[i] == '！' or result[i] == '？':
                splitted_text += '\n'

If you rewrite it somehow, it should work. maybe.

Sentence generation

def textGen(file):
    f = open(file, 'r', encoding="utf-8")
    text = f.read()
    sentence = None
    while sentence == None: #None may be returned depending on the material, so countermeasures
        #Split into a form that can process text
        splitted_text = split(text)

        #Model generation
        text_model = markovify.NewlineText(splitted_text, state_size=3)

        #Generate sentences based on the model
        sentence = text_model.make_sentence()   
    
    #Saving training data
    with open('learned_data.json', 'w') as f:
        f.write(text_model.to_json())
    #Returns as a combined set of strings
    return ''.join(sentence.split())

This time I wrote it for generation from something other than Aozora Bunko, so I skipped the processing for that and simply read it. Since it is a standard procedure for markovify, other than that, it roughly follows Reference 1.

Also, sometimes None is returned due to state_size and the amount of material text (markovify Issue # 96, [Issue # 22]. ](Https://github.com/jsvine/markovify/issues/22)), so I've turned it around until I easily return something that isn't None. I don't think it will be an infinite loop if there is a certain amount of text.

In addition, it is possible to deal with it to some extent by specifying the number of trials with the keyword argument tries of make_sentence. (Code below)

    #Split into a form that can process text
    splitted_text = split(text)

    #Model generation
    text_model = markovify.NewlineText(splitted_text, state_size=3)

    #Generate sentences based on the model
    sentence = text_model.make_sentence(tries=100)

Generation result

For testing, from Bochan of Aozora Bunko to [delruby.exe](https://www.aokids.jp/others/ I deleted the ruby using delruby.html) and tried to generate it based on the one excluding unnecessary parts.

If you think that there is a shortage of one person, everyone in the world is like this student, the one who does not like insects is kind and elegant, but there is no choice but to come with a poor samurai, so make a loud voice I'll put it out.

The condition of the ink stick is also very good, if you try it, it is lower than me, but if all the Japanese are dismissed from the mouth first, they may tell you a good boarding house, so in exchange I'll give you ten yen.
If it's the place where the train of the city of flowers goes, the field is a little dismayed, do you come when you're looking at it? " Uranari gets on your eyes, and even if you're on the way, your eyes are dazzled.
Noda sometimes talks to Yamaarashi, but he did exactly what Yamaarashi said.
The man whispered, but soon he came back and couldn't speak, so he gave it.
I'm inside with the door open. "" My predecessor was killed.
I said that history is the same as the vice-principal.
When I stood at the gate and asked to teach junior high school, when you went to the school with your house, Yamaarashi was surprised if you could pull it in.
I wonder if I'll answer like this from the time I leave home after parting from Uranari.
Then he is a man who fights.

It seems that the purpose has been achieved.

Afterword

Both Janome and MeCab have similar functions in the sense that they perform Japanese morphological analysis, so I was able to implement them with only minor rewriting. It seems that it can be used when creating a bot.

reference

Learn Japanese sentences with markovify and generate sentences by Markov chain
How to learn sentences and automatically generate them using [Python] MeCab and Markov chain library markovify
[Sentence generation by Markov chain-Salad bowl of knowledge](https://omedstu.jimdofree.com/2018/05/06/%E3%83%9E%E3%83%AB%E3%82%B3% E3% 83% 95% E9% 80% A3% E9% 8E% 96% E3% 81% AB% E3% 82% 88% E3% 82% 8B% E6% 96% 87% E6% 9B% B8% E7% 94% 9F% E6% 88% 90 /)

[Python3] Automatic sentence generation using janome and markovify