[Python3] Automatic sentence generation using janome and markovify

Introduction

Using the Japanese morphological analysis engine Janome written in Pure Python and the Markov chain library markovify, Japanese sentences are learned and automatically generated.

Basically Learn Japanese sentences with markovify and generate sentences by Markov chain It is based on.

It's been a while since I touched Python, so it's pretty loose. Please note.

background purpose

Actually, there is no precedent for learning and automatic generation of Japanese sentences using markovify, but most of them use MeCab, and due to its introduction, it is trivial in Windows environment and some virtual environments. It takes time and effort (Heroku).

In that respect, Janome is easy to install even on Windows, and there is also an article that actually introduced how to use it for automatic generation [after all](https://omedstu.jimdofree.com/2018/05/06/%E3%83% 9E% E3% 83% AB% E3% 82% B3% E3% 83% 95% E9% 80% A3% E9% 8E% 96% E3% 81% AB% E3% 82% 88% E3% 82% 8B% E6% 96% 87% E6% 9B% B8% E7% 94% 9F% E6% 88% 90 /), but I couldn't find a combination of markovify and Janome (too niche). It is easier to use markovify to enhance the naturalness of the generated sentence, so I would like to use it if possible.

So, this time, I tried to make it possible to generate sentences by using both, so I will put it instead of a memo. Well, if you want to use it together, I feel that you can rewrite it by yourself, so it's really just a memo ...

Preparation

Both janome and markovify can be installed with pip install. ( Pip3 depending on the environment)

code

First of all. For the textGen part, Reference 1 is diverted more than halfway.

janomeGen.py


#!/usr/bin/env python3
# -*- coding:utf-8 -*-
from janome.tokenizer import Tokenizer
import markovify

def split(text):
    #Line breaks, spaces, replacement of problematic characters
    table = str.maketrans({
        '\n': '',
        '\r': '',
        '(': '(',
        ')': ')',
        '[': '[',
        ']': ']',
        '"':'”',
        "'":"’",
    })
    text = text.translate(table)
    t = Tokenizer()
    result = t.tokenize(text, wakati=True)
    #Look at each morpheme, insert a half-width space between them, and insert a line break at the end of the sentence.
    splitted_text = ""
    for i in range(len(result)):
        splitted_text += result[i]
        if result[i] != '。' and result[i] != '!' and result[i] != '?':
            splitted_text += ' '
        if result[i] == '。' or result[i] == '!' or result[i] == '?':
            splitted_text += '\n'
    return splitted_text

def textGen(file):
    f = open(file, 'r', encoding="utf-8")
    text = f.read()
    sentence = None
    while sentence == None: #Empty sentences may be generated depending on the material, so countermeasures
        #Split into a form that can process text
        splitted_text = split(text)

        #Model generation
        text_model = markovify.NewlineText(splitted_text, state_size=3)

        #Generate sentences based on the model
        sentence = text_model.make_sentence()   
    
    #Saving training data
    with open('learned_data.json', 'w') as f:
        f.write(text_model.to_json())
    
    #When reusing data
    """
    with open('learned_data.json') as f:
        text_model = markovify.NewlineText.from_json(f.read())
    """

    #Returns as a combined set of strings
    return ''.join(sentence.split())

Below, we will look at them in order.

Text preparation

    table = str.maketrans({
        '\n': '',
        '\r': '',
        '(': '(',
        ')': ')',
        '[': '[',
        ']': ']',
        '"':'”',
        "'":"’",
    })
    text = text.translate(table)

Replace some characters so that markovify can be read. Line breaks and spaces are used to indicate sentence breaks and word breaks, respectively, so delete them once (Japanese sentences mixed with English sentences cannot be processed well, but this time they are ignored).

Also, replace'bad characters', which adversely affect the operation of markovify, with harmless double-byte characters. (Since markovify v0.7.2, you can specify whether to ignore sentences containing bad characters with the well_formed parameter of markovify.Text, but it is a waste to ignore the whole sentence, so it is replaced in advance.)

Split text

t = Tokenizer()
    result = t.tokenize(text, wakati=True)
    splitted_text = ""
    for i in range(len(result)):
        splitted_text += result[i]
        if result[i] != '。' and result[i] != '!' and result[i] != '?':
            splitted_text += ' '
        if result[i] == '。' or result[i] == '!' or result[i] == '?':
            splitted_text += '\n'

What you are doing is almost the same as Reference Article 1, so it is more accurate to refer to it.

If you tokenize like this with Janome's Tokenizer, it will be returned as a list separated by morpheme. If "I eat an apple", then ``` ['I','is','apple',' is','one','eat','. It feels like'] `` `. For those who want only the morpheme body, it is easier and more convenient than MeCab.

This time, read the morphemes one by one so that markovify can read them, divide the space with a half-width space, and when it comes to the end of the sentence, separate it with a line break (like the English sentence). In the reference article, I cut it only with Kuten, but this time! When? But I tried to cut it. How to divide the reading point depends on your preference, but here it is divided as one word. If you want to make it similar to English, replace the if statement

        if i+1 < len(result):
            if result[i] != '。' and result[i] != '!' and result[i] != '?' and result[i+1] != '、':
                splitted_text += ' '
            if result[i] == '。' or result[i] == '!' or result[i] == '?':
                splitted_text += '\n'
        else:
            if result[i] != '。' and result[i] != '!' and result[i] != '?':
                splitted_text += ' '
            if result[i] == '。' or result[i] == '!' or result[i] == '?':
                splitted_text += '\n'

If you rewrite it somehow, it should work. maybe.

Sentence generation

def textGen(file):
    f = open(file, 'r', encoding="utf-8")
    text = f.read()
    sentence = None
    while sentence == None: #None may be returned depending on the material, so countermeasures
        #Split into a form that can process text
        splitted_text = split(text)

        #Model generation
        text_model = markovify.NewlineText(splitted_text, state_size=3)

        #Generate sentences based on the model
        sentence = text_model.make_sentence()   
    
    #Saving training data
    with open('learned_data.json', 'w') as f:
        f.write(text_model.to_json())
    #Returns as a combined set of strings
    return ''.join(sentence.split())

This time I wrote it for generation from something other than Aozora Bunko, so I skipped the processing for that and simply read it. Since it is a standard procedure for markovify, other than that, it roughly follows Reference 1.

Also, sometimes None is returned due to state_size and the amount of material text (markovify Issue # 96, [Issue # 22]. ](Https://github.com/jsvine/markovify/issues/22)), so I've turned it around until I easily return something that isn't None. I don't think it will be an infinite loop if there is a certain amount of text.

In addition, it is possible to deal with it to some extent by specifying the number of trials with the keyword argument tries of make_sentence. (Code below)

    #Split into a form that can process text
    splitted_text = split(text)

    #Model generation
    text_model = markovify.NewlineText(splitted_text, state_size=3)

    #Generate sentences based on the model
    sentence = text_model.make_sentence(tries=100)   

Generation result

For testing, from Bochan of Aozora Bunko to [delruby.exe](https://www.aokids.jp/others/ I deleted the ruby using delruby.html) and tried to generate it based on the one excluding unnecessary parts.

  • If you think that there is a shortage of one person, everyone in the world is like this student, the one who does not like insects is kind and elegant, but there is no choice but to come with a poor samurai, so make a loud voice I'll put it out.

It seems that the purpose has been achieved.

Afterword

Both Janome and MeCab have similar functions in the sense that they perform Japanese morphological analysis, so I was able to implement them with only minor rewriting. It seems that it can be used when creating a bot.

reference

  1. Learn Japanese sentences with markovify and generate sentences by Markov chain
  2. How to learn sentences and automatically generate them using [Python] MeCab and Markov chain library markovify
  3. [Sentence generation by Markov chain-Salad bowl of knowledge](https://omedstu.jimdofree.com/2018/05/06/%E3%83%9E%E3%83%AB%E3%82%B3% E3% 83% 95% E9% 80% A3% E9% 8E% 96% E3% 81% AB% E3% 82% 88% E3% 82% 8B% E6% 96% 87% E6% 9B% B8% E7% 94% 9F% E6% 88% 90 /)

Recommended Posts

[Python3] Automatic sentence generation using janome and markovify
Clustering and visualization using Python and CytoScape
[PyTorch] Japanese sentence generation using Transformer
[Let's play with Python] Aiming for automatic sentence generation ~ Completion of automatic sentence generation ~
[Let's play with Python] Aiming for automatic sentence generation ~ Read .txt and make it one sentence unit ~
Notes using cChardet and python3-chardet in Python 3.3.1.
Automatic collection of stock prices using python
From Python to using MeCab (and CaboCha)
Using Python and MeCab with Azure Databricks
[Let's play with Python] Aiming for automatic sentence generation ~ Perform morphological analysis ~
I'm using tox and Python 3.3 with Travis-CI
Head orientation estimation using Python and OpenCV + dlib
I tried web scraping using python and selenium
Notes on installing Python3 and using pip on Windows7
Develop and deploy Python APIs using Kubernetes and Docker
Python development flow using Poetry, Git and Docker
I tried object detection using Python and OpenCV
Create a web map using Python and GDAL
[Short sentence] [Python] Format and print lists and dictionaries
Try using tensorflow ① Build python environment and introduce tensorflow
Create a Mac app using py2app and Python3! !!
Template network config generation with Python and Jinja2
Try using ChatWork API and Qiita API in Python
Start using Python
Automatic mosaic generation
Scraping using Python
Initial settings for using Python3.8 and pip on CentOS8
Searching for pixiv tags and saving illustrations using Python
Extendable skeletons for Vim using Python, Click and Jinja2
Try creating a compressed file using Python and zlib
Aggregate Git logs using Git Python and analyze associations using Orange
[Python] Easy Google Translate app using Eel and Googletrans
Automatic follow on Twitter with python and selenium! (RPA)
Implementing a generator using Python> link> yield and next ()> yield
Get and automate ASP Datepicker control using Python and Selenium
Read and write NFC tags in python using PaSoRi
Speech transcription procedure using Python and Google Cloud Speech API
Get files from Linux using paramiko and scp [Python]
HTTP server and HTTP client using Socket (+ web browser) --Python3