Using the Japanese morphological analysis engine Janome written in Pure Python and the Markov chain library markovify, Japanese sentences are learned and automatically generated.
Basically Learn Japanese sentences with markovify and generate sentences by Markov chain It is based on.
It's been a while since I touched Python, so it's pretty loose. Please note.
Actually, there is no precedent for learning and automatic generation of Japanese sentences using markovify, but most of them use MeCab, and due to its introduction, it is trivial in Windows environment and some virtual environments. It takes time and effort (Heroku).
In that respect, Janome is easy to install even on Windows, and there is also an article that actually introduced how to use it for automatic generation [after all](https://omedstu.jimdofree.com/2018/05/06/%E3%83% 9E% E3% 83% AB% E3% 82% B3% E3% 83% 95% E9% 80% A3% E9% 8E% 96% E3% 81% AB% E3% 82% 88% E3% 82% 8B% E6% 96% 87% E6% 9B% B8% E7% 94% 9F% E6% 88% 90 /), but I couldn't find a combination of markovify and Janome (too niche). It is easier to use markovify to enhance the naturalness of the generated sentence, so I would like to use it if possible.
So, this time, I tried to make it possible to generate sentences by using both, so I will put it instead of a memo. Well, if you want to use it together, I feel that you can rewrite it by yourself, so it's really just a memo ...
Both janome and markovify can be installed with pip install
. ( Pip3
depending on the environment)
First of all. For the textGen part, Reference 1 is diverted more than halfway.
janomeGen.py
#!/usr/bin/env python3
# -*- coding:utf-8 -*-
from janome.tokenizer import Tokenizer
import markovify
def split(text):
#Line breaks, spaces, replacement of problematic characters
table = str.maketrans({
'\n': '',
'\r': '',
'(': '(',
')': ')',
'[': '[',
']': ']',
'"':'”',
"'":"’",
})
text = text.translate(table)
t = Tokenizer()
result = t.tokenize(text, wakati=True)
#Look at each morpheme, insert a half-width space between them, and insert a line break at the end of the sentence.
splitted_text = ""
for i in range(len(result)):
splitted_text += result[i]
if result[i] != '。' and result[i] != '!' and result[i] != '?':
splitted_text += ' '
if result[i] == '。' or result[i] == '!' or result[i] == '?':
splitted_text += '\n'
return splitted_text
def textGen(file):
f = open(file, 'r', encoding="utf-8")
text = f.read()
sentence = None
while sentence == None: #Empty sentences may be generated depending on the material, so countermeasures
#Split into a form that can process text
splitted_text = split(text)
#Model generation
text_model = markovify.NewlineText(splitted_text, state_size=3)
#Generate sentences based on the model
sentence = text_model.make_sentence()
#Saving training data
with open('learned_data.json', 'w') as f:
f.write(text_model.to_json())
#When reusing data
"""
with open('learned_data.json') as f:
text_model = markovify.NewlineText.from_json(f.read())
"""
#Returns as a combined set of strings
return ''.join(sentence.split())
Below, we will look at them in order.
table = str.maketrans({
'\n': '',
'\r': '',
'(': '(',
')': ')',
'[': '[',
']': ']',
'"':'”',
"'":"’",
})
text = text.translate(table)
Replace some characters so that markovify can be read. Line breaks and spaces are used to indicate sentence breaks and word breaks, respectively, so delete them once (Japanese sentences mixed with English sentences cannot be processed well, but this time they are ignored).
Also, replace'bad characters', which adversely affect the operation of markovify, with harmless double-byte characters. (Since markovify v0.7.2, you can specify whether to ignore sentences containing bad characters with the well_formed parameter of markovify.Text, but it is a waste to ignore the whole sentence, so it is replaced in advance.)
t = Tokenizer()
result = t.tokenize(text, wakati=True)
splitted_text = ""
for i in range(len(result)):
splitted_text += result[i]
if result[i] != '。' and result[i] != '!' and result[i] != '?':
splitted_text += ' '
if result[i] == '。' or result[i] == '!' or result[i] == '?':
splitted_text += '\n'
What you are doing is almost the same as Reference Article 1, so it is more accurate to refer to it.
If you tokenize like this with Janome's Tokenizer, it will be returned as a list separated by morpheme. If "I eat an apple", then ``` ['I','is','apple',' is','one','eat','. It feels like'] `` `. For those who want only the morpheme body, it is easier and more convenient than MeCab.
This time, read the morphemes one by one so that markovify can read them, divide the space with a half-width space, and when it comes to the end of the sentence, separate it with a line break (like the English sentence). In the reference article, I cut it only with Kuten, but this time! When? But I tried to cut it. How to divide the reading point depends on your preference, but here it is divided as one word. If you want to make it similar to English, replace the if statement
if i+1 < len(result):
if result[i] != '。' and result[i] != '!' and result[i] != '?' and result[i+1] != '、':
splitted_text += ' '
if result[i] == '。' or result[i] == '!' or result[i] == '?':
splitted_text += '\n'
else:
if result[i] != '。' and result[i] != '!' and result[i] != '?':
splitted_text += ' '
if result[i] == '。' or result[i] == '!' or result[i] == '?':
splitted_text += '\n'
If you rewrite it somehow, it should work. maybe.
def textGen(file):
f = open(file, 'r', encoding="utf-8")
text = f.read()
sentence = None
while sentence == None: #None may be returned depending on the material, so countermeasures
#Split into a form that can process text
splitted_text = split(text)
#Model generation
text_model = markovify.NewlineText(splitted_text, state_size=3)
#Generate sentences based on the model
sentence = text_model.make_sentence()
#Saving training data
with open('learned_data.json', 'w') as f:
f.write(text_model.to_json())
#Returns as a combined set of strings
return ''.join(sentence.split())
This time I wrote it for generation from something other than Aozora Bunko, so I skipped the processing for that and simply read it. Since it is a standard procedure for markovify, other than that, it roughly follows Reference 1.
Also, sometimes None is returned due to state_size and the amount of material text (markovify Issue # 96, [Issue # 22]. ](Https://github.com/jsvine/markovify/issues/22)), so I've turned it around until I easily return something that isn't None. I don't think it will be an infinite loop if there is a certain amount of text.
In addition, it is possible to deal with it to some extent by specifying the number of trials with the keyword argument tries of make_sentence. (Code below)
#Split into a form that can process text
splitted_text = split(text)
#Model generation
text_model = markovify.NewlineText(splitted_text, state_size=3)
#Generate sentences based on the model
sentence = text_model.make_sentence(tries=100)
For testing, from Bochan of Aozora Bunko to [delruby.exe](https://www.aokids.jp/others/ I deleted the ruby using delruby.html) and tried to generate it based on the one excluding unnecessary parts.
- If you think that there is a shortage of one person, everyone in the world is like this student, the one who does not like insects is kind and elegant, but there is no choice but to come with a poor samurai, so make a loud voice I'll put it out.
It seems that the purpose has been achieved.
Both Janome and MeCab have similar functions in the sense that they perform Japanese morphological analysis, so I was able to implement them with only minor rewriting. It seems that it can be used when creating a bot.