There are many intriguing applications of deep learning in recent years, one of which is automatic sentence generation. It's exciting to generate sentences like the one written by the author from the style of a book, or to learn lyrics and create new songs. So, I wanted to make some sentences, but since there are a lot of books and songs, I was wondering if there was anything, "Jojo world by learning Jojo's world view. Wouldn't it be interesting to build a? " Well, I knew that the reason why I didn't do much was because I didn't have enough training data, but I tried it in a trial, so I will share it.
**-Do you remember how many breads you have ever eaten? ** ** **-But decline **
Even those who say "I don't know Jojo" may have heard the above lines once. Jojo is a boy's manga "JoJo's Bizarre Adventure" serialized by Hirohiko Araki since 1986, and is a story with a unique world view with the theme of "human hymn" (wiki. //ja.wikipedia.org/wiki/%E3%82%B8%E3%83%A7%E3%82%B8%E3%83%A7%E3%81%AE%E5%A5%87%E5% From A6% 99% E3% 81% AA% E5% 86% 92% E9% 99% BA)). Due to its unique expression method, I feel that it is a manga in which lines and sound effects (Zukuuun, Memetaa, etc.) are most quoted on the net.
LSTM The algorithm used this time is LSTM (Long Short Term Memory). LSTM is a kind of RNN (Recurrent Neural Network) which is a neural network that can handle time series information. By introducing three gates called input gate, output gate, and oblivion gate, data that could not be done in the past You will be able to memorize for a long time, and you can handle long time series information such as sentences. For example, to learn the sentence "Do you remember the number of breads you have eaten?", Extract the N character and learn the next character. If N = 10, "You have eaten so far" will be the training sample, and the next "Pa" will be the label. Then, for example, by shifting this sentence by 3 characters, new data is created, and learning is done in the order of "I have eaten bread" → "sheet". As you can see from this example, a huge amount of sentences are required as learning data for sentence generation. If I used this method with the amount of quotes, the sentence would be broken in Japanese, so I learned it word by word instead of by letter. In other words, "you have eaten until now" → "bread". For details on LSTM, refer to this article. LSTM is provided in frameworks such as Keras, so you can easily build it without implementing it yourself.
Let's implement it. This code is mostly based on lstm_text_generation.py published by the Keras team. The execution environment is Google Colanoratory.
First, import the required libraries.
Library import
import os
import re
import bs4
import requests
from __future__ import print_function
from keras.callbacks import LambdaCallback
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.optimizers import Adam
from keras.utils.data_utils import get_file
import numpy as np
import random
import sys
import io
import matplotlib.pyplot as plt
!apt install aptitude
!aptitude install mecab libmecab-dev mecab-ipadic-utf8 git make curl xz-utils file -y
!pip install mecab-python3==0.7
import MeCab
This time, we will get the data from the website and perform morphological analysis, so we have installed the necessary items in addition to the usual machine learning library. The last four lines are Mecab, an OSS morphological analysis tool that breaks sentences into words.
Then get the quote by scraping. This time, I got it from this website.
Get Quotations
'''Get Jojo Quotations'''
with open('jojo.txt', 'a') as f:
url = 'http://kajipon.sakura.ne.jp/art/jojo9.htm'
res = requests.get(url)
res.raise_for_status()
'''Parse the acquired html and acquire only the dialogue line'''
soup = bs4.BeautifulSoup(res.content, 'html.parser')
soup = soup.find_all( text=re.compile("「.+?」+(.+?)+No.") )
for s in soup:
txt = s.__str__()
'''Remove extra parts'''
txt = re.sub('No..+?roll', '', txt)
txt = re.sub('(.+?)', '', txt)
txt = re.sub('※.+', '', txt)
txt = txt.replace('「', '')
txt = txt.replace('」', '')
print(txt)
f.write(txt)
The execution result looks like this (only the first 20 lines).
Nah! What are you doing! Yuru-san!
As expected, Dio! Do what we can't do! It feels numb there! I yearn for it!
Dioooooooo! You are! Until you cry! I won't stop hitting you!
Dio! If your stupid kiss was aiming for this, it would have worked better than expected!
The motive for fighting is different from you guys!
Don't get rid of it! Rich Ama-chan!
I'm quitting humans! Jojo! !!
No! That father's spirit is ... his son Jonathan Joestar has inherited it! It will be his strong will, pride, and future! !!
Damn it! Let's get caught up in the crime of trespassing, and I'll enter this room and celebrate it to the fullest! !!
Speedwagon is leaving cool
Oh dear! What! Me with a broken arm! I always support you
Is it fate ... Maybe it's fate that people meet ...
Uhohohohohoho!
Remove the joints and extend your arms! The severe pain is softened by ripple energy!
Papau! Pow Pow! Ripples cutter! !!
Do you remember how many breads you have ever eaten?
"Ripples"? What is "breathing"? If you blow hoo hoo ... It suits me to blow even fanfare for me!
Surprising! That's hair!
Think the other way around. I think it's okay to give it
Shake your heart! Heat to the point of burning out! !! Oh oh, I'll chop the beat of blood! Mountain-blown ripple sprint! !!
Read the jojo.txt created above and create training data. In addition, unlike sentences and lyrics, quotes are independent in principle, so there is no time series. Therefore, create learning data line by line so that the ending of the previous quote → the stem of the next quote is not connected. First, let's see how many quotes there are.
Read line by line
'''File reading'''
path = './jojo.txt'
'''Get lines line by line'''
nline = 0
with io.open(path, encoding='utf-8') as f:
lines = f.readlines()
for line in lines:
nline += 1
print('Number of lines:', nline)
result
Number of lines: 283
There were 283 quotes in all. Next, we will decompose all the quotes into parts of speech and pack the elements into a container for all the words that appear (corpus), a container for all sentences that will be learning data (sentences), and a container for labels for sentences (next_chars). .. The length of the training data was 3 words, and the shift interval was 1 word to create the next data. "You are now" → "until" "Until now" → "Eat" ・ ・ ・
python
'''Set the sentence size and interval to learn'''
corpus = []
sentences = []
next_chars = []
maxlen = 3
step = 1
mecab= MeCab.Tagger('-Ochasen')
mecab.parse('')
for line in lines:
'''Get part of speech line by line'''
corpusl = []
nodel = mecab.parseToNode(line)
while nodel:
corpusl.append(nodel.surface)
corpus.append(nodel.surface)
nodel = nodel.next
'''Generate learning sentences and teacher labels'''
for i in range(0, len(corpusl) - maxlen, step):
sentences.append(corpusl[i: i + maxlen])
next_chars.append(corpusl[i + maxlen])
print('Number of sentences', len(sentences))
print('Number of words: ', len(corpus))
'''Generate a corpus with duplicate words removed'''
chars = set(corpus)
print('Corpus size: ', len(chars))
result
Number of sentences 8136
Number of words: 8984
Corpus size: 1883
For how to use Mecab, I referred to this article. corpusl is a temporary corpus that stores the words within that line. The total number of words that came out this time is 1883, and the generated sentences are made from these words. Computers cannot handle words as they are, so create dictionaries corresponding to words → indexes and indexes → words. Finally, the training data x and the teacher label y are created as a one-hot vector.
python
'''Words → numbers and numbers → create dictionaries corresponding to words'''
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))
'''Vectorization'''
x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
for t, char in enumerate(sentence):
x[i, t, char_indices[char]] = 1
y[i, char_indices[next_chars[i]]] = 1
Make an LSTM model. The number of hidden layer units is 128, the cost function is categorical cross entropy, and the optimization method is Adam.
LSTM
'''LSTM model creation'''
model = Sequential()
model.add(LSTM(128, input_shape = (maxlen, len(chars))))
model.add(Dense(len(chars), activation='softmax'))
optimizer = Adam()
model.compile(loss='categorical_crossentropy', optimizer=optimizer)
model.summary()
result
Model: "sequential_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
lstm_1 (LSTM) (None, 128) 1030144
_________________________________________________________________
dense_1 (Dense) (None, 1883) 242907
=================================================================
Total params: 1,273,051
Trainable params: 1,273,051
Non-trainable params: 0
_________________________________________________________________
Now that we are ready, we will study.
Learning
def sample(preds, temperature=1.0):
# helper function to sample an index from a probability array
preds = np.asarray(preds).astype('float64')
preds = np.log(preds) / temperature
exp_preds = np.exp(preds)
preds = exp_preds / np.sum(exp_preds)
probas = np.random.multinomial(1, preds, 1)
return np.argmax(probas)
'''Function to display the lines generated for each epoch'''
def on_epoch_end(epoch, _):
print()
print('----- %d epoch:' % epoch)
start_index = random.randint(0, len(corpus) - maxlen -1)
for diversity in [8.0, 16.0, 32.0, 64.0, 128.0, 256, 512, 1024]:
print('----- diversity:', diversity)
generated = ''
sentence = corpus[start_index: start_index + maxlen]
generated += ''.join(sentence)
print('-----seed"' + ''.join(sentence) + '"Generated by:')
sys.stdout.write(generated)
for i in range(10):
x_pred = np.zeros((1, maxlen, len(chars)))
for t, char in enumerate(sentence):
x_pred[0, t, char_indices[char]] = 1.
preds = model.predict(x_pred, verbose = 0)[0]
next_index = sample(preds, diversity)
next_char = indices_char[next_index]
sentences.append(next_char)
sys.stdout.write(next_char)
sys.stdout.flush()
print()
sample is a function that samples the probability distribution, and it seems that the higher the temple, the lower the prediction. (https://www.freecodecamp.org/news/applied-introduction-to-lstms-for-text-generation-380158b29fb3/) In the original, low values such as 0.2 to 1.2 are used, but in this data, only the same words are output probably because the number is too small (probably "safe" frequently occurring words are output intently. ), So I try to use various words with a large value. Randomly select 3 words for each epoch, predict the next 10 words, and display the sentence.
Pass the above function to fit.
python
print_callback = LambdaCallback(on_epoch_end = on_epoch_end)
history = model.fit(x, y, batch_size=128, epochs=60, callbacks=[print_callback])
The execution result is as follows. (Excerpt)
Execution result after 5 epoch
~~~
Epoch 5/60
8136/8136 [==============================] - 1s 98us/step - loss: 5.5118
-----Dialogue generated after 4 epoch:
----- diversity: 8.0
-----seed"Gah!"Generated by:
Gah! It ’s refreshing, everyone burns out, but I ’m glad to hear that Germany
----- diversity: 16.0
-----seed"Gah!"Generated by:
Gah! It feels like I'm going to call you
----- diversity: 32.0
-----seed"Gah!"Generated by:
Gah! It's possible to look for a monkey, but it's shining
----- diversity: 64.0
-----seed"Gah!"Generated by:
Gah! Just before you know it, annoying progressive rock band forgive me
----- diversity: 128.0
-----seed"Gah!"Generated by:
Gah! Difficult to come, I'm sorry to bury heraherasanzy, forgive me! Eh
----- diversity: 256
-----seed"Gah!"Generated by:
Gah! Disaster Zamamiro Mt. Fuji Sea Venice Jornot World Disaster
----- diversity: 512
-----seed"Gah!"Generated by:
Gah! Respect First Mijimeho Courage W Isagi
----- diversity: 1024
-----seed"Gah!"Generated by:
Gah! Combat F Shark Slapstick Right Love Betting Maybe
~~~
It's a Jojo-like word, but it's incoherent and what's wrong ...
Execution result after 33 epoch
~~~
Epoch 33/60
8136/8136 [==============================] - 1s 98us/step - loss: 2.3640
-----Dialogue generated after 32 epoch:
----- diversity: 8.0
-----seed"Grandfather Joseph"Generated by:
Grandfather Joseph Cho Story Cho Ne Cho Stars Entering First Love I
----- diversity: 16.0
-----seed"Grandfather Joseph"Generated by:
My grandfather Joseph's brain has a cute god's life in the land.
----- diversity: 32.0
-----seed"Grandfather Joseph"Generated by:
Grandfather Joseph, warrior aaaaaaa--.. of!
----- diversity: 64.0
-----seed"Grandfather Joseph"Generated by:
You were shot by your grandfather Joseph, left Lakai handsome oooohehehehe
----- diversity: 128.0
-----seed"Grandfather Joseph"Generated by:
My grandfather Joseph remembers my life Kuranenza formed
----- diversity: 256
-----seed"Grandfather Joseph"Generated by:
Grandfather's Joseph Pig Joseph Being Protected Here Jojooooooo Desert Suka
----- diversity: 512
-----seed"Grandfather Joseph"Generated by:
Grandfather Joseph Do it with a real face Sui Eee Eee Eee A pharmacy secret that goes in the way
----- diversity: 1024
-----seed"Grandfather Joseph"Generated by:
My grandfather Joseph Teme wins and shakes the desert I'm having a crush on the black jam
~~~
I don't feel like I've become a little Japanese. Even so, there are some terrifying words like "Grandfather's Joseph Brains".
60epoch execution result
~~~
Epoch 60/60
8136/8136 [==============================] - 1s 104us/step - loss: 0.7271
-----Dialogue generated after 59 epoch:
----- diversity: 8.0
-----seed"Problem until just before"Generated by:
Problem until just before O Scary amount Cherry received Shooting over irrelevant hurt
----- diversity: 16.0
-----seed"Problem until just before"Generated by:
Problem until just before Jojo, always strengthen mysterious king centimeter beauty relationship Asahi
----- diversity: 32.0
-----seed"Problem until just before"Generated by:
Until just before the problem Passenger guts convinced big wheels useful old daughter to stand on your side
----- diversity: 64.0
-----seed"Problem until just before"Generated by:
Until just before, it seems to be a problem.
----- diversity: 128.0
-----seed"Problem until just before"Generated by:
Until just before the problem I don't care until I take away my feelings
----- diversity: 256
-----seed"Problem until just before"Generated by:
Until just before the problem Mother Coke 216 What is the Gedged person Papau pedigree Honor lined up
----- diversity: 512
-----seed"Problem until just before"Generated by:
Make a problem needle decision until just before Star Valkyrie Search for it
----- diversity: 1024
-----seed"Problem until just before"Generated by:
Until just before the problem Happiness child understands the old man and get a command
It's a character string that seems to come out in the brain when it's hard to fall asleep due to a cold. I ran 60epoch, but I couldn't quite make a sentence that wasn't strange in Japanese. I think it is appropriate considering the number of words and the number of data. But is there something like "Papau Pedigree" or "Star Valkyrie"? It was very interesting for Jojo fans to see words that make me think. I think there are many points to be improved, but I think it will be difficult to improve, so I will stop here this time. It was reasonable for a trial. After all sentence generation is interesting. I put the code on GitHub, so if you are interested, please play with it. Finally, plot the cost function.
Cost function plot
'''loss visualization'''
plt.figure(figsize=(10,7))
loss = history.history['loss']
plt.plot(loss, color='b', linewidth=3)
plt.tick_params(labelsize=18)
plt.ylabel('loss', fontsize=20)
plt.xlabel('epoch', fontsize=20)
plt.legend(['training'], loc='best', fontsize=20)
plt.show()
Recommended Posts