[PYTHON] Sentence generation with GRU (keras)

Introduction

Last time, I created an article I tried to automatically create a report with Markov chains. At this time, I was using a Markov chain, so I ended up with a sentence that ignored the flow of the sentence. Therefore, this time, we are planning to create a sentence that is conscious of the context using a technology called GRU.

What is GRU?

I think that RNN and LSTM are well known when it comes to sentence generation, but I chose this because it requires less learning time than LSTM. I don't know the detailed structure, so I won't write it, but in practice you can write it with almost the same code as LSTM.

Implementation

Let's implement it now.

First, load the library to be used.


import re
import MeCab
import numpy as np
from sklearn.model_selection import train_test_split
from gensim.models import word2vec

from keras.layers.core import Activation
from keras.layers.core import Dense, Dropout
from keras.layers.core import Masking
from keras.models import Sequential
from keras.layers.recurrent import GRU
from keras.layers.embeddings import Embedding
from keras.layers.normalization import BatchNormalization
from keras.utils import np_utils
from keras.optimizers import RMSprop

Reading / morphological analysis

The data used this time is from Aozora Bunko. This time, I used "The Fiend with Twenty Faces of the Phantom" by Ranpo Edogawa.

Define a function that removes unnecessary data from this data and morphologically analyzes it.


def convert(text, train):
    if train:
        #Delete unnecessary items such as ruby
        text = re.split(r"\-{5,}", text)[2]
        text = re.split(r"Bottom book:", text)[0]
        text = re.sub(r"《.*?》", "", text)
        text = re.sub(r"[.*?]", "", text)
        text = re.sub(r"[|]", "", text)
        text = re.sub("(\n){2,}", "\n", text)
        text = re.sub(r"\u3000", "", text)

    #Morphological analysis
    mecab = MeCab.Tagger("-Owakati")
    return mecab.parse(text).split()

Here, I want to use it for preprocessing of prediction data, so I prepared an argument of bool value called train and deleted ruby etc. only at the time of training. Now let's execute this function.


path = "text_raw/kaijin_nijumenso.txt"
with open(path, "r") as f:
    text = f.read()
text_split = convert(text, True)

When read like this, text = The Fiend with Twenty Faces \ n Edogawa Ranpo \ n \ n ----------- It is read like. And it is split like text_split = ['hashi',' is',' ki','that','koro',',' ,,,.

Creating a dataset

First, convert the morphologically parsed words into numbers. This is just a number, so do the following: Here, 0 is used for masking (the surplus data is treated as 0), so it is necessary to + 1.


#Creating a dictionary of words and numbers
vocab_r = dict((key + 1, word) for (key, word) in enumerate(set(text_split)))
vocab = dict((word, key) for (key, word) in vocab_r.items())

text_num = list(map(lambda x:vocab[x], text_split))

Next, create a data set for training. As will be described later, since the input data vectorizes words in the Embedding layer, set the dimension to (n_sample, n_seq) = (number of input data, how many words to learn). Also, since the correct label will be converted to one-hot later, this should also be in the dimension of (n_sample) = (number of input data).


n_seq = 10 #How many words to learn
num_char = len(set(text_split)) + 1 #Word type (including masking)
n_sample = len(text_num) - n_seq #Number of input data

#Data creation
input_data = np.zeros((n_sample, n_seq))
correct_data = np.zeros((n_sample))

for i in range(n_sample):
    input_data[i] = text_num[i:i + n_seq]
    correct_data[i] = text_num[i + n_seq]

x_train, x_test, y_train, y_test = train_test_split(input_data, correct_data, test_size=0.1, shuffle=True)

Distributed representation of words

The input data is vectorized in the Embedding layer, but if it is simply one-hot, the data size will be enormous, so we will use a distributed representation. This is to vectorize words in consideration of the relationship between words, and by using this, it is possible to express words in a small dimension.


model_w2v = word2vec.Word2Vec([text_split], size=100, min_count=1, window=5, iter=10)
embedding_matrix = np.zeros((num_char, 100))
for w, vec in zip(model_w2v.wv.index2word, model_w2v.wv.vectors):
    embedding_matrix[vocab[w]] = vec

Using model_w2v.wv.index2word, model_w2v.wv.vectors

'Chopsticks':[ 0.04145062, -0.01713059, -0.0444374 ,,,],
'But':[ 0.554178  , -0.19353375, -0.56997895,,,]

It can be vectorized as Combine this with the word and number dictionary you created earlier

[[ 0.        ,  0.        ,  0.        , ...,  0.        ],
[ 0.00860965, -0.00045624, -0.00883528, ..., -0.00861127],
...
[ 0.00662873, -0.00499085, -0.01188819, ..., -0.01252057]]

I was able to create a weight vector like this.

Creating a model

Let's study at GRU. The model itself can learn the parameters as they are by simply changing the LSTM to GRU.


#Creating a model
model = Sequential()
model.add(Embedding(num_char, 100, weights=[embedding_matrix], trainable=False, input_length=n_seq))
model.add(BatchNormalization(axis=-1))
model.add(Masking(mask_value=0, input_shape=(n_seq, 100)))
model.add(GRU(128, input_shape=(n_seq, 100), kernel_initializer='random_uniform'))
model.add(Dropout(0.2))
model.add(Dense(num_char))
model.add(Activation('softmax'))

model.compile(loss='categorical_crossentropy', optimizer=RMSprop(lr=0.01), metrics=['categorical_accuracy'])
model.fit(x_train, np_utils.to_categorical(y_train, num_char), batch_size=128, validation_split=0.05, epochs=100,
          shuffle=True)

I am not familiar with the parameters, so I referred to this article.

Let's take a look at the test data.


y_pred = model.predict(x_test)
print(list(map(lambda x:vocab_r[x], np.argmax(y_pred, axis=1))))

The output looks like this:

Hahaha ……. ", Praise constantly"To"
I looked up. Indeed, the long hand"You guys"
You are a natural adventurer and you are in junior high school"、"

It has a slightly subtle atmosphere.

Predict by input string

Let's try what kind of sentences are actually created assuming the entered character string.

First, the word is morphologically analyzed and divided. I will try it with the news that happened to be in the trend on Twitter.


input_text = "Dementia-causing substances accumulate in periodontal disease Kyushu University and others announce"
input_text_split = convert(input_text, False)

Converts words to numbers. At this time, words that are not in the training data are not registered in the dictionary and an error occurs, so exception handling that returns 0 is performed.


def word2vec_input(x):
    """
Convert numbers to words
Characters not in the dictionary return 0
    """
    try:
        return vocab[x]
    except:
        return 0


vocab_r[0] = "<nodata>"

#Convert words to numbers
input_text_num = list(map(lambda x:word2vec_input(x), input_text_split))

When making a prediction, make sure that the input data is (number of input data, how many words to learn) = (1, n_seq).


#Fill with 0 if the number of words is short
if len(input_text_num) < n_seq:
    input_text_num = [0]*(n_seq - len(input_text_num)) + input_text_num

#Function to predict
def prediction(input_text_num):
    input_test = np.zeros((1, n_seq))
    input_test[0] = input_text_num[-n_seq:]
    y_pred = model.predict(input_test)
    return np.append(input_text_num, np.where(y_pred == np.sort(y_pred)[:, -1].reshape(-1, 1))[1])  # np.argmax(y_pred))
    #I wrote this because I was playing with writing sentences with the second highest probability, but argmax is enough

#Run forecast
for i in range(1000):
    input_text_num = prediction(input_text_num)

After learning, the numbers are converted back into words to create sentences.


test_pred = list(map(lambda x:vocab_r[x], input_text_num[len(input_text_split):])) #Cut out the number of words to learn
pred_text = input_text + "".join(test_pred) #Use the input as it is
print(pred_text)

The created string looks like this.

Accumulation of dementia-causing substances in periodontal disease Kyushu University and others have announced. "You, you, how, you, what kind of thing, I looked inside." Yes, you, how, like you, look at that kind of thing, "see. In addition, to the subordinates of the Twenty Faces, one of them said, "I'm thrilled, that's it." Hahaha ... "Akechi said," Yes, what, "Evening ..." "..." "Oh, what about Kobayashi? I looked inside you and saw the nigiri." Oh, how did you come? It settled in the art of the day and that He had a cute face on his face like a thief. "... In today's day, that's it." The art said, "Then, the shelf." "And the thief grabbed his complexion, and Mr. Sotaro said.

My Japanese has gone crazy ...

After that, save the model as needed.


#Save model
model.save('test.h5')
model.save_weights('param.hdf5')

#Read
from tensorflow.python.keras.models import load_model
model = load_model('model/test.h5')
model.load_weights('model/param.hdf5')

Finally

As you can see, the accuracy is very low. If you have any ideas for improvement, please let me know.

References

-[3. Natural language processing with Python 1-2. How to create a corpus: Aozora Bunko] (https://qiita.com/y_itoh/items/fa04c1e2f3df2e807d61) -Understanding Word2Vec -How to embed word2vec learned in keras into embedding layer -Try Keras LSTM sentence generation on a word-by-word basis -Automatic writing with deep learning-Document generation using Keras (Part 1)

Recommended Posts

Sentence generation with GRU (keras)
Create an idol-like tweet with Keras LSTM (sentence generation)
I tried sentence generation with GPT-2
Image recognition with keras
CIFAR-10 tutorial with Keras
07. Sentence generation by template
Multivariate LSTM with Keras
Automatic quiz generation with COTOHA
I wrote the code for Japanese sentence generation with DeZero
Artificial data generation with numpy
Install Keras (used with Anaconda)
Multiple regression analysis with Keras
Document classification with Sentence Piece
Auto Encodder notes with Keras
Implemented word2vec with Theano + Keras
Tuning Keras parameters with Keras Tuner
Easily build CNN with Keras
Image caption generation with Chainer
Implemented Efficient GAN with keras
Image recognition with Keras + OpenCV
[Let's play with Python] Aiming for automatic sentence generation ~ Completion of automatic sentence generation ~
MNIST (DCNN) with Keras (TensorFlow backend)
Accelerate query generation with SQLAlchemy ORM
Predict Kaggle's Titanic with keras (kaggle ⑦)
[TensorFlow] [Keras] Neural network construction with Keras
Implement Keras LSTM feedforward with numpy
Compare DCGAN and pix2pix with keras
Score-CAM implementation with keras. Comparison with Grad-CAM
Prediction of sine wave with keras
Sentence vector creation using BERT (Keras BERT)
Beginner RNN (LSTM) | Try with Keras
Password generation in texto with python
CSRF countermeasure token generation with Python
Write Reversi AI with Keras + DQN
[PyTorch] Japanese sentence generation using Transformer
Feature generation with pandas group by
Gradation image generation with Python [1] | np.linspace
4/22 prediction of sine wave with keras
[Let's play with Python] Aiming for automatic sentence generation ~ Perform morphological analysis ~