Introduction

In a previous article, I experimented with creating "Christ sign" -like text using letter-based LSTMs. [\ TensorFlow ] I tried mass-producing "posthumous judgment" style messages with LSTM --Qiita

--Only the characters that appear in the learning data can be output. --There are quite a few sentences that are exactly the same as the learning data. ――Sentences that are grammatically strange or have strange meanings appear --Learning data itself is very small

Since various issues have arisen, I would like to try a model that uses the entire sentence as input. In the end, it would be nice to try a GAN-based generative model (SeqGAN [^ 1], etc.), but suddenly it seems to be frustrating, so first try the Seq2Seq model while also practicing the idea of inputting and outputting in sentence units. I will try.

Verification environment

This time it is difficult to learn without GPU, so I will run it with Google Colaboratory which can be used for free.

Google Colaboratory
- TensorFlow 2.2.0-rc3 --Runtime: ** GPU **

strategy

Let's use the Seq2Seq model used in machine translation. The basics are based on this tutorial. Keras: Ex-Tutorials: Seq2Seq Intro to Learning – PyTorch Since the input is a word this time, change the code by referring to "If you want to use a word-level model with an integer sequence".

When generating a sentence with Seq2Seq, there is a problem of what to input, but refer to the method described in another article.

--Enter the first sentence and get the output --Repeat inputting the output text next and getting the output again

I will use it in this way. Sentence generation using Seq2Seq (Wikipedia data is used for training) --Qiita

The explanation of the Encoder-Decoder model including the Seq2Seq model was easy to understand, so I threw it all around. Encoder-Decoder --Qiita with PyTorch

According to this article, each word in the input data is first converted into a vector by the Embedding layer and then trained using the sequence of vector representations. Originally, the vector representation of each word should also be learned from the data, but since there is not much data, we will substitute it with the learned Word2Vec model.

This time, I used this Word2Vec model. [^ w2v] shiroyagicorp/japanese-word2vec-model-builder: A tool for building gensim word2vec model for Japanese.

[^ w2v]: There are various learned Word2Vec models in Japanese, so as long as the morpheme unit (such as splitting the inflected ending) matches the learning data, you should be able to use other ones. → About learned Japanese word2vec and its evaluation --Blog of Hoxom Co., Ltd.

A trained word2vec model is available at: http://public.shiroyagi.s3.amazonaws.com/latest-ja-word2vec-gensim-model.zip

There is, so download this zip file and use it.

This model contains data in which about 330,000 words are represented by 50-dimensional vectors. With this model, any word contained here can be output. Since it is learned based on the expression of Word2Vec, words with similar meanings may appear with a certain probability even if they are not in the learning data. ** Isn't it a dream to "reconcile with a cat"? ** **

program

First, from the required module `ʻimport``.

import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Dense, Activation, LSTM, Embedding
from tensorflow.keras.optimizers import RMSprop
from tensorflow.keras.callbacks import ModelCheckpoint, ReduceLROnPlateau
import numpy as np
import random
import sys
import pickle
from gensim.models import Word2Vec

Data preparation

In a sense, the most important data.

** We will use the signboard posted by "Bdljapan Association" and the placard text (a copy) that they put up when they are active on the street. (Same as last time) **

text = """
Oh, listen to the words of the world god
The reward of evil is death
Evil people do not recognize God
The one who sows evil reaps disaster
Your god is the only one
Get ready to meet your maker
Remember your maker
Jesus Christ is your Creator
Jesus Christ gives eternal hope
Jesus Christ is the only Son of God
Jesus Christ is the Son of God
Jesus Christ is the only God
Jesus Christ judges the world correctly
Those who call on Jesus Christ will be saved
eternal life
Source of eternal life
Eternal god
Source of eternal salvation
Hope for eternity
Stand before God in the last days
In the last days man stands before God
The day when God will judge mankind is near
Reconcile with God
Repent of your sins against God
Seek God's Kingdom and God's Justice
Seek the kingdom of God and justice
The kingdom of God is approaching
The kingdom of God is approaching, repent
Those who reject God's Word prefer death
God's judgment comes suddenly
The day of God's righteous judgment is near
Jesus Christ, the only Son of God, is the Savior
God says that it is not the rule to die here
God sees the heart
God punishes sin
God sent Christ the only Son to the world
God sent His Son Christ to the world
God is the only
God has fooled the wisdom of the world
God set the day to judge the world
Repent of your attitude toward God
Fear God
Fear God
Those who move away from God enter the path of evil
Acknowledge God
Seek God
Think about the destination after death
Today is the day of salvation
There is no salvation other than Christ
The day when Christ judges people is near
Christ is the true god
Christ is the way truth life
The Second Coming of Christ is near
Purify the blood of Christ
The blood of Christ cleanses sin
The blood of Christ cleanses sin
The blood of Christ cleanses sin
The blood of Christ removes sin
There is no god but Christ
The resurrection of Christ is a confirmation of salvation
Christ gives you eternal life
Christ justifies you
Christ frees you from sin
Christ gives eternal life
Christ is the Son of God
Christ sinned man on the cross
Christ is the alter ego of the true God
Christ will come soon
Christ will come soon
Christ cancels sin and gives life
Christ revokes sin
Christ forgives sin and gives eternal life
Christ rose from the tomb
Christ has risen from the tomb
Christ sinned on behalf of man
Christ will come again and judge the world
Christ will come again and judge the world
Christ will come again
Christ will come again
Christ was resurrected
Christ is the true god
Christ is the alter ego of the true God
Christ sinned in his place
Christ is the way truth life
Christ resurrected and gives eternal life
Christ revived and overcame death
Those who believe in Christ will be saved
Those who believe in Christ have eternal life
Those who call on Christ will be saved
Repent
Repent
Idolatry is a sin
Fortunately for those who have a pure heart
Believe in God from the bottom of my heart
God judges the sins of the heart
Escape the indelible fire of hell
Hell is eternal suffering
Hell is the second death
There is a judgment after death
Meet the afterlife
Meet the afterlife
Think about your destination after death
Think about your destination after death
God sees private life
There is a way of death and a way of life
Death is the reward of sin
The Second Coming of the Lord Jesus Christ is near
The resurrection of the Lord Jesus Christ is a confirmation of salvation
Lord Jesus Christ is the Creator of all things
Lord's day comes suddenly
Life is short, heaven is long
No one is right
Just believe
Just believe
Corrupted society does not recognize God
The earth and the person are of God
Don't worship "made" things
Fortunately those who have been cleansed of their sins
Fortunately those who have been cleansed from sin
Enlightenment about sin, justice, and judgment
If you die in sin, you will go to eternal hell
If you die in sin, you will go to eternal hell
The reward of sin is death
The reward of sin is the gift of the god of death eternal life
The reward of sin is the gift of the god of death eternal life in Christ
Get forgiveness of sin
Get forgiveness of sins
Seek forgiveness of sins
God punishes sin
Fortunately those who have been cleansed of their sins
Repent of your sins
Repent of your sins
Heaven or hell or your destination
Heaven is eternal life Hell is the sea of fire
Even if the heavens and the earth perish, my words will not perish
The maker of all things
The land of heaven, hell, or all people will be resurrected
The kingdom of heaven repents of near sins
Watch out for fake
In the beginning God created the heavens and the earth
What man made is not a god
Get rid of human evil
Sin dwells in people
Get rid of a person's sin
God sees people's ways and deeds
The pond of fire is the second death
Redeem unrighteousness
God judges immorality and adultery
God judges adultery and adultery
There is a way of death and a way of life
God is not recognized in the crooked times
The true God loves people and removes their sins
Believe in the true god
Believe in the true god
Way truth life
Behold I will come soon
The world and the greed of the world are dying
The end of the world is near
The end of the world comes suddenly
The resurrected Christ gives eternal life
Judge the world correctly
Is it a disaster? A person who calls evil good
Worshipers who worship idols
Woe to the hero of drinking alcohol
I wonder if it will be a disaster
I am the way truth life
There is eternal life in my words
I am life
I'm the bread of life
I'm the bread of life
I have the keys to death and hell
I am the truth
I am the truth
I am the light of the world
Those who believe in me have eternal life
Those who believe in me have eternal life
Those who believe in me have eternal life
Those who believe in me live even if they die
I am the truth and the life
Remember your Creator
There is no salvation other than Jesus Christ
Call on God before it's too late
In the last days man stands before God
Christ judges hidden things
Christ sent by God is the savior
If you return to God, God will forgive you abundantly
Repent of your sins against God
The kingdom of God is approaching, repent
The day of God's judgment is near
The gift of God is eternal life
God loves people and removes their sins
Fear God and obey the word
Recognize God and be afraid to leave sin
Christ has the word of eternal life
Accept the salvation of Christ
The blood of Christ cleanses sin
Christ will come again and judge the world
Christ sinned man on the cross
Christ saves sinners
Christ gives man eternal life
Christ was punished for the substitution of man
Christ frees people from sin
Believe in Christ and be saved
Those who call on Christ will be saved
Repent and believe in the gospel
Repent
Be new from the bottom of your heart
Think about your destination after death
Fortunately those who have been cleansed from sin
The reward of sin is death
Please forgive me for my sins
Get rid of sin
Acknowledge your sins and return to God
Christ was resurrected for the sake of man justice
God sees people's hearts and thoughts
People meet afterlife judgment
The world and the desires of the world are gone
There is no god other than me
I am the way truth life
Those who believe in me have eternal life
"""

input_texts = [[w for w in s.split(" ")] for s in text.strip().split("\n")] * 10

Here, basically, the result of analyzing the original text with mecab -Owakati is pasted in text. However, I am visually correcting the incorrect analysis results and erasing punctuation marks and symbols. In addition, words that are not included in the Word2Vec model have been changed due to notational fluctuations and conjugations. For example

―― "Worry" → "Disaster" --"Aganau" → "Redemption" ―― "Takaburu" → "Takaburu" ―― "Ma" "Ta" → "Mama" "Just" ―― "Repent" → "Repent" ("Yo" is an imperative conjugation ending, so it is not originally divided, but it was not in the Word2Vec model used by "Repent")

There were patterns such as (this is not all).

In addition, `ʻinput_texts`` assumes that each set is repeated 10 times. As we will see later, when these data are entered, we will train to return random sentences in them. Prepare a data set that can output about 10 ways from one input.

Word2Vec model

Load the Word2Vec model. Assume that the above model has been extracted to the latest-ja-word2vec-gensim-model directory. Please change the path accordingly. I think it is convenient to put the model on Google Drive, mount it, and use it. Mount Google Colaboratory on Google Drive and run Python. --Qiita

#White goat Word2Vec model
model = Word2Vec.load("latest-ja-word2vec-gensim-model/word2vec.gensim.model")
words = ["<PAD>"] + model.wv.index2word
mat_embedding = np.insert(model.wv.vectors, 0, 0, axis=0)
input_token_index = target_token_index = dict((w, i) for i, w in enumerate(words))
num_encoder_tokens = num_decoder_tokens = mat_embedding.shape[0] # for Masking
max_encoder_seq_length = max(len(txt) for txt in input_texts)
max_decoder_seq_length = max_encoder_seq_length + 1 # BOS/EOS
latent_dim = mat_embedding.shape[1]

When you load the model with gensim, you get two pieces of information: model.wv.index2word and model.wv.vectors.

-- model.wv.index2word contains the vocabulary (word list) contained in the model. -- model.wv.vectors contains a matrix ( ndarray) that lists the vector representations of each word.

For the convenience of later, I want to reserve ID: 0 for padding (filling in the shortage part) because I am entering variable length data, so I will dummy it to ID: 0 with the following code. Is inserting the word and vector representation of. This shifts the word ID by one compared to the Word2Vec model.

words = ["<PAD>"] + model.wv.index2word
mat_embedding = np.insert(model.wv.vectors, 0, 0, axis=0)

The size of the vocabulary contained in this model and the number of dimensions of the vector determine the size of the model to be learned (excluding the number of dimensions of the hidden layer (internal state) of the LSTM). The size of each variable should be as follows (as of 05/01/2020).

num_encoder_tokens, num_decoder_tokens : 335477
max_encoder_seq_length: 15
max_decoder_seq_length: 16
latent_dim: 50

Here, max_decoder_seq_length is 1 larger than max_encoder_seq_length because it is necessary to add <BOS> to the beginning when learning the Decoder side. [^ 3]

num_encoder_tokens and num_decoder_tokens are the input and output vocabulary sizes (word types), respectively. Considering the translation task etc., num_encoder_tokens corresponds to the vocabulary size of the translation source and num_decoder_tokens corresponds to the vocabulary size of the translation destination, so the two values are generally different, but this time the input and output are the same language It is a value.

[^ 3]: Just in case, <BOS> and <PAD> are different. <BOS> is a symbol that represents the beginning of a sentence and is a target for learning, while <PAD> is a symbol for convenience to fit the length of a word string by packing it after short data. It is not a learning target.

Model definition

hidden_dim = 64

# Embedding
layer_emb = Embedding(num_encoder_tokens, latent_dim, trainable=False, mask_zero=True)

# Encoder
encoder_inputs = Input(shape=(None,), dtype=tf.int32)
x = layer_emb(encoder_inputs)
_, state_h, state_c = LSTM(hidden_dim, return_sequences=True, return_state=True)(x)
encoder_states = [state_h, state_c]

# Decoder
decoder_inputs = Input(shape=(None,), dtype=tf.int32)
x = layer_emb(decoder_inputs)
x, _, _ = LSTM(hidden_dim, return_sequences=True, return_state=True)(x, initial_state=encoder_states)
decoder_outputs = Dense(num_decoder_tokens)(x)

def accuracy_masking(y_true, y_pred):
    return tf.keras.metrics.sparse_categorical_accuracy(tf.gather_nd(y_true, tf.where(y_pred._keras_mask)), tf.gather_nd(y_pred, tf.where(y_pred._keras_mask)))

model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
opt = RMSprop(lr=0.01)
model.compile(optimizer=opt, loss=lambda y_true, y_pred: tf.nn.softmax_cross_entropy_with_logits(tf.one_hot(tf.cast(y_true, tf.int32), num_decoder_tokens), y_pred),
              metrics=[accuracy_masking])
# set embedding matrix
layer_emb.set_weights([mat_embedding])

The entered word ID string is converted to a vector representation in the `ʻEmbeddinglayer and passed to the LSTM. This embedded matrix (vector representation of each word) uses the values brought from the trained Word2Vec model as described above [^ 2], so use trainable = Falseto prevent the embedded matrix from being updated during training. make. Also, add mask_zero = True`` to treat ID: 0 as padding.

[^ 2]: Immediately after creating the layer, the weight matrix is not initialized and set_weights () cannot be done, so set it after building the model.

Subsequent model creation is largely the same as the Keras tutorial. However, I want to give the label by word ID (rather than a One-hot vector), so I define the loss function myself. I want to manipulate the score to make it easier to come up with various words later, so I decided not to put Softmax in the output layer ( Dense) of the model, and `` tf.nn.sparse_softmax_cross_entropy_with_logits () Calculate the loss function with. [^ 4]

[^ 4]: If you don't need to manipulate the score, you can use loss ='sparse_categorical_crossentropy'. In that case, put Softmax in the output layer of the model.

Also, regarding the Accuracy value displayed during learning, even the data that should be treated as padding (should be ignored) will be calculated, so I made metrics to correct it. You can get the mask by referring to the property _keras_mask of Tensor. Masking and padding with Keras | TensorFlow Core

Data shaping for learning

# create data
encoder_input_data = np.zeros(
    (len(input_texts), max_encoder_seq_length),
    dtype='int32')
decoder_input_data = np.zeros(
    (len(input_texts), max_decoder_seq_length),
    dtype='int32')
decoder_target_data = np.zeros(
    (len(input_texts), max_decoder_seq_length),
    dtype='int32')

output_texts = list(input_texts)
random.shuffle(output_texts)
# input one text and randomly output another text
for i, (input_text, target_text) in enumerate(zip(input_texts, output_texts)):
    for t, w in enumerate(input_text):
        encoder_input_data[i, t] = input_token_index[w]

    decoder_input_data[i, 0] = target_token_index['＠'] # BOS
    for t, w in enumerate(target_text):
        # decoder_target_data is ahead of decoder_input_data by one timestep
        decoder_input_data[i, t + 1] = target_token_index[w]
        decoder_target_data[i, t] = target_token_index[w]
    decoder_target_data[i, t + 1:] = target_token_index['。'] # EOS

Create three ndarrays: ʻencoder_input_data``, decoder_input_data``, and `decoder_target_data. Since all are prepared as word ID columns, dtype ='int32'`` is set.

As mentioned above, ʻinput_texts`` is data in which one input string is repeated 10 times each. By shuffling this and making it ʻoutput_texts``, we are creating training data that randomly returns one output string for a certain input string (up to 10 different output strings for the same input string). Is associated with).

When training the Decoder side with the Encoder-Decoder model, make sure that the output string is one word faster than the input string (vector representation). As mentioned above, the beginning of the sentence on the input side starts with <BOS>, but this time I decided to use the word '@' to indicate the beginning of the sentence (Is it okay to do this?) .. On the contrary, the end of the sentence on the output side is '. It will be represented by'.

Learning

Just run the fit () method and wait for a while. It takes about 10 minutes in the GPU environment of Google Colab.

batch_size = 64  # Batch size for training.
epochs = 50  # Number of epochs to train for.

cp_cb = ModelCheckpoint(
    filepath="model.{epoch:02d}.hdf5",
    verbose=1,
    mode="auto")

reduce_cb = ReduceLROnPlateau()

# Run training
model.fit([encoder_input_data, decoder_input_data], decoder_target_data,
          batch_size=batch_size,
          epochs=epochs,
          validation_split=0.2,
          callbacks=[cp_cb, reduce_cb])

Learning will proceed as follows.

Epoch 1/50
28/28 [==============================] - ETA: 0s - loss: 2.7266 - accuracy_masking: 0.1383
Epoch 00001: saving model to model.01.hdf5
28/28 [==============================] - 11s 390ms/step - loss: 2.7266 - accuracy_masking: 0.1383 - val_loss: 1.9890 - val_accuracy_masking: 0.1918 - lr: 0.0100
(Omitted)
Epoch 50/50
28/28 [==============================] - ETA: 0s - loss: 0.2723 - accuracy_masking: 0.8259
Epoch 00050: saving model to model.50.hdf5
28/28 [==============================] - 9s 329ms/step - loss: 0.2723 - accuracy_masking: 0.8259 - val_loss: 0.4014 - val_accuracy_masking: 0.7609 - lr: 1.0000e-03

inference

It is a preparation for actually generating sentences using the learned model.

encoder_model = Model(encoder_inputs, encoder_states)

decoder_lstm = model.layers[4]
decoder_dense = model.layers[5]

decoder_state_input_h = Input(shape=(hidden_dim,))
decoder_state_input_c = Input(shape=(hidden_dim,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
decoder_outputs, state_h, state_c = decoder_lstm(
    layer_emb(decoder_inputs), initial_state=decoder_states_inputs)
decoder_states = [state_h, state_c]
decoder_outputs = decoder_dense(decoder_outputs)
decoder_model = Model(
    [decoder_inputs] + decoder_states_inputs,
    [decoder_outputs] + decoder_states)

Create Encoder and Decoder separately. The Encoder side takes a word string as input and outputs the state after the input is completed. The Decoder side takes a word string and a state as input (actually, only one word can be entered), and outputs an output string and a new state.

def decode_sequence(input_seq):
    # Encode the input as state vectors.
    states_value = encoder_model.predict([input_seq])
    # Generate empty target sequence of length 1. (batch x sample x time)
    target_seq = np.zeros((1, 1, 1), dtype=np.int32)
    # Populate the first character of target sequence with the start character.
    target_seq[0, 0, 0] = target_token_index['＠']

    # Sampling loop for a batch of sequences
    # (to simplify, here we assume a batch of size 1).
    stop_condition = False
    decoded_sentence = []
    temperature = 1.2
    while not stop_condition:
        output_tokens, h, c = decoder_model.predict(
            [target_seq] + states_value)

        # Sample a token (with temperature)
        logits = output_tokens[0, 0, :]
        prob = np.exp(logits / temperature)
        prob /= (prob.sum() * 1.00001)
        sampled_token_index = np.argmax(np.random.multinomial(1, prob))
        sampled_word = words[sampled_token_index]
        decoded_sentence.append(sampled_word)

        # Exit condition: either hit max length
        # or find stop character.
        if (sampled_word == '。' or
           len(decoded_sentence) > max_decoder_seq_length):
            stop_condition = True

        # Update the target sequence (of length 1).
        target_seq[0, 0, 0] = sampled_token_index

        # Update states
        states_value = [h, c]

    return decoded_sentence

We will process the input string using the inference Encoder and Decoder mentioned earlier. states_value corresponds to the state of the Encoder when the input column is input to the Encoder. Give the Decoder this state and the first <BOS> word '@' . The input word is set to target_seq, but since the model assumes batch input and outputs one sample and one word in order, the shape is (1, 1, 1). ).

Decoder returns the score of the output word (the output of the Dense layer) for the input word and the previous state. You can convert this to a probability with Softmax, but to make it easier to get words with a low score, change the score to be divided by the constant temperature and then put in Softmax. One word is sampled according to the obtained probability and used as the output. This output word is the end of the sentence '. If it is', the loop is exited.

Let's do it!

I'm finally ready. Generate as much as you want! Decide the first sentence appropriately.

s = ["cat", "When", "settlement", "Seyo"]
for i in range(50):
    gen = decode_sequence([input_token_index[w] for w in s])
    print("".join(gen))
    s = gen

50 sentences are output like this.

Christ gives you the life of Christ.
Repent of your sins against God.
Idolatry is a sin.
I was guilty of the destination of Christ.
There is no god but Christ.
Repent of your sins.
Christ is nearing you.
A corrupt society does not recognize God.
Repent and believe in the gospel.
Get rid of sin.
Idolatry is a sin.
Heaven is eternal life The gift of hell.
The gift of God is eternal life.
Christ was resurrected because man was justified.
God sins the Hasse principle.
In the beginning God has heavenly things.
Lord of Evian's sins.
Fire gives way and eternal life.
Repent and believe in the gospel.
The day of whac-a-mole comes suddenly.
The hidden era is near.
Even if the heavens and the earth perish, the hell of man does not perish.
Christ rose from the tomb.
Fortunately, those who are clean fake.
God is the only one.
Christ will come again.
A lustful person does not recognize God.
God is the only one.
Christ is the true God.
Think about your destination after death.
There is a way of Anastasia and a way of life.
I have the keys to death and hell.
God is the only one.
God sees the heart.
God sent His Son Christ to the world.
Heaven or hell or your destination?
Judge the world correctly.
Today is the day of salvation.
God is the only one.
Christ saves sinners.
A person who worships an idol.
Meet the judgment after death.
Christ was resurrected.
God punishes sin.
Get rid of human evil.
Christ is the only god.
Believe in the true God.
Give the blood of Christ.
There is no god but Christ.
The bottom of the heavens and the earth calls the god of death Monday Night.

There is quite a bit of output similar to the original text, but ** some words are not in the training data **. ** "The day of whac-a-mole comes suddenly" ** laughed. ** "Even if the heavens and the earth perish, the hell of man does not perish" ** cannot be saved. ** "The bottom of the heavens and the earth calls the god of death Monday Night" ** Is it a comedian or someone's story?

Since the output column of the learning data is only the wording of the signboard, honestly, it feels like ** whatever you enter, the wording will come out **. ** Initial value is almost irrelevant. Perhaps. ** That may be fine (I don't think it's the original usage of Seq2Seq, but it's okay).

Summary

Even though there are more variations than last time, there are still some outputs that have collapsed as Japanese, and similar sentences have been made many times. Even so, it can be said that it is as intended that the words that are outrageous appear and come to you.

With this kind of feeling, I hope to continue to enjoy various models in the future. Because it's a long vacation. It's pretty tough to study on a laptop, but Colab has a free GPU so you can just try it out as a hobby. Really fat. Resurrection of the Christ Sign Generator (v2)

[PYTHON] [TensorFlow] I felt that I was able to mass-produce "posthumous judgment" style messages using Seq2Seq.