[PYTHON] Pre-processing to build seq2seq model using keras Functional API

What kind of article?

For those who want to try this or that for deep learning modeling but don't know how to implement it Using keras's functional API as a framework that is relatively flexible and reasonably abstracted Try to implement seq2seq, which is difficult with sequential, as simply as possible

table of contents

  1. Overview
  2. Pretreatment (Imakoko)
  3. Model Construction & Learning
  4. Inference
  5. Model improvement (not made yet)

Motivation for this article

It turns out that deep learning can be implemented using keras. I understand that deep learning has preprocessing. So how can I convert the data into a format that can use keras' deep learning capabilities? The main answer is that.

What you need for preprocessing

Granting start / end tokens

When the translation model infers the first word of a sentence, it takes the virtual word start token \ as input. Similarly, if the translation model estimates that the next word is the end token, give the end token \ so that the sentence can end there.

Digitization of word strings

In order to input to the machine learning model, it is necessary to quantify the loaded character string data in some way. Bag of words and one-hot encoding of each word are famous methods. This time I want to use the embedding layer of keras at the beginning of the network, so I assign a word ID to the word and convert it to a column of word ID.

Embedding layer https://keras.io/ja/layers/embeddings/

Unification of word string length

If possible, unify the length of the word string within the dataset to make it easier to enter into the LSTM later. This time, the length of the word string is adjusted by padding to match the maximum length in the dataset.

Processing for teacher forcing

There is teacher forcing as a technique when learning the seq2seq model. Originally, the decoder uses the estimation result of the previous word to estimate the next word. Since the correct answer data can be used during learning, the next word is estimated using the previous correct answer word instead of the previous estimation result. In the figure, the flow is as follows. LSTM-Page-2.png Even if the inference of "that" and "this", and "pen" and "pencil" is wrong, the next input will be corrected to the correct answer data. To achieve this, prepare a word string with only one word shifted from the target word string as the input of the decoder.

Example If the estimation target is "This is a pen. \ ", Prepare the word string "\ This is a pen" as input to the decoder

Summary of processing flow

The above flow can be summarized as follows.

  1. Add special words \ and \ to the start and end, respectively, so that the model can determine the start and end of the sentence.
  2. Define a conversion rule that converts a word into an ID that has a one-to-one correspondence with the word, and convert the word string into a word ID string.
  3. Match the length of the word string to the maximum length in the dataset with zero padding.
  4. Since the input to the decoder uses a learning technique called teacher forcing, the position is shifted with respect to the correct answer data of the decoder output.

When the above processing is performed, for example, a conversion like this will be performed. Dataset word string <start> i can 't tell who will arrive first . <end> ↓ Word ID column \ [2, 6, 42, 20, 151, 137, 30, 727, 234, 4, 3, 0, 0, 0, 0, 0, 0, 0](18 elements)

Implementation of preprocessing

Granting start / end tokens

Define the following two functions, read the data from the dataset for each row, and give the start / end tokens.

def preprocess_sentence(w):
    w = w.rstrip().strip()

    #Add statement start and end tokens
    #To let the model know when to start and when to end the forecast
    w = '<start> ' + w + ' <end>'
    return w

def create_dataset(path, num_examples):
    with open(path) as f:
        word_pairs = f.readlines()
    word_pairs = [preprocess_sentence(sentence) for sentence in word_pairs]

    return word_pairs[:num_examples]

Although it is called preprocess_sentence, it only gives start / end tokens, which is not a very good name for a function. The variable in create_dataset is word_pairs because the sample code of TensorFlow that I referred to is still there. It's not pairs at all, it returns num_examples word strings with start / end tokens.

Define conversion rule to word ID

Convert to a column of word IDs

Here, keras keras.preprocessing.text.Tokenizer is very convenient and you can take a break.

def tokenize(lang):
    lang_tokenizer = keras.preprocessing.text.Tokenizer(filters='', oov_token='<unk>')
    lang_tokenizer.fit_on_texts(lang)

    tensor = lang_tokenizer.texts_to_sequences(lang)

    tensor = keras.preprocessing.sequence.pad_sequences(tensor, padding='post')

    return tensor, lang_tokenizer

Determine the conversion rule between words and word IDs from the list of word strings entered by the fit_on_texts method. You can use the texts_to_sequences method to convert the list of input word strings into a list of word ID strings. 0 padding is also done with keras.preprocessing.sequence.pad_sequences.

Processing for teacher forcing

Input word ID string obtained by processing ʻinput_tensorby the method described above. Processtarget_tensor` as the correct word ID string processed by the above method as follows.

encoder_input_tensor = input_tensor
decoder_input_tensor = target_tensor[:,:-1]
decoder_target_tensor = target_tensor[:,1:] #This realizes teacher forcing

You now have the data to use in the seq2seq model. We will do modeling and learning in the next article.

reference

The pretreatment part is as follows Neural machine translation with attention https://www.tensorflow.org/tutorials/text/nmt_with_attention

The code base for the learning / inference part is as follows Sequence to sequence example in Keras (character-level). https://keras.io/examples/lstm_seq2seq/

The data used for learning is as follows https://github.com/odashi/small_parallel_enja

Repository containing the code for this article https://github.com/nagiton/simple_NMT

Recommended Posts

Pre-processing to build seq2seq model using keras Functional API
Building a seq2seq model using keras' Functional API Inference
Build a seq2seq model using keras's Functional API Model building & learning
Building a seq2seq model using keras's Functional API Overview
Multi-input / multi-output model with Functional API
I tried to make PyTorch model API in Azure environment using TorchServe
Bulk posting to Qiita: Team using Qiita API
Try implementing XOR with Keras Functional API
Applying Bayesian optimization to Keras DNN model
Procedure to use TeamGant's WEB API (using python)
Easy to create API server using go-json-rest module
How to get article data using Qiita API
[PyTorch] Sample ⑧ ~ How to build a complex model ~