[PYTHON] TensorFlow Tutorial-Sequence Transformation Model (Translation)

TensorFlow Tutorial (Sequence-to-Sequence Models) https://www.tensorflow.org/versions/master/tutorials/seq2seq/index.html#sequence-to-sequence-models It is a translation of. We look forward to pointing out any translation errors.

Recurrent neural networks are as described in RNN Tutorials (if you haven't read them, read them before proceeding. You can learn to model the language. This raises an interesting question: is it possible to generate a meaningful response, subject to words generated by some input? For example, can you train a neural network to translate from English to French? The answer turned out to be yes.

This tutorial will walk you through how to build and train such a system. Suppose you've cloned the tensorflow git repository installed by the pip package and you're at the root of the git tree.

Start by running the translation program:

cd tensorflow/models/rnn/translate
python translate.py --data_dir [your_data_directory]

This program downloads English-French translation data from the WMT'15 website, prepares for training and trains. .. It requires about 20GB of disk space. I'll download and prepare for a while (see below for details (https://www.tensorflow.org/versions/master/tutorials/seq2seq/index.html#run_it)), so get started and read this tutorial. It's a good idea to keep it running while you're at it.

In this tutorial, we will refer to the following files in models / rnn.

File	Contents
seq2seq.py	Library for building sequence transformation models
translate/seq2seq_model.py	Neural translation sequence transformation model
translate/data_utils.py	Helper function to prepare translation data
translate/translate.py	Binaries that train and execute translation models

Basics of sequence conversion

The basic sequence transformation model consists of two recurrent neural networks (RNNs), as introduced in Cho et al., 2014. : An encoder that processes inputs and a decoder that produces outputs. This basic architecture is shown below.

Each box in the figure above is an RNN cell, most commonly a GRU cell or an LSTM cell (for a description of them, see the RNN Tutorials (https://www.tensorflow.org/versions/master/tutorials). (see /recurrent/index.html)) is expressed. Encoders and decoders can also share weights and, more generally, use separate sets of parameters. Multi-layer cells are also used successfully in sequence transformation models, for example [Sutskever et al., 2014] on translation (http://arxiv.org/abs/1409.3215).

In the base model shown above, all inputs must be encoded into a fixed-size state vector as the only thing passed to the decoder. Bahdanu et al., 2014 introduces an attention mechanism that allows the decoder to access the input more directly. I will not go into the details of the mechanism of interest (see paper). It's just to mention that it allows the decoder to sneak in the input at every decoding step. A multi-layer sequence conversion network with the LSTM cell and decoder focus mechanism looks like this:

TensorFlow seq2seq library

As you can see above, there are a variety of sequence transformation models. Each of these models uses a different RNN cell, but all of them accept encoder and decoder inputs. This prompts an interface within TensorFlow's seq2seq library (models / rnn / seq2seq.py). The basic RNN encoder-decoder-sequence transformation model works as follows.

outputs, states = basic_rnn_seq2seq(encoder_inputs, decoder_inputs, cell)

In the above call, encoder_inputs corresponds to the list of tensors that represent the inputs to the encoder, that is, the letters A, B, and C in the first figure above. Similarly, decoder_inputs are tensors that represent the inputs to the decoder, GO, W, X, Y, Z in the first figure.

The cell argument is an instance of the models.rnn.rnn_cell.RNNCell class that defines the cells used inside the model. You can use existing cells such as GRUCell and LSTMCell, or you can write your own. The rnn_cell also provides a wrapper for constructing multi-layer cells for the purpose of adding dropouts to cell inputs and outputs and for other transformations. For an example, see RNN Tutorials (https://www.tensorflow.org/versions/master/tutorials/recurrent/index.html).

The call to basic_rnn_seq2seq returns two arguments: output and state. Both are lists of tensors that are the same length as decoder_inputs. Not surprisingly, the output corresponds to the output of the decoder at each time step, W, X, Y, Z, EOS in the first figure above. The returned state represents the internal state of the decoder at every time step.

In many cases, the sequence transformation model feeds back the output of the decoder at time t, which is the input of the decoder at time t + 1. This is how the sequence is constructed when decoding the sequence during testing. On the other hand, during training, you generally enter the correct answer in the decoder at every time step, even if the decoder makes a mistake in the previous step. The seq2seq.py function uses the feed_previous argument to support both modes. As an example, let's analyze the following usage of the embedded RNN model.

outputs, states = embedding_rnn_seq2seq(
    encoder_inputs, decoder_inputs, cell,
    num_encoder_symbols, num_decoder_symbols,
    output_projection=None, feed_previous=False)

In the embedding_rnn_seq2seq model, all inputs (both encoder_inputs and decoder_inputs) are integer tensors that represent discrete values. These are embedded in dense representations (see the Vector Expression Tutorial (https://www.tensorflow.org/versions/master/tutorials/word2vec/index.html) for more information on embedding. Please), in order to build these embeds, you need to specify the maximum number of discrete symbols that will appear: num_encoder_symbols on the encoder side and num_decoder_symbols on the decoder side.

In the above call, we set feed_previous to False. This means that the decoder will use the provided decoder_inputs tensor. If feed_previous is set to True, the decoder will only use the first element of decoder_inputs. All other tensors in this list are ignored and the previous output of the encoder is used instead. This is used for decryption conversion in our translation model. It can also be used during training, such as Bengio et al., 2015, to make the model more robust to its own mistakes. I can do it.

Another important argument used above is output_projection. If this is not specified, the output of the embedded model will be a tensor of batch size x num_decoder_symbols, representing the logit of each generated symbol. When training a model with a large output vocabulary, i.e. large num_decoder_symbols, it is not practical to store these large tensors. Instead, we recommend returning a small output tensor. They are later projected onto a large output tensor using output_projection. This allows us to use our seq2seq model with sampling softmax losses as described in Jean et. Al., 2015. I can.

In addition to basic_rnn_seq2seq and embedding_rnn_seq2seq, there are several sequence transformation models in seq2seq.py, so take a look at them. They all have similar interfaces and will not be discussed in detail. The translation model below uses embedding_attention_seq2seq.

Neural translation model

The heart of the sequence transformation model consists of the functions in models / rnn / seq2seq.py, but there are some tricks worth mentioning that are used in the translation model in models / rnn / translate / seq2seq_model.py. There is.

Sampling softmax and output projection

Use sampling softmax to process large output vocabularies for the reasons mentioned above. It decodes from the output projection, so you have to keep track of it. Both the sampling softmax loss and the output prediction are constructed with the following code in seq2seq_model.py.

  if num_samples > 0 and num_samples < self.target_vocab_size:
    w = tf.get_variable("proj_w", [size, self.target_vocab_size])
    w_t = tf.transpose(w)
    b = tf.get_variable("proj_b", [self.target_vocab_size])
    output_projection = (w, b)

    def sampled_loss(inputs, labels):
      labels = tf.reshape(labels, [-1, 1])
      return tf.nn.sampled_softmax_loss(w_t, b, inputs, labels, num_samples,
                                        self.target_vocab_size)

First, note that we only build sampling softmax if the number of samples (512 by default) is smaller than the target vocabulary size. If your vocabulary is less than 512, it's better to just use the standard softmax loss.

Then build the output projection, as you can see in the code. This is a pair of weight matrix and bias vector. When used, the RNN cell returns a vector of batch size x size shapes instead of batch size x target_vocab_size. You need to multiply and bias the weight matrix to retrieve the logit, as it does on lines 124-126 of seq2seq_model.py.

if output_projection is not None:
  self.outputs[b] = [tf.matmul(output, output_projection[0]) +
                     output_projection[1] for ...]

Bucketing and padding

In addition to sampling softmax, our translation model utilizes bucketing, which is a way to efficiently process sentences of different lengths. Let's clarify the problem first. When translating English into French, there is input for variable-length L1 English sentences and output for variable-length L2 French sentences. Since the English sentence is passed as encoder_inputs and becomes the French sentence as decoder_inputs (starting with the GO symbol), in principle, create a seq2seq model for all pairs of English and French sentence lengths (L1, L2 + 1). is needed. This is a huge graph consisting of many very similar subgraphs. On the other hand, all statements can also be padded with a special PAD symbol. In that case, you only need the seq2seq model for the only padded length. However, short statements make this model inefficient because it requires encoding and decoding a large number of unnecessary PAD symbols.

As a compromise between how to build a graph for all length pairs and how to pad to a single length, use several buckets and make each sentence longer than that. Do the padding for. translate.py uses the following default bucket:

buckets = [(5, 10), (10, 15), (20, 25), (40, 50)]

This is put in the first bucket if the input is a 3-token English sentence and the corresponding output is a 6-token French sentence, the encoder input is 5 in length and the decoder input is 10 in length. It means to be padded. If you have an English sentence with 8 tokens and the corresponding French sentence is 18 tokens, then the (20, 25) bucket is used because it does not fit in the (10, 15) bucket, that is, the English sentence is 20 and French. The sentence is padded to 25.

Recall that when you build the decoder input, you add a special GO symbol to the input data. This is done with the get_batch () function in seq2seq_model.py. This function also inverts the input English text. Input inversion was shown in Sutskever et al., 2014 to improve the results of the neural translation model. To put everything together, there is the sentence "I go." As input, tokenized as ["I", "go", "."], And the sentence "Je vais." As output, [" Let's say it is tokenized as Je "," vais ",". "]. In this case, the encoder input represented by [PAD PAD "." "Go" "I"] and the decoder input represented by [GO "Je" "vais" "." EOS PAD PAD PAD PAD PAD](5, 10) Can be put in a bucket.

Let's run

A large English-French corpus is required to train the above model. From the WMT'15 website, using a 10 ^ 9 French-English corpus for training, as a development set from the same site Use the 2013 news test. When you run the command, these datasets are downloaded to data_dir, training begins, and checkpoints are stored in train_dir.

python translate.py
  --data_dir [your_data_directory] --train_dir [checkpoints_directory]
  --en_vocab_size=40000 --fr_vocab_size=40000

It takes about 18GB of disk space and several hours to prepare the training corpus. The preparation unzips the dataset, creates a vocabulary file in data_dir, then tokenizes the corpus and converts it to an integer ID. Pay attention to the parameters that determine the vocabulary size. The above example converts all but the most common 40,000 words into UNK tokens that represent unknown words. If you change the vocabulary size, the binary will again convert the corpus into tokens and IDs.

After the data is prepared, training will begin. The default parameter for translate is set to use a large value. Larger models that have been trained over a long period of time give good results. However, it can take too long or the GPU can run out of memory, so you can request to train a smaller model, as in the following example.

python translate.py
  --data_dir [your_data_directory] --train_dir [checkpoints_directory]
  --size=256 --num_layers=2 --steps_per_checkpoint=50

The above command trains a model with 2 layers (default 3), 256 units for each layer (default 1024), and saves checkpoints every 50 steps (default 200). You can allow the model to fit in the memory of the GPU. You can change these parameters to make the model fit in the GPU's memory.

During training, at each step_per_checkpoint step, the binary outputs statistics from the most recent step. With the default parameters (three layers, size 1024), the first message looks like this:

global step 200 learning rate 0.5000 step-time 1.39 perplexity 1720.62
  eval: bucket 0 perplexity 184.97
  eval: bucket 1 perplexity 248.81
  eval: bucket 2 perplexity 341.64
  eval: bucket 3 perplexity 469.04
global step 400 learning rate 0.5000 step-time 1.38 perplexity 379.89
  eval: bucket 0 perplexity 151.32
  eval: bucket 1 perplexity 190.36
  eval: bucket 2 perplexity 227.46
  eval: bucket 3 perplexity 238.66

You can see that each step takes less than 1.4 seconds, and that each bucket has perplexity on the training set and perplexity on the development set. After about 30,000 steps, make sure that the perplexity of the short sentences (buckets 0 and 1) is in the single digits. Since the training corpus contains about 22M sentences, one epoch (the period of one cycle of training data) is about 340,000 steps in 64 batch sizes. At this point, you can use the model to translate English sentences into French by using the --decode option.

python translate.py --decode
  --data_dir [your_data_directory] --train_dir [checkpoints_directory]

Reading model parameters from /tmp/translate.ckpt-340000
>  Who is the president of the United States?
 Qui est le président des États-Unis ?

next?

The example above shows how to build your own English-French translator from end to end. Run it and see how the model works. It's reasonably good quality, but the default parameters don't give the best translation model. There are some things that can be improved here.

First of all, we used a very primitive tokenizer, the basic_tokenizer function, in data_utils. You can find a better tokenizer on the WMT'15 website (http://www.statmt.org/wmt15/translation-task.html). Using that tokenizer and a larger vocabulary should improve translation.

Also, the default parameters of the translation model are not tuned. You can change the training rate and attenuation, or try to initialize the model weights in another way. You can also change the default GradientDescentOptimizer to something more advanced, such as AdagradOptimizer, in seq2seq_model.py. Try these things and see how the results improve!

Finally, the above model can be used for any sequence conversion task, not just translation. For example, if you want to convert a sequence to a tree to generate a parse tree, you can do the same, as shown in Vinyals & Kaiser et al., 2015. The model gives cutting-edge results. You can build your own translator, as well as parsers, chatbots, or any program you can think of. Experiment!