In Sequence to Sequence (Seq2Seq), which is a kind of EncoderDecoder model, Attention Model will be introduced, and its implementation and verification results will be explained.
Last time http://qiita.com/kenchin110100/items/b34f5106d5a211f4c004 I implemented the Sequence to Sequence (Seq2Seq) model with Chainer, This time, I added Attention Model to that model.
In the following, we will explain the Attention Model, its implementation method, and verification results.
Attention Model
By using an RNN network such as LSTM, series data such as sentences can be converted into feature vectors.
However, the initially input data is less likely to be reflected in the final output feature vector.
In other words, the sentence "Mom put on make-up and put on a skirt and went out to the city" and the sentence "Dad put on make-up and put on a skirt and went out to the city" became almost the same feature vector. It means that it will end up.
The Attention Model is a mechanism that properly considers the data entered at the beginning.
Sequence to Sequence with Attention Model
The figure below shows the calculation flow of the Seq2Seq model implemented last time.
| Sequence to Sequence | 
|---|
|  | 
(The figure is slightly different from the previous one)
The blue part is the Encoder that vectorizes the utterance, and the red part is the Decoder that outputs the response from the vector.
If you add the Attention Model to this, it will look like the figure below.
| Sequence to Sequence with Attention model | 
|---|
|  | 
It will be a little complicated, but the place where [At] is written in the figure is the Attention Model.
On the Encoder side, the intermediate vector that is output each time is stored in the Attention Model.
On the Decoder side, input the previous intermediate vector to Attention Model. Based on the input vector, Attention Model takes the weighted average of the intermediate vector input on the Encoder side and returns.
By inputting the weighted average of the intermediate vector of the Encoder into the Decoder, the Attention Model makes it possible to pay attention to the word before, the word after, and anywhere.
There are two main types of Attention Model, called Global Attention and Local Attention.
In the following, we will explain Global Attention and Local Attention.
Global Attention
The following paper proposes Global Attention.
Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate." arXiv preprint arXiv:1409.0473 (2014).
Originally it was used in machine translation.
The material that explains Global Attention is https://www.slideshare.net/yutakikuchi927/deep-learning-nlp-attention Is easy to understand.
It will be in English, https://talbaumel.github.io/attention/ Is also easy to understand.
The mechanism for taking the weighted average of the intermediate vector input on the Encoder side is shown below.
| Global Attention | 
|---|
|  | 
The figure assumes that the Encoder has three vectors, [Intermediate Vector 1], [Intermediate Vector 2], and [Intermediate Vector 3], input.
In the figure, [eh] and [hh] are linear combination layers that output the vector of hidden layer size from the vector of hidden layer size, [+] is the addition of vectors, and [×] is the multiplication of each element of the vector. I am.
[tanh] is a hyperbolic tangent that transforms the elements of a vector from -1 to 1.
[hw] is a linear combination layer that outputs a size 1 scalar from the size of the hidden layer.
[soft max] is a SoftMax function that normalizes the entered values so that the sum is 1.
The value calculated by [Soft max] is used as the weight of the weighted average, and the result of taking the weighted average of the intermediate vector is output.
This is how Global Attention works.
Local Attention
The following papers have proposed Local Attention
Luong, Minh-Thang, Hieu Pham, and Christopher D. Manning. "Effective approaches to attention-based neural machine translation." arXiv preprint arXiv:1508.04025 (2015).
This is also a machine translation paper.
The reference material is https://www.slideshare.net/yutakikuchi927/deep-learning-nlp-attention is.
Below is the calculation flow diagram of Local Attention.
| Local Attention | 
|---|
|  | 
More networks have been added to Global Attention.
The main difference is the network on the right.
[ht] is a linear combination that outputs a vector of hidden layer size from a vector of hidden layer size, and [tanh] works to scale the elements of the vector from -1 to 1 as before.
[tw] is a linear combination layer that transforms a vector of hidden layer sizes into a scalar, and [sigmoid] is a sigmoid function that scales the input value from 0 to 1. Therefore, the vector entered so far will be a scalar in the range 0 to 1.
Next, I will explain what you are doing with [ga] in the figure. The calculation of ga is as follows.
output = \exp\bigl(-\frac{(s - input * Len(S))^2}{\sigma^2}\bigl)
Where $ input $ is a scalar scaled from 0 to 1, $ Len (S) $ is the number of intermediate vectors input by the encoder, and $ s $ is the order of the intermediate layer vectors ([intermediate vector 1]]. If it is 1, it represents 2) if it is [intermediate vector 2].
If the value output by the sigmoid function is 0.1, [ga] will be a large value when the intermediate vector is 1, and a small value when the intermediate vector is 3.
By multiplying this output by the weight calculated by Global Attention, it is possible to focus more on a specific intermediate vector.
I implemented it with chainer as before. The Encoder part is the same as for Seq2Seq.
The code I referred to is that of oda. Thank you very much. https://github.com/odashi/chainer_examples
Attention
I implemented Global Attention. The code is as follows
attention.py
class Attention(Chain):
    def __init__(self, hidden_size, flag_gpu):
        """
Attention instantiation
        :param hidden_size:Hidden layer size
        :param flag_gpu:Whether to use GPU
        """
        super(Attention, self).__init__(
            #A linear combination layer that transforms a forward Encoder intermediate vector into a hidden layer size vector
            fh=links.Linear(hidden_size, hidden_size),
            #A linear combination layer that transforms the reverse Encoder intermediate vector into a hidden layer size vector
            bh=links.Linear(hidden_size, hidden_size),
            #A linear combination layer that transforms the Decoder's intermediate vector into a hidden layer size vector
            hh=links.Linear(hidden_size, hidden_size),
            #Linear combination layer for converting a hidden layer size vector to a scalar
            hw=links.Linear(hidden_size, 1),
        )
        #Remember the size of the hidden layer
        self.hidden_size = hidden_size
        #Use numpy when not using cupy when using GPU
        if flag_gpu:
            self.ARR = cuda.cupy
        else:
            self.ARR = np
    def __call__(self, fs, bs, h):
        """
Attention calculation
        :param fs:List of forward Encoder intermediate vectors
        :param bs:A list of reverse Encoder intermediate vectors
        :param h:Intermediate vector output by Decoder
        :return:Weighted average of the intermediate vector of the forward Encoder and weighted average of the intermediate vector of the reverse Encoder
        """
        #Remember the size of the mini-batch
        batch_size = h.data.shape[0]
        #Initializing the list to record weights
        ws = []
        #Initialize the value to calculate the total weight
        sum_w = Variable(self.ARR.zeros((batch_size, 1), dtype='float32'))
        #Weight calculation using Encoder intermediate vector and Decoder intermediate vector
        for f, b in zip(fs, bs):
            #Weight calculation using forward Encoder intermediate vector, reverse Encoder intermediate vector, and Decoder intermediate vector
            w = functions.tanh(self.fh(f)+self.bh(b)+self.hh(h))
            #Normalize using the softmax function
            w = functions.exp(self.hw(w))
            #Record the calculated weight
            ws.append(w)
            sum_w += w
        #Initialization of output weighted average vector
        att_f = Variable(self.ARR.zeros((batch_size, self.hidden_size), dtype='float32'))
        att_b = Variable(self.ARR.zeros((batch_size, self.hidden_size), dtype='float32'))
        for f, b, w in zip(fs, bs, ws):
            #Normalized so that the sum of the weights is 1.
            w /= sum_w
            #weight*Add the intermediate vector of Encoder to the output vector
            att_f += functions.reshape(functions.batch_matmul(f, w), (batch_size, self.hidden_size))
            att_b += functions.reshape(functions.batch_matmul(b, w), (batch_size, self.hidden_size))
        return att_f, att_b
In the explanation, only one Encoder was used, but in fact, it is common to use two types of Encoder, forward Encoder and reverse Encoder in Attention Model.
Therefore, when calculating Attention, we are passing two lists, a list of intermediate vectors calculated by the forward Encoder and a list of intermediate vectors calculated by the reverse Encoder.
Decoder
Unlike the case of Seq2Seq, the values input by Decoder are the word vector, the intermediate vector calculated by Decoder, and the weighted average of the intermediate vector of Encoder. So I am rewriting the implementation of Decoder.
att_decoder.py
class Att_LSTM_Decoder(Chain):
    def __init__(self, vocab_size, embed_size, hidden_size):
        """
Decoder instantiation for Attention Model
        :param vocab_size:Vocabulary number
        :param embed_size:Word vector size
        :param hidden_size:Hidden layer size
        """
        super(Att_LSTM_Decoder, self).__init__(
            #Layer to convert words into word vectors
            ye=links.EmbedID(vocab_size, embed_size, ignore_label=-1),
            #A layer that transforms a word vector into a vector four times the size of the hidden layer
            eh=links.Linear(embed_size, 4 * hidden_size),
            #A layer that transforms the Decoder's intermediate vector into a vector four times the size of the hidden layer
            hh=links.Linear(hidden_size, 4 * hidden_size),
            #A layer that transforms the weighted average of the forward Encoder's intermediate vector into a vector four times the size of the hidden layer
            fh=links.Linear(hidden_size, 4 * hidden_size),
            #A layer that transforms the weighted average of the forward Encoder's intermediate vector into a vector four times the size of the hidden layer
            bh=links.Linear(hidden_size, 4 * hidden_size),
            #Layer that converts a hidden layer size vector to the size of a word vector
            he=links.Linear(hidden_size, embed_size),
            #Layer to convert word vector to vocabulary size vector
            ey=links.Linear(embed_size, vocab_size)
        )
    def __call__(self, y, c, h, f, b):
        """
Decoder calculation
        :param y:Words to enter in Decoder
        :param c:Internal memory
        :param h:Decoder intermediate vector
        :param f:Weighted average of forward encoder calculated by Attention Model
        :param b:Weighted average of reverse encoder calculated by Attention Model
        :return:Vocabulary size vector, updated internal memory, updated intermediate vector
        """
        #Convert words to word vectors
        e = functions.tanh(self.ye(y))
        #LSTM using word vector, Decoder intermediate vector, forward Encoder Attention, reverse Encoder Attention
        c, h = functions.lstm(c, self.eh(e) + self.hh(h) + self.fh(f) + self.bh(b))
        #Convert the intermediate vector output from the LSTM to a vocabulary size vector
        t = self.ey(functions.tanh(self.he(h)))
        return t, c, h
Using a vector four times the size of the hidden layer is the same reason I explained last time.
We have added layers [fh] and [bh] to use the weighted average of the Encoder's intermediate vectors calculated by Attention, but otherwise they are the same.
Seq2Seq with Attention
The model that combines Encoder, Decoder, and Attention is as follows.
att_seq2seq.py
class Att_Seq2Seq(Chain):
    def __init__(self, vocab_size, embed_size, hidden_size, batch_size, flag_gpu=True):
        """
        Seq2Seq +Attention instantiation
        :param vocab_size:Vocabulary size
        :param embed_size:Word vector size
        :param hidden_size:Hidden layer size
        :param batch_size:Mini batch size
        :param flag_gpu:Whether to use GPU
        """
        super(Att_Seq2Seq, self).__init__(
            #Forward Encoder
            f_encoder = LSTM_Encoder(vocab_size, embed_size, hidden_size),
            #Reverse Encoder
            b_encoder = LSTM_Encoder(vocab_size, embed_size, hidden_size),
            # Attention Model
            attention = Attention(hidden_size, flag_gpu),
            # Decoder
            decoder = Att_LSTM_Decoder(vocab_size, embed_size, hidden_size)
        )
        self.vocab_size = vocab_size
        self.embed_size = embed_size
        self.hidden_size = hidden_size
        self.batch_size = batch_size
        #Cupy when using GPU, numpy when not using
        if flag_gpu:
            self.ARR = cuda.cupy
        else:
            self.ARR = np
        #Initialize the list to store the forward Encoder intermediate vector and the reverse Encoder intermediate vector
        self.fs = []
        self.bs = []
    def encode(self, words):
        """
Encoder calculation
        :param words:A recorded list of words to use for input
        :return: 
        """
        #Internal memory, intermediate vector initialization
        c = Variable(self.ARR.zeros((self.batch_size, self.hidden_size), dtype='float32'))
        h = Variable(self.ARR.zeros((self.batch_size, self.hidden_size), dtype='float32'))
        #First, calculate the forward Encoder
        for w in words:
            c, h = self.f_encoder(w, c, h)
            #Record the calculated intermediate vector
            self.fs.append(h)
        #Internal memory, intermediate vector initialization
        c = Variable(self.ARR.zeros((self.batch_size, self.hidden_size), dtype='float32'))
        h = Variable(self.ARR.zeros((self.batch_size, self.hidden_size), dtype='float32'))
        #Reverse Encoder calculation
        for w in reversed(words):
            c, h = self.b_encoder(w, c, h)
            #Record the calculated intermediate vector
            self.bs.insert(0, h)
        #Internal memory, intermediate vector initialization
        self.c = Variable(self.ARR.zeros((self.batch_size, self.hidden_size), dtype='float32'))
        self.h = Variable(self.ARR.zeros((self.batch_size, self.hidden_size), dtype='float32'))
    def decode(self, w):
        """
Decoder calculation
        :param w:Words to enter with Decoder
        :return:Predictive word
        """
        #Calculate the weighted average of the middle layer of the Encoder using the Attention Model
        att_f, att_b = self.attention(self.fs, self.bs, self.h)
        #Using Decoder's intermediate vector, forward Attention, reverse Attention
        #Calculation of next intermediate vector, internal memory, predicted word
        t, self.c, self.h = self.decoder(w, self.c, self.h, att_f, att_b)
        return t
    def reset(self):
        """
Initialize instance variables
        :return: 
        """
        #Internal memory, intermediate vector initialization
        self.c = Variable(self.ARR.zeros((self.batch_size, self.hidden_size), dtype='float32'))
        self.h = Variable(self.ARR.zeros((self.batch_size, self.hidden_size), dtype='float32'))
        #Initialization of the list that records the intermediate vector of the Encoder
        self.fs = []
        self.bs = []
        #Gradient initialization
        self.zerograds()
It uses a total of three LSTMs: forward Encoder, reverse Encoder, and Decoder.
The forward calculation and train calculation are the same as for Seq2Seq.
The created code is https://github.com/kenchin110100/machine_learning/blob/master/sampleAttSeq2Seq.py It is in.
I used the dialogue failure corpus as I did last time. https://sites.google.com/site/dialoguebreakdowndetection/chat-dialogue-corpus
4 types of utterances as before
Let's see the response for each Epoch.
First 1 Epoch
Utterance:Good morning=>response:  ['so', 'is', 'Ne', '</s>']
Utterance:How's it going?=>response:  ['Yes', '、', 'what', 'To', 'You see', 'hand', 'Masu', 'Or', '?', '</s>']
Utterance:I'm hungry=>response:  ['Yes', '</s>']
Utterance:It's hot today=>response:  ['Yes', '、', 'what', 'To', 'You see', 'hand', 'Masu', 'Or', '?', '</s>']
Did you have such a nasty look ...
3Epoch
Utterance:Good morning=>response:  ['Hello.', '</s>']
Utterance:How's it going?=>response:  ['so', 'is', '</s>']
Utterance:I'm hungry=>response:  ['heatstroke', 'To', 'Qi', 'To', 'Attach', 'hand', 'Masu', 'Or', '?', '</s>']
Utterance:It's hot today=>response:  ['heatstroke', 'To', 'Qi', 'To', 'Attach', 'Absent', 'Hmm', 'is', 'Or', '?', '</s>']
5Epoch
Utterance:Good morning=>response:  ['Thank you', '</s>']
Utterance:How's it going?=>response:  ['watermelon', 'Is', 'Like', 'is', 'Ne', '</s>']
Utterance:I'm hungry=>response:  ['heatstroke', 'To', 'Qi', 'To', 'Attach', 'hand', 'Want', 'is', 'Or', '?', '</s>']
Utterance:It's hot today=>response:  ['heatstroke', 'To', 'Qi', 'To', 'Attach', 'hand', 'Want', 'is', 'Or', '?', '</s>']
I already know about heat stroke ...
7Epoch
Utterance:Good morning=>response:  ['Good evening', '</s>']
Utterance:How's it going?=>response:  ['watermelon', 'Is', 'I love You', 'is', 'Ne', '</s>']
Utterance:I'm hungry=>response:  ['Bai', 'Bai', '</s>']
Utterance:It's hot today=>response:  ['heatstroke', 'To', 'Qi', 'To', 'Attach', 'Absent', 'Hmm', 'is', 'Or', '?', '</s>']
Good morning => Good evening, it's terrible ...
I feel that the accuracy is worse than that of Seq2Seq ...
As with Seq2Seq, I haven't been able to respond well to the utterance "How are you doing?" Perhaps the word "tone" was not used in the corpus.
Since the original is a dialogue collapse corpus, the result of the collapse is returned, that is, it may be that you are learning well with this ...
The transition of the total loss value and the comparison of the calculation time are summarized at the end.
I calculated Seq2Seq + Attention Model using chainer.
I felt that the calculation time was extremely long compared to the case of Seq2Seq alone. The comparison of that area will be ...
Next time, I will implement CopyNet ... I want to do it.
Recommended Posts