** I studied Deep Learning2 Natural Language Processing ** from scratch. There is ** Attention ** in Chapter 8, but only 4 pages were devoted to ** Transformer **, which is the basis of ** BERT **, which is currently mainstream.
Actually, this is not unreasonable, because Zero Tsuku 2 was published in July 2018, Transformer in December 2017, and BERT in October 2018.
This time, I would like to move from ** Zero Tsuku 2 Attention ** to ** Transformer **'s paper ** Attention Is All You Need **, which can be said to be the basis of current natural language processing.
In Chapter 8 Attention of Zero Tsuku 2, in order to improve the performance of ** Seq2Seq translation model **, ** pay attention to Encoder's "I" when translating "I" with Decoder **. Implement Attention **.
First, with Attention Weight, calculate the ** inner product ** to find the similarity between the “I” vector of the Decoder and each word vector of the Encoder, and multiply the result by ** Softmax ** to obtain the ** weight a. ** Ask. However, giving priority to clarity, the inner product is calculated by ** multiplication + H-axis SUM **.
Then, in Attention Sum, the ** weighted sum ** of each word vector of Encoder is calculated from the weight a, and ** Context Vector ** is obtained. This Context Vector is a strong reflection of the word vector of "I", indicating that "I" should pay attention to "I".
The process of zeroing 2 gives priority to clarity, and is intentionally calculated by ** multiplication + axis SUM ** without using ** inner product **. Then, what happens if you calculate using the inner product properly?
Both the weight calculation and the weight sum calculation will be refreshed like this if they are in the form of an inner product. This is more rational, isn't it?
By the way, in Attention, the target to be calculated is ** Query **, the set of word vectors used for similarity calculation is ** Key **, and the set of vectors used for weighted sum calculation is ** Value **. say. If you review the figure to reflect these,
The target to be calculated ** "I" is Query **, the set of word vectors used for similarity calculation ** hs is Key **, and the set of vectors used for weighted sum calculation ** hs is Value **. In Zero Tsuku 2, the key and Value are the same, but making the two independent will improve the expressiveness.
Another thing, Attention is generally divided according to where the input comes from.
When Qurery and Key / Value come from different places, it is called ** Sorce Taget Attention **, and when Quety, Key, and Value all come from the same place (Self), it is called ** Self Attention **. .. Attention with zero is called Sorce Taget Attention.
Now, let's change the writing style a little and add one term in the format ** Self Attention **.
I added the ** section divided by the root dk **. The reason for dividing by the root dk is that if there is a value that is too large in the inner product calculation, when multiplying by ** Softmax **, the other values will be 0 and the ** gradient disappears **, which is prevented. To do. dk is the number of dimensions of the word distributed expression, which is 512 in the dissertation.
By the way, Transformer is composed only of Attention which can be processed in parallel without using any RNN which requires sequential processing. Therefore, it is not necessary to calculate each query one by one, and ** all queries can be calculated at once **, so it can be expressed as follows.
![Screenshot 2020-09-04 23.07.57.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/209705/ebf59620-7432-2f75-6714- 4e37375c7af9.png)
In the Attention Is All You Need paper, this is called ** Scaled dot-product Attention ** and is expressed by the following formula. This is the heart of Transformer.
4.Attention Is All You Need Now, in order to understand the paper ** Attention Is All You Need ** published in December 2017, I will explain the figures that appear in the paper.
First, the Scaled Dot-product Attention diagram. ** Matmul ** is the dot product and ** Scale ** is the root product. ** Mask (opt) ** means to mask the padded part when the number of words to be input is shorter than the sequence length. You can see that it is coded with the Attention (Q, K, V) equation.
It is ** Multi-Head Attention ** including the Scaled Dot-product Attention mentioned earlier. One output of the previous stage is input as Q, K, V via three ** Linear (fully connected layer) **. In other words, Q, K, and V are the outputs of the previous stage multiplied by the weights Wq, Wk, and Wv, respectively, and these weights are obtained by learning.
A set of Query, Key, Value is called ** Head **. Performance is better if you have multiple smaller heads, calculate the latent representation for each, and concat at the end, rather than having one large head. That's why we use Multi-Head. In the paper, the output of the previous stage is 512 dimensions, and it is divided into 8 by the 64 dimension head.
The overall configuration is ** Encoder-Decoder translation model **. The aim is to stop the RNN of sequential calculation and configure only Attention that can perform parallel calculation to speed up the processing. Nx means that the same configuration is repeated N times, and in the paper N = 6.
① Input Embedding Inputs is (number of batches, number of word ID columns). Outputs a vector converted using a pre-trained word distribution representation (number of batches, number of word ID columns, number of dimensions). The number of dimensions of the dissertation is 512.
② Positional Encoding When calculating the weight sum, ** word order information is lost ** ("I like her" and "she likes me" cannot be distinguished), so the position information of words (sin function) in the distributed expression And the pattern of the cos function) are added to make it easier to learn ** relative word positions **. In other words, the same word will have different vectors if they are in different places. The formula for the location information in the paper is as follows.
③ Add & Norm In ** Skip connection **, normalization by ** Layer Normalization ** and regularization by Dropout are performed. Layer Normalization normalizes each data unit (sentence) in the batch, not in batch units.
④ Feed Foward The output from the Attention layer is converted into features with two fully connected layers.
⑤ Masked Multi-Head Attention When calculating the Atention of "I", if you include "am", "a", and "cat", you will cheat the word to be predicted, so to make the previous word in Key invisible Put on the mask.
Transformer was proposed as a translation model, but as research progressed, it became clear that the ** meaning extraction power of sentences by Self Attention ** was quite strong, and various developments have been made, so it is easy. I will touch on.
June 2018. Using only the Encoder side of Transformer, the performance has been improved by two-step learning of ** pre-learning + fine tuning **. Pre-learning is ** guessing what the next word is **, and it is not good to know future information, so the following sentence is masked.
It is a well-known fact that structurally, there is a drawback that the back context cannot be used, and if the back context can be used, further performance improvement can be achieved.
February 2018. In order to use the back context somehow, I abandoned the merit of parallel processing by Attention and tried to use the back context using a multi-layered bidirectional LSTM. It turns out that doing so improves performance.
October 2018. A revolutionary method in natural language processing ** BERT ** has been announced. To put it simply, by replacing GPT pre-learning with the following two tasks, we have realized parallel processing while using the context of the context. So to speak, the birth of a bidirectional Transformer.
** ① Masked Language Model **: Solve the fill-in-the-blank problem. Choose 15% of words, mask 80%, replace 10% with other words, leave 10% as is. ** ② Next Sentence Prediction **: Determine if the contexts of the two sentences are connected.
Recommended Posts