1.First of all

I'm reading ** "Developmental Deep Learning with PyTorch" **. This time, I studied Transformer in Chapter 7, so I would like to output my own summary.

2. What is Transformer?

In 2017, an epoch-making paper ** "Attention All You Need" ** was published in the field of natural language processing. The model proposed there was ** Transformer **, which achieved SoTA with ** Attention ** alone, without using any of the previously mainstream RNNs in translation tasks.

Since then, models based on this Transformer, such as ** BERT, XLNet, and ALBERT **, have dominated the field of natural language processing, and natural language processing has come to be called Transformer.

スクリーンショット 2020-07-28 10.04.07.png

Here is a model diagram of the Transformer that performs the translation task. For example, considering Japanese-English translation, the ** Encoder ** on the left side learns the Attention of each word in the Japanese sentence, and the ** Decoder ** on the right side learns the Attention of each word in the English sentence while referring to that information. is. Now, let me explain five features.

1) Psitional Encoding Transformer's biggest aim is to utilize GPU and greatly increase processing speed by processing all words ** in parallel ** for each sentence, instead of processing words one by one like RNN. .. Therefore, ** Positional Encoding ** adds word order information to each word to prevent the loss of word order information due to parallel processing.

2) Scaled dot-product Attention This is the heart of Transformer, so I'll explain it a little more carefully. For Attention calculation, ** Query ** (word vector for which Attention is calculated), ** Key ** (collection of word vectors used for relevance calculation), ** Value ** (vector used for weighted sum calculation) 3) will appear.

スクリーンショット 2020-07-28 13.55.03.png I will explain how to calculate the Attention of "I am a cat" when the sentence is composed of five words, "I am a cat", "ha", "cat", "de", and "aru".

Since the degree of relevance can be calculated by the inner product of the vectors, we take the inner product of the "I" vector ** Query ** and the transposed matrix ** $ Key ^ T $ ** of the five word vectors. Then, by dividing by $ {\ sqrt {d_k}} $ and then multiplying by Softmax, you can find the weight (** Attention Weight **) that indicates which word is related to "I" and how much.

The reason for dividing by $ {\ sqrt {d_k}} $ is that if there is a value that is too large in the inner product calculation, when Softmax is multiplied, the other values may become 0.

Next, by ** inner product ** of ** Attention Weight ** and a matrix of 5 word vectors ** Value **, the vector component of words that are ** closely related to "I" is dominant * * Context Vector can be calculated. This is the calculation of Attention of "I".

By the way, Trindformer can calculate in parallel, and all queries can be calculated at once, so スクリーンショット 2020-07-28 18.41.37.png

In this way, Attention calculation of all queries is completed in one shot. This calculation is expressed by the following formula in the paper.

Attention(Q, K, V)=softmax(\frac{QK^T}{\sqrt{d_k}})V

3) Multi-Head Attention

スクリーンショット 2020-07-28 18.30.55.png

Input to Scaled dot-product Attention ** Query, Key, Value ** has a structure in which the output of the previous stage comes in via each fully connected layer. In other words, the output of the previous stage is multiplied by the weights $ W_q, W_k, and W_v $, respectively. At this time, rather than having one large Query, Key, Value set (called a head), have multiple small Query, Key, Value heads, and each head has a latent expression $ W_q, W_k, ** Multi-Head Attention ** shows that the performance is improved by calculating W_v $ and making it one at the end.

4) Musked Multi-Head Attention スクリーンショット 2020-07-29 09.37.30.png

Attention on the Decoder side is also calculated in parallel, but when calculating the Attention of "I", if "am", "a", and "cat" are included in the calculation target, the word to be predicted will be cheated. So, mask the previous word in Key to make it invisible. Multi-Head Attention with this feature is called ** Musked Multi-Head Attention **.

5) Position-wise Feed-Forward Networks This is a unit that converts the features from the output from the Attention layer with two fully connected layers. The input is (number of words, number of embedded dimensions of words), and the inner product of the weights of the two fully connected layers is the output (number of words, number of embedded dimensions of words). We named it ** Position-wise ** because it looks like there is an independent neural network for each word.

3. Model to be implemented this time

This time, we will implement a model that can solve the classification task by learning the Attention of each word of the sentence ** using only the Encoder on the left side of the Transformer translation model. In addition, prioritizing clarity, it is Single-Head Attention, not Multi-Head Attention. スクリーンショット 2020-07-27 16.53.25.png

The dataset used is ** IMDb ** (Internet Movie Dataset), which summarizes whether the content of a movie review (in English) is positive or negative.

By training the model, when you enter a review of a movie, ** determine whether the review is positive or negative **, and from the mutual attention of the review words ** clearly indicate the word on which the decision was based * * Let me do it.

Then, I would like to implement in order from the input.

4. Model code

class Embedder(nn.Module):
    '''Converts the word indicated by id to a vector'''

    def __init__(self, text_embedding_vectors):
        super(Embedder, self).__init__()

        self.embeddings = nn.Embedding.from_pretrained(
            embeddings=text_embedding_vectors, freeze=True)
        # freeze=True prevents backpropagation from updating and changing

    def forward(self, x):
        x_vec = self.embeddings(x)

        return x_vec

This is the part that converts the word ID into an embedded vector using Pytorch's nn.Embedding unit.

class PositionalEncoder(nn.Module):
    '''Add vector information indicating the position of the entered word'''

    def __init__(self, d_model=300, max_seq_len=256):
        super().__init__()

        self.d_model = d_model  #Number of dimensions of word vector

        #Create a table of values as pe that is uniquely determined by the order of words (pos) and the dimensional position (i) of the embedded vector.
        pe = torch.zeros(max_seq_len, d_model)

        #Send to GPU if GPU is available
        device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
        pe = pe.to(device)

        for pos in range(max_seq_len):
            for i in range(0, d_model, 2):
                pe[pos, i] = math.sin(pos / (10000 ** ((2 * i)/d_model)))
                pe[pos, i + 1] = math.cos(pos / (10000 ** ((2 * i)/d_model)))

        #Add the dimension that will be the mini-batch dimension to the beginning of the table pe.
        self.pe = pe.unsqueeze(0)

        #Avoid calculating gradients
        self.pe.requires_grad = False

    def forward(self, x):

        #Add input x and Positive Encoding
        #x is smaller than pe, so make it larger
        ret = math.sqrt(self.d_model)*x + self.pe
        return ret

This is the Positional Encoder part.

class Attention(nn.Module):
    '''Transformer is really a multi-head Attention,
Implement with a single Attention, giving priority to clarity'''

    def __init__(self, d_model=300):
        super().__init__()

        #SAGAN used 1dConv, but this time the features are converted in the fully connected layer.
        self.q_linear = nn.Linear(d_model, d_model)
        self.v_linear = nn.Linear(d_model, d_model)
        self.k_linear = nn.Linear(d_model, d_model)

        #Fully connected layer used for output
        self.out = nn.Linear(d_model, d_model)

        #Attention resizing variable
        self.d_k = d_model

    def forward(self, q, k, v, mask):
        #Convert features in fully connected layers
        k = self.k_linear(k)
        q = self.q_linear(q)
        v = self.v_linear(v)

        #Calculate the value of Attention
        #Adding each value would be too big, so root(d_k)Divide by and adjust
        weights = torch.matmul(q, k.transpose(1, 2)) / math.sqrt(self.d_k)

        #Calculate mask here
        mask = mask.unsqueeze(1)
        weights = weights.masked_fill(mask == 0, -1e9)

        #Normalize with softmax
        normlized_weights = F.softmax(weights, dim=-1)

        #Multiply Attention by Value
        output = torch.matmul(normlized_weights, v)

        #Convert features in fully connected layers
        output = self.out(output)

        return output, normlized_weights

This is the Attention part. In the mask calculation here, the part where the text data is short and is inserted should be 0 when multiplied by softmax, so the corresponding part is replaced with minus infinity (-1e9).

class FeedForward(nn.Module):
    def __init__(self, d_model, d_ff=1024, dropout=0.1):
        '''It is a unit that simply converts the features from the Attention layer with two fully connected layers.'''
        super().__init__()

        self.linear_1 = nn.Linear(d_model, d_ff)
        self.dropout = nn.Dropout(dropout)
        self.linear_2 = nn.Linear(d_ff, d_model)

    def forward(self, x):
        x = self.linear_1(x)
        x = self.dropout(F.relu(x))
        x = self.linear_2(x)
        return x

This is the part of Feed Forward. It is a simple two-layer fully connected layer.

class TransformerBlock(nn.Module):
    def __init__(self, d_model, dropout=0.1):
        super().__init__()

        #Layer Normalization layer
        # https://pytorch.org/docs/stable/nn.html?highlight=layernorm
        self.norm_1 = nn.LayerNorm(d_model)
        self.norm_2 = nn.LayerNorm(d_model)

        #Attention layer
        self.attn = Attention(d_model)

        #Two fully connected layers after Attention
        self.ff = FeedForward(d_model)

        # Dropout
        self.dropout_1 = nn.Dropout(dropout)
        self.dropout_2 = nn.Dropout(dropout)

    def forward(self, x, mask):
        #Normalization and Attention
        x_normlized = self.norm_1(x)
        output, normlized_weights = self.attn(
            x_normlized, x_normlized, x_normlized, mask)

        x2 = x + self.dropout_1(output)

        #Normalization and fully connected layer
        x_normlized2 = self.norm_2(x2)
        output = x2 + self.dropout_2(self.ff(x_normlized2))

        return output, normlized_weights

This is the part that creates a Transformer Block by combining Attention and Feed Foward. Both are multiplied by ** Layer Normalization ** and ** Dropout **, and ** residue binding ** similar to ResNet.

class ClassificationHead(nn.Module):
    '''Transformer_Use Block output and finally classify'''

    def __init__(self, d_model=300, output_dim=2):
        super().__init__()

        #Fully connected layer
        self.linear = nn.Linear(d_model, output_dim)  # output_dim is two positive and negative

        #Weight initialization processing
        nn.init.normal_(self.linear.weight, std=0.02)
        nn.init.normal_(self.linear.bias, 0)

    def forward(self, x):
        x0 = x[:, 0, :]  #Extract the features (300 dimensions) of the first word of each sentence in each mini-batch
        out = self.linear(x0)

        return out

Finally, it is the part that makes a negative / positive judgment. By classifying using the features of the first word of each sentence and backpropagating and learning the loss, the features of the first word naturally become the features that judge the negative and positive of the sentence.

class TransformerClassification(nn.Module):
    '''Classify with Transformer'''

    def __init__(self, text_embedding_vectors, d_model=300, max_seq_len=256, output_dim=2):
        super().__init__()

        #Model building
        self.net1 = Embedder(text_embedding_vectors)
        self.net2 = PositionalEncoder(d_model=d_model, max_seq_len=max_seq_len)
        self.net3_1 = TransformerBlock(d_model=d_model)
        self.net3_2 = TransformerBlock(d_model=d_model)
        self.net4 = ClassificationHead(output_dim=output_dim, d_model=d_model)

    def forward(self, x, mask):
        x1 = self.net1(x)  #Words into vectors
        x2 = self.net2(x1)  #Add Position information
        x3_1, normlized_weights_1 = self.net3_1(
            x2, mask)  # Self-Convert features with Attention
        x3_2, normlized_weights_2 = self.net3_2(
            x3_1, mask)  # Self-Convert features with Attention
        x4 = self.net4(x3_2)  #Classification 0 using the 0th word of the final output-Output 1 scalar
        return x4, normlized_weights_1, normlized_weights_2

This is the part that finally builds the entire model using the classes defined so far.

5. Whole code and execution

The entire code was created on Google Colab and posted on Github, so if you want to try it yourself, this [** "link" **](https://github.com/cedro3/Transformer/blob/master/ You can move it by clicking transformer_en_run.ipynb) and clicking the "Colab on Web" button at the top of the displayed sheet.

When you run

スクリーンショット 2020-07-29 10.46.38.png

In this way, the judgment result and its basis are displayed.

6. Try it with a Japanese dataset.

When I was browsing various websites, there was an example of extracting sentences from the securities report of a Japanese listed company called ** chABSA-dataset **, making a negative / positive judgment and displaying the judgment basis, so I also summarized it on Google Colab. Saw. If you want to try it yourself, click this ** "link" ** and it will be at the top of the displayed sheet. You can move it by clicking the "Colab on Web" button.

(reference) ・ Learn while making! Deep learning developed by PyTorch ・ I made a negative / positive analysis app with deep learning (python) [Part 1] ・ Thesis commentary Attention Is All You Need (Transformer)