[PYTHON] Explanation and implementation of Decomposable Attention algorithm

The algorithm explanation and implementation of the following paper.

The story is that the Decomposable Attention [^ 1] algorithm seems to be usable when you want to investigate the relationship between two sentences, such as assumptions and assumptions, answers, and whether or not it is already mentioned, so how do you improve performance? While introducing it, it is a story that I implemented it with Keras [^ 4].

What is Decomposable Attention [^ 1]?

--Network for Natural Language Inference (NLI) --Used for classification problems that input two documents --Improved performance by incorporating Attention [^ 1] into the network --Kaggle's Quora Question Pairs [^ 3](competition to judge if they are the same question) also used by the winner [^ 5]

image.png Figure: From A Decomposable Attention for Natural Language Inference [^ 1]

algorithm

In order to use this algorithm, it is necessary to convert each word into a distributed representation with GloVe [^ 7] or Word2Vec [^ 8].

❏ Flow

--Attend? --The words that are related to each other's documents are sorted with higher weight. --Compare --Compare two aligned sentences and convert them into two feature vectors --Aggregate --Calculate the likelihood for each class by combining feature vectors

❏ Input variables

A pair of input sentences

\bar{a} = (a_1, ..., a_{l_a})^\mathrm{T} \\
\bar{b} = (b_1, ..., b_{l_b})^\mathrm{T} \\

Let. Here, $ l_a and l_b $ represent the length of the document.

Also, $ a_i, b_j \ in \ mathbb {R} ^ d $ is some kind of distributed representation.

--In the experiment, GloVe [^ 7] is used to convert to a 300-dimensional variance vector. --Distribute vector $ \ bar {a}, \ bar {b} $ can be normalized to the $ l_2 $ norm of $ 1 $

❏ Attend

The weight of attention is calculated below. By doing this, important words are sorted so that they are highly correlated.

e_{ij} := F'(\bar{a_i}, \bar{b_j}) := F(\bar{a})^{\mathrm{T}}F(\bar{b})

Let. This attention weight $ e_ {ij} $ makes the weights of words that are closely related to the document $ a $ and $ b $ stronger.

The implementation function $ F $ is just a neural network.

If you calculate with $ F'$, the dimension of the input variable will be $ l_a \ times l_b $, which is large. In the actual calculation, the number of input dimensions of $ l_a + l_b $ is simplified to $ F $.

This $ F $ is a fully connected neural network at the time of implementation.

Based on the calculated attention weight $ e_ {ij} $, the normalized value $ \ alpha_i, \ beta_j $ of $ \ bar {a_i}, \ bar {b_j} $ is obtained.

\beta_i := \sum_{j=1}^{l_b} \frac{ \exp(e_{ij}) }{ \sum_{k=1}^{l_b} \exp(e_{ik})} \bar{b}_j \quad \forall i \in \left\{1,...,l_a \right\} \\

\alpha_j := \sum_{i=1}^{l_a} \frac{ \exp(e_{ij}) }{ \sum_{k=1}^{l_b} \exp(e_{kj})} \bar{a}_j \quad \forall j \in \left\{1,...,l_b \right\} 

❏ Compare

v_{1,i} := G([\bar{\alpha_i}, \beta_i]) \quad \forall i \in \left\{1,...,l_a \right\} \\
v_{2,j} := G([\bar{\beta_j}, \alpha_i]) \quad \forall j \in \left\{1,...,l_b \right\}

Here, $ G $ is a neural network at the time of implementation.

❏ Aggregate

v_1 = \sum_{i=1}^{l_a} v_{1,i} \\
v_2 = \sum_{j=1}^{l_b} v_{2,j} \\
\hat{y} = H([v_1, v_2])

❏ Intra-Sentence Attention (optional)

A function that I thought that the meaning as a sentence could be taken in by using a neural network instead of simply converting each word in the word embedding part.

f_{ij} := F_{\rm{intra}}(a_i)^{\mathrm{T}}F_{\rm{intra}}(a_j)

Here, $ F_ {\ rm {intra}} $ is used as a feedforward net, and $ \ bar {a_i} and \ bar {b_i} $ are changed as follows.

\bar{a_i} := [a_i, a'_i]\\
\bar{b_i} := [b_i, b'_i]

Where $ a'_i $ is

a'_i := \sum_{j=1}^{l_a} \frac{\exp(f_{ij} + d_{i-j})}{\sum_{k=1}^{l_a} \exp(f_{ik + d_{i-k}})} a_j

It is said. This $ d_ {i-j} \ in \ mathbb {R} $ is biased. The same applies to $ b'_i $

Implementation [^ 4]

It is created with Keras [^ 9] with reference to the code [^ 2] [^ 6] [^ 10].

https://gist.github.com/namakemono/f4f273dbc63fc2174940415a9f689a6f

References

Recommended Posts

Explanation and implementation of Decomposable Attention algorithm
Explanation and implementation of ESIM algorithm
Explanation and implementation of SocialFoceModel
Explanation and implementation of PRML Chapter 4
Explanation and implementation of simple perceptron
Explanation of edit distance and implementation in Python
Explanation of CSV and implementation example in each programming language
Introduction and Implementation of JoCoR-Loss (CVPR2020)
Introduction and implementation of activation function
Implementation of Dijkstra's algorithm with python
Mathematical explanation of binary search and ternary search and implementation method without bugs
Implementation and experiment of convex clustering method
Implementation and explanation using XGBoost for beginners
Machine learning algorithm classification and implementation summary
Non-recursive implementation of extended Euclidean algorithm (Python)
Explanation and implementation of the XMPP protocol used in Slack, HipChat, and IRC
Implementation of TRIE tree with Python and LOUDS
Comparison of R and Python writing (Euclidean algorithm)
Python --Explanation and usage summary of the top 24 packages
Sequential update of covariance to derivation and implementation of expressions
Implementation of DB administrator screen by Flask-Admin and Flask-Login
Overview of generalized linear models and implementation in Python
Explanation of package tools and commands for Linux OS
Implementation of Fibonacci sequence
Perceptron basics and implementation
Rabbit and turtle algorithm
Python implementation of CSS3 blend mode and talk of color space
[Deep Learning from scratch] Implementation of Momentum method and AdaGrad method
Derivation and implementation of update equations for non-negative tensor factorization
[With simple explanation] Scratch implementation of deep Boltzmann machine with Python ②
[With simple explanation] Scratch implementation of deep Boltzmann machine with Python ①
Theory and implementation of multiple regression models-why regularization is needed-
Verification and implementation of video reconstruction method using GRU and Autoencoder
Derivation of EM algorithm and calculation example for coin toss
Implementation of ML-EM method, cross-section reconstruction algorithm for CT scan