[PYTHON] Explanation and implementation of Decomposable Attention algorithm

The algorithm explanation and implementation of the following paper.

Title: A Decomposable Attention Model for Natural Language Inference
Author: Ankur P.Parikh, et al.
Year: 2016
URL: https://arxiv.org/abs/1606.01933

The story is that the Decomposable Attention [^ 1] algorithm seems to be usable when you want to investigate the relationship between two sentences, such as assumptions and assumptions, answers, and whether or not it is already mentioned, so how do you improve performance? While introducing it, it is a story that I implemented it with Keras [^ 4].

What is Decomposable Attention [^ 1]?

--Network for Natural Language Inference (NLI) --Used for classification problems that input two documents --Improved performance by incorporating Attention [^ 1] into the network --Kaggle's Quora Question Pairs [^ 3](competition to judge if they are the same question) also used by the winner [^ 5]

Figure: From A Decomposable Attention for Natural Language Inference [^ 1]

algorithm

In order to use this algorithm, it is necessary to convert each word into a distributed representation with GloVe [^ 7] or Word2Vec [^ 8].

❏ Flow

--Attend? --The words that are related to each other's documents are sorted with higher weight. --Compare --Compare two aligned sentences and convert them into two feature vectors --Aggregate --Calculate the likelihood for each class by combining feature vectors

❏ Input variables

A pair of input sentences

\bar{a} = (a_1, ..., a_{l_a})^\mathrm{T} \\
\bar{b} = (b_1, ..., b_{l_b})^\mathrm{T} \\

Let. Here, $ l_a and l_b $ represent the length of the document.

Also, $ a_i, b_j \ in \ mathbb {R} ^ d $ is some kind of distributed representation.

--In the experiment, GloVe [^ 7] is used to convert to a 300-dimensional variance vector. --Distribute vector $ \ bar {a}, \ bar {b} $ can be normalized to the $ l_2 $ norm of $ 1 $

❏ Attend

The weight of attention is calculated below. By doing this, important words are sorted so that they are highly correlated.

e_{ij} := F'(\bar{a_i}, \bar{b_j}) := F(\bar{a})^{\mathrm{T}}F(\bar{b})

Let. This attention weight $ e_ {ij} $ makes the weights of words that are closely related to the document $ a $ and $ b $ stronger.

The implementation function $ F $ is just a neural network.

If you calculate with $ F'$, the dimension of the input variable will be $ l_a \ times l_b $, which is large. In the actual calculation, the number of input dimensions of $ l_a + l_b $ is simplified to $ F $.

This $ F $ is a fully connected neural network at the time of implementation.

Based on the calculated attention weight $ e_ {ij} $, the normalized value $ \ alpha_i, \ beta_j $ of $ \ bar {a_i}, \ bar {b_j} $ is obtained.

\beta_i := \sum_{j=1}^{l_b} \frac{ \exp(e_{ij}) }{ \sum_{k=1}^{l_b} \exp(e_{ik})} \bar{b}_j \quad \forall i \in \left\{1,...,l_a \right\} \\

\alpha_j := \sum_{i=1}^{l_a} \frac{ \exp(e_{ij}) }{ \sum_{k=1}^{l_b} \exp(e_{kj})} \bar{a}_j \quad \forall j \in \left\{1,...,l_b \right\}

❏ Compare

v_{1,i} := G([\bar{\alpha_i}, \beta_i]) \quad \forall i \in \left\{1,...,l_a \right\} \\
v_{2,j} := G([\bar{\beta_j}, \alpha_i]) \quad \forall j \in \left\{1,...,l_b \right\}

Here, $ G $ is a neural network at the time of implementation.

❏ Aggregate

v_1 = \sum_{i=1}^{l_a} v_{1,i} \\
v_2 = \sum_{j=1}^{l_b} v_{2,j} \\
\hat{y} = H([v_1, v_2])

❏ Intra-Sentence Attention (optional)

A function that I thought that the meaning as a sentence could be taken in by using a neural network instead of simply converting each word in the word embedding part.

f_{ij} := F_{\rm{intra}}(a_i)^{\mathrm{T}}F_{\rm{intra}}(a_j)

Here, $ F_ {\ rm {intra}} $ is used as a feedforward net, and $ \ bar {a_i} and \ bar {b_i} $ are changed as follows.

\bar{a_i} := [a_i, a'_i]\\
\bar{b_i} := [b_i, b'_i]

Where $ a'_i $ is

a'_i := \sum_{j=1}^{l_a} \frac{\exp(f_{ij} + d_{i-j})}{\sum_{k=1}^{l_a} \exp(f_{ik + d_{i-k}})} a_j

It is said. This $ d_ {i-j} \ in \ mathbb {R} $ is biased. The same applies to $ b'_i $

Implementation [^ 4]

It is created with Keras [^ 9] with reference to the code [^ 2] [^ 6] [^ 10].

https://gist.github.com/namakemono/f4f273dbc63fc2174940415a9f689a6f

References