The algorithm explanation and implementation of the following paper.
The story is that the Decomposable Attention [^ 1] algorithm seems to be usable when you want to investigate the relationship between two sentences, such as assumptions and assumptions, answers, and whether or not it is already mentioned, so how do you improve performance? While introducing it, it is a story that I implemented it with Keras [^ 4].
--Network for Natural Language Inference (NLI) --Used for classification problems that input two documents --Improved performance by incorporating Attention [^ 1] into the network --Kaggle's Quora Question Pairs [^ 3](competition to judge if they are the same question) also used by the winner [^ 5]
Figure: From A Decomposable Attention for Natural Language Inference [^ 1]
In order to use this algorithm, it is necessary to convert each word into a distributed representation with GloVe [^ 7] or Word2Vec [^ 8].
--Attend? --The words that are related to each other's documents are sorted with higher weight. --Compare --Compare two aligned sentences and convert them into two feature vectors --Aggregate --Calculate the likelihood for each class by combining feature vectors
A pair of input sentences
\bar{a} = (a_1, ..., a_{l_a})^\mathrm{T} \\
\bar{b} = (b_1, ..., b_{l_b})^\mathrm{T} \\
Let. Here, $ l_a and l_b $ represent the length of the document.
Also, $ a_i, b_j \ in \ mathbb {R} ^ d $ is some kind of distributed representation.
--In the experiment, GloVe [^ 7] is used to convert to a 300-dimensional variance vector. --Distribute vector $ \ bar {a}, \ bar {b} $ can be normalized to the $ l_2 $ norm of $ 1 $
❏ Attend
The weight of attention is calculated below. By doing this, important words are sorted so that they are highly correlated.
e_{ij} := F'(\bar{a_i}, \bar{b_j}) := F(\bar{a})^{\mathrm{T}}F(\bar{b})
Let. This attention weight $ e_ {ij} $ makes the weights of words that are closely related to the document $ a $ and $ b $ stronger.
The implementation function $ F $ is just a neural network.
If you calculate with $ F'$, the dimension of the input variable will be $ l_a \ times l_b $, which is large. In the actual calculation, the number of input dimensions of $ l_a + l_b $ is simplified to $ F $.
This $ F $ is a fully connected neural network at the time of implementation.
Based on the calculated attention weight $ e_ {ij} $, the normalized value $ \ alpha_i, \ beta_j $ of $ \ bar {a_i}, \ bar {b_j} $ is obtained.
\beta_i := \sum_{j=1}^{l_b} \frac{ \exp(e_{ij}) }{ \sum_{k=1}^{l_b} \exp(e_{ik})} \bar{b}_j \quad \forall i \in \left\{1,...,l_a \right\} \\
\alpha_j := \sum_{i=1}^{l_a} \frac{ \exp(e_{ij}) }{ \sum_{k=1}^{l_b} \exp(e_{kj})} \bar{a}_j \quad \forall j \in \left\{1,...,l_b \right\}
❏ Compare
v_{1,i} := G([\bar{\alpha_i}, \beta_i]) \quad \forall i \in \left\{1,...,l_a \right\} \\
v_{2,j} := G([\bar{\beta_j}, \alpha_i]) \quad \forall j \in \left\{1,...,l_b \right\}
Here, $ G $ is a neural network at the time of implementation.
❏ Aggregate
v_1 = \sum_{i=1}^{l_a} v_{1,i} \\
v_2 = \sum_{j=1}^{l_b} v_{2,j} \\
\hat{y} = H([v_1, v_2])
A function that I thought that the meaning as a sentence could be taken in by using a neural network instead of simply converting each word in the word embedding part.
f_{ij} := F_{\rm{intra}}(a_i)^{\mathrm{T}}F_{\rm{intra}}(a_j)
Here, $ F_ {\ rm {intra}} $ is used as a feedforward net, and $ \ bar {a_i} and \ bar {b_i} $ are changed as follows.
\bar{a_i} := [a_i, a'_i]\\
\bar{b_i} := [b_i, b'_i]
Where $ a'_i $ is
a'_i := \sum_{j=1}^{l_a} \frac{\exp(f_{ij} + d_{i-j})}{\sum_{k=1}^{l_a} \exp(f_{ik + d_{i-k}})} a_j
It is said. This $ d_ {i-j} \ in \ mathbb {R} $ is biased. The same applies to $ b'_i $
It is created with Keras [^ 9] with reference to the code [^ 2] [^ 6] [^ 10].
https://gist.github.com/namakemono/f4f273dbc63fc2174940415a9f689a6f
References
Recommended Posts