[PYTHON] Explanation and implementation of ESIM algorithm

The algorithm explanation and implementation of the following paper.

Title: Enhanced LSTM for Natural Language Inference Author: Chen, et al. Year: 2017 URL: https://arxiv.org/abs/1609.06038

What is the story? There is Decomsable Attention in the algorithm that finds the relationship between the two documents (eg, assumptions and hypotheses, whether they have the same meaning, etc.), and the performance of ESIM, which is an improved version of it. Seems to be good, so I tried to implement it with Keras.

The article below is about Decomposable Attention, the predecessor of this algorithm.

Explanation and implementation of Decomposable Attention algorithm

What is ESIM?

--ESIM is an abbreviation for Enhanced Sequential Inference Model, which is an improved version of Decomposable Attention [^ 2]. --Algorithm for classification that takes two documents as input --Also used by Kaggle's Quora [^ 5] winner. (The single model had the best performance)

Whole algorithm (quoted from the original paper [^ 1])

Difference from Decomposable Attention

--Use Tree-LSTM [^ 12]. (BiLSTM [^ 13] is also acceptable) --Use differences and element products when comparing. --Use Average Pooling as well as Max Pooling during Aggregation.

algorithm

❏ Flow

The flow itself is the same as Decomposable Attention.

--Attend --Arranged words with similar meanings in similar places for easy comparison. --Compare --Compare the arranged vectors and convert them into two feature vectors --Aggregate --Calculate the likelihood for each class by combining two feature vectors

❏ Input variables

The input document $ a $ and document $ b $ are composed of words of length $ l_a $ and $ l_b $, respectively, and each element is a distributed representation (eg GloVe [^ 7] or Word2Vec [^ 8]]. ) Is converted to a $ d $ dimensional vector.

a = (a_1, ..., a_{l_a})^\mathrm{T} \\
b = (b_1, ..., b_{l_b})^\mathrm{T} \\

Convert words to distributed representation (experiment uses 300-dimensional GloVe [^ 7])

\bar{a}_i = \mathrm{BiLSTM}(a, i),\quad \forall i \in \left\{1,...,l_a\right\} \\
\bar{b}_j = \mathrm{BiLSTM}(b, j),\quad \forall j \in \left\{1,...,l_b\right\} \\

❏Attend

This is the same as Decomposable Attention.

\begin{align}
e_{ij} &= \bar{a}_i^\mathrm{T}\bar{b}_j　\\
\tilde{a}_i &= \sum_{j=1}^{l_b} \frac{\exp(e_{ij})}{\sum_{k=1}^{l_a}\exp(e_{kj})} \bar{b}_j, \quad \forall i \in \left\{1,...,l_a\right\} \\
\tilde{b}_j &= \sum_{i=1}^{l_a} \frac{\exp(e_{ij})}{\sum_{k=1}^{l_b}\exp(e_{ik})} \bar{a}_i, \quad \forall j \in \left\{1,...,l_b\right\}  \\
\end{align}

What are you doing?

--A). I baked an apple and ate it --B). I ate oranges yesterday

When there were two sentences

\bar{a}	I	Is	Apple	To	Bake	ate
\tilde{a}	I	Is	Mandarin orange	To	-	ate

It is easier to compare if the elements similar to are sorted so that they are paired. The documents are arranged by weighting the words that are likely to be related to the two sentences. (Actually calculation between vectors)

❏Compare

\begin{align}
m_a &= (\bar{a}, \tilde{a}, \bar{a}-\tilde{a}, \bar{a} \odot \tilde{a})^\mathrm{T} \\
m_b &= (\bar{b}, \tilde{b}, \bar{b}-\tilde{b}, \bar{b} \odot \tilde{b})^\mathrm{T} \\
\end{align}

In Decomposable Attention, there were only two, $ \ bar {a}, \ tilde {a}, \ bar {b}, \ tilde {b} $, but in ESIM, performance is improved by adding subtraction and element product. It seems that it is.

❏Aggregate

v_{a,t} = \mathrm{BiLSTM}(F(m_a)) \\
v_{b,t} = \mathrm{BiLSTM}(F(m_b))

Here, the original paper proposes both TreeLSTM [^ 12] and BiLSTM [^ 13] methods, but in this article, BiLSTM is used for convenience of implementation. The function $ F $ is a feedforward neural network.

v_{a,\mathrm{ave}} = \sum_{i=1}^{l_a}\frac{v_{a,i}}{l_a}, \quad 
v_{a,\max} = \max_{i=\left\{1,..l_a\right\}}\frac{v_{a,i}}{l_a}, \\

v_{b,\mathrm{ave}} = \sum_{j=1}^{l_b}\frac{v_{b,i}}{l_b}, \quad 
v_{b,\max} = \max_{j=\left\{1,..l_b\right\}}\frac{v_{b,j}}{l_b}, \\

v = (v_{a,\mathrm{ave}}, v_{a,\max}, v_{b,\mathrm{ave}}, v_{b,\max})^\mathrm{T}

With Decomposable Attention, it was only $ \ max $, but it is difficult to adjust, so $ \ mathrm {avg} $ was also added.

Implementation

It is based on the code [^ 3].

https://gist.github.com/namakemono/b74547e82ef9307da9c29057c650cdf1

References

[^ 1]: Chen, Enhanced LSTM for Natural Language Inference, 2017. (Original paper of ESIM algorithm) [^ 2]: Parikh, A Decomposable Attention for Natural Language Inference, 2016. (Original paper of Decomposable Attention) [^ 3]: Dang, Quora Question Pairs --DL models, 2017. (with ESIM and Decomposable Attention code) [^ 4]: Kaggle, Quora Question Pairs, 2017. (Competition to determine if the documents are the same) [^ 5]: Maximilien @ DAMI, Quora Question Pairs-1st place solution, 2017. (Kaggle-Quora Winner Article) [^ 6]: explosion, spaCy, 2017. (Natural Language Processing Library) [^ 7]: Pennington et al., Glove: Global vectors for word representation., 2014. (Distributed representation algorithm GloVe Original paper) [^ 8]: Milkolov et al., Efficient Optimization of Word Representations in Vector Space, 2013. (Original paper of distributed representation algorithm Word2Vec) [^ 9]: Chollet, Keras, 2016. (Keras: Library for Deep Learning) [^ 10]: lystdo, LSTM with word2vec embeddings, 2017. (LSTM + Word2Vec Keras version code available) [^ 11]: namakemono, Implementation of ESIM, 2017. (ESIM implementation) [^ 12]: Tai et al., Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks, 2015. (LSTM in tree structure) [^ 13]: Bahdanau, Neural machine translation by jointly learning to align and translate, 2014. (Machine translation paper, mentioning BiLSTM)