"You only hear the last part of my words."

In fact, this paper shows the result that not only humans but also neural networks were the same.

Frustratingly Short Attention Spans in Neural Language Modeling

The excuse is, "Because it's enough to predict your next word," but it seems that the point is the same in relationships and research.

In this volume, I would like to take a look at the fact that only the last one is really needed, and if so, why, while introducing the above papers and other related papers. ..

The referenced papers are managed on the following GitHub. It is updated daily, so if you are interested in research trends, please do Star & Watch! ..

arXivTimes

What is Attention

Attention is a method for focusing on important points in the past (= Attention) when dealing with continuous data. The image is that when answering a question, you pay attention to a specific keyword in the other person's question. As represented by this example, it is a widely used method in natural language processing.

The figure below shows that when predicting the next hidden layer $ h ^ * $ (red box), the past 5 hidden layers ($ h_2-h_6 $) are referenced. $ A_1-a_5 $ written on the arrow from each hidden layer in the past becomes "Attention", and it becomes "weight" which point in the past is important.

From Figure 1: Memory-augmented neural language modelling architectures.

Proposal in the dissertation: Let's share the role of the hidden layer

Now, with the advent of this Attention, the role played by the hidden layer in RNNs has increased. In addition to the original role of "predicting the next word," it must also play a role of Attention, that is, "information useful for predicting the future." Furthermore, since Attention itself is calculated from the hidden layer, it is necessary to have information such as "whether it is information that should be noted in the future".

In other words, the hidden layer plays the following three roles in the RNN that introduced Attention.

Storage of information for predicting the next word
Storage of information on whether or not it should be noted in the future (calculation of Attention)
Storage of information useful for future prediction

A situation that can be called a one-operation in a neural network. Isn't it better to share the work a little? That is what this paper proposes.

Orange plays the role of (p) 1, green (k) plays the role of 2, and blue (v) plays the role of 3. These are simply a combination of vectors, implemented as x3 300 dimensions if the original was 100 dimensions.

When I verified this with the Wikipedia corpus and the children's book corpus called Children's Book Test, the result was that it was generally more effective than the existing model, but it became clear during the verification. There was one fact.

Does Attention only look at the most recent location?

This figure shows the weight of Attention at the time of prediction, randomly sampled from the Wikipedia corpus used in the experiment. From the right, it is -1 to -15, but -1 is one before, then two, three, and so on, and the darker the color, the more important it is.

If you look at this, you can see that -1, that is, the most recent, is very important and has hardly been referred to since then.

This is a more detailed diagram, but you can see that the points with high weights are concentrated around -1 to -5. In fact, Attention's Window Size (how far it looks) was optimally 5.

Does that mean ...?

This is an RNN that uses an ordinary n-gram (*), and if only the past 5 are attracted anyway, the past 5 hidden layers can be used as they are for prediction.

N-gram: A method of capturing multiple words as a group. If there is a series of a, b, c, d, (a, b), (b, c) ... for 2-gram, (a, b, c), (b, c,) for 3-gram d) ...

h^*_t = tanh \left( 
W^N 
\begin{bmatrix}
 h^1_t \\ 
 \vdots \\
 h^{N-1}_{t-N+1} 
\end{bmatrix}
\right)

Excerpt from equation13

As a result, it is said that the accuracy surpasses the elaborate RNN and the accuracy is thinner than the method proposed in this study.

(The value is perplexity, the lower the better. Key-Value-Predict is the proposed method of this research, and 4-gram is a model that simply uses the hidden layer of the past)

What is this!

from Shadow Hearts 2

The curtain closes in the form of.

Two issues that create the unbearable shortness of Attention

First of all, there are two possible problems with this outcome.

Problem setting problem
Learning problems

The problem with problem setting is that it was a task that didn't require a long dependency in the first place, so this was the result. This was also the case in a study previously pointed out to Stanford by Deep Mind.

From Toward the acquisition of the ability to read and understand sentences-Research trends of Machine Comprehension-.

It is said that Deep Mind succeeded in mechanically creating a learning data set from CNN news, but when I verified this data ...

It is a story that a simple model was able to overwhelm the neural network. When I looked it up, there were few problems that required long dependence and understanding of the context, and even a simple model was able to record sufficient accuracy.

In other words, in this case as well, it was a task that could be answered sufficiently even with a simple model, so it is possible that high accuracy could be recorded even with a simple model, and Attention was within a short range. As a response to this point, data sets that require a high degree of understanding have recently been developed. Stanford's SQuAD and Salesforce's WikiText -A lot of datasets such as modeling-dataset /) were released last year alone (Is there any data in Japanese ...?).

The other point is that long dependencies may not be captured well. This may be due in part to the lack of data that requires such dependencies as described above, but there seems to be room for consideration in terms of network configuration and other factors.

Recently, the trend is to have external memory.

Attempts are also being made to change the structure so that longer-term dependence can be grasped.

SampleRNN: An Unconditional End-to-End Neural Audio Generation Model

This is a study of voice, and in the case of voice, the data density is quite high (normal music has nearly 40,000 data per second). As a result, there is a greater need to capture long-term addiction. In that sense, the structure that is suitable for capturing long-term addiction may come out first in the voice. (In this paper, there is a sentence at the beginning, such as "WaveNet, but I think that CNN still can not catch long-term dependence", and I feel hot.)

The proposed network has the role of stacking RNNs in a pyramid shape in the image, and taking charge of longer dependence at the top. The image is that roles are divided according to the length of the dependency in charge.

By the way, speech synthesis using this model has also been proposed.

Char2Wav: End-to-End Speech Synthesis

Attempts have been made to explore cell structures that replace LSTMs, which are often used in RNNs, but in fact LSTMs, a simplified version of them, GRUs are pretty good, and research shows that it's not easy to go beyond that. It is shown.

An Empirical Exploration of Recurrent Network Architectures

Therefore, I have the impression that it is better to devise the entire network configuration, including the externalization of memory.

In this way, research is still underway from various points. The development beyond this end will be updated steadily.

[PYTHON] Unbearable shortness of Attention in natural language processing

What is Attention

Proposal in the dissertation: Let's share the role of the hidden layer

Does Attention only look at the most recent location?

Two issues that create the unbearable shortness of Attention