Introduction

I implemented RNN in python. I used ["Deep Learning"](http://www.amazon.co.jp/ Deep Learning-Machine Learning Professional Series-Okatani-Takayuki / dp / 4061529021) as a textbook.

Structure of this article

Introduction
RNN
Forward propagation calculation
Backpropagation calculation
Weight update
Implementation in python
Result
Loss
Forecast of series data
Prediction of sine wave
in conclusion

RNN

RNN is a recursive neural network that handles series data. Examples of series data include audio, language, and moving images. The feature of such series data is that the length of the series differs for each sample, and the order of the elements in the series is meaningful. RNNs make it possible to handle the characteristics of series data well.

Forward propagation calculation

First, let's define the characters. The indexes of each unit of the input layer, intermediate layer, and output layer are represented by $ i, j, and k $, respectively. In addition, the input at time $ t $, the input / output of the intermediate layer, the input / output of the output layer, and the teacher are expressed as follows.

Input: $ \ boldsymbol x ^ t = (x_i ^ t) $
Middle layer input / output: $ \ boldsymbol u ^ t = (u_j ^ t) $, $ \ boldsymbol z ^ t = (z_j ^ t) $
Output layer I / O: $ \ boldsymbol v ^ t = (v_k ^ t) $, $ \ boldsymbol y ^ t = (y_k ^ t) $
Teacher: $ \ boldsymbol d ^ t = (d_k ^ t) $

Furthermore, the weights between the input-intermediate layers, the weights of the feedback path between the intermediate-intermediate layers, and the weights between the intermediate-output layers are expressed as follows.

Input-Weight between intermediate layers: $ \ boldsymbol W ^ {(in)} = (w_ {ji} ^ {(in)}) $
Middle-Middle layer return path weights: $ \ boldsymbol W = (w_ {jj'}) $
Intermediate-Weight between output layers: $ \ boldsymbol W ^ {(out)} = (w_ {kj} ^ {(out)}) $

The figure below shows the structure of the network including the above definitions.

Now, I will explain the calculation of forward propagation. Input / output $ u_j ^ t $ and $ z_j ^ t $ of the unit $ j $ in the middle layer are the input $ \ boldsymbol x ^ t $ and the value of the middle layer at the previous time $ \ boldsymbol z ^ {t It can be expressed as follows using -1} $. $ f $ is the activation function.

\begin{align}
& u_j^t = \sum_i w_{ji}^{(in)}x_i^t + \sum_{j'} w_{jj'}z_{j'}^{t-1} \tag{1} \\
& z_j^t = f(u_j^t) \tag{2}
\end{align}

However, when $ t = 1 $, there is no input so far, so $ \ boldsymbol z ^ 0 = \ boldsymbol 0 $. When equations (1) and (2) are put together and expressed in a matrix, the output $ \ boldsymbol z ^ t $ of the intermediate layer is as follows.

\boldsymbol z^t = \boldsymbol f(\boldsymbol W^{(in)}\boldsymbol x^t + \boldsymbol W \boldsymbol z^{t-1}) \tag{3}

The input / output $ v_k ^ t $ and $ y_k ^ t $ of the output layer unit $ k $ can be expressed as follows using the output $ \ boldsymbol z ^ t $ of the intermediate layer.

\begin{align}
& v_k^t = \sum_j w_{kj}^{(out)}z_j^t \tag{4} \\
& y_k^t = f^{(out)}(v_k^t) \tag{5}
\end{align}

When equations (4) and (5) are summarized and expressed in a matrix, the output $ \ boldsymbol y ^ t $ is as follows.

\boldsymbol y^t = \boldsymbol f^{(out)}(\boldsymbol W^{(out)}\boldsymbol z^t) \tag{6}

Backpropagation calculation

The gradient is calculated from the error in each unit of each layer by the error back propagation method. This time, I will explain the ** BPTT method (backpropagation through time) **. In the BPTT method, the RNN is expanded in the time direction as shown in the figure below, and the error back propagation is calculated.

** Intermediate-Gradient between output layers **

\begin{align}
\frac{\partial E}{\partial w_{kj}^{(out)}} &= \sum_{t=1}^T \frac{\partial E}{\partial v_k^t} \frac{\partial v_k^t}{\partial w_{kj}^{(out)}} \\
&= \sum_{t=1}^T \frac{\partial E}{\partial y_k^t} \frac{\partial y_k^t}{\partial v_k^t} \frac{\partial v_k^t}{\partial w_{kj}^{(out)}} \\
&= \sum_{t=1}^T \frac{\partial E}{\partial y_k^t} f^{(out)'}(v_k^t)z_j^t \\
&= \sum_{t=1}^T \delta_k^{(out), t} z_j^t \tag{7}
\end{align}

** Intermediate-Gradient in the return path of the intermediate layer **

\begin{align}
\frac{\partial E}{\partial w_{jj'}} &= \sum_{t=1}^T \frac{\partial E}{\partial u_j^t} \frac{\partial u_j^t}{\partial w_{jj'}} \\
&= \sum_{t=1}^T \biggl(\sum_{k'} \frac{\partial E}{\partial v_{k'}^t} \frac{\partial v_{k'}^t}{\partial z_j^t} \frac{\partial z_j^t}{\partial u_j^t} + \sum_{j''} \frac{\partial E}{\partial u_{j''}^{t+1}} \frac{\partial u_{j''}^{t+1}}{\partial z_j^t} \frac{\partial z_j^t}{\partial u_j^t} \biggr) \frac{\partial u_j^t}{\partial w_{jj'}} \\
&= \sum_{t=1}^T \biggl(\sum_{k'} \frac{\partial E}{\partial v_{k'}^t} \frac{\partial v_{k'}^t}{\partial z_j^t} + \sum_{j''} \frac{\partial E}{\partial u_{j''}^{t+1}} \frac{\partial u_{j''}^{t+1}}{\partial z_j^t} \biggr) \frac{\partial z_j^t}{\partial u_j^t} \frac{\partial u_j^t}{\partial w_{jj'}} \\
&= \sum_{t=1}^T \biggl(\sum_{k'} \delta_{k'}^{(out), t} w_{k'j} + \sum_{j''} \delta_{j''}^{t+1} w_{j''j}  \biggr) f'(u_j^t) z_j^{t-1} \\
&= \sum_{t=1}^T \delta_j^t z_j^{t-1} \tag{8}
\end{align}

** Input-Gradient between intermediate layers **

\begin{align}
\frac{\partial E}{\partial w_{ji}^{(in)}} &= \sum_{t=1}^T \frac{\partial E}{\partial u_j^t} \frac{\partial u_j^t}{\partial w_{ji}^{(in)}} \\
&= \sum_{t=1}^T \delta_j^t x_i^t \tag{9}
\end{align}

Intermediate-In the gradient in the feedback path of the intermediate layer, there is a term of error $ \ boldsymbol \ delta ^ {t + 1} $ of the intermediate layer at the next time. If $ t = T $, then $ \ boldsymbol \ delta ^ {T + 1} = \ boldsymbol 0 $. Propagate the error $ \ delta $ in the order of $ t = T, T-1, \ cdots, 2, 1 $, and sum the gradients at each time. Use the sum of this gradient to update the weights.

Weight update

From equations (7), (8), and (9), the gradient between each layer can be obtained. Using this gradient, we update the weights from the following equation.

w_{new} = w_{old} - \varepsilon \frac{\partial E}{\partial w_{old}} \tag{10}

Implementation in python

Implemented sine wave prediction. The code is given at here. $ x ^ t = \ sin \ bigl (\ theta + (t -1) \ cdot \ Delta \ theta \ bigr) $ Enter $ x ^ t $, and $ y ^ t = \ at the next time Predict $ y ^ t $, which is sin \ bigl (\ theta + t \ cdot \ Delta \ theta \ bigr) $. When outputting $ y ^ t $ for a certain input $ x ^ t $, the prediction direction cannot be determined only by the input $ x ^ t $. However, if you enter $ x ^ t $ and $ z ^ {t-1} $, you can use the information of the previous time, so you can determine the prediction direction.

result

For the training data, the series data of $ \ boldsymbol x = (x ^ 1, x ^ 2, \ cdots, x ^ T) $ is used as a $ 1 $ sample. $ \ theta $ is determined by a random number from the range of $ [0, \ pi] $.

\begin{align}
& \{\theta \,|\, 0 \leq \theta \leq \pi\}, \Delta \theta = \frac{\pi}{6} \\
& x^1 = \sin \theta \\
& x^2 = \sin \bigl(\theta + \Delta \theta\bigr) \\
& \cdots \\
& x^t = \sin \bigl(\theta + (t - 1)\cdot \Delta \theta\bigr) \tag{11} \\
& \cdots \\
& x^T = \sin \bigl(\theta + (T - 1)\cdot \Delta \theta\bigr) \\
\end{align}

The learning parameters are as follows. Training data: $ 7000 $ Sample Learning rate: $ \ varepsilon = 0.05 $ Regularization factor: $ \ lambda = 0.001 $ Epoch: $ 30 $

Loss

The loss during learning is as follows.

Forecasting series data

Next, check if the prediction was correct when the time series data was input. The input is represented by $ x $, the output is represented by $ y $, and the ideal output is represented by $ d $.

\begin{align}
& x^1 = \sin\Bigl(\frac{1}{6}\pi\Bigr) \to y^1 = 0.43831960 \ \ (d^1 = 0.86602540) \tag{12} \\
& x^1 = \sin\Bigl(\frac{5}{6}\pi\Bigr) \to y^1 = 0.43831960 \ \ (d^1 = 0.0) \tag{13}
\end{align}

The values of $ y $ match because the values of $ x $ are equal. Also, the prediction $ y $ is far from the ideal output $ d $ because there is no sequence information. Therefore, I will also enter the data of the previous time.

\begin{align}
& \boldsymbol x = (x^1, x^2) = \biggl(\sin\Bigl(\frac{0}{6}\pi\Bigr), \sin\Bigl(\frac{1}{6}\pi\Bigr)\biggr) \to y^2 = 0.84290507 \ \ (d^2 = 0.86602540) \tag{14} \\
& \boldsymbol x = (x^1, x^2) = \biggl(\sin\Bigl(\frac{4}{6}\pi\Bigr), \sin\Bigl(\frac{5}{6}\pi\Bigr)\biggr) \to y^2 = -0.02663726 \ \ (d^2 = 0.0) \tag{15}
\end{align}

When input as time series data, correct prediction results are obtained. This is possible because RNNs have a recursive structure that holds time-series information.

Prediction of sine wave

Finally, we predict the sine wave. A simple algorithm is shown below.

Enter the short series data $ \ boldsymbol x = (x ^ 1, x ^ 2, \ cdots, x ^ {T'}) $ and get the output $ y ^ {T'} $.
Enter $ y ^ {T'} $ as $ x ^ {T'+ 1} $ and get the output $ y ^ {T'+ 1} $.
Repeat step 2 to obtain the results in sequence.

The figure below shows the result. Series data with short blue dots and green dots are the results obtained by prediction. Looking at the graph, we can see that the sine wave can be predicted correctly.

in conclusion

I was able to implement RNN. Next, I would like to study LSTM.

RNN implementation in python