RNN implementation in python

Introduction

I implemented RNN in python. I used ["Deep Learning"](http://www.amazon.co.jp/ Deep Learning-Machine Learning Professional Series-Okatani-Takayuki / dp / 4061529021) as a textbook.

Structure of this article

RNN

RNN is a recursive neural network that handles series data. Examples of series data include audio, language, and moving images. The feature of such series data is that the length of the series differs for each sample, and the order of the elements in the series is meaningful. RNNs make it possible to handle the characteristics of series data well.

Forward propagation calculation

First, let's define the characters. The indexes of each unit of the input layer, intermediate layer, and output layer are represented by $ i, j, and k $, respectively. In addition, the input at time $ t $, the input / output of the intermediate layer, the input / output of the output layer, and the teacher are expressed as follows.

Furthermore, the weights between the input-intermediate layers, the weights of the feedback path between the intermediate-intermediate layers, and the weights between the intermediate-output layers are expressed as follows.

The figure below shows the structure of the network including the above definitions.

rnn.png

Now, I will explain the calculation of forward propagation. Input / output $ u_j ^ t $ and $ z_j ^ t $ of the unit $ j $ in the middle layer are the input $ \ boldsymbol x ^ t $ and the value of the middle layer at the previous time $ \ boldsymbol z ^ {t It can be expressed as follows using -1} $. $ f $ is the activation function.

\begin{align}
& u_j^t = \sum_i w_{ji}^{(in)}x_i^t + \sum_{j'} w_{jj'}z_{j'}^{t-1} \tag{1} \\
& z_j^t = f(u_j^t) \tag{2}
\end{align}

However, when $ t = 1 $, there is no input so far, so $ \ boldsymbol z ^ 0 = \ boldsymbol 0 $. When equations (1) and (2) are put together and expressed in a matrix, the output $ \ boldsymbol z ^ t $ of the intermediate layer is as follows.

\boldsymbol z^t = \boldsymbol f(\boldsymbol W^{(in)}\boldsymbol x^t + \boldsymbol W \boldsymbol z^{t-1}) \tag{3}

The input / output $ v_k ^ t $ and $ y_k ^ t $ of the output layer unit $ k $ can be expressed as follows using the output $ \ boldsymbol z ^ t $ of the intermediate layer.

\begin{align}
& v_k^t = \sum_j w_{kj}^{(out)}z_j^t \tag{4} \\
& y_k^t = f^{(out)}(v_k^t) \tag{5}
\end{align}

When equations (4) and (5) are summarized and expressed in a matrix, the output $ \ boldsymbol y ^ t $ is as follows.

\boldsymbol y^t = \boldsymbol f^{(out)}(\boldsymbol W^{(out)}\boldsymbol z^t) \tag{6}

Backpropagation calculation

The gradient is calculated from the error in each unit of each layer by the error back propagation method. This time, I will explain the ** BPTT method (backpropagation through time) **. In the BPTT method, the RNN is expanded in the time direction as shown in the figure below, and the error back propagation is calculated.

bptt.png

** Intermediate-Gradient between output layers **

\begin{align}
\frac{\partial E}{\partial w_{kj}^{(out)}} &= \sum_{t=1}^T \frac{\partial E}{\partial v_k^t} \frac{\partial v_k^t}{\partial w_{kj}^{(out)}} \\
&= \sum_{t=1}^T \frac{\partial E}{\partial y_k^t} \frac{\partial y_k^t}{\partial v_k^t} \frac{\partial v_k^t}{\partial w_{kj}^{(out)}} \\
&= \sum_{t=1}^T \frac{\partial E}{\partial y_k^t} f^{(out)'}(v_k^t)z_j^t \\
&= \sum_{t=1}^T \delta_k^{(out), t} z_j^t \tag{7}
\end{align}

** Intermediate-Gradient in the return path of the intermediate layer **

\begin{align}
\frac{\partial E}{\partial w_{jj'}} &= \sum_{t=1}^T \frac{\partial E}{\partial u_j^t} \frac{\partial u_j^t}{\partial w_{jj'}} \\
&= \sum_{t=1}^T \biggl(\sum_{k'} \frac{\partial E}{\partial v_{k'}^t} \frac{\partial v_{k'}^t}{\partial z_j^t} \frac{\partial z_j^t}{\partial u_j^t} + \sum_{j''} \frac{\partial E}{\partial u_{j''}^{t+1}} \frac{\partial u_{j''}^{t+1}}{\partial z_j^t} \frac{\partial z_j^t}{\partial u_j^t} \biggr) \frac{\partial u_j^t}{\partial w_{jj'}} \\
&= \sum_{t=1}^T \biggl(\sum_{k'} \frac{\partial E}{\partial v_{k'}^t} \frac{\partial v_{k'}^t}{\partial z_j^t} + \sum_{j''} \frac{\partial E}{\partial u_{j''}^{t+1}} \frac{\partial u_{j''}^{t+1}}{\partial z_j^t} \biggr) \frac{\partial z_j^t}{\partial u_j^t} \frac{\partial u_j^t}{\partial w_{jj'}} \\
&= \sum_{t=1}^T \biggl(\sum_{k'} \delta_{k'}^{(out), t} w_{k'j} + \sum_{j''} \delta_{j''}^{t+1} w_{j''j}  \biggr) f'(u_j^t) z_j^{t-1} \\
&= \sum_{t=1}^T \delta_j^t z_j^{t-1} \tag{8}
\end{align}

** Input-Gradient between intermediate layers **

\begin{align}
\frac{\partial E}{\partial w_{ji}^{(in)}} &= \sum_{t=1}^T \frac{\partial E}{\partial u_j^t} \frac{\partial u_j^t}{\partial w_{ji}^{(in)}} \\
&= \sum_{t=1}^T \delta_j^t x_i^t \tag{9}
\end{align}

Intermediate-In the gradient in the feedback path of the intermediate layer, there is a term of error $ \ boldsymbol \ delta ^ {t + 1} $ of the intermediate layer at the next time. If $ t = T $, then $ \ boldsymbol \ delta ^ {T + 1} = \ boldsymbol 0 $. Propagate the error $ \ delta $ in the order of $ t = T, T-1, \ cdots, 2, 1 $, and sum the gradients at each time. Use the sum of this gradient to update the weights.

Weight update

From equations (7), (8), and (9), the gradient between each layer can be obtained. Using this gradient, we update the weights from the following equation.

w_{new} = w_{old} - \varepsilon \frac{\partial E}{\partial w_{old}} \tag{10}

Implementation in python

Implemented sine wave prediction. The code is given at here. $ x ^ t = \ sin \ bigl (\ theta + (t -1) \ cdot \ Delta \ theta \ bigr) $ Enter $ x ^ t $, and $ y ^ t = \ at the next time Predict $ y ^ t $, which is sin \ bigl (\ theta + t \ cdot \ Delta \ theta \ bigr) $. When outputting $ y ^ t $ for a certain input $ x ^ t $, the prediction direction cannot be determined only by the input $ x ^ t $. However, if you enter $ x ^ t $ and $ z ^ {t-1} $, you can use the information of the previous time, so you can determine the prediction direction.

result

For the training data, the series data of $ \ boldsymbol x = (x ^ 1, x ^ 2, \ cdots, x ^ T) $ is used as a $ 1 $ sample. $ \ theta $ is determined by a random number from the range of $ [0, \ pi] $.

\begin{align}
& \{\theta \,|\, 0 \leq \theta \leq \pi\}, \Delta \theta = \frac{\pi}{6} \\
& x^1 = \sin \theta \\
& x^2 = \sin \bigl(\theta + \Delta \theta\bigr) \\
& \cdots \\
& x^t = \sin \bigl(\theta + (t - 1)\cdot \Delta \theta\bigr) \tag{11} \\
& \cdots \\
& x^T = \sin \bigl(\theta + (T - 1)\cdot \Delta \theta\bigr) \\
\end{align}

The learning parameters are as follows. Training data: $ 7000 $ Sample Learning rate: $ \ varepsilon = 0.05 $ Regularization factor: $ \ lambda = 0.001 $ Epoch: $ 30 $

Loss

The loss during learning is as follows.

loss.png

Forecasting series data

Next, check if the prediction was correct when the time series data was input. The input is represented by $ x $, the output is represented by $ y $, and the ideal output is represented by $ d $.

\begin{align}
& x^1 = \sin\Bigl(\frac{1}{6}\pi\Bigr) \to y^1 = 0.43831960 \ \ (d^1 = 0.86602540) \tag{12} \\
& x^1 = \sin\Bigl(\frac{5}{6}\pi\Bigr) \to y^1 = 0.43831960 \ \ (d^1 = 0.0) \tag{13}
\end{align}

The values of $ y $ match because the values of $ x $ are equal. Also, the prediction $ y $ is far from the ideal output $ d $ because there is no sequence information. Therefore, I will also enter the data of the previous time.

\begin{align}
& \boldsymbol x = (x^1, x^2) = \biggl(\sin\Bigl(\frac{0}{6}\pi\Bigr), \sin\Bigl(\frac{1}{6}\pi\Bigr)\biggr) \to y^2 = 0.84290507 \ \ (d^2 = 0.86602540) \tag{14} \\
& \boldsymbol x = (x^1, x^2) = \biggl(\sin\Bigl(\frac{4}{6}\pi\Bigr), \sin\Bigl(\frac{5}{6}\pi\Bigr)\biggr) \to y^2 = -0.02663726 \ \ (d^2 = 0.0) \tag{15}
\end{align}

When input as time series data, correct prediction results are obtained. This is possible because RNNs have a recursive structure that holds time-series information.

Prediction of sine wave

Finally, we predict the sine wave. A simple algorithm is shown below.

  1. Enter the short series data $ \ boldsymbol x = (x ^ 1, x ^ 2, \ cdots, x ^ {T'}) $ and get the output $ y ^ {T'} $.
  2. Enter $ y ^ {T'} $ as $ x ^ {T'+ 1} $ and get the output $ y ^ {T'+ 1} $.
  3. Repeat step 2 to obtain the results in sequence.

The figure below shows the result. Series data with short blue dots and green dots are the results obtained by prediction. Looking at the graph, we can see that the sine wave can be predicted correctly.

sin.png

in conclusion

I was able to implement RNN. Next, I would like to study LSTM.

Recommended Posts

RNN implementation in python
ValueObject implementation in Python
SVM implementation in python
Neural network implementation in python
Implementation of quicksort in Python
Sorting algorithm and implementation in Python
HMM parameter estimation implementation in python
Mixed normal distribution implementation in python
Implementation of life game in Python
Implementation of original sorting in Python
Quadtree in Python --2
Python in optimization
CURL in python
Metaprogramming in Python
Python 3.3 in Anaconda
SendKeys in Python
Meta-analysis in Python
Unittest in python
Epoch in Python
Discord in Python
Sudoku in Python
DCI in Python
quicksort in python
nCr in python
N-Gram in Python
Programming in python
Plink in Python
Constant in python
Lifegame in Python.
FizzBuzz in Python
Sqlite in python
StepAIC in Python
N-gram in python
LINE-Bot [0] in Python
Csv in python
Disassemble in Python
Reflection in Python
Constant in python
nCr in Python.
format in python
Scons in Python3
Puyo Puyo in python
python in virtualenv
PPAP in Python
Quad-tree in Python
Reflection in Python
Chemistry in Python
Hashable in python
DirectLiNGAM in Python
LiNGAM in Python
Flatten in python
flatten in python
Implementation module "deque" in queue and Python
Sorted list in Python
Daily AtCoder # 36 in Python
Clustering text in Python
Daily AtCoder # 2 in Python
Implement Enigma in python
Daily AtCoder # 32 in Python
Daily AtCoder # 6 in Python
Daily AtCoder # 18 in Python