[PYTHON] I tried to implement a basic Recurrent Neural Network model

I'm interested in Recurrent Neural Networks, but I'm having a hard time writing code. Isn't there a lot of cases like this? There are several reasons, but in my case I can think of the following.

The network configuration is simply complicated. From MLP (Multi-layer Perceptron) to CNN (Convolutional-NN), the signal flow was only forward, even if there was a special layer. (Excluding error calculation.)
There was an easy-to-understand example in MLP and CNN, "MNIST" (called'Hello World'in Deep Learning), but there is no such standard (standard) example in RNN.

By the way, Theano's Deep Learning and TensorFlow's Tutorial deal with language models. Those who are familiar with language models may get started quickly, but beginners first need to understand "what the example is trying to solve".

This time, I took up an example dealing with a simpler sequence that is not a language model, and decided to implement a simple Recurrent Neural Network (RNN).

(The programming environment used is python 2.7.11, Theano 0.7.0.)

Simple RNN structure

To check the RNN, I first tried running the Tutorial (ptb_word_lm.py) of "TensorFlow". It can be seen that the variable of "perplexity" decreases as the value of "epoch" increases. However, I could not understand the details of what it solved. Since LSTM (Long Short-term Memory) is also used as the model of RNN, I felt that the introduction to RNN was a high threshold.

The Elman net is introduced as a simple RNN in the document "Deep Learning". Also, when I searched for "Elman RNN" as a keyword, I found a blog called "Peter's note" (http://peterroelants.github.io/) that introduces simple RNNs, so I used this as a reference for the program. investigated.

The figure of RNN is quoted from the above site.

Fig. Simple RNN structure

Data enters from the input unit x, and after multiplying by the weight W_in, it enters the hidden layer unit s. There is a recursive flow for the output of unit S, and the result of applying the weight W_rec returns to unit s at the next time. In addition, it is usually necessary to consider the weight W_out for the output, but to simplify the structure, if W_out = 1.0 is fixed, the state of unit S will be output as it is.

To apply the BPTT method (Backpropagation through time) to the state shown on the left, consider the "expanded" state on the right. The state of the initial value s_0 of the hidden unit changes to the right while multiplying by the weight W_rec as the time advances. Also, [x_1, x_2, ... x_n] is input at each time. The state of s_n is output to the unit y at the final time.

The above model can be converted into Python code as follows. (Quoted from "Peter's note".)

def update_state(xk, sk, wx, wRec):

    return xk * wx + sk * wRec

def forward_states(X, wx, wRec):
    # Initialise the matrix that holds all states for all input sequences.
    # The initial state s0 is set to 0.
    S = np.zeros((X.shape[0], X.shape[1]+1))
    # Use the recurrence relation defined by update_state to update the 
    #  states trough time.
    for k in range(0, X.shape[1]):
        # S[k] = S[k-1] * wRec + X[k] * wx
        S[:,k+1] = update_state(X[:,k], S[:,k], wx, wRec)
    
    return S

What is the content of the example?

Also, regarding "what kind of problem is handled by the above RNN model", input a binary numerical value of X_k = 0. or 1. as an input. The output is a network model that outputs the total value of these binaries. For example For X = [0.1.0.0.0.0.0.0.0.0.0.0. 1.](because the total value of this list X is 2.) Set the output of Y = 2. to the correct value. Of course, the content of the example is to estimate by RNN (including two weighting coefficients) without using the "algorithm of counting numerical values".

Since the output value is a numerical value that takes continuous values, it can be considered as a kind of "regression" problem, not a "classification" problem. Therefore, MSE (mean square error) is used as the cost function, and the unit data is passed as it is without passing the Activation function.

First, training is performed using Train data (created in advance), and the weighting coefficient of 2 [W_in, W_rec] is obtained. It can be easily estimated by looking at the above figure, but the correct answer is [W_in, W_rec] = [1.0, 1.0] .

Preliminary study of model implementation

In the "Peter's note" article that I referred to, I used python (with numpy) to put together an IPython Notebook without using the Deep Learning library. If you copy this as it is, you can get the result as in the blog article, but considering the development, I tried to implement it using the Deep Learning library. I considered the following as options.

Use "TensorFlow".
Use "Theano".
Use higher level (abstracted) libraries such as "Keras" and "Pylearn2".

At first, I tried to make the original python code "one by one" into the TensorFlow version, but

    for k in range(0, X.shape[1]):
        # S[k] = S[k-1] * wRec + X[k] * wx
        S[:,k+1] = update_state(X[:,k], S[:,k], wx, wRec)
    
    return S

It turned out that the loop processing of the part of can not be fixed well (to TensorFlow version). If you refer to the tutorial code of TensorFlow (ptb_word_lm.py etc.), you should be able to implement this simple RNN model as a matter of course, but since the related class library is complicated and difficult to understand, we will use TensorFlow this time. passed.

In addition, options 3 such as "Keras" and "Pylearn2" were not selected this time because they deviate from the purpose of "understanding the implementation of RNN".

In the end, I decided to write the "Theano" version of the code for option 2.

“Theano scan” for RNN

What is common to the RNN codes by Theano found on the net is that most of the codes use "Theano scan". Theano scan is a function for performing Loop processing (iteration processing) and Iteration processing (convergence calculation) in the Theano framework. The specifications are complicated, and it is difficult to understand immediately even if you look at the original documentation (Theano Documentation). Although the Japanese information is quite limited, I proceeded with the behavior investigation of Theano scan while trying a small code with Jupyter Notebook by referring to Mr. sinhrks' blog article.

n = T.iscalar('n')
result, updates = theano.scan(fn=lambda prior, nonseq: prior * 2,
                              sequences=None,
                              outputs_info=a, #Refer to the value in the previous Loop--> prior
                              non_sequences=a, #Non-sequence value--> nonseq
                              n_steps=n)

myfun1 = theano.function(inputs=[a, n], outputs=result, updates=updates)
myfun1(5, 3)
# array([10, 20, 40])
# return-1 = 5 * 2
# return-2 = return-1 * 2
# return-3 = return-2 * 2

Execution result:

>>> array([10, 20, 40], dtype=int32)

I can't explain it in great detail, so I'll cover some usage examples. Theano.scan () takes 5 types of arguments as described above.

Key Word	Contents	Example of use
fn	Function for iteration	fn=lambda prior, nonseq: prior * 2
sequences	List that inputs while advancing the elements during sequential processing,Matrix type variable	sequences=T.arange(x)
outputs_info	Gives the initial value of sequential processing	outputs_info=a
non_sequences	Fixed value that is not a sequence (invariant with iterative processing)	non_sequences=a
n_steps	Iterative function	n_steps=n

In the above code, theano.scan () is given an initial value of 5 (not a sequence) and a number of times 3, and each iteration is multiplied by 2 to the result of the previous process. There is. First iteration: 5 x 2 = 10 Second iteration: 10 x 2 = 20 Third iteration: 20 x 2 = 40 As a result, result = [10, 20, 40] is calculated.

The following is a test that is a little more RNN conscious.

v = T.matrix('v')
s0 = T.vector('s0')
result, updates = theano.scan(fn=lambda seq, prior: seq + prior * 2,
                                             sequences=v,
                                             outputs_info=s0,
                                             non_sequences=None)
myfun2 = theano.function(inputs=[v, s0], outputs=result, updates=updates)

myfun2([[1., 0.], [0., 1.], [1., 1.]], [0.5, 0.5])

Execution result:

>>> array([[ 2.,  1.],
       [ 4.,  3.],
       [ 9.,  7.]], dtype=float32)

The initial value [0.5, 0.5] is input to the function. $ fn = \ texttt {lambda} \ seq, prior: \ seq + prior * Since we defined it as 2 $, First iterative process: [1., 0.] + [0.5, 0.5] x 2 = [2., 1.] Second iteration: [0., 1.] + [2., 1.] x 2 = [4., 3.] Third iteration: [1., 1.] + [4., 3.] x 2 = [9., 7.] It is calculated in the flow.

"theano.scan ()" is a function that supports the flow control of processing required by RNN. Similar functionality is not currently supported for TensorFlow,

Our white paper mentions a number of control flow operations that we've experimented with -- I think once we're happy with its API and confident in its implementation we will try to make it available through the public API -- we're just not quite there yet. It's still early days for us :)

(Quoted from discussion in GitHub TensorFlow issue # 208.)

So I would like to wait for future support.

(I don't understand what kind of implementation is done for TensorFlow's RNN model, but the fact that RNN calculation has already been realized means that such a "theano.scan ()" -like function is " It means that it is not "essential". I think that it is necessary to study the sample code of TensorFlow a little more in this case.)

Simple RNN code details using Theano

Now that we know Theano Scan (), let's take a look at the Simple RNN code. First, define a simpleRNN class.

class simpleRNN(object):
    #   members:  slen  : state length
    #             w_x   : weight of input-->hidden layer
    #             w_rec : weight of recurrnce 
    def __init__(self, slen, nx, nrec):
        self.len = slen
        self.w_x = theano.shared(
            np.asarray(np.random.uniform(-.1, .1, (nx)),
            dtype=theano.config.floatX)
        )
        self.w_rec = theano.shared(
            np.asarray(np.random.uniform(-.1, .1, (nrec)),
            dtype=theano.config.floatX)
        )
    
    def state_update(self, x_t, s0):
        # this is the network updater for simpleRNN
        def inner_fn(xv, s_tm1, wx, wr):
            s_t = xv * wx + s_tm1 * wr
            y_t = s_t
            
            return [s_t, y_t]
        
        w_x_vec = T.cast(self.w_x[0], 'float32')
        w_rec_vec = T.cast(self.w_rec[0], 'float32')

        [s_t, y_t], updates = theano.scan(fn=inner_fn,
                                    sequences=x_t,
                                    outputs_info=[s0, None],
                                    non_sequences=[w_x_vec, w_rec_vec]
                                   )
        return y_t

As a class member, a class is defined by giving the length and weight (w_x, w_rec) of the state. The class method state_update () updates the network state given the initial value s0 of state and the input sequence x_t, and calculates y_t (output sequence). y_t is a vector, but in the main processing, only the final value is extracted and used to calculate the cost function, such as y = y_t [-1].

In the main process, first, the data used for learning is created. (Almost as in the original "Peter's note".)

    np.random.seed(seed=1)

    # Create Dataset by program
    num_samples = 20
    seq_len = 10
    
    trX = np.zeros((num_samples, seq_len))
    for row_idx in range(num_samples):
        trX[row_idx,:] = np.around(np.random.rand(seq_len)).astype(int)
    trY = np.sum(trX, axis=1)
    trX = trX.astype(np.float32)
    trX = trX.T                    # need 'List of vector' shape dataset
    trY = trY.astype(np.float32)
    # s0 is time-zero state 
    s0np = np.zeros((num_samples), dtype=np.float32)

trX is a series data of length 10 and 20 samples. The point here is that the matrix is transposed as trX = trX.T. As a general machine learning data set, it seems that the features of one data are arranged in the horizontal direction (column) and arranged in the vertical direction (row) for the number of samples.

  Data Set Shape
                  feature1   feature2   feature3  ...
     sample1:        -          -          -
     sample2:        -          -          -
     sample3:        -          -          -
       .
       .

However, this time, when updating the time series data with theano.scan (), it was necessary to group the data vertically and pass the data.

(By grouping as follows, theano.scan()It is consistent with the operation of. )
  Data Set Shape (updated)
               [  time1[sample1,  time2[sample1,  time3[sample1 ...    ]
                        sample2,        sample2,        sample2,
                        sample3,        sample3,        sample3,
                         ...    ]         ...   ]         ...    ]

In order to realize this easily, the matrix is transposed and processed as an input to theano.scan ().

After this, the cost loss is calculated from Theano's graph, the model calculation value y_hypo, and the Train data label y_.

    # Tensor Declaration
    x_t = T.matrix('x_t')
    x = T.matrix('x')
    y_ = T.vector('y_')
    s0 = T.vector('s0')
    y_hypo = T.vector('y_hypo')

    net = simpleRNN(seq_len, 1, 1)  
    y_t = net.state_update(x_t, s0)
    y_hypo = y_t[-1]
    loss = ((y_ - y_hypo) ** 2).sum()

Once you reach this point, you can proceed with learning in a familiar way.

# Train Net Model
    params = [net.w_x, net.w_rec]
    optimizer = GradientDescentOptimizer(params, learning_rate=1.e-5)
    train_op = optimizer.minimize(loss)

    # Compile ... define theano.function 
    train_model = theano.function(
        inputs=[],
        outputs=[loss],
        updates=train_op,
        givens=[(x_t, trX), (y_, trY), (s0, s0np)],
        allow_input_downcast=True
    )
    
    n_epochs = 2001
    epoch = 0
    
    w_x_ini = (net.w_x).get_value()
    w_rec_ini = (net.w_rec).get_value()
    print('Initial weights: wx = %8.4f, wRec = %8.4f' \
                % (w_x_ini, w_rec_ini))
    
    while (epoch < n_epochs):
        epoch += 1
        loss = train_model()
        if epoch % 100 == 0:
            print('epoch[%5d] : cost =%8.4f' % (epoch, loss[0]))
    
    w_x_final = (net.w_x).get_value()
    w_rec_final = (net.w_rec).get_value()
    print('Final weights : wx = %8.4f, wRec = %8.4f' \
                % (w_x_final, w_rec_final))

This time, we prepared and used two optimizers, Gradient Decent (gradient descent method) and RMSPropOptimizer (RMSProp method). (The code for the optimizer part is omitted this time. For the RMSProp method, refer to the website shown later.)

Execution result

The description that "RNNs are generally difficult to advance learning" can be found in various places, but the result made me realize it.

Condition 1. Gradient Descent, Learning Rate = 1.0e-5

Initial weights: wx =   0.0900, wRec =   0.0113
epoch[  100] : cost =529.6915
epoch[  200] : cost =504.5684
epoch[  300] : cost =475.3019
epoch[  400] : cost =435.9507
epoch[  500] : cost =362.6525
epoch[  600] : cost =  0.2677
epoch[  700] : cost =  0.1585
epoch[  800] : cost =  0.1484
epoch[  900] : cost =  0.1389
epoch[ 1000] : cost =  0.1300
epoch[ 1100] : cost =  0.1216
epoch[ 1200] : cost =  0.1138
epoch[ 1300] : cost =  0.1064
epoch[ 1400] : cost =  0.0995
epoch[ 1500] : cost =  0.0930
epoch[ 1600] : cost =  0.0870
epoch[ 1700] : cost =  0.0813
epoch[ 1800] : cost =  0.0760
epoch[ 1900] : cost =  0.0710
epoch[ 2000] : cost =  0.0663
Final weights : wx =   1.0597, wRec =   0.9863

As a result of learning, we were able to obtain an approximate value of the correct answer [w_x, w_rec] = [1.0, 1.0]. The figure below shows how the cost function is reduced.

Fig. Loss curve (GradientDescent)

Condition 2. RMSProp method, learning rate = 0.001

Initial weights: wx =   0.0900, wRec =   0.0113
epoch[  100] : cost =  5.7880
epoch[  200] : cost =  0.3313
epoch[  300] : cost =  0.0181
epoch[  400] : cost =  0.0072
epoch[  500] : cost =  0.0068
epoch[  600] : cost =  0.0068
epoch[  700] : cost =  0.0068
epoch[  800] : cost =  0.0068
epoch[  900] : cost =  0.0068
epoch[ 1000] : cost =  0.0068
epoch[ 1100] : cost =  0.0068
epoch[ 1200] : cost =  0.0068
epoch[ 1300] : cost =  0.0068
epoch[ 1400] : cost =  0.0068
epoch[ 1500] : cost =  0.0068
epoch[ 1600] : cost =  0.0068
epoch[ 1700] : cost =  0.0068
epoch[ 1800] : cost =  0.0068
epoch[ 1900] : cost =  0.0068
epoch[ 2000] : cost =  0.0068
Final weights : wx =   0.9995, wRec =   0.9993

Fig. Loss curve (RMSProp)

In this model, the non-linearity of the cost function vs. parameters is very strong. Since the numerical value diverges as soon as the learning rate is increased, it was necessary to set the learning rate to 1.0e-5, which is quite small, in the gradient descent method. On the other hand, with the RMSProp method, which is said to be suitable for RNNs, learning can proceed without problems even with a learning rate of 0.001.

(Supplement) The reference "Peter's note" blog has a detailed explanation of the status of the cost function and RMSProp (named "Rprop" in the blog from which it was quoted). The non-linearity of the cost function is visualized with shades of color, so please refer to it if you are interested. (It will be the link below.)

References (web site)

Peter's note - How to implement a recurrent neural network http://peterroelants.github.io/ --Summary of behavior of Python Theano function / scan --StatsFragments (sinhrks' blog) http://sinhrks.hatenablog.com/entry/2015/04/25/233025
Theano scan　- Looping in Theano http://deeplearning.net/software/theano/library/scan.html
Theano optimizers - Gist/ kastnerkyle/opimizers.py https://gist.github.com/kastnerkyle/816134462577399ee8b2 (This is an implementation example of the RMS Prop method optimizer. This time, I referred to this.) --Deep Learning, Kodansha Machine Learning Professional Series ――Learn the basics of Theano once again ―― Qiita http://qiita.com/TomokIshii/items/1f483e9d4bfeb05ae231