Introduction

This is a learning record when I took the Rabbit Challenge with the aim of passing the Japan Deep Learning Association (JDLA) E qualification, which will be held on January 19th and 20th, 2021.

Rabbit Challenge is a course that utilizes the teaching materials edited from the recorded video of the commuting course of "Deep learning course that can be crushed in the field". There is no support for questions, but it is a cheap course (the lowest price as of June 2020) for taking the E qualification exam.

Please check the details from the link below.

List of subjects

Applied Mathematics Machine learning Deep learning (day1) Deep learning (day2) Deep learning (day3) Deep learning (day4)

Section1: Concept of recurrent neural network

A Recurrent Neural Network (RNN) is a neural network that can handle time series data. Time-series data is a series of data that is observed at regular intervals in chronological order and that has a statistical dependency on each other, and includes voice data and text data.

Since RNN handles time series data, it needs a recursive structure that holds the initial state and the state of the past time $ t-1 $ and finds the state of the next time $ t $.

u^t = W_{(in)}x^t+Wz^{(t-1)}+b

z^t = f(W_{(in)}x^t+Wz^{(t-1)}+b)

v^t = W_{(out)}z^t+c

y^t = g(W_{(out)}z^t+c)

As a parameter adjustment method in RNN, there is BPTT which is a kind of error back propagation.

[Parameter update formula] $ W_{(in)}^{t+1} = W_{(in)}^{t}-\epsilon\frac{\partial E}{\partial W_{(in)}} = W_{(in)}^{t}-\epsilon \sum_{z=0}^{T_t}\delta^{t-z}[x^{t-z}]^T $ $ W_{(out)}^{t+1} = W_{(out)}^{t}-\epsilon\frac{\partial E}{\partial W_{(out)}} = W_{(out)}^{t}-\epsilon \delta^{out,t}[z^{t}]^T $ $ W^{t+1} = W^{t}-\epsilon\frac{\partial E}{\partial W} = W_{(in)}^{t}-\epsilon \sum_{z=0}^{T_t}\delta^{t-z}[x^{t-z-1}]^T $ $ b^{t+1} = b^t-\epsilon\frac{\partial E}{\partial b} = b^t-\epsilon \sum_{z=0}^{T_t}\delta^{t-z} $ $ c^{t+1} = c^t-\epsilon\frac{\partial E}{\partial c} = c^t-\epsilon \delta^{out,t} $

`simple_RNN (binary addition)`


import sys, os
sys.path.append(os.pardir)  #Settings for importing files in the parent directory
import numpy as np
from common import functions
import matplotlib.pyplot as plt


def d_tanh(x):
    return 1/(np.cosh(x) ** 2)

#Prepare the data
#Number of binary digits
binary_dim = 8
#Maximum value+ 1
largest_number = pow(2, binary_dim)
# largest_Prepare binary numbers up to number
binary = np.unpackbits(np.array([range(largest_number)],dtype=np.uint8).T,axis=1)

input_layer_size = 2
hidden_layer_size = 16
output_layer_size = 1

weight_init_std = 1
learning_rate = 0.1

iters_num = 10000
plot_interval = 100

#Weight initialization(Bias is omitted for simplicity)
W_in = weight_init_std * np.random.randn(input_layer_size, hidden_layer_size)
W_out = weight_init_std * np.random.randn(hidden_layer_size, output_layer_size)
W = weight_init_std * np.random.randn(hidden_layer_size, hidden_layer_size)
# Xavier
# W_in = np.random.randn(input_layer_size, hidden_layer_size) / (np.sqrt(input_layer_size))
# W_out = np.random.randn(hidden_layer_size, output_layer_size) / (np.sqrt(hidden_layer_size))
# W = np.random.randn(hidden_layer_size, hidden_layer_size) / (np.sqrt(hidden_layer_size))
# He
# W_in = np.random.randn(input_layer_size, hidden_layer_size) / (np.sqrt(input_layer_size)) * np.sqrt(2)
# W_out = np.random.randn(hidden_layer_size, output_layer_size) / (np.sqrt(hidden_layer_size)) * np.sqrt(2)
# W = np.random.randn(hidden_layer_size, hidden_layer_size) / (np.sqrt(hidden_layer_size)) * np.sqrt(2)


#Slope
W_in_grad = np.zeros_like(W_in)
W_out_grad = np.zeros_like(W_out)
W_grad = np.zeros_like(W)

u = np.zeros((hidden_layer_size, binary_dim + 1))
z = np.zeros((hidden_layer_size, binary_dim + 1))
y = np.zeros((output_layer_size, binary_dim))

delta_out = np.zeros((output_layer_size, binary_dim))
delta = np.zeros((hidden_layer_size, binary_dim + 1))

all_losses = []

for i in range(iters_num):
    
    # A,B initialization(a + b = d)
    a_int = np.random.randint(largest_number/2)
    a_bin = binary[a_int] # binary encoding
    b_int = np.random.randint(largest_number/2)
    b_bin = binary[b_int] # binary encoding
    
    #Correct answer data
    d_int = a_int + b_int
    d_bin = binary[d_int]
    
    #Output binary
    out_bin = np.zeros_like(d_bin)
    
    #Error in the entire time series
    all_loss = 0    
    
    #Time series loop
    for t in range(binary_dim):
        #Input value
        X = np.array([a_bin[ - t - 1], b_bin[ - t - 1]]).reshape(1, -1)
        #Correct answer data at time t
        dd = np.array([d_bin[binary_dim - t - 1]])
        
        u[:,t+1] = np.dot(X, W_in) + np.dot(z[:,t].reshape(1, -1), W)
        z[:,t+1] = functions.sigmoid(u[:,t+1])
#         z[:,t+1] = functions.relu(u[:,t+1])
#         z[:,t+1] = np.tanh(u[:,t+1])    
        y[:,t] = functions.sigmoid(np.dot(z[:,t+1].reshape(1, -1), W_out))


        #error
        loss = functions.mean_squared_error(dd, y[:,t])
        
        delta_out[:,t] = functions.d_mean_squared_error(dd, y[:,t]) * functions.d_sigmoid(y[:,t])        
        
        all_loss += loss

        out_bin[binary_dim - t - 1] = np.round(y[:,t])
    
    
    for t in range(binary_dim)[::-1]:
        X = np.array([a_bin[-t-1],b_bin[-t-1]]).reshape(1, -1)        

        delta[:,t] = (np.dot(delta[:,t+1].T, W.T) + np.dot(delta_out[:,t].T, W_out.T)) * functions.d_sigmoid(u[:,t+1])
#         delta[:,t] = (np.dot(delta[:,t+1].T, W.T) + np.dot(delta_out[:,t].T, W_out.T)) * functions.d_relu(u[:,t+1])
#         delta[:,t] = (np.dot(delta[:,t+1].T, W.T) + np.dot(delta_out[:,t].T, W_out.T)) * d_tanh(u[:,t+1])    

        #Gradient update
        W_out_grad += np.dot(z[:,t+1].reshape(-1,1), delta_out[:,t].reshape(-1,1))
        W_grad += np.dot(z[:,t].reshape(-1,1), delta[:,t].reshape(1,-1))
        W_in_grad += np.dot(X.T, delta[:,t].reshape(1,-1))
    
    #Gradient application
    W_in -= learning_rate * W_in_grad
    W_out -= learning_rate * W_out_grad
    W -= learning_rate * W_grad
    
    W_in_grad *= 0
    W_out_grad *= 0
    W_grad *= 0
    

    if(i % plot_interval == 0):
        all_losses.append(all_loss)        
        print("iters:" + str(i))
        print("Loss:" + str(all_loss))
        print("Pred:" + str(out_bin))
        print("True:" + str(d_bin))
        out_int = 0
        for index,x in enumerate(reversed(out_bin)):
            out_int += x * pow(2, index)
        print(str(a_int) + " + " + str(b_int) + " = " + str(out_int))
        print("------------")

lists = range(0, iters_num, plot_interval)
plt.plot(lists, all_losses, label="loss")
plt.xlabel("iters_num", fontsize=14)
plt.ylabel("loss", fontsize=14)

plt.show()

--Changed weight_init_std → When weight_init_std was decreased from 1.0 to 0.5 or increased to 2.0, both converged slowly.

--Changed learning_rate → Decreasing learning_rate from 0.1 to 0.01 slowed learning, but increasing it to 1.0 speeded up learning, and increasing it to 3.0 did not progress learning.

--Changed hidden_layer_size → Decreasing hidden_layer_size from 16 to 8 slowed learning, but increasing it to 32 made learning faster, and increasing it to 128 slowed learning.

--Changed the weight initialization method ・ Change to Xavier ・ Change to He → Learning was slow for both initialization methods.

--Changed the activation function of the middle layer

・ Change to ReLU

・ Change to tanh

Section2：LSTM（Long Short-Term Memory） With a simple RNN, long-distance dependencies cannot be learned well due to the disappearance of the gradient during error backpropagation. LSTM is used as a method to solve the problem.

Input gate:$ i_t = \sigma(W^{(i)}x_t + U^{(i)}h_{t-1} + b^{(i)} ) $
Forget gate:$ f_t = \sigma(W^{(f)}x_t + U^{(f)}h_{t-1} + b^{(f)} ) $
output gate:$ o_t = \sigma(W^{(o)}x_t + U^{(o)}h_{t-1} + b^{(o)} ) $ --Memory cell: $ \ tilde c_t = tanh (W ^ {(\ tilde c)} x_t + U ^ {(\ tilde c)} h_ {t-1} + b ^ {(\ tilde c)}, \ quad c_t = i_t \ circ \ tilde c_t + f_t \ circ c_ {t-1} $ --Status update: $ h_t = o_t \ circ tanh (c_t) $
CEC(Constant Error Carousel) As a solution for gradient disappearance and gradient explosion, it can be solved if the gradient is 1. Problem: The weight of the input data is uniform regardless of the time dependence. → There is no learning characteristic of the neural network.

--Input gate and output gate The problem of CEC is solved by adding an input gate and an output gate and making the weight of the input value to each gate variable by the weight matrices W and U.

--Oblivion Gate CEC stores all past information, but when the past information is no longer needed, it cannot be deleted and remains stored. Therefore, when the past information is no longer needed, the forgetting gate was born as yesterday to forget the information at that timing.

--Peephole connection I want to propagate or forget the past information stored in CEC to other nodes at any time. The value of CEC itself does not affect the gate control. → Peephole connection as a structure that allows propagation to the value of CEC itself via a weight matrix.

Section3：GRU（Gated Recurrent Unit） In the conventional LSTM, since there are many parameters, the calculation addition is large. Therefore, GRU has a structure in which the parameters are greatly reduced and the accuracy can be expected to be equal to or higher than that.

Update gate:$ z_t = \sigma(W^{(z)}x_t + U^{(z)}h_{t-1}) $
Reset gate:$ r_t = \sigma(W^{(r)}x_t + U^{(r)}h_{t-1}) $ --Status update: $ \ tilde h_t = tanh (Wx_t + r_t \ circ Uh_ {t-1}), \ quad h_t = z_t \ circ h_ {t-1}) + (1-z_t) \ circ \ tilde h_t $ If the value of Reset gate is 0, the previous state is ignored. If the value of Update gate is 1, the previous state is copied as it is.

Section4: Bidirectional RNN (Bidirectional RNN)

A model for improving accuracy by adding not only past information but also future information. Used for text elaboration and machine translation.

Section5：Seq2Seq Normal RNNs must have the same input and output length and order. On the other hand, Seq2Seq is a kind of Encoder-Decoder model, which uses different RNNs on the input side and the output side. Used for machine dialogue and machine translation.

Encoder RNN A structure in which text data input by the user is divided into tokens such as words and passed. Input vec1 to RNN and output hidden state. This hidden state and the next input vec2 have been input to the RNN again. Set the hidden state when the last vec is inserted as the final state. This final state is called a thought vector and becomes a vector that expresses the meaning of the input sentence.
Decoder RNN A structure in which the system generates output data for each token such as a word.

[Decoder RNN processing procedure]

Decoder RNN: Output the generation probability of each token from the final state (thought vector) of Encoder RNN. Set final state as the initial state of Decoder RNN and enter Embedding.
Sampling: Randomly select tokens based on the generation probability.
Embedding: Embedding the token selected in 2 as the next input to the Decoder RNN.
Detokenize: Repeat 1-3 to convert the token obtained in 2 into a character string.

HRED Generate the next utterance from the past n-1 utterances. In Seq2seq, the response was made by ignoring the context of the conversation, but in HRED, the response follows the flow of the previous word, so a more human-like sentence is generated. With a structure that combines Seq2seq and Context RNN (a structure that combines the series of sentences compiled by Encoder and converts it into a vector that represents the entire conversation context so far), it generates a response that takes into account the history of past utterances. There is. Problem 1: There is only probabilistic diversity in writing, and there is no diversity like the “flow” of conversation. → Even if the same context (utterance list) is given, the content of the answer can only be the same as the flow of conversation each time. Exercise 2: They tend to give short, information-poor answers. → Tends to learn short and common answers.
VHRED A structure that solves the problems of HRED by adding the concept of latent variables of VAE to HRED.
VAE Assuming a probability distribution for the latent variable z.

Section6：Word2vec RNNs cannot give NNs variable-length strings like words. In word2vec, a vocabulary is created from the training data, and the weight matrix of "number of vocabulary x number of dimensions of arbitrary word vector" makes it possible to learn the distributed representation of large-scale data at a realistic calculation speed and memory amount. The resulting numeric vector can be used to transform raw text into a numerical representation suitable for data visualization, machine learning, and deep learning.

Section7：Attention Mechanism A mechanism to learn the degree of relevance of "which word of input and output is related". By using the weighted average of the hidden state of each word in the Encoder as input as the information when the Decoder outputs each word, the context information of the translation source sentence can be grasped in more detail. With the introduction of the Attention mechanism, the accuracy of neural machine translation has been greatly improved, surpassing the performance of conventional statistical machine translation models.

[PYTHON] [Rabbit Challenge (E qualification)] Deep learning (day3)

Introduction

List of subjects

Section1: Concept of recurrent neural network

simple_RNN (binary addition)

Section4: Bidirectional RNN (Bidirectional RNN)

`simple_RNN (binary addition)`