[PYTHON] [Rabbit Challenge (E qualification)] Deep learning (day3)

Introduction

This is a learning record when I took the Rabbit Challenge with the aim of passing the Japan Deep Learning Association (JDLA) E qualification, which will be held on January 19th and 20th, 2021.

Rabbit Challenge is a course that utilizes the teaching materials edited from the recorded video of the commuting course of "Deep learning course that can be crushed in the field". There is no support for questions, but it is a cheap course (the lowest price as of June 2020) for taking the E qualification exam.

Please check the details from the link below.

List of subjects

Applied Mathematics Machine learning Deep learning (day1) Deep learning (day2) Deep learning (day3) Deep learning (day4)

Section1: Concept of recurrent neural network

A Recurrent Neural Network (RNN) is a neural network that can handle time series data. Time-series data is a series of data that is observed at regular intervals in chronological order and that has a statistical dependency on each other, and includes voice data and text data.

Since RNN handles time series data, it needs a recursive structure that holds the initial state and the state of the past time $ t-1 $ and finds the state of the next time $ t $.

image.png

u^t = W_{(in)}x^t+Wz^{(t-1)}+b
z^t = f(W_{(in)}x^t+Wz^{(t-1)}+b)
v^t = W_{(out)}z^t+c
y^t = g(W_{(out)}z^t+c)

As a parameter adjustment method in RNN, there is BPTT which is a kind of error back propagation.

[Parameter update formula] $ W_{(in)}^{t+1} = W_{(in)}^{t}-\epsilon\frac{\partial E}{\partial W_{(in)}} = W_{(in)}^{t}-\epsilon \sum_{z=0}^{T_t}\delta^{t-z}[x^{t-z}]^T $ $ W_{(out)}^{t+1} = W_{(out)}^{t}-\epsilon\frac{\partial E}{\partial W_{(out)}} = W_{(out)}^{t}-\epsilon \delta^{out,t}[z^{t}]^T $ $ W^{t+1} = W^{t}-\epsilon\frac{\partial E}{\partial W} = W_{(in)}^{t}-\epsilon \sum_{z=0}^{T_t}\delta^{t-z}[x^{t-z-1}]^T $ $ b^{t+1} = b^t-\epsilon\frac{\partial E}{\partial b} = b^t-\epsilon \sum_{z=0}^{T_t}\delta^{t-z} $ $ c^{t+1} = c^t-\epsilon\frac{\partial E}{\partial c} = c^t-\epsilon \delta^{out,t} $

simple_RNN (binary addition)


import sys, os
sys.path.append(os.pardir)  #Settings for importing files in the parent directory
import numpy as np
from common import functions
import matplotlib.pyplot as plt


def d_tanh(x):
    return 1/(np.cosh(x) ** 2)

#Prepare the data
#Number of binary digits
binary_dim = 8
#Maximum value+ 1
largest_number = pow(2, binary_dim)
# largest_Prepare binary numbers up to number
binary = np.unpackbits(np.array([range(largest_number)],dtype=np.uint8).T,axis=1)

input_layer_size = 2
hidden_layer_size = 16
output_layer_size = 1

weight_init_std = 1
learning_rate = 0.1

iters_num = 10000
plot_interval = 100

#Weight initialization(Bias is omitted for simplicity)
W_in = weight_init_std * np.random.randn(input_layer_size, hidden_layer_size)
W_out = weight_init_std * np.random.randn(hidden_layer_size, output_layer_size)
W = weight_init_std * np.random.randn(hidden_layer_size, hidden_layer_size)
# Xavier
# W_in = np.random.randn(input_layer_size, hidden_layer_size) / (np.sqrt(input_layer_size))
# W_out = np.random.randn(hidden_layer_size, output_layer_size) / (np.sqrt(hidden_layer_size))
# W = np.random.randn(hidden_layer_size, hidden_layer_size) / (np.sqrt(hidden_layer_size))
# He
# W_in = np.random.randn(input_layer_size, hidden_layer_size) / (np.sqrt(input_layer_size)) * np.sqrt(2)
# W_out = np.random.randn(hidden_layer_size, output_layer_size) / (np.sqrt(hidden_layer_size)) * np.sqrt(2)
# W = np.random.randn(hidden_layer_size, hidden_layer_size) / (np.sqrt(hidden_layer_size)) * np.sqrt(2)


#Slope
W_in_grad = np.zeros_like(W_in)
W_out_grad = np.zeros_like(W_out)
W_grad = np.zeros_like(W)

u = np.zeros((hidden_layer_size, binary_dim + 1))
z = np.zeros((hidden_layer_size, binary_dim + 1))
y = np.zeros((output_layer_size, binary_dim))

delta_out = np.zeros((output_layer_size, binary_dim))
delta = np.zeros((hidden_layer_size, binary_dim + 1))

all_losses = []

for i in range(iters_num):
    
    # A,B initialization(a + b = d)
    a_int = np.random.randint(largest_number/2)
    a_bin = binary[a_int] # binary encoding
    b_int = np.random.randint(largest_number/2)
    b_bin = binary[b_int] # binary encoding
    
    #Correct answer data
    d_int = a_int + b_int
    d_bin = binary[d_int]
    
    #Output binary
    out_bin = np.zeros_like(d_bin)
    
    #Error in the entire time series
    all_loss = 0    
    
    #Time series loop
    for t in range(binary_dim):
        #Input value
        X = np.array([a_bin[ - t - 1], b_bin[ - t - 1]]).reshape(1, -1)
        #Correct answer data at time t
        dd = np.array([d_bin[binary_dim - t - 1]])
        
        u[:,t+1] = np.dot(X, W_in) + np.dot(z[:,t].reshape(1, -1), W)
        z[:,t+1] = functions.sigmoid(u[:,t+1])
#         z[:,t+1] = functions.relu(u[:,t+1])
#         z[:,t+1] = np.tanh(u[:,t+1])    
        y[:,t] = functions.sigmoid(np.dot(z[:,t+1].reshape(1, -1), W_out))


        #error
        loss = functions.mean_squared_error(dd, y[:,t])
        
        delta_out[:,t] = functions.d_mean_squared_error(dd, y[:,t]) * functions.d_sigmoid(y[:,t])        
        
        all_loss += loss

        out_bin[binary_dim - t - 1] = np.round(y[:,t])
    
    
    for t in range(binary_dim)[::-1]:
        X = np.array([a_bin[-t-1],b_bin[-t-1]]).reshape(1, -1)        

        delta[:,t] = (np.dot(delta[:,t+1].T, W.T) + np.dot(delta_out[:,t].T, W_out.T)) * functions.d_sigmoid(u[:,t+1])
#         delta[:,t] = (np.dot(delta[:,t+1].T, W.T) + np.dot(delta_out[:,t].T, W_out.T)) * functions.d_relu(u[:,t+1])
#         delta[:,t] = (np.dot(delta[:,t+1].T, W.T) + np.dot(delta_out[:,t].T, W_out.T)) * d_tanh(u[:,t+1])    

        #Gradient update
        W_out_grad += np.dot(z[:,t+1].reshape(-1,1), delta_out[:,t].reshape(-1,1))
        W_grad += np.dot(z[:,t].reshape(-1,1), delta[:,t].reshape(1,-1))
        W_in_grad += np.dot(X.T, delta[:,t].reshape(1,-1))
    
    #Gradient application
    W_in -= learning_rate * W_in_grad
    W_out -= learning_rate * W_out_grad
    W -= learning_rate * W_grad
    
    W_in_grad *= 0
    W_out_grad *= 0
    W_grad *= 0
    

    if(i % plot_interval == 0):
        all_losses.append(all_loss)        
        print("iters:" + str(i))
        print("Loss:" + str(all_loss))
        print("Pred:" + str(out_bin))
        print("True:" + str(d_bin))
        out_int = 0
        for index,x in enumerate(reversed(out_bin)):
            out_int += x * pow(2, index)
        print(str(a_int) + " + " + str(b_int) + " = " + str(out_int))
        print("------------")

lists = range(0, iters_num, plot_interval)
plt.plot(lists, all_losses, label="loss")
plt.xlabel("iters_num", fontsize=14)
plt.ylabel("loss", fontsize=14)

plt.show()

image.png

--Changed weight_init_std image.png → When weight_init_std was decreased from 1.0 to 0.5 or increased to 2.0, both converged slowly.

--Changed learning_rate image.png → Decreasing learning_rate from 0.1 to 0.01 slowed learning, but increasing it to 1.0 speeded up learning, and increasing it to 3.0 did not progress learning.

--Changed hidden_layer_size image.png → Decreasing hidden_layer_size from 16 to 8 slowed learning, but increasing it to 32 made learning faster, and increasing it to 128 slowed learning.

--Changed the weight initialization method ・ Change to Xavier image.png ・ Change to He image.png → Learning was slow for both initialization methods.

--Changed the activation function of the middle layer

・ Change to ReLU image.png

・ Change to tanh image.png

Section2:LSTM(Long Short-Term Memory) With a simple RNN, long-distance dependencies cannot be learned well due to the disappearance of the gradient during error backpropagation. LSTM is used as a method to solve the problem.

--Input gate and output gate The problem of CEC is solved by adding an input gate and an output gate and making the weight of the input value to each gate variable by the weight matrices W and U.

--Oblivion Gate CEC stores all past information, but when the past information is no longer needed, it cannot be deleted and remains stored. Therefore, when the past information is no longer needed, the forgetting gate was born as yesterday to forget the information at that timing.

--Peephole connection I want to propagate or forget the past information stored in CEC to other nodes at any time. The value of CEC itself does not affect the gate control. → Peephole connection as a structure that allows propagation to the value of CEC itself via a weight matrix.

Section3:GRU(Gated Recurrent Unit) In the conventional LSTM, since there are many parameters, the calculation addition is large. Therefore, GRU has a structure in which the parameters are greatly reduced and the accuracy can be expected to be equal to or higher than that.

image.png

Section4: Bidirectional RNN (Bidirectional RNN)

A model for improving accuracy by adding not only past information but also future information. Used for text elaboration and machine translation.

image.png

Section5:Seq2Seq Normal RNNs must have the same input and output length and order. On the other hand, Seq2Seq is a kind of Encoder-Decoder model, which uses different RNNs on the input side and the output side. Used for machine dialogue and machine translation.

image.png

[Decoder RNN processing procedure]

  1. Decoder RNN: Output the generation probability of each token from the final state (thought vector) of Encoder RNN. Set final state as the initial state of Decoder RNN and enter Embedding.
  2. Sampling: Randomly select tokens based on the generation probability.
  3. Embedding: Embedding the token selected in 2 as the next input to the Decoder RNN.
  4. Detokenize: Repeat 1-3 to convert the token obtained in 2 into a character string.

Section6:Word2vec RNNs cannot give NNs variable-length strings like words. In word2vec, a vocabulary is created from the training data, and the weight matrix of "number of vocabulary x number of dimensions of arbitrary word vector" makes it possible to learn the distributed representation of large-scale data at a realistic calculation speed and memory amount. The resulting numeric vector can be used to transform raw text into a numerical representation suitable for data visualization, machine learning, and deep learning.

Section7:Attention Mechanism A mechanism to learn the degree of relevance of "which word of input and output is related". By using the weighted average of the hidden state of each word in the Encoder as input as the information when the Decoder outputs each word, the context information of the translation source sentence can be grasped in more detail. With the introduction of the Attention mechanism, the accuracy of neural machine translation has been greatly improved, surpassing the performance of conventional statistical machine translation models.

image.png

Recommended Posts

[Rabbit Challenge (E qualification)] Deep learning (day2)
[Rabbit Challenge (E qualification)] Deep learning (day3)
[Rabbit Challenge (E qualification)] Deep learning (day4)
Rabbit Challenge Deep Learning 1Day
Rabbit Challenge Deep Learning 2Day
[Rabbit Challenge (E qualification)] Applied Mathematics
Rabbit Challenge 4Day
Rabbit Challenge 3DAY
Machine learning rabbit challenge
<Course> Deep Learning: Day2 CNN
<Course> Deep Learning: Day1 NN
Subjects> Deep Learning: Day3 RNN
[Deep Learning Association E Qualification] What to do to receive
Thoroughly study Deep Learning [DW Day 0]
Deep Learning
<Course> Deep Learning Day4 Reinforcement Learning / Tensor Flow
Deep Learning Memorandum
Start Deep learning
Python learning day 4
Python Deep Learning
Deep learning × Python
First Deep Learning ~ Struggle ~
Python: Deep Learning Practices
Deep learning / activation functions
Deep Learning from scratch
Learning record 4 (8th day)
Learning record 9 (13th day)
Learning record 3 (7th day)
Deep learning 1 Practice of deep learning
Deep learning / cross entropy
Learning record 5 (9th day)
Learning record 6 (10th day)
First Deep Learning ~ Preparation ~
Programming learning record day 2
First Deep Learning ~ Solution ~
Learning record 8 (12th day)
[AI] Deep Metric Learning
Learning record 1 (4th day)
Learning record 7 (11th day)
I tried deep learning
Python: Deep Learning Tuning
Learning record 2 (6th day)
Deep learning large-scale technology
Learning record 16 (20th day)
Learning record 22 (26th day)
Deep learning / softmax function
[Deep learning] Image classification with convolutional neural network [DW day 4]
Automatic composition by deep learning (Stacked LSTM edition) [DW Day 6]