table of contents [Deep Learning: Day1 NN] (https://qiita.com/matsukura04583/items/6317c57bc21de646da8e) [Deep Learning: Day2 CNN] (https://qiita.com/matsukura04583/items/29f0dcc3ddeca4bf69a2) [Deep Learning: Day3 RNN] (https://qiita.com/matsukura04583/items/9b77a238da4441e0f973) [Deep Learning: Day4 Reinforcement Learning / TensorFlow] (https://qiita.com/matsukura04583/items/50806b750c8d77f2305d)
 
•AlexNet
AlexNet is a model that won the second place in the image recognition competition held in 2012 by a large margin. With the advent of AlexNet, deep learning has received a lot of attention.
Model structure
Consists of 3 fully connected layers, including 5 convolution layers and a pooling layer

1-1 RNN overview
1-1-1 What is RNN?
What is RNN? $ \ Rightarrow $ A neural network that can handle time series data.
1-1-2 Time series data
What is time series data? $ \ Rightarrow $ A series of data that are observed at regular intervals in chronological order and that have statistical dependencies on each other.
What is specific time series data? $ \ Rightarrow $ Voice data / text data ... etc
1-1-3 About RNN
What are the features of RNN? To handle the $ \ Rightarrow $ time series model, we need a recursive structure that holds the initial state and the state of the past time t-1 and recursively finds t at the next time.
 
Overview of RNN
 
Formula
 
Code
python
u[:,t+1] = np.dot(X, W_in) + np.dot(z[:,t].reshape(1, -1), W)
z[:,t+1] = functions.sigmoid(u[:,t+1])
np.dot(z[:,t+1].reshape(1, -1), W_out)
y[:,t] = functions.sigmoid(np.dot(z[:,t+1].reshape(1, -1), W_out))np.dot(z[:,t+1].reshape(1, -1), W_out)
1-2BPTT
1-2-1 What is BPTT? Review of error back propagation method
 
1-2-2 Mathematical description of BPTT
 
When linked to the code from above + np.dot(X.T,delta[:,t].reshape(1,-1)) + np.dot(z[:,t+1].reshape(-1,1),delta_out[:,t].reshape(-1,1)) + 3)np.dot(z[:,t].reshape(-1,1), delta[:,t].reshape(1,-1))
(B and c) omitted in simple RNN code for brevity
(B and c) omitted in simple RNN code for brevity
(Code) y [:, t] = functions.sigmoid (np.dot (z [:, t + 1] .reshape (1, -1), W_out))
 
Parameter update formula
 
 
1-2-3 Overall view of BPTT
 
Underlined $ \ Longrightarrow $ 
Section2) LSTM Overall picture (previous flow and vision of the overall picture of issues)
 
2-1CEC
As a solution for gradient disappearance and gradient explosion, it can be solved if the gradient is 1. The weight of the task input data is uniform regardless of the time dependence. $ \ Rightarrow $ There is no learning characteristic of the neural network. Input layer → weight to hidden layer Input weight collision. Hidden layer → Weight output weight collision to the output layer.
2-2 Input gate and output gate
What is the role of the input / output gate? By adding $ \ Rightarrow $ input / output gates, the weights of the input values to each gate can be changed by the weight matrices W and U. Solved the problem of $ \ Rightarrow $ CEC.
2-3 Oblivion Gate
Oblivion gate is born from the task of LSTM block
Issue When past information is no longer needed, it cannot be deleted and remains stored. Current status of LSTM CEC stores all past information. Solution When you no longer need past information, you need a function to forget the information at that timing. $ \ Rightarrow $ Birth of Oblivion Gate
2-4 Peephole connection
 
A mechanism that was created to meet the needs of spreading the past information of CEC storage to other nodes at any time or forgetting it at any time. The value of CEC itself does not affect the gate control. What is a peephole connection? A structure that allows propagation to the value of $ \ Rightarrow $ CEC itself via a weight matrix.
Section3) GRU
 
(Background) In LSTM, there was a problem that the number of parameters was large and the calculation load was high. Birth of $ \ Rightarrow $ GRU
What is GRU? $ \ Rightarrow $ In the conventional LSTM, the calculation load was heavy because there were many parameters. However, with GRU, the parameters have been significantly reduced, and the accuracy can be expected to be equal to or higher than that.
Merit $ \ Rightarrow $ Computational load is low.
 
Section5)Seq2Seq
 What is the specific use of Seq2seq? $ \ Rightarrow $ It is used for machine dialogue and machine translation.
 What is Seq2seq? $ \ Rightarrow $ Encoder-Decoder model.
 What is the specific use of Seq2seq? $ \ Rightarrow $ It is used for machine dialogue and machine translation.
 What is Seq2seq? $ \ Rightarrow $ Encoder-Decoder model.
5-1Encoder RNN
Structure that passes the text data input by the user by dividing it into tokens such as words
Taking: Divide the sentence into tokens such as words and divide it into IDs for each token. 
 Embedding: Convert from ID to distributed representation vector representing the token. 
 Encoder RNN: Input vectors to RNN in order.
Encoder RNN processing procedure Input + vec1 to RNN and output hidden state. This hidden state and the next input vec2 are input to the RNN again, and the hidden state is output, which is repeated.
Save the hidden state when the last vec is inserted as the final state. This final state is called a thought vector and becomes a vector that expresses the meaning of the input sentence.
 
5-2Decoder RNN
A structure in which the system generates output data for each token such as a word.
Decoder RNN processing
 
5-3HRED
Seq2seq tasks Only one question can be answered. There is no context to the question, just the response continues. HRED was born to solve this.
What is HRED? Generates the next utterance from the past n-1 utterances. System: Parakeets are cute, aren't they? User: Yeah System: I understand the parakeet is cute. In Seq2seq, the response was made by ignoring the context of the conversation, but in HRED, the response is made according to the flow of the previous word, so a more human-like sentence is generated.
5-4VHRED
What is VHRED? $ \ Rightarrow $ HRED with the concept of latent variables in VAE (called flies) added.
A structure that solves the problem of HRED by adding the concept of latent variables of VAE.
5-5VAE
5-5-1 Autoencoder-
What is an autoencoder? $ \ Rightarrow $ One of unsupervised learning. Therefore, the input data at the time of learning is only training data, and teacher data is not used.
Autoencoder Specific example In the case of MNIST, it is a neural network that puts an image of 28x28 numbers and outputs the same image.
Autoencoder structure
Structural drawing 
 <img width = 25% alt = "Screenshot 2020-01-03 22.40.52.png " src = "https://qiita-image-store.s3.ap-northeast-1.amazonaws" .com / 0/357717/9a74cfde-b8f7-f0d4-7bca-49d192d811e4.png ">
Autoencoder structure explanation $ \ Rightarrow $ Encoder is a neural network that converts input data to latent variable z. On the contrary, Decoder is a neural network that restores the original image with latent variable z as input.
** Benefits $ \ Rightarrow $ Dimensionality reduction **
 
Section6)Word2vec
Create a vocabulary from training data Ex) I want to eat apples. I like apples. {Apples, eat, I, like, to, want} * If you create a 7-word vocabulary that is easy to understand, you will originally have as many words as there are in the dictionary.
Merit
Learning of distributed representation of large-scale data has become feasible with realistic calculation speed and amount of memory.
✗: A weight matrix only for vocabulary x vocabulary is born.
○: Vocabulary × A weight matrix is born in any word vector dimension + Section7)AttentionMechanism
Challenge: The problem with seq2seq is that it is difficult to deal with long sentences. With seq2seq, you have to enter 2 or 100 words in a fixed dimensional vector
Solution: We need a mechanism that the longer the sentence, the larger the dimension of the internal representation of the sequence.
A mechanism to learn the degree of relevance of "which word of input and output is related"
Attention Mechanism Specific example $ \ Rigtharrow $ * "a" has a low degree of relevance in the first place, and "I" has a high degree of relevance to "I".
 
[P11] Confirmation test Answer the size of the output image when the input image of size 5x5 is folded with the filter of size 3x3. The stride is 2 and the padding is 1. ⇒ [Discussion] Answer 3 ✖️ 3 Input size height (H), input size width (W) Output Hight(OH) Output Width(OW) Filler Hight(FH) Filler Width(FW) Stride (S) Panning (P)
[P12] Find dz / dx using the principle of chain rule.
     z = t^2,t=x+y
⇒ [Discussion] It can be calculated by the following calculation.
 \frac{dz}{dx}=\frac{dz}{dy}\frac{dy}{dx}
,t=x+y
z = t^Since it is 2, if it is differentiated by t\frac{dz}{dt}=2t
t=x+Since it is y, if it is differentiated by x\frac{dt}{dx}=1
\frac{dz}{dx}=2t ・ 1=2t=2(x+y)
   OH =\frac{H+2P-FH}{S}+1 =\frac{5+2.1-3}{2}+1=3
   OH =\frac{W+2P-FW}{S}+1 =\frac{5+2.1-3}{2}+1=3
It's a fixed calculation method, so let's remember it as a formula.
[P23] Confirmation test
RNN networks have three main weights. One is the weight applied when defining the current middle layer from the input, and the other is the weight applied when defining the output from the intermediate layer. Explain the remaining one weight.
⇒ [Discussion]
The answer is the weight passed from one middle layer to the next.

[P37] Find dz / dx using the principle of chain rule.
     z = t^2,t=x+y
⇒ [Discussion] It can be calculated by the following calculation.
 \frac{dz}{dx}=\frac{dz}{dy}\frac{dy}{dx}
,t=x+y
z = t^Since it is 2, if it is differentiated by t\frac{dz}{dt}=2t
t=x+Since it is y, if it is differentiated by x\frac{dt}{dx}=1
\frac{dz}{dx}=2t ・ 1=2t=2(x+y)
[P46] Confirmation test
Express y1 in the figure below with a mathematical formula using x, s0, s1, win, w, and wout. * Define the bias with any character. * Also, let the sigmoid function g (x) act on the output of the intermediate layer.
Z_1=sigmoid(S_0W+x_1W_{(in)}+b)
The output layer also uses sigmoid
y_1=sigmoid(Z_1W_{(out)}+c)
Know the essence because the way to write symbols differs depending on the book.
[P54] Code exercises
 ⇒ [Discussion] The answer is (2)
[Explanation] In RNN, the intermediate layer output h_ {t} depends on the past intermediate layer output h_ {t-1}, .., h_ {1}. When we partially differentiate the loss function with respect to the weights W and U in the RNN, we need to take that into account, and note that dh_ {t} / dh_ {t-1} = U, U each time we go back in time. Is hung. That is, delta_t = delta_t.dot (U).
⇒ [Discussion] The answer is (2)
[Explanation] In RNN, the intermediate layer output h_ {t} depends on the past intermediate layer output h_ {t-1}, .., h_ {1}. When we partially differentiate the loss function with respect to the weights W and U in the RNN, we need to take that into account, and note that dh_ {t} / dh_ {t-1} = U, U each time we go back in time. Is hung. That is, delta_t = delta_t.dot (U).
[P63] When the sigmoid function is differentiated, the maximum value is taken when the input value is 0. Select the correct value from the options. (1) 0.15 (2) 0.25 (3) 0.35 (4) 0.45
⇒ [Discussion] Differentiation of sigumoid
     (sigmoid)'=(1-sigmoid)(sigmoid)
Since the sigmoid function is maximum at 0.5,
     (sigmoid)'=(1-0.5)(0.5)=0.Will be 25
[P65] Exercise Challenge

⇒ [Discussion] Correct answer: 1 [Explanation] When the norm of the gradient is larger than the threshold value, the norm of the gradient is normalized to the threshold value, so the clipped gradient is calculated as gradient × (threshold value / norm of gradient). To. That is, gradient * rate. It is easy to understand because the threshold value is simply multiplied by the gradient and normalized.
[P79] Confirmation test Suppose you want to enter the following sentence into an LSTM and predict the words that fit in the blanks. The word "very" in the text is not considered to have any effect even if it disappears in the blank prediction. Which gate is considered to work in such a case? "The movie was interesting. By the way, I'm so hungry that something ____." ⇒ [Discussion] Correct answer: Oblivion gate. The role of the forgetting gate is used to determine how much immediate impact is considered.
[P80] Exercise Challenge

⇒ [Discussion] Correct answer: 3 [Explanation] The state of the new cell is expressed as the sum of the input to the calculated cell and the state of the cell one step before, multiplied by the input gate and the forgetting gate. That is, input_gate * a + forget_gate * c.
[P89] Confirmation test Briefly describe the challenges facing LSTMs and CECs.
⇒ [Discussion] Challenges faced by LSTM and CEC. The LSTM has a problem that the number of parameters is large and the calculation load is high. In CEC, there is no concept of learning, and weights are not used. It cannot meet the needs of propagating the stored past information to other nodes at any time or forgetting it at any time.
[P91] Exercise Challenge

[P93] Confirmation test Briefly describe the difference between LSTMs and GRUs. ⇒ [Discussion] In LSTM, there was a problem that the number of parameters was large and the calculation load was high, but in GRU, the parameters were reduced and the processing became faster. However, not all GRUs are superior, and it is better to compare and select in some cases.
[P96] Exercise Challenge
 ⇒ [Discussion]
 Correct answer: 4 [Explanation] In a bidirectional RNN, the feature quantity is the combination of the intermediate layer representation when propagating in the forward and reverse directions, so np.concatenate ([h_f, h_b [:: -1]]] , Axis = 1).
 (Reference) [Learn the np.concatenate syntax here](https://www.sejuku.net/blog/67869)
 ⇒ [Discussion]
 Correct answer: 4 [Explanation] In a bidirectional RNN, the feature quantity is the combination of the intermediate layer representation when propagating in the forward and reverse directions, so np.concatenate ([h_f, h_b [:: -1]]] , Axis = 1).
 (Reference) [Learn the np.concatenate syntax here](https://www.sejuku.net/blog/67869)
[P111] Exercise Challenge
 ⇒ [Discussion]
Correct answer: 1 [Explanation] The word w is a one-hot vector, which is converted into another feature by embedding the word. This can be written as E.dot (w) using the embedded matrix E.
w is made up of One-hot vectors.
⇒ [Discussion]
Correct answer: 1 [Explanation] The word w is a one-hot vector, which is converted into another feature by embedding the word. This can be written as E.dot (w) using the embedded matrix E.
w is made up of One-hot vectors.
(Reference) Learn the relationship between natural language processing and on-hot here When the document is large, the on-hot data also becomes large, and there is a problem that processing may not be in time. [P120] Confirmation test seq2 Briefly describe the difference between seq and HRED and between HRED and VHRED. ⇒ [Discussion] seq2seq could only answer one question at a time, but HRED was created to solve that problem. The difference between HRED and VHRED is that there are problems that HRED cannot answer in the same way, and VHRED can answer while changing the expression by solving the problems.
[P129] Confirmation test Answer the blanks in the description below about VAE. Introducing ____ to the latent variable of the self-encoder ⇒ [Discussion] The answer is the introduction of "random variables" into latent variables.
[P138] Confirmation test Briefly describe the difference between RNN and word2vec, and seq2seq and Attention. ⇒ [Discussion] RNN needed to generate a matrix of vocabulary number ✖️ vocabulary number weights, but word2vec can be made of a vocabulary number ✖️ arbitrary word vector number weight matrix. With seq2seq and Attention, you can only give the same answer to the same question with seq2seq, but with Attention, you can use the importance and relevance, and you will be able to return answers with variations. Through iterative learning, it becomes possible to give answers that lead to improved accuracy.
[Video DN60] Exercise Challenge
 ⇒ [Discussion] The answer is (2).
It is represented by a representation vector, and the left and right representation vectors are weighted and calculated.
⇒ [Discussion] The answer is (2).
It is represented by a representation vector, and the left and right representation vectors are weighted and calculated.
simple RNN
Binary addition
Execution result of binary addition

[try] Let's change weight_init_std, learning_rate, hidden_layer_size
weight_init_std 1→10
learning_rate 0.1→0.01
hidden_layer_size 16→32
 Learning got worse.
Learning got worse.
[try] Let's change the weight initialization method Try changing both Xavier and He. (Source change)
python
###########changes##############
#Weight initialization(Bias is omitted for simplicity)
#W_in = weight_init_std * np.random.randn(input_layer_size, hidden_layer_size)
#W_out = weight_init_std * np.random.randn(hidden_layer_size, output_layer_size)
#W = weight_init_std * np.random.randn(hidden_layer_size, hidden_layer_size)
#Weight initialization using Xavier
W_in = np.random.randn(input_layer_size, hidden_layer_size) / (np.sqrt(input_layer_size))
W_out = np.random.randn(hidden_layer_size, output_layer_size) / (np.sqrt(hidden_layer_size))
W = np.random.randn(hidden_layer_size, hidden_layer_size) / (np.sqrt(hidden_layer_size))
#Weight initialization using He
# W_in = np.random.randn(input_layer_size, hidden_layer_size) / (np.sqrt(input_layer_size)) * np.sqrt(2)
# W_out = np.random.randn(hidden_layer_size, output_layer_size) / (np.sqrt(hidden_layer_size)) * np.sqrt(2)
# W = np.random.randn(hidden_layer_size, hidden_layer_size) / (np.sqrt(hidden_layer_size)) * np.sqrt(2)
#####################################
Results using Xavier
 Results using HE
Results using HE
 The results were almost close.
The results were almost close.
[try] Let's change the activation function of the middle layer ReLU (Let's check the gradient explosion)
python changes
     #  z[:,t+1] = functions.sigmoid(u[:,t+1])
        z[:,t+1] = functions.relu(u[:,t+1])
     #  z[:,t+1] = functions.np.tanh(u[:,t+1])
 
tanh (tanh is provided in numpy. Let's create a derivative as d_tanh)
python changes Added definition of derivative
def d_tanh(x):
     return np.tanh(x)
python changes
     #  z[:,t+1] = functions.sigmoid(u[:,t+1])
     #  z[:,t+1] = functions.relu(u[:,t+1])
        z[:,t+1] = d_tanh(u[:,t+1])
 
        Recommended Posts