Level7. Deep learning DAY3

alt What is a deep learning course that can be crushed in the field in 3 months

7-1. Review of the whole picture of deep learning

●AlexNet

☆ Confirmation test ☆ ・ 7-1-1 Input image of size 5x5 with size 3x3 filter Answer the size of the output image when folded. The stride is 2 and the padding is 1.

[My answer] 3x3 output image 　　　　OH = (5+21-3)/2+1 = 3 　　　　OW = (5+21-3)/2+1 = 3

【answer】 3x3 output image

7-2 Section1 Concept of recurrent neural network

7-2-1 RNN overview

● What is RNN? It is a neural network that can handle time series data. ● What is time series data? Observed at regular intervals in chronological order, Moreover, a series of data in which statistical dependencies are recognized from each other. ・ Voice data ・ Text data (monthly number of visitors, etc.), etc.

● About RNN The point is that the middle layer (hidden layer) is very important. 　　　　　【RNN】

\begin{align}
&u^t = W_{(in)}x^t + Wz^{t-1} + b\\
&z^t = f(W_{(in)}x^t + Wz^{t-1} + b)\\
&v^t = W_{(out)}z^t + c\\
&y^t = g(W_{(out)}z^t + c)
\end{align}

・ U [:, t + 1] = np.dot (X, W_in) + np.dot (z [:, t] .reshape (1, -1), W) ・ Z [:, t + 1] = functions.sigmoid (u [:, t + 1]) ・ Np.dot (z [:, t + 1] .reshape (1, -1), W_out) ・ Y [:, t] = functions.sigmoid (np.dot (z [:, t + 1] .reshape (1, -1), W_out))

☆ Confirmation test ☆ ・ 7-2-1-1 RNN networks are roughly divided into three weights. One is the weight applied when defining the current middle layer from the input, The first is the weight applied when defining the output from the middle layer. Explain the remaining one weight.

[My answer] Weight from middle layer to middle layer.

【answer】 Weight from middle layer to middle layer. (W) Weight from input layer to intermediate layer (W (in)) Weight from the middle layer to the output layer (W (out))

● What are the characteristics of RNNs? To work with time series models, the initial state and the past Holds the state at time t-1 and then t at the next time A recursive structure that is recursively obtained is required.

● Source exercise Binary addition (predicts binary value) ① Let's change weight_init_std, learning_rate, hidden_layer_size ② Let's change the weight initialization method ・ Xavier I feel that the processing is fast. ・ He I made a coding mistake. .. .. I'm glad I got to work. (Sweat) ③ Let's change the activation function of the middle layer I didn't know right away, so I referred to the answer. ・ ReLU ・ Hyperbolic tangent

7-2-2　BPTT ● What is BPTT? A type of parameter adjustment method in RNN ⇒ A type of inverse error propagation

● Review of error back propagation method ・ Let's explain the error back propagation method to others.

☆ Confirmation test ☆ ・ 7-2-2-1 Use the principle of chain rule to find dz / dx. 　　　　　　　z=t^2 　　　　　　　t=x+y

[My answer]

\begin{align}
\frac{dz}{dx} &= \frac{dz}{dt}\frac{dt}{dx}\\
&=\frac{d}{dt}t^2.\frac{d}{dx}(x + y)\\
&=2t ・ 1\\
&=2(x + y)
\end{align}

【answer】 Same as my answer

7-2-3 Mathematical description of BPTT

[Mathematical description 1]

\begin{align}
\frac{\partial E}{\partial W_{(in)}} &= \frac{\partial E}{\partial u^t}\left [ \frac{\partial u^t}{\partial W_{(in)}} \right ]^T = δ^t[x^t]^T\\

\frac{\partial E}{\partial W_{(out)}} &= \frac{\partial E}{\partial v^t}\left [ \frac{\partial v^t}{\partial W_{(out)}} \right ]^T = δ^{out,t}[z^t]^T\\

\frac{\partial E}{\partial W} &= \frac{\partial E}{\partial u^t}\left [ \frac{\partial u^t}{\partial W} \right ]^T = δ^t[z^{t-1}]^T\\

\frac{\partial E}{\partial b} &= \frac{\partial E}{\partial u^t}\frac{\partial u^t}{\partial b}  = δ^t\\

\frac{\partial E}{\partial c} &= \frac{\partial E}{\partial v^t}\frac{\partial v^t}{\partial c}  = δ^{out,t}
\end{align}

・ Np.dot (X.T, delta [:, t] .reshape (1, -1)) ・ Np.dot (z [:, t + 1] .reshape (-1,1), delta_out [:, t] .reshape (-1,1)) ・ Np.dot (z [:, t] .reshape (-1,1), delta [:, t] .reshape (1, -1))

[Mathematical description 2]

\begin{align}
u^t &= W_{(in)}x^t + Wz^{t-1} + b\\
z^t &= f(W_{(in)}x^t + Wz^{t-1} + b)\\
v^t &= W_{(out)}z^t + c\\
y^t &= g(W_{(out)}z^t + c)
\end{align}

☆ Confirmation test ☆ ・ 7-2-3-1 Express $ y_1 $ in the figure below using $ x ・ S_0 ・ S_1 ・ W_ {in} ・ W ・ W_ {out} $.

Define the bias with any character.
Also, let the sigmoid function $ g (x) $ act on the output of the intermediate layer. [My answer] Where is $ S $ ... (Assuming z. The bias is b.) First, $ y_1 = W_ {(out)} (W_ {(in)} x_1 + Wz_0) + b) $ If you include the sigmoid function ... 　　　y_1 = g(W_{(out)} (W_{(in)}x_1+Wz_0) + b)) + c

【answer】　　　z_1 = sigmoid(S_0W + x_1W_{(in)} + b) 　　　y_1 = sigmoid(z_1W_{(out)} + c)

● Business case There is a character in the mirror, who gives advice on makeup for each part.

[Mathematical description 3]

\frac{\partial E}{\partial u^t} = \frac{\partial E}{\partial v^t}\frac{\partial v^t}{\partial u^t}=\frac{\partial E}{\partial v^t}\frac{\partial ｛W_{(out)}f(u^t)+c｝}{\partial u^t}= f'(u^t)W^T_{(out)}δ^{out,t}=δ^t

・ Delta [:, t] = (np.dot (delta [:, t + 1] .T, W.T) + np.dot (delta_out [:, t] .T, W_out.T)) 　　　　　　　　　　* functions.d_sigmoid(u[:,t+1])

[Parameter update formula]

\begin{align}
W^{t+1}_{(in)}&=W^t_{(in)}-ε\frac{\partial E}{\partial W_{(in)}}= W^t_{(in)}-ε\sum^{Tt}_{z=0}δ^{t-z}[x^{t-z}]^T\\

W^{t+1}_{(out)}&=W^t_{(out)}-ε\frac{\partial E}{\partial W_{(out)}}= W^t_{(out)}-εδ^{out,t}[z^t]^T\\

W^{t+1}&=W^t-ε\frac{\partial E}{\partial W}= W^t_{(in)}-ε\sum^{Tt}_{z=0}δ^{t-z}[z^{t-z-1}]^T\\

b^{t+1}&=b^t - ε\frac{\partial E}{\partial b} = b^t-ε\sum^{Tt}_{z=0}δ^{t-z}\\

c^{t+1}&=c^t - ε\frac{\partial E}{\partial c}= c^t - εδ^{out,t}
\end{align}

・ W_in-= learning_rate * W_in_grad ・ W_out-= learning_rate * W_out_grad ・ W-= learning_rate * W_grad

7-2-4 Overview of BPTT

● You can easily write as BPTT as follows. Each will develop the contents.

\begin{align}
E^t&= loss(y^t,d^t)\\
&=loss(g(W_{(out)}z^t+c),d^t)\\
&=loss(g(W_{(out)}f(W_{(in)}x^t+Wz^{t-1}+b)+c),d^t)
\end{align}

The contents of $ f () $ are ...

\begin{align}
&W_{(in)}x^t+Wz^{t-1} + b\\
&W_{(in)}x^t + Wf(u^{t-1})+ b\\
&W_{(in)}x^t+Wf(W_{(in)}x^{t-1}+Wz^{t-2}+b)+b
\end{align}

● Code exercises [My answer] 　　　(3)delta_t.dot(V)

【answer】　　　(2)delta_t.dot(U) When partially differentiating the loss function with respect to the weights W and U in an RNN, You need to take that into account, $ dh_ {t} / dh_ {t-1} = U $ Note that U is multiplied each time we go back in time. That is, delta_t = delta_t.dot (U)

7-3　Section2 LSTM ● RNN issues The more you go back in time, the more the gradient disappears. ⇒Difficult to learn long time series.

● Solution LSTM is the solution by changing the structure itself.

● Review of vanishing gradient problem As the error backpropagation method progresses to the lower layers The gradient is getting gentler and gentler. In the gradient descent update, the lower layer parameters are Almost unchanged, training does not converge to the optimum value.

☆ Confirmation test ☆ ・ 7-3-1 Maximum value when the input value is 0 when the sigmoid function is differentiated. Take. Select the correct value from the options. 　　　　　　　(1) 0.15　(2) 0.25　(3) 0.35　(4) 0.45

[My answer] 　　　　(2)0.25（（1-0.5）×0.5）

【answer】　　　　(2)0.25

● Gradient explosion The gradient increases exponentially as it propagates back through the layer.

● Exercise Challenge [My answer] 　　　(3)gradient/threshold

【answer】　　　(1)gradient*rate Just multiply the gradient by rate. 7-3-1　CEC 　　●CEC As a solution for gradient disappearance and gradient explosion It can be solved if the gradient is 1.

● CEC issues About input data The weight is uniform regardless of the time dependence. ⇒There is no learning characteristic of neural network. (Learning is lost in the first place)

・ Input layer ⇒ Weight to hidden layer ・・・ Input weight collision ・ Hidden layer ⇒ Weight to output layer ... Output weight collision.

7-3-2 Input gate and output gate

● Input gate ● Output gate ● Role of input / output gate By adding input / output gates The weight of the input value to each gate, It can be changed by the weight matrices W and U.

⇒ By using W and U, it becomes easier to make a difference, I was able to solve the CEC issue.

7-3-3 Oblivion Gate

● Current status of LSTM CEC stores all past information. ● LSTM issues If you no longer need the past information, you cannot delete it, It will continue to be stored.

● Solution If you no longer need past information A function to forget information at that timing is required. ⇒ The oblivion gate was born.

☆ Confirmation test ☆ ・ 7-3-3-1 Suppose you want to enter the following sentences into the LSTM and predict the words that apply to the blanks. The word "very" in the text is in the blank prediction It is considered that it will not affect even if it disappears. Which gate is considered to work in such a case? "The movie was interesting. By the way, I'm so hungry that something ____."

[My answer] Oblivion gate

【answer】 Keywords that do not affect ⇒ Oblivion gate

● Exercise Challenge [My answer] 　　　(3）input_gate* a + forget_gate* c

【answer】　　　(3）input_gate* a + forget_gate* c The state of the new cell is the input to the calculated cell and Input gate to the state of the cell one step before, It is expressed as the sum of the oblivion gates.

7-3-4 Peephole connection

● Challenges CEC's stored past information, Propagate to other nodes at any time, Or I want to forget it at any time.

The value of CEC itself does not affect gate control.

⇒ Peephole connection? ?? A structure that allows propagation of CEC's own values via a weight matrix.

Affects input gates, forgetting gates, and output gates.

7-4　Section3　GRU(Gated Recurrent Unit) ● LSTM issues There was a problem that the number of parameters was large and the calculation load was high. ⇒GRU as a solution.

● What is GRU? In the conventional LSTM, there were many parameters, so The calculation load was heavy. GRU has significantly reduced parameters A structure that can be expected to have the same or higher accuracy.

⇒ There is a merit that the calculation load is low.

● Overview of GRU ☆ Confirmation test ☆ ・ 7-4-1 Briefly describe the issues facing LSTM and CEC.

[My answer] LSTM ･･･ The number of parameters is large and the calculation load is high. CEC ･･･ All past information is stored and cannot be forgotten.

【answer】 LSTM ･･･ The number of parameters is large and the calculation load is large. CEC ･･･ The concept of weight is standardized. The optimum parameters cannot be learned.

● Exercise Challenge [My answer] 　　　（3）z * h * h_bar

【answer】　　　（4）(1-z) * h + z * h_bar I was at a loss ...

The new intermediate state is calculated as the intermediate representation one step before It is represented by the linear sum of intermediate representations. That is, using the update gate z, You can write (1-z) * h + z * h_bar.

⇒ If you make a mistake, reconfirm the whole picture ... I will do it ↓ ↓ ↓

☆ Confirmation test ☆ ・ 7-4-2 Briefly describe the difference between LSTM and GRU.

[My answer] Since the number of parameters is different, the amount of calculation will be different. The accuracy does not change much.

【answer】 Relative comparison, LSTM has many parameters, GRU has a small number of parameters. ⇒ It is necessary to run each process and verify it. It is good to check where there are many parameters and where there are few.

7-5 Section4 Bidirectional RNN

● Bidirectional RNN By adding future information as well as past information, A model for improving accuracy. ・ Practical example Sentence elaboration, machine translation, etc.

● Exercise Challenge [My answer] 　　　（4）np.concatenate([h_f, h_b[::-1]], axis=1)

【answer】　　　（4）np.concatenate([h_f, h_b[::-1]], axis=1) Bi-directional RNNs represent intermediate layers when propagating in the forward and reverse directions. Since the combined feature is the feature amount, np.concatenate ([h_f, h_b [:: -1]], axis = 1).

7-6　Section5　Seq2Seq 7-6-1　Seq2Seq ● What is Seq2Seq? Encoder-A type of Decoder model. ● Specific uses Very characteristic, used for machine dialogue and machine translation. 　　　 7-6-2　Encoder RNN 　　●Encoder RNN Text data input by the user A structure that separates tokens such as words and passes them. ● Taking: Divide sentences into tokens such as words, Divide into ID for each token.

● Embedding: Converts ID to distributed representation vector representing token. ⇒ Digitize.

● Encoder RNN: Input vectors to RNN in order.

● Encoder RNN processing procedure -Input vec1 to RNN and output hidden state. I've also entered this hidden state and the next input vec2 into the RNN Repeat the flow of outputting hidden state.

・ The hidden state when the last vec is inserted is set as the final state. Save it. This final state is called a thought vector. It is a vector that represents the meaning of the input sentence.

vec (input) → hidden state (output) → last vec (input) → final state (output) ... called thought vector.

7-6-3　Decoder RNN 　　●Decoder RNN The system outputs the output data A structure generated for each token such as a word. ● Decoder RNN processing 1.Decoder RNN: From the final state (thought vector) of the Encoder RNN, The ** generation probability ** of each token is output. Set final state as the initial state of Decoder RNN and enter Embedding. ⇒ Predict the next word from what is vectorized.

2.Sampling: Randomly select tokens based on the generation probability.

Embedding: Embedding the token selected in 2 and Decoder RNN The next input to.

4.Detokenize: Repeat 1 to 3 to convert the token obtained in 2 into a character string.

⇒ Predict the word that comes after "belly" and adopt the one with high probability of generation. "Ga" → "It hurts" → "It is" → "."

☆ Confirmation test ☆ ・ 7-6-3-1 Select the one that explains seq2seq from the following options.

(1) Configure RNNs in the forward and reverse directions with respect to time, These two intermediate layer representations are used as features. (2) A type of Encoder-Decoder model using RNN, Used for models such as machine translation. (3) Expression vector from adjacent words for a tree structure such as a syntax tree The operation of creating (phrase) is performed recursively (weights are common), It is a neural network that obtains the representation vector of the entire sentence. (4) Vanishing gradient problem, which is a type of RNN and is a problem in simple RNNs. It was solved by introducing the concept of CEC and gate.

[My answer] 　　　(2)

【answer】　　　(2) 　　　 ● Exercise Challenge [My answer] 　　　（3）w.dot(E.T)

【answer】　　　（1）E.dot(w) The word w is a one-hot vector, It is converted into another feature by word embedding. This can be written as E.dot (w) using the embedded matrix E.

⇒ Take the inner product. It is meaningful to convert it into a feature quantity. 7-6-4　HRED ● Issues of Seq2Seq You can only answer one question at a time. ⇒ There is no context for the question, just a response.

⇒ HRED gives a human touch.

● What is HRED? Generate the next utterance from the past n-1 utterances.

⇒ In Seq2Seq, the response is made by ignoring the context of the conversation, HRED responds according to the flow of the previous word. More human-like sentences are generated.

【Seq2Seq＋Context RNN】 Context RNN: Summarize the series of each sentence summarized by Encoder, Vector representing the entire conversation context so far Structure to convert to. Vectorize the context.

⇒ You can reply with the history of past utterances taken into consideration.

● HRED issues HRED has only stochastic diversity in writing, There is no diversity like the "flow" of conversation. ⇒ Even if the same context (utterance list) is given Only the same answer can be given as the flow of conversation each time. 　 HRED tends to give short, information-poor answers. ⇒ Tends to choose short, common answers.

7-6-5　VHRED ● What is VHRED? HRED with the concept of VAE latent variables added. (???)

⇒HRED issues, VAE latent variable concept The structure solved by adding.

In particular, the part of HRED's issues that gives similar answers has improved.

☆ Confirmation test ☆ ・ 7-6-5-1 Briefly describe the difference between Seq2Seq and HRED, and HRED and VHRED.

[My answer] Seq2Seq ignores the context, but HRED produces the next utterance, More human sentences can be generated.

HRED tends to choose a common answer, VHRED solves the problems of HRED. 　【answer】 Seq2Seq can only answer one question at a time. HRED solves that problem. HRED has answers in the context of the conversation, There is a similar answer every time. VHRED improves it and gives a different answer.

7-6-6　AE（Auto Encoder） ● What is AE? One of unsupervised learning. The input data at the time of learning is only training data, not teacher data. ⇒One of the features.

In the case of MNIST, put an image of a number of 28 x 28 (= 784), It becomes a neural network that outputs the same image.

● Structural explanation Convert input data to latent variable $ z $ Encoder neural network Conversely, the original image is restored using the latent variable $ z $ as input. Decoder neural network

● Merit Dimension reduction is possible. If the dimension of $ z $ is smaller than the input data, it can be regarded as dimension reduction. ● Grasp the whole picture from various angles (leading company) Cloud ... AWS (easy to use) GCP (AWS is easier to use in terms of price)

TensorFlow made by Google is easy to use. Hard ... nVIDIA is good. Solution: TOYOTA's autonomous driving technology, etc.

7-6-7　VAE ● What is VAE? For a normal autoencoder Although I'm pushing data into the latent variable $ z $, I don't know what the structure is like.

⇒ VAE assumes a probability distribution z ~ N (0,1) for the latent variable $ z $. The answer words come out according to the context.

⇒ VAE has a structure called a probability distribution of latent variable $ z $ It is possible to push it in.

☆ Confirmation test ☆ ・ 7-6-7-1 Answer the words that apply to the blanks in the following explanation about VAE.

Introducing ____ into the latent variable of the self-encoder.

[My answer] Structure of probability distribution

【answer】 Probability distribution

7-7　Section6　Word2vec ● RNN issues RNNs cannot give NNs variable-length strings like words. ⇒ Fixed length is required.

●Word2vec Create a vocabulary from training data You just need to have the necessary vocabulary.

(Example) I want to eat apples. I like apples. 　　　　　⇒{apples,eat,I,like,to,want}

● one-hot vector When inputting Apples, the following vector is input to the input layer.

Originally, one-hot vectors can be created for the number of words in the dictionary.

● Advantages of Word2vec Learning distributed representation of large-scale data It is feasible with a realistic calculation speed and memory amount. ⇒ Conventionally, a weight matrix was created by the number of vocabularies x the number of vocabularies. A weight matrix can be created by multiplying the number of vocabularies by any word vector dimension. (You can understand the meaning of the word even if you reduce the dimension a lot.) 7-8　Section7　Attention Mechanism ● Issues of Seq2Seq The problem of Seq2Seq is difficult to deal with long sentences. You need to enter 2 words or 100 words in a fixed dimension vector.

● Solution The longer the sentence, the more the dimension of the internal representation of the sequence A mechanism to grow is needed. 　　　⇒Attention Mechanism

⇒ "Which word of input and output is related" It is a mechanism to learn the degree of relevance of.

It will be possible to respond by increasing the accuracy of the degree of relevance. Day3_0045Attenton Mechanism.png

☆ Confirmation test ☆ ・ 7-8-1 Briefly describe the difference between RNN and word2vec, and seq2seq and Attention.

[My answer] The difference between RNN and Word2vec is the amount of calculation. Seq2Seq and Attention are the differences in correspondence to long sentences.

【answer】 The difference between Word2vec is that the weight for WORD can be calculated with realistic consumption. Number of vocabulary x number of vocabulary ⇒ number of vocabulary x arbitrary word vector Seq2Seq and Attention are long text captures, To be valid as a translation.

● Exercise Challenge Day3_0046Attenton Mechanism_challenge.png [My answer] This is not listed in the PowerPoint document ... 　　　(4)W.dot(np.maximum(left,right))

【answer】　　　(2)W.dot(np.concatenate([left,right])) Do you want to vectorize using left and right? Weight the features.

● Gain practical experience It is better to practice where there is existing data Can be worn. The structure of AI is also a study, It is also an experience to provide what is analyzed and modeled as a service.

[PYTHON] DeepRunning ~ Level7 ~