[PYTHON] Report_Deep Learning (Part 1)

Confirmation test (3-1) Answer the size of the output image when the input image of size 5x5 is folded with the filter of size 3x3. The stride is 2 and the padding is 1.

answer Output image (vertical size) = (input image (vertical size) + 2 x padding-filter (vertical size) / stride) + 1            (5+2×1-3/2) + 1 = 3

Output image (horizontal size) = (input image (horizontal size) + 2 x padding-filter (horizontal size) / stride) + 1            (5+2×1-3/2) + 1 = 3

Therefore, the size of the output image is 3 × 3.

Recurrent Neural Network (RNN)

RNN is a neural network that can handle time series data. Time-series data is a series of data that is observed at regular intervals in chronological order and that has statistical dependencies on each other. For example, voice data, text data, stock price data, etc.

Confirmation test (3-2) RNN networks can be broadly divided into three weights. One is the weight applied when defining the middle layer from the input, The first is the weight applied when defining the output from the middle layer. Explain the remaining one weight.

answer The weight applied from the middle layer of the past (one before) to the middle layer of the present (now).

Features of RNN: To work with a time series model, keep the initial state and the state of the past time t-1, and then recursively t at the next time. You need the recursive structure you want.

Exercise (3-3,4) image.png image.png image.png image.png

As a result of code execution, an error occurs up to 2000 times, and the error gradually converges. As a try, I changed hidden_layer_size to 100. As a result, it was found that by increasing the number of hidden layers excessively, considerable calculations were required for the results to converge. image.png

Next, I changed weight_init_std (initial value of weight) to 0.1. Confirm that the error does not converge by reducing the initial value of the weight. image.png

Finally, I changed the learning_rate to 0.5. Confirm that the error convergence is faster by increasing the learning rate. image.png

Use Xavier as the initial value of the weight. image.png Change the initial value and execute. image.png Next, set the initial value of the weight with He and execute. image.png result image.png

Change the activation function to the ReLU function and check the gradient explosion. (Gradient explosion: The opposite of gradient disappearance, it grows exponentially with each back propagation and does not converge.) image.png image.png image.png It was found that the result was a gradient explosion and did not converge.

Executed using tanh. Define the derivative of tanh image.png image.png result image.png It can be seen that the gradient explodes like the ReLU function.

BPTT Abbreviation for BackproPagation Through Time, which is one of the backpropagation methods.

review The error back propagation method is based on the calculation result (output). Confirmation test (3-5) Find dx / dz using the principle of chain rule. z=t^2 t=x+y

dx/dz = az/dt * dt/dx

az/dt = 2t From dt / dx = 1

dx/dz = 2t =2(x+y)

Confirmation test (3-6) Express y1 in the figure below as a mathematical formula using x, s0, s1, Win, W, Wout.

RNN task: It is difficult to learn a long time series because the gradient disappears as you go back in time.

review The vanishing gradient problem is that the gradient becomes gentler as it propagates to the lower layers by the error back propagation method. for that reason, In the lower layer, parameter updates rarely occur and do not converge to the optimum value.

Confirmation test (3-9) When the sigmoid function is differentiated, the maximum value is taken when the input value is 0. Select the correct value.

answer 0.25 Differentiating the sigmoid function (1-sigmoid (x)) * It becomes sugmoid (x), and when 0 is substituted for x and calculated, it becomes 0.25.

Gradient explosion As confirmed in the above exercise, the gradient increases exponentially with each backpropagation and does not converge.

LSTM model

CEC: As a solution for gradient disappearance and gradient explosion, it can be solved if the gradient is 1. The problem is that the weight of the input data is uniform regardless of the time dependence.

Regarding the fact that the weights are uniform, the input / output dates are used to change the weights to solve the problem.

Challenges for LSTM models All past information is stored in the CEC. Therefore, even if the past information is no longer needed, it cannot be deleted. I will keep it.

As a solution, if past information is no longer needed, provide a function to delete the information at that timing → Oblivion gate

Confirmation test (3-10) Suppose you want to enter the following sentence into an LSTM and predict the words that fit in the blanks. The word "very" in the text is in the blank prediction It is considered that it will not affect even if it disappears. Which date is likely to work in such cases? "The movie was interesting. By the way, I was so hungry that something ..."

answer It seems that the forgetting gate works because "very" has no effect even if it is not in the prediction.

Peephole connection I want to propagate the past information stored in CEC to other nodes at any time, or forget it at any time. → A structure that allows propagation to the value of CEC itself via a weight matrix is called a peephole connection.

GRU

The LSTM model has a problem that the number of parameters is large and the calculation load is high. GRU was devised to solve it. The number of parameters has been significantly reduced, and accuracy can be expected from the same party or higher. (In practice, it depends on the case, so it is not necessarily upward compatible)

Confirmation test (3-12) Briefly describe the problems that LSTM and CEC have. LSTM: The number of parameters is large and the calculation load is heavy, and it takes time to calculate the result. CEC: Since all past information is saved, it is not possible to delete past information that is no longer needed.

Confirmation test (3-13) Briefly describe the difference between LSTMs and GRUs. The difference between LSTM and GRU is that the number of parameters is small and the calculation load is light. (In practice, both models are run and the appropriate model is selected.)

Bidirectional RNN

A model that improves accuracy by adding future information in addition to past information. Example: Used in sentence prediction and machine translation.

seq2seq Encoder-A type of Decoder model used for machine dialogue and machine translation.

Encoder-RNN A structure in which text data input by the user is passed by dividing it into tokens for each word. Divide each token and decompose the ID for each token. Embedding Convert from an ID to a distributed representation vector representing that token.

Process flow Input vector 1 to RNN and output hidden-state. Enter this hidden-state and the following vector 2 into the RNN to set the hidden-state. Repeat the output flow. The hidden-state when the last vector is input is set as final-state, and this final-state is It is called a thought-vector and is a vector that expresses the meaning of the input sentence.

Decoder-RNN A structure in which the system generates output data for each token such as a word.

Process flow Output the generation probability of each token from the final-state of Encoder-RNN. final-state as install-state of Decoder-RNN Set and enter in Embedding. It then randomly selects tokens based on the probability of generation (Sampling). Embedding the token selected by sampling It is the next input of Decoder-RNN. Repeat the above to convert the token into a string.

Confirmation test (3-15) Choose from the options below that describe seq2seq.

answer 2: A type of Encoder-Decoder model that uses RNN, and is used for models such as machine translation.

HRED It consists of seq2seq + Context-RNN. Context-RNNha A structure that collects a series of sentences compiled by Encoder and converts it into a vector that represents the entire conversation context so far. This You will be able to respond with the history of past conversations taken into account.

Challenges of seq2seq You can only answer one question at a time. There is no context to the question, just a response. Because it was devised to make more human-like remarks HRED.

HRED generates the next utterance from the past n-1 utterances. This responds according to the flow of the previous word, so that more human-like sentences Will be generated.

Challenges for HRED ・ There is only literal diversity, and there is no “flowing” diversity like conversation. Even if given the same dialogue list, only the same answer will be given each time. I can't. ・ Short and lacking in information. Tends to learn short and common answers.

VHRED A structure that adds the concept of latent variables of VAE to HRED. Solve the problem of HRED by adding VAE.

Confirmation test (3-16) seq2 Briefly describe the difference between seq and HRED and between HRED and VHRED. Difference between seq2seq and HRED seq2seq is a kind of Encoder-Decoder model, and HRED is a structure that combines seq2seq with Context-RNN. seq2seq is a question-and-answer, but HRED generates the next dialogue from the past dialogue.

Difference between HRED and VHRED VHRED is a structure in which the latent variable of VAE is added to HRED. This is an issue for HRED. It was devised to eliminate the lack of information.

AE(Auto-Encoder) One of unsupervised learning. Specific example In the case of MNIST, a neural network that outputs the same image when an image with a number of 28 x 28 is input.

Auto-Encoder structure A neural network that converts input data to a latent function z is called an Encoder, and restores the original image using the latent variable z as an input. The neural network is called Decoder. As a merit, dimension reduction can be performed. Process flow Processing is performed in the order of input data → Encoder → latent variable z → Decoder → output data.

VAE In a normal Auto-Encoder, some data is put in the latent variable z, but the state of its structure is unknown. Therefore, VAE assumes a probability distribution z to N (0,1) for this latent variable z.

Confirmation test (3-19) Answer the blanks in the description below about VAE. "Introducing ____ into the latent variable of the self-encoder."

answer Probability distribution

Word2Vec RNNs cannot give NNs variable-length strings like words. Therefore, a vocabulary is created from the training data. For example, if you create a vocabulary of 7 words, you will originally have as many words as the dictionary.

merit Learning the distributed representation of large-scale data has become feasible with realistic computational speed and memory capacity. Conventional: Only vocabulary x vocabulary can create a weight matrix This time: Vocabulary × Weight vector in any word vector dimension can be created.

Attention Mechanism seq2seq is difficult to handle long sentences. Either 2 words or 100 words must be entered in a fixed dimensional vector. As a solution, Attention Machanism was devised because the longer the sentence, the larger the dimension of the internal representation of the sequence. Which words in the input and output are related? A structure for learning the degree of relevance.

Confirmation test (3-20) Describe the difference between RNN and Word2Vec, seq2seq and Attention. Difference between RNN and Word2Vec RNN generates a weight matrix only for vocabulary x vocabulary, but Word2Vec generates a weight vector for vocabulary x arbitrary word resolution. This made it possible to calculate with a realistic calculation speed and memory amount by using Word2Vec.

Difference between seq2seq and Attention With seq2seq, it is difficult to handle long sentences and only fixed lengths can be supported, but with Attention, variable lengths can be supported by learning the degree of relevance between inputs and outputs.

i.png

Recommended Posts

Report_Deep Learning (Part 2)
Report_Deep Learning (Part 1)
Report_Deep Learning (Part 1)
Report_Deep Learning (Part 2)
Python: Supervised Learning: Hyperparameters Part 1
Python: Supervised Learning: Hyperparameters Part 2
[PyTorch Tutorial ⑤] Learning PyTorch with Examples (Part 2)
Try deep learning with TensorFlow Part 2
[PyTorch Tutorial ⑤] Learning PyTorch with Examples (Part 1)
real-time-Personal-estimation (learning)
datetime part 1
Predict power demand with machine learning Part 2
numpy part 1
argparse part 1
Learning record
Learning record # 3
Learning record # 1
Machine learning
python learning
Python: Gender Identification (Deep Learning Development) Part 1
Python: Gender Identification (Deep Learning Development) Part 2
Learning record # 2
numpy part 2
6/10 Learning content
Deep Learning
numpy-sigmoid learning
"Deep Learning from scratch" Self-study memo (Part 12) Deep learning
EV3 x Pyrhon Machine Learning Part 3 Classification
Machine learning memo of a fledgling engineer Part 1
Classification of guitar images by machine learning Part 1
Machine learning starting with Python Personal memorandum Part2
Basics of Supervised Learning Part 1-Simple Regression- (Note)
Machine learning starting with Python Personal memorandum Part1
Pattern recognition learning in video Part 1 Field of Pattern Recognition
[Machine learning] Supervised learning using kernel density estimation Part 2
EV3 x Pyrhon Machine Learning Part 1 Environment Construction
EV3 x Python Machine Learning Part 2 Linear Regression
[Machine learning] Supervised learning using kernel density estimation Part 3
Video frame interpolation by deep learning Part1 [Python]
Stock Price Forecast Using Deep Learning (TensorFlow) -Part 2-
Machine learning memo of a fledgling engineer Part 2
Classification of guitar images by machine learning Part 2
Try Katsuhiro Morishita / Aso_Sea_Clouds_Pridiction Memo-Excluding the learning part-
Basics of Supervised Learning Part 3-Multiple Regression (Implementation)-(Notes)-