Level6. Deep learning DAY2

alt What is a deep learning course that can be crushed in the field in 3 months

6-1. Review of error back propagation method

☆ Confirmation test ☆ 6-1-1 Use the principle of chain rule to find $ dz / dx $. 　　　　　z = t^2 　　　　　t = x + y

[My answer]

\begin{align}
　　　\frac{dz}{dx} &= \frac{dz}{dt}\frac{dt}{dx}\\
　　　&= \frac{d}{dt}t^2 \frac{d}{dx}(x + y)\\
　　　&= 2t(1)\\
　　　&= 2t\\
　　　&= 2(x + y)
\end{align}

【answer】 It was the same as my answer.

● Vanishing gradient problem The error back propagation method advances to the lower layer (output layer ⇒ input layer) The disappearance problem is that the parameters of the lower layer hardly change, Training does not converge to optimal values.

● Sigmoid function (activation function) A function that changes slowly between 0 and 1. You can now tell the strength of the signal.

☆ Confirmation test ☆ 6-1-2 When the sigmoid function is differentiated, it takes the maximum when the input value is 0. Select the correct value from the options. 　　　　(1) 0.15　(2) 0.25　(3) 0.35　(4) 0.45

[My answer] (1-0.5) x 0.5 = 0.25 ･･･ (2)

【answer】 $ f (u)'= (1 --sigmoid (u)) ･ sigmoid (u) $ From (1-0.5) x 0.5 = 0.25

6-2.Section1 Vanishing gradient problem

● How to solve the gradient disappearance ・ Selection of activation function ・ Initial weight setting ・ Batch normalization

6-2-1 Selection of activation function

● ReLu function (activation function) Contribution to avoiding vanishing gradient problems and sparseness

6-2-2 Initial value setting of weight

● Activation function when setting the initial value of Xavier ・ ReLU function ・ Sigmoid (logistic) function ・ Hyperbolic tangent function ⇒ The value obtained by dividing the weight element by the square root of the number of nodes in the previous layer.

● Activation function when setting the initial value of He ・ ReLU function ⇒ For the value obtained by dividing the weight element by the square root of the number of nodes in the previous layer Value multiplied by √2 I devised that it might be possible to make the gradient disappearance less likely to occur.

☆ Confirmation test ☆ 6-2-2-1 What kind of problem will occur if 0 is set as the initial value of the weight? Explain briefly.

[My answer] Because the total input value is only the bias value Appropriate predicted values are not calculated, and it takes time to update the parameters thereafter.

【answer】 All weights propagate at 0. Parameters are no longer tuned. (The video at the top of the answer was cut off and could not be confirmed.)

6-2-3 Batch normalization

● What is batch normalization? A method to suppress the bias of input value data in mini-batch units

● How to use batch normalization Before and after passing the value to the activation function Add a layer with batch normalization processing.

⇒ $ u ^ {(l)} = w ^ {(l)} z ^ {(l-1)} + b ^ {(l)} $ or $ z $

☆ Confirmation test ☆ 6-2-3-1 List two commonly considered effects of batch normalization.

[My answer] ・ There is a difference in generalization performance. -By converting the input distribution to a Gaussian distribution and limiting the input data, It has the effect of preventing the input data from changing significantly with each learning. Learning in the middle layer stabilizes

【answer】・ High-speed calculation (compact calculation) ・ Gradient disappearance is less likely to occur

● Mathematical description of batch normalization

\begin{align}
&１．μ_t = \frac{1}{N_t}\sum^{N_t}_{i=1} x_{ni}\\
&２．σ_t^2 = \frac{1}{N_t}\sum^{N_t}_{i=1}(x_{ni} - μ_t)^2\\
&３．\hat x_{ni} = \frac{x_{ni} - μ_t}{\sqrt{σ_t^2 + \theta}}\\
&４．y_{ni} = γx_{ni} + β
\end{align}

・ Explanation of processing and symbols

\begin{align}
μ_t&: Average of the whole mini-batch t\\
σ_t^2&: Standard deviation of the entire mini-batch t\\
N_t&: Mini batch index\\
\hat x_{ni}&: Calculation to bring the value closer to 0 (centering around 0)\\
&And normalized value\\
γ&: Scaling parameters\\
β&: Shift parameter\\
y_{ni}&: For the product of mini-batch index value and scaling\\
&Value with shift added (output of batch normalization operation)\\
\end{align}

● Calculation graph Learn forward and back propagation of batch normalization.

6-2-4 Implementation to solve the vanishing gradient problem

● Confirmation of Sigmoid implementation ● Confirmation of ReLU implementation ● Confirmation of Sigmoid + Xavier implementation ● Confirmation of ReLU + He implementation ● Implementation confirmation with Sigmoid + Xavier and hidden_layer_size changed ● Implementation confirmation with ReLU + He and hidden_layer_size changed ● Implementation confirmation with Sigmoid + He and hidden_layer_size changed ● Implementation confirmation with ReLU + Xavier and hidden_layer_size changed

6-2-5 About data

● Grasp the whole picture from multiple angles. The number of data is very small in Japan. The point is how to collect data (corporate needs). It is necessary to collect data in consideration of the product.

6-3.Section2 Learning rate optimization method

● Review of gradient descent ・ Creating a network that minimizes errors through learning ⇒ Find the parameter $ w $ that minimizes the error $ E (w) $

● Review of learning rate ・ When the learning rate value is large It diverges without reaching the optimum value forever

・ When the learning rate value is small It does not diverge, but if it is too small, it takes time to converge. It becomes difficult to converge to the global local optimum value

● How to determine the learning rate Guidelines for how to set the initial learning rate ・ Set the initial learning rate to a large value and gradually decrease the learning rate. ・ Variable learning rate for each parameter ⇒ ** Use learning rate optimization method ** to optimize learning rate

● Learning rate optimization method ・ Momentum ・ AdaGrad ・ RMS Prop ・ Adam

6-3-1. Momentum

[Momentum] Inertia: $ μ $ ・ After subtracting the product of the error differentiated by the parameter and the learning rate, Add the product of the current weight minus the previous weight and the inertia. ・ $ Μ $ is a hyperparameter

\begin{align}
&V_t = μV_{t-1} - \epsilon∇E\\
&W^{(t+1)} = W^{(t)} + V_t\\
\end{align}

・ Self.v [key] = self.momentum * self.v [key] -self.learning_rate * grad [key] ・ Params [key] + = self.v [key]

[Gradient descent method] -Subtract the product of the error differentiated by the parameter and the learning rate.

W^{(t+1)} = W^{(t)} + \epsilon∇E\\

● Benefits of momentum ・ It does not become a local optimum solution, but a global optimum solution. ・ It takes a short time to reach the lowest position (optimal solution) from the valley. (Since the gradient becomes gentle, it takes time to converge) ・ Under the influence of the previous weight, you can learn including numerical changes.

☆ Confirmation test ☆ ・ 6-3-1 Features of Momentum, AdaGrad, RMS Drop Explain each briefly.

[My answer] ・ Momentum ... A global optimum solution can be obtained. Convergence is fast. ・ AdaGrad ... I don't know. ・ RMSProp ... I don't know.

【answer】・ Momentum ... Convergence is fast. ・ AdaGrad ･･･ It is easy to approach the optimum value on a gentle slope. ・ RMSProp ・・・ Parameter adjustment is small.

6-3-2.AdaGrad 　　【AdaGrad】・ The error is differentiated by parameters and Subtract the product of the optimally arrived learning rates. ・ Θ after √ is added so that the denominator does not become 0. It is described to unify the writing style with RMS Prop etc. -In coding, add "1e-7".

\begin{align}
&h_0 = \theta\\
&h_t = h_{t-1} + (∇E)^2\\
&W^{(t+1)} = W{(t)} - \epsilon\frac{1}{\sqrt{h_t} + \theta}∇E
\end{align}

・ Self.h [key] = np.zeros_like (val) ・ Self.h [key] + = grad [key] * grad [key] ・ Params [key]-= self.learning_rate * grad [key] / (np.sqrt (self.h [key]) + 1e-7)

● Advantages of AdaGrad ・ For slopes with gentle slopes, approach the optimum value.

● AdaGrad challenges ・ Since the learning rate gradually decreases, it may cause a ** saddle point problem **. (Saddle point problem: It was the maximum value when I thought it was the minimum value.)

6-3-3.RMSProp 　　【RMSProp】・ The error is differentiated by parameters and Subtract the product of the redefined learning rates.

\begin{align}
&h_t = αh_{t-1} + (1 - α)(∇E)^2\\
&W^{(t+1)} = W^{(t)} - \epsilon\frac{1}{\sqrt{h_t}+\theta} ∇E\\
\end{align}

・ Self.h [key] * = self.decay_rate ・ Self.h [key] + = (1 -self.decay_rate) * grad [key] * grad [key] ・ Params [key]-= self.learning_rate * grad [key] / (np.sqrt (self.h [key]) + 1e-7)

● Advantages of RMS Drop ・ It is not a local optimization but a global optimization solution. -There are few cases where hyperparameters need to be adjusted.

6-3-4.Adam ● What is Adam? Exponential decay average of past gradients of momentum ・ Exponential decay average of the square of the past gradient of RMSProp It is an optimization algorithm that contains the above. ⇒The current weight is affected by the previous weight.

● Advantages of Adam -Algorithm with the merits of momentum and RMSProp. ⇒ Good points.

● Source code ・ M [key] + = (1 --beta1) * (grad [key] --m [key]) ・ V [key] + = (1 --beta2) * (grad [key] ** 2 --v [key]) ・ Network.params [key]-= learning_rate_t * m [key] / (np.sqrt (v [key]) + 1e-7)

6-3-5 Implementation of learning rate optimization method

●SGD 　　●Momentum Inertia setting defaults to 0.9. 　　●Momentum⇒AdaGrad 　　●RMSProp For learning_rate = 0.01 and decay_rate = 0.99, Try changing the value and check the change. 　　●Adam

● Grasp the whole picture from multiple angles ・ Singularity Modules such as the frontal lobe, hippocampus, and midbrain are made. How to define greed is being considered. Is AI like Terminator Skynet just a matter of time? ⇒ It is good to study technological innovation every day and think about what you want to make.

6-4.Section3 About overfitting

● Review of overfitting -The learning curve deviates between the test error and the training error. ⇒Specialize in learning for a specific training sample. ・ Cause Many parameters Parameter values are incorrect There are many nodes, etc. ⇒ Degree of freedom of network (number of layers, number of nodes, parameter values, etc.) Is high.

6-4-1 Regularization

● What is regularization? Degrees of freedom in the network (number of layers, number of nodes, parameter values, etc.) To constrain. ⇒ Suppress overfitting by using a regularization method.

● Regularization method ・ L1 regularization, L2 regularization ·Drop out

☆ Confirmation test ☆ ・ 6-4-1-1. [My answer] 　　　(b)

【answer】 (a) Correct answer. If you set the hyperparameter to a large value, All weights approach 0 infinitely

(b) Linear regression. (c) Bias is not normalized. (d) It is an error function, not a hidden layer.

6-4-2 Weight decay

● Causes of overfitting Overfitting occurs when a value with a large weight is taken.

● Overfitting solutions Weight is suppressed by adding a regularization term to the error

6-4-3 L1, L2 regularization

[Formula] Add the $ p $ norm to the error function.

\begin{align}
&E_n(W) + \frac{1}{p}λ|| x ||_p\\
&|| x || _p = (|x_1|^p +･ ･ ･+ |x_n|^p)^{\frac{1}{p}}
\end{align}

If $ p = 1 $, it is called L1 regularization. If $ p = 2 $, it is called L2 regularization.

There are L3, L4, L5 ..., but the values are very small, It costs a lot to calculate. Therefore, L1 and L2 are the mainstream.

・ Np.sum (np.abs (network.params ['W' + str (idx)])) ・ Weight_decay + = weight_decay_lambda 　　　*np.sum(np.abs(network.params['W' + str(idx)])) 　　　loss = network.loss(x_batch, d_batch) + weight_decay ⇒ np.abs is a function that takes an absolute value.

☆ Confirmation test ☆ ・ 6-4-3-1 The figure below shows L1 regularization. Answer either graph. [My answer] The figure on the left (Ridge estimator)

【answer】 The figure on the right (Lasso estimator) The concentric circles represent the contour lines of the error. Understand the essential difference between L1 and L2.

【reference】 Explanation of L1 regularization (Lasso) formula and scratch implementation 　　　https://qiita.com/torahirod/items/a79e255171709c777c3a Blog of data scientist working in front of Shibuya station 　　　https://tjo.hatenablog.com/entry/2015/03/03/190000

☆ Example challenge ☆ ・ 6-4-3-2 [My answer] 　　　(3)param**2

【answer】　　　(4)param The L2 norm is||param||^Since it is 2, the gradient is added to the error. It is 2 * param, but since there is 1/2, it becomes param.

☆ Example challenge ☆ ・ 6-4-3-3 [My answer] 　　　(4)np.abs(param)

【answer】　　　(3)np.sign(param) The L1 norm is|param|So that gradient is added to the error. In sign (param), sign is a sign function.

Sign function: 1 is returned for values greater than 0. 0 returns 0. Negative values return -1.

☆ Example challenge ☆ ・ 6-4-3-4 [My answer] 　　　(4)image[top:bottom,left:right,:]

【answer】　　　(4)image[top:bottom,left:right,:] The image format is vertical width, horizontal width, and channel.

6-4-4 Dropout

● Overfitting issues Large number of nodes

● What is a dropout? Randomly delete nodes to learn. Deactivate some nodes to proceed with learning.

● Benefits of dropout It can be interpreted that different models are trained without changing the amount of data.

6-4-4-1 Source code exercise

●overfiting 　　●weight decay（L2）　　●L1 ● weight decay (L2) -change weight_decay_lambda

6-4-4-2 Dropout Source Code Exercise 2

After understanding the characteristics of each It is necessary to consider what kind of combination to execute. 　　●Dropout 　　●Dropout + L1

6-5.Section4 Concept of convolutional neural network

● CNN structure diagram ・ There are various CNN structures. ・ Convolutional neural network provides a solution with images ・ Can also be provided in the field of voice. -There are "input layer", "convolution layer", "pooling layer", "fully connected layer", and "output layer".

● Overview of the convolution layer -Filter: Weight in full connection.

6-5-1. Convolution layer "bias"

● Convolution layer In the convolution layer, in the case of an image, 3D of vertical, horizontal, and channel You can learn the data as it is and then convey it.

Conclusion: There is a layer that can also learn 3D spatial information It is a convolutional layer.

● Convolution operation concept (bias) Calculate the sum of the products of the input image and the filter values.

6-5-2. Convolution layer "padding"

● Padding 0 padding is common. Fill 0 data as fixed data around the input image. An output image of the same size as the input data can be output.

6-5-3. Convolution layer "Stride"

● Stride The amount of movement of the filter is defined as stride. Move by skipping two or three.

6-5-4. Convolution layer "channel"

● Channel The number of decomposed layers is called the number of channels. You can learn in 3D. Depth can be created in addition to vertical and horizontal.

6-5-5. Challenges when learning images with fully combined

Disadvantages of fully connected layer ・ In the case of images, it is 3D data of vertical, horizontal, and channels, It is processed as one-dimensional data.

⇒Relevance between each RGB channel Not reflected in learning.

6-5-6 Source code exercise

●im2col（image to column） Convert image data into a two-dimensional array. ● Let's check the processing of im2col 　　●col2im（column to image） Convert a two-dimensional array into image data. 　　●convolution class

6-5-7 Pooling layer

● There are two main types ・ Max pooling ・ Average pooling

● Computational concept of pooling layer For the input image, the MAX value of the target area or Get the average value and output it.

☆ Confirmation test ☆ ・ 6-5-7-1 Input image of size 6x6 with size 2x2 filter Answer the size of the output image when folded. The stride and padding are set to 1.

[My answer] 7x7 output image

【answer】 7x7 output image

OH = (H+2P-FH)/S + 1 　　　 = (6+2*1-2)/1+1 　　　 = 7

OW = (W+2P-FW)/S + 1 　　　 = (6+2*1-2)/1+1 　　　 = 7

【reference】 AI from "0" ~ Please take a break ~ 　　　https://www.zerokaraai.com/entry/2018/10/03/210211

6-5-8 Source Code Exercise

● Data convolution [Discussion] It took a long time to process, probably because I wasn't using the GPU. Also, if you do not specify the drive, common will not be found and 　　　>import sys, os

sys.path.append (os.pardir) # Settings for importing files in the parent directory Is being added.

However, I found it interesting to do image processing. I am also learning properly.

● Syllabus It is necessary to be willing to learn based on the keywords. In applied mathematics, it is important to move your hands to calculate. Let's review regularization in particular. CNN is the most important item. RNNs need the ability to search for confusing parts. It is an advanced field such as an autoencoder. ⇒Let's study when you're done.

6-6.Section5 Latest CNN

● AlexNet model description At the 2012 Image Recognition Competition The model that won the championship with a big difference in 2nd place.

● Measures to prevent overfitting I am using a dropout for the output of a fully connected layer of size 4096. The paper also reports differences with and without dropouts.

6-7. Set your own assignment

● Purpose and required data I don't have the data at hand. ⇒ Set the task "Try to make a crawler."

● Crawler There is a way to make it depending on whether you have knowledge of DB or not. The sample source is finally converted to CSV. Insert time.sleep so that it does not affect the acquisition site.

[PYTHON] DeepRunning ~ Level 6 ~