[PYTHON] DeepRunning ~ Level5 ~

Level5. Deep learning DAY1

alt What is a deep learning course that can be crushed in the field in 3 months

5-1. Overview of neural network

● From the client's point of view, they only care what they use for input and what they produce for output. Engineers need to be especially aware of the middle class.

☆ Confirmation test ☆ 5-1-1 For deep learning, describe what you are trying to do within two lines. Also, which of the following values is the ultimate goal of optimization?

(1) Input value X Output value Y Weight W Bias [b] ⑤ Total input [u] ⑥ Intermediate layer input [z] ⑦ Learning rate [ρ]

(I'm wondering if the confirmation test is a confirmation after studying before ...)

[My answer] To update the parameters from the predicted value and the correct answer value and make them learn like a human brain. The values to be optimized are (3) weight [W] and (4) bias [b].

[Answer] Discover the parameters that minimize the error. ③ Weight [W], ④ Bias [b]

☆ Confirmation test ☆ 5-1-2 Put the following network on paper. Input layer: 2 nodes 1 layer Middle layer: 3 nodes, 2 layers Output layer: 1 node, 1 layer

I wrote it in my notebook by hand. DSC_0001_poyopon.JPG

5-2. What you can do with a neural network (NN)

[Regression] ● Result forecast ... Sales forecast, stock price forecast ● Ranking ... Horse racing ranking forecast, popularity ranking forecast

[Category] ● Identification of cat photos ● Handwritten character recognition ● Flower type classification

5-3. Regression

● Approximation of a function that takes consecutive real values ・ Linear regression ・ Regression tree ・ Random forest ・ Neural network (NN)

5-4. Classification

● Analysis to predict discrete results such as gender (male and female) and animal type ・ Bayes classification ・ Logistic regression ・ Decision tree ・ Random forest ・ Neural network (NN)

5-5. Practical examples of deep learning

● Automatic trading Even if you have a good model, it doesn't open very much. Since we have information, we will converge on our own algorithm.

● Chatbot Something that meets current needs, such as call center automation. The FAQ system also reduces labor costs.

● Translation There are Google Translate etc. Importance is used to improve accuracy, and there is an attention mechanism.

● Speech interpretation Speakers like Amazon Echo and Google Home. Dragon speech (developed in the field of speech interpretation) There are parts that use AI and parts that do not. When you try to make your own product, you need the audio data necessary to improve the accuracy. It will collect a huge amount of data, but I don't know if it will be accurate enough. Efforts through trial and error, such as audio compression, are also required to raise the system.

● Go / Shogi AI AlphaGo (Go), Ponanza (Shogi), etc. ⇒ Deep learning technology is closely linked. Hybrid of reinforcement learning and CNN, etc.

5-6.Section1 "Input layer-intermediate layer"

☆ Confirmation test ☆ 5-6-1 Let's put an example of animal classification in this diagram. DSC_0002_poyopon.JPG

☆ Confirmation test ☆ 5-6-2 Write this formula in Python (numpy library).     u=w_1x_1+w_2x_2+w_3x_3+w_4x_4+b =Wx+b

[My answer] $ u = np.dot (x, W) + b $ [Answer] $ u_1 = np.dot (x, W_1) + b_1 $

5-7. Source code explanation

☆ Confirmation test ☆ Extract the source that defines the output of the middle layer from the 5-7-1 1-1 file.     (= 1_1_forward_propagation)

[My answer] As shown in the image. Day1_0001.png [Answer] z = functions.relu (u)

5-8. Jupyter Exercise

☆ Confirmation test ☆ 5-8-1 Forward propagation (single layer / single unit) of "1_1_forward_propagation.ipynb" Check the operation using the weight / bias "Let's try".

Day1_0002.png Day1_0003.png ⇒I decided to enter a random number that I like. Both weight and bias. Day1_0004.png Day1_0005.png

5-9.Section2 "Activation function"

● What is the activation function? A non-linear function that determines the magnitude of the output to the next layer in a neural network. It has the function of determining the ON / OFF and strength of the signal to the next layer depending on the value of the input value.

⇒By adjusting the strength of the total amount of electricity, you can make judgments that are closer to humans.   ☆ Confirmation test ☆ 5-9-1 Explain the difference between linear and non-linear with a diagram. DSC_0003_poyopon.JPG

● Activation function for the middle layer ・ ReLU function ・ Sigmoid function (logistic function) ・ Step function (function that was the basis of deep learning)

● Activation function for output layer ・ Softmax function ・ Conformal map ・ Sigmoid function (logistic function)

● Step function There were the following issues. ・ The middle between 0-1 cannot be expressed. (Only ON and OFF can be expressed) ・ I could only learn linearly separable things.

[Formula]    f(x) = 1(x≧0) or 0(x<0)

● Sigmoid function (logistic function) -A function that changes slowly between 0 and 1. ・ You can now tell the strength of the signal, It became an opportunity for the spread of predictive neural networks.   However, there are the following issues. -Large values cause a ** vanishing gradient problem ** because the change in output is small.

[Formula]    f(u) = \frac{1}{1+e^{-u}}

● ReLU function -Currently, the most used activation function. ・ Avoiding the ** vanishing gradient problem ** Good results have been achieved by contributing to ** sparsification **. -Don't stick to ReLU, use various functions depending on the configuration. Sometimes the sigmoid function produces better results.

[Formula]    f(x) = x(x>0) or 0(x≦0)

● Fully connected NN single layer / multiple nodes -The middle layer has multiple nodes. ・ Because the weight and bias increase, the amount of calculation also increases and the load is applied. -An extension of a single layer / single node.

☆ Confirmation test ☆ Extract the relevant part from the distributed source code.   z = f(u)

[My answer]   z = functions.sigmoid(u)!

【answer】   z = functions.sigmoid(u)

Check the operation of each activation function using the distribution source.

Day1_0006.png Day1_0007.png

Both the sigmoid function and the step function worked properly.

5-10.Section3 "Output layer"

● Role of output layer -Convert into a form that humans can understand by using a function. ・ It becomes important when showing it to the client.

● Error function ・ As data to be prepared in advance "Input data" and "training data (correct answer value)" are required. -Values output as a result of each classification, such as "Probability of XX%". ・ If a certain forecast reaches 100%, it is necessary to doubt a little. ・ Quantitatively take the error between the calculated value and the training data using a function. -Do not use "square error" in the classification.

☆ Confirmation test ☆ 5-10-1 Describe why you square instead of subtraction [My answer] Since the error is ±, I want to take the absolute value, Since the calculation becomes difficult, I try to square and take the error.

【answer】 To make the value positive.

5-10-2 Describe what the 1/2 of the formula below means. [My answer] This can be easily done when differentiating the sum of the squares of the errors. 【answer】 To simplify differentiation.

● Output layer activation function ・ Difference between output layer and intermediate layer Middle layer: Adjust the signal strength before and after the threshold Output layer: Converts the signal size (ratio) as it is

・ Probability output For classification problems, limit the output of the output layer to the range 0 to 1 The sum must be 1.

⇒The activation functions used in the output layer and the intermediate layer are different.

[Regression] Activation function ... Conformal map Error function ... Square error

[Binary classification] Activation function ... Sigmoid function Error function ... Cross entropy

[Multi-class classification] Activation function ... Softmax function Error function ... Cross entropy

[Sigmoid function (mathematical formula)]    f(u) = \frac{1}{1+e^{-u}}

☆ Confirmation test ☆ [Softmax function (mathematical formula)]

   f(i,u) = \frac{e^{u_{i}}} {\sum^k_{k=1} e^{u_{k}}}

Show the source code corresponding to the formulas ① to ③, Explain the process line by line.

[My answer] ① y.T ... Result of calculation by Softmax (2) np.exp (x) ... Calculate x with a sigmoid function ③ np.sum (np.exp (x), axis = 0) ... The sum of the values calculated by the sigmoid function for x.

【answer】 As a premise, multidimensional ones cannot be used. Of the transposed values of x, the max value is subtracted steadily.

[Source: Axis specification in NumPy (https://qiita.com/shuetsu@github/items/2bf8bba233c5ecc7a0ad)](https://qiita.com/shuetsu@github/items/2bf8bba233c5ecc7a0ad) -Using the numerical calculation library NumPy, specify the axis for the matrix and perform aggregation.

● Mean squared error     [Formula]

   E_n(W) = \frac{1}{2} \sum^I_{i=1}(y_n - d_n)^2

python


    def  mean_squared_error(d, y):
        return np.mean(np.square(d  - y)) / 2

☆ Confirmation test ☆ [Cross entropy (mathematical formula)]

   E_n(W) = -\sum^I_{i=1} d_i logy_i

Show the source code corresponding to the formulas ① to ②, Explain the process line by line.

[My answer]   ①return -np.sum(np.log(y[np.arange(batch_size), d] + 1e-7)) / batch_size   ②return -np.sum(np.log(y[np.arange(batch_size), d] + 1e-7)) / batch_size ⇒Because the log value may become 0 A small value 1e-7 is added so that the numerator does not become 0.

python


def  cross_entropy_error(d, y):
    if y.ndim== 1:
        d= d.reshape(1, d.size)
        y= y.reshape(1, y.size)

    #Teacher data is one-hot-In case of vector, convert to index of correct label

    if d.size== y.size:
        d= d.argmax(axis=1)
        batch_size= y.shape[0]

    return -np.sum(np.log(y[np.arange(batch_size), d] + 1e-7)) / batch_size

5-11.Section 4 "Gradient descent method"

● Gradient descent method ・ Purpose of deep learning Creating a network that minimizes errors through learning. ⇒ Discover the parameter $ w $ that minimizes the error $ E (w) $. ⇒ ** Use gradient descent method ** to optimize parameters

[Gradient descent method]

w^{(t+1)}=w^{(t)} -  ε∇E\\
∇E = \frac{\partial E}{\partial w} = \biggl \{\frac{\partial E}{\partial w_1}・ ・ ・\frac{\partial E}{\partial w_M} \biggr \}

☆ Confirmation test ☆ Let's find the corresponding source code.

[My answer] Parameters ... network [key]-= learning_rate * grad [key] $ ∇E $ ... grad = backward (x, d, z1, y)

【answer】 Parameters ... network [key]-= learning_rate * grad [key] $ ∇E $ ... grad = backward (x, d, z1, y)

  • $ ∇ E $ is the error differentiated by the parameter.

[$ Ε $: Learning rate] ・ Large learning rate: It cannot converge well and diverges. ・ Small learning rate: It does not diverge, but it takes time to converge. ・ Global minimum solution ... The point that becomes the minimum as a whole

● Gradient descent algorithm About the algorithm for determining the learning rate and improving the convergence Several papers have been published and are often used.   ・ Momentum ・ AdaGrad ・ Adadelta ・ Adam

● Stochastic Gradient Descent (SGD) ・ Advantages of stochastic gradient descent ・ Reduction of calculation cost when data is redundant -Reducing the risk of converging on unwanted local minimal solutions ・ You can study online [Stochastic gradient descent]

w^{(t+1)}=w^{(t)} -  ε∇E_n

[Gradient descent method]

w^{(t+1)}=w^{(t)} -  ε∇E

☆ Confirmation test ☆ What is online learning? Summarize in two lines.

[My answer] Even during real-time prediction processing Weights and biases can be updated from the error.

【answer】 For example, on Facebook, etc. You can always learn using only the information of newly registered users.

● Mini batch gradient descent method -A set of randomly divided data (mini-batch) $ D_t $ Average error of samples belonging to ・ Without losing the merit of stochastic gradient descent The computing resources of the computer can be used effectively. ⇒ Thread parallelization using CPU and SIMD parallelization using GPU

[Mini batch gradient descent method]

w^{(t+1)}=w^{(t)} -  ε∇E_t\\
E_t = \frac{1}{N_t}\sum_{n \in{D_t}}E_n\\
N_t = |D_t|

[Stochastic gradient descent]

w^{(t+1)}=w^{(t)} -  ε∇E_n

☆ Confirmation test ☆

[Mini batch gradient descent method] w^{(t+1)}=w^{(t)} -  ε∇E_t

Explain the meaning of this formula in a diagram.

[My answer] Since it was a set of randomly divided data (mini-batch) $ D_t $, Group the appropriate samples and calculate for each Finally, the overall error is calculated. (I couldn't show it ...)

【answer】 DSC_0004_poyopon.JPG

● Calculation of error gradient How to calculate

∇E = \frac{\partial E}{\partial w} = \biggl \{\frac{\partial E}{\partial w_1}・ ・ ・\frac{\partial E}{\partial w_M} \biggr \}

[Numerical differentiation] A general method of generating minute numbers in a program and calculating the derivative in a pseudo manner.

\frac{\partial E}{\partial w_m} \approx \frac{E(w_m + h)- E(w_m - h)}{2h}

However, there are major disadvantages. ・ To calculate $ E (w_m + h) $ and $ E (w_m-h) $ for each parameter $ w_m $ It is necessary to repeat the calculation of forward propagation, which increases the load.

⇒ ** Use the error back propagation method **

● Deep learning development environment ・ Local: CPU, GPU ・ Cloud: AWS, GCP

  • It may not be possible to use it depending on the version of the server environment. Care must be taken regarding the combination.

5-12.Section 5 "Error back propagation method"

● Calculation of error gradient How to calculate

∇E = \frac{\partial E}{\partial w} = \biggl \{\frac{\partial E}{\partial w_1}・ ・ ・\frac{\partial E}{\partial w_M} \biggr \}

[Error back propagation method] The calculated error is differentiated in order from the output layer side and propagated to the previous layer and the previous layer. A method of calculating the differential value of each parameter ** analytically ** with a minimum of calculation.

By back-calculating the derivative from the calculation result (= error), avoiding the necessary recursive calculation, The derivative can be calculated. ⇒Use the chain rule.

☆ Confirmation test ☆ In the error back propagation method, unnecessary recursive processing can be avoided. Extract the source code that holds the calculation results that have already been performed.

[My answer]

python


#Error back propagation
def backward(x, d, z1, y):
    print("\n#####Error back propagation start#####")

    grad = {}

    W1, W2 = network['W1'], network['W2']
    b1, b2 = network['b1'], network['b2']
    #Delta at the output layer
    delta2 = functions.d_sigmoid_with_loss(d, y)
    #Gradient of b2
    grad['b2'] = np.sum(delta2, axis=0)
    #Gradient of W2
    grad['W2'] = np.dot(z1.T, delta2)
    #Delta in the middle layer
    delta1 = np.dot(delta2, W2.T) * functions.d_relu(z1)
    #Gradient of b1
    grad['b1'] = np.sum(delta1, axis=0)
    #Gradient of W1
    grad['W1'] = np.dot(x.T, delta1)
        
    print_vec("Partial differential_dE/du2", delta2)
    print_vec("Partial differential_dE/du2", delta1)

    print_vec("Partial differential_Weight 1", grad["W1"])
    print_vec("Partial differential_Weight 2", grad["W2"])
    print_vec("Partial differential_Bias 1", grad["b1"])
    print_vec("Partial differential_Bias 2", grad["b2"])

    return grad

【answer】 Is it the same ...?

● Error back propagation method

\begin{align}
&E(y) = \frac{1}{2}\sum^J_{j=1}(y_j - d_j)^2 = \frac{1}{2}||y - d||^2 
⇒ Error function=Let it be a square error function.\\
&y = u^{(L)}⇒ Output layer activation function=Make it an identity map.\\
&u^{(l)} = w^{(l)}z^{(l-1)} + b^{(l)}⇒ Calculation of total input\\\\
&\frac{\partial E}{\partial w^{(2)}_{ji}}=\frac{\partial E}{\partial y}\frac{\partial y}{\partial u}\frac{\partial u}{\partial w^{(2)}_{ji}}\\\\
&\frac{\partial E(y)}{\partial y} = \frac{\partial}{\partial y}\frac{1}{2}||y - d||^2 = y - d\\\\
&\frac{\partial y(u)}{\partial u} =  \frac{\partial u}{\partial u} = 1\\
\end{align}
\frac{\partial u(w)}{\partial w_{ji}} = \frac{\partial}{\partial w_{ji}}(w^{(l)}z^{(l-1)}+b^{(l)})=\frac{\partial}{\partial w_{ji}} \left(
\begin{bmatrix} w_{11}z_1 +・ ・ ・+ w_{1i}z_i +・ ・ ・w_{1I}z_I \\
\vdots\\
w_{j1}z_1 +・ ・ ・+ w_{ji}z_i +・ ・ ・w_{jI}z_I\\
\vdots\\
w_{J1}z_1 +・ ・ ・+ w_{Ji}z_i +・ ・ ・w_{JI}z_I \end{bmatrix}
+\begin{bmatrix} b_1\\
\vdots\\
b_j\\
\vdots\\
b_J \end{bmatrix}

=\begin{bmatrix} 0\\
\vdots\\
z_i\\
\vdots\\
0 \end{bmatrix}
\right)

5-13. Explanation in Jupyter (1_3_stochastic_gradient_descent)

● Try to practice using the source code. (1) Set the activation function to the ReLU function. Day1_0008_Jupyter2.png Day1_0009_Jupyter2.png Day1_0010_Jupyter2.png Day1_0011_Jupyter2.png Day1_0012_Jupyter2.png

(2) Set the activation function to the sigmoid function x ReLU function. Day1_0013_Jupyter(sigmoid-ReLU).png Day1_0014_Jupyter(sigmoid-ReLU).png

(3) Set the activation function to sigmoid function x sigmoid function. Day1_0015_Jupyter(sigmoid-sigmoid).png Day1_0016_Jupyter(sigmoid-sigmoid).png

④ Set the input value to a random value from -5 to 5. Day1_0017_Jupyter(random-input).png Day1_0018_Jupyter(random-input).png

☆ Confirmation test ☆ Find the source code that corresponds to the two spaces

\frac{\partial E}{\partial y}

⇒delta2 = functions.d_mean_squared_error(d, y)

\frac{\partial E}{\partial y}\frac{\partial y}{\partial u}

⇒?

\frac{\partial E}{\partial y}\frac{\partial y}{\partial u}\frac{\partial u}{\partial w^{(2)}_{ji}}

⇒?

[My answer] I don't understand a little. It's difficult just to watch the video normally.

【answer】 ・ Since it is an identity map, y = u.     \frac{\partial E}{\partial y}\frac{\partial y}{\partial u} = \frac{\partial E}{\partial y}\frac{\partial u}{\partial u} So delta2 = functions.d_mean_squared_error (d, y)

・ Fetch the inner product value. The inner product value of the value passed to delta2 and the transposed version of the activation function of the intermediate layer.     grad ['W2'] = np.dot(z1.T, delta2)

5-14. Thesis commentary

Obtain the knowledge necessary for design from the dissertation for implementation. Check the graphs of the experimental results as soon as possible and understand them visually.

Recommended Posts

DeepRunning ~ Level4.4.2 ~
DeepRunning ~ Level 4.6 ~
DeepRunning ~ Level4.3.1 ~
DeepRunning ~ Level3.3 ~
DeepRunning ~ Level7 ~
DeepRunning ~ Level2, Level3.1 ~
DeepRunning ~ Level4.4.1 ~
DeepRunning ~ Level 1 ~
DeepRunning ~ Level 4.7 ~
DeepRunning ~ Level5 ~