What is a deep learning course that can be crushed in the field in 3 months
● From the client's point of view, they only care what they use for input and what they produce for output. Engineers need to be especially aware of the middle class.
☆ Confirmation test ☆ 5-1-1 For deep learning, describe what you are trying to do within two lines. Also, which of the following values is the ultimate goal of optimization?
(1) Input value X Output value Y Weight W Bias [b] ⑤ Total input [u] ⑥ Intermediate layer input [z] ⑦ Learning rate [ρ]
(I'm wondering if the confirmation test is a confirmation after studying before ...)
[My answer] To update the parameters from the predicted value and the correct answer value and make them learn like a human brain. The values to be optimized are (3) weight [W] and (4) bias [b].
[Answer] Discover the parameters that minimize the error. ③ Weight [W], ④ Bias [b]
☆ Confirmation test ☆ 5-1-2 Put the following network on paper. Input layer: 2 nodes 1 layer Middle layer: 3 nodes, 2 layers Output layer: 1 node, 1 layer
I wrote it in my notebook by hand.
[Regression] ● Result forecast ... Sales forecast, stock price forecast ● Ranking ... Horse racing ranking forecast, popularity ranking forecast
[Category] ● Identification of cat photos ● Handwritten character recognition ● Flower type classification
● Approximation of a function that takes consecutive real values ・ Linear regression ・ Regression tree ・ Random forest ・ Neural network (NN)
● Analysis to predict discrete results such as gender (male and female) and animal type ・ Bayes classification ・ Logistic regression ・ Decision tree ・ Random forest ・ Neural network (NN)
● Automatic trading Even if you have a good model, it doesn't open very much. Since we have information, we will converge on our own algorithm.
● Chatbot Something that meets current needs, such as call center automation. The FAQ system also reduces labor costs.
● Translation There are Google Translate etc. Importance is used to improve accuracy, and there is an attention mechanism.
● Speech interpretation Speakers like Amazon Echo and Google Home. Dragon speech (developed in the field of speech interpretation) There are parts that use AI and parts that do not. When you try to make your own product, you need the audio data necessary to improve the accuracy. It will collect a huge amount of data, but I don't know if it will be accurate enough. Efforts through trial and error, such as audio compression, are also required to raise the system.
● Go / Shogi AI AlphaGo (Go), Ponanza (Shogi), etc. ⇒ Deep learning technology is closely linked. Hybrid of reinforcement learning and CNN, etc.
☆ Confirmation test ☆
5-6-1 Let's put an example of animal classification in this diagram.
☆ Confirmation test ☆
5-6-2 Write this formula in Python (numpy library).
[My answer] $ u = np.dot (x, W) + b $ [Answer] $ u_1 = np.dot (x, W_1) + b_1 $
☆ Confirmation test ☆ Extract the source that defines the output of the middle layer from the 5-7-1 1-1 file. (= 1_1_forward_propagation)
[My answer] As shown in the image.
[Answer] z = functions.relu (u)
☆ Confirmation test ☆ 5-8-1 Forward propagation (single layer / single unit) of "1_1_forward_propagation.ipynb" Check the operation using the weight / bias "Let's try".
⇒I decided to enter a random number that I like.
Both weight and bias.
● What is the activation function? A non-linear function that determines the magnitude of the output to the next layer in a neural network. It has the function of determining the ON / OFF and strength of the signal to the next layer depending on the value of the input value.
⇒By adjusting the strength of the total amount of electricity, you can make judgments that are closer to humans.
☆ Confirmation test ☆
5-9-1 Explain the difference between linear and non-linear with a diagram.
● Activation function for the middle layer ・ ReLU function ・ Sigmoid function (logistic function) ・ Step function (function that was the basis of deep learning)
● Activation function for output layer ・ Softmax function ・ Conformal map ・ Sigmoid function (logistic function)
● Step function There were the following issues. ・ The middle between 0-1 cannot be expressed. (Only ON and OFF can be expressed) ・ I could only learn linearly separable things.
[Formula]
● Sigmoid function (logistic function) -A function that changes slowly between 0 and 1. ・ You can now tell the strength of the signal, It became an opportunity for the spread of predictive neural networks. However, there are the following issues. -Large values cause a ** vanishing gradient problem ** because the change in output is small.
[Formula]
● ReLU function -Currently, the most used activation function. ・ Avoiding the ** vanishing gradient problem ** Good results have been achieved by contributing to ** sparsification **. -Don't stick to ReLU, use various functions depending on the configuration. Sometimes the sigmoid function produces better results.
[Formula]
● Fully connected NN single layer / multiple nodes -The middle layer has multiple nodes. ・ Because the weight and bias increase, the amount of calculation also increases and the load is applied. -An extension of a single layer / single node.
☆ Confirmation test ☆
Extract the relevant part from the distributed source code.
[My answer]
【answer】
Check the operation of each activation function using the distribution source.
Both the sigmoid function and the step function worked properly.
● Role of output layer -Convert into a form that humans can understand by using a function. ・ It becomes important when showing it to the client.
● Error function ・ As data to be prepared in advance "Input data" and "training data (correct answer value)" are required. -Values output as a result of each classification, such as "Probability of XX%". ・ If a certain forecast reaches 100%, it is necessary to doubt a little. ・ Quantitatively take the error between the calculated value and the training data using a function. -Do not use "square error" in the classification.
☆ Confirmation test ☆ 5-10-1 Describe why you square instead of subtraction [My answer] Since the error is ±, I want to take the absolute value, Since the calculation becomes difficult, I try to square and take the error.
【answer】 To make the value positive.
5-10-2 Describe what the 1/2 of the formula below means. [My answer] This can be easily done when differentiating the sum of the squares of the errors. 【answer】 To simplify differentiation.
● Output layer activation function ・ Difference between output layer and intermediate layer Middle layer: Adjust the signal strength before and after the threshold Output layer: Converts the signal size (ratio) as it is
・ Probability output For classification problems, limit the output of the output layer to the range 0 to 1 The sum must be 1.
⇒The activation functions used in the output layer and the intermediate layer are different.
[Regression] Activation function ... Conformal map Error function ... Square error
[Binary classification] Activation function ... Sigmoid function Error function ... Cross entropy
[Multi-class classification] Activation function ... Softmax function Error function ... Cross entropy
[Sigmoid function (mathematical formula)]
☆ Confirmation test ☆ [Softmax function (mathematical formula)]
f(i,u) = \frac{e^{u_{i}}} {\sum^k_{k=1} e^{u_{k}}}
Show the source code corresponding to the formulas ① to ③, Explain the process line by line.
[My answer] ① y.T ... Result of calculation by Softmax (2) np.exp (x) ... Calculate x with a sigmoid function ③ np.sum (np.exp (x), axis = 0) ... The sum of the values calculated by the sigmoid function for x.
【answer】 As a premise, multidimensional ones cannot be used. Of the transposed values of x, the max value is subtracted steadily.
● Mean squared error [Formula]
E_n(W) = \frac{1}{2} \sum^I_{i=1}(y_n - d_n)^2
python
def mean_squared_error(d, y):
return np.mean(np.square(d - y)) / 2
☆ Confirmation test ☆ [Cross entropy (mathematical formula)]
E_n(W) = -\sum^I_{i=1} d_i logy_i
Show the source code corresponding to the formulas ① to ②, Explain the process line by line.
[My answer] ①return -np.sum(np.log(y[np.arange(batch_size), d] + 1e-7)) / batch_size ②return -np.sum(np.log(y[np.arange(batch_size), d] + 1e-7)) / batch_size ⇒Because the log value may become 0 A small value 1e-7 is added so that the numerator does not become 0.
python
def cross_entropy_error(d, y):
if y.ndim== 1:
d= d.reshape(1, d.size)
y= y.reshape(1, y.size)
#Teacher data is one-hot-In case of vector, convert to index of correct label
if d.size== y.size:
d= d.argmax(axis=1)
batch_size= y.shape[0]
return -np.sum(np.log(y[np.arange(batch_size), d] + 1e-7)) / batch_size
● Gradient descent method ・ Purpose of deep learning Creating a network that minimizes errors through learning. ⇒ Discover the parameter $ w $ that minimizes the error $ E (w) $. ⇒ ** Use gradient descent method ** to optimize parameters
[Gradient descent method]
w^{(t+1)}=w^{(t)} - ε∇E\\
∇E = \frac{\partial E}{\partial w} = \biggl \{\frac{\partial E}{\partial w_1}・ ・ ・\frac{\partial E}{\partial w_M} \biggr \}
☆ Confirmation test ☆ Let's find the corresponding source code.
[My answer] Parameters ... network [key]-= learning_rate * grad [key] $ ∇E $ ... grad = backward (x, d, z1, y)
【answer】 Parameters ... network [key]-= learning_rate * grad [key] $ ∇E $ ... grad = backward (x, d, z1, y)
[$ Ε $: Learning rate] ・ Large learning rate: It cannot converge well and diverges. ・ Small learning rate: It does not diverge, but it takes time to converge. ・ Global minimum solution ... The point that becomes the minimum as a whole
● Gradient descent algorithm About the algorithm for determining the learning rate and improving the convergence Several papers have been published and are often used. ・ Momentum ・ AdaGrad ・ Adadelta ・ Adam
● Stochastic Gradient Descent (SGD) ・ Advantages of stochastic gradient descent ・ Reduction of calculation cost when data is redundant -Reducing the risk of converging on unwanted local minimal solutions ・ You can study online [Stochastic gradient descent]
w^{(t+1)}=w^{(t)} - ε∇E_n
[Gradient descent method]
w^{(t+1)}=w^{(t)} - ε∇E
☆ Confirmation test ☆ What is online learning? Summarize in two lines.
[My answer] Even during real-time prediction processing Weights and biases can be updated from the error.
【answer】 For example, on Facebook, etc. You can always learn using only the information of newly registered users.
● Mini batch gradient descent method -A set of randomly divided data (mini-batch) $ D_t $ Average error of samples belonging to ・ Without losing the merit of stochastic gradient descent The computing resources of the computer can be used effectively. ⇒ Thread parallelization using CPU and SIMD parallelization using GPU
[Mini batch gradient descent method]
w^{(t+1)}=w^{(t)} - ε∇E_t\\
E_t = \frac{1}{N_t}\sum_{n \in{D_t}}E_n\\
N_t = |D_t|
[Stochastic gradient descent]
w^{(t+1)}=w^{(t)} - ε∇E_n
☆ Confirmation test ☆
[Mini batch gradient descent method] w^{(t+1)}=w^{(t)} - ε∇E_t
Explain the meaning of this formula in a diagram.
[My answer] Since it was a set of randomly divided data (mini-batch) $ D_t $, Group the appropriate samples and calculate for each Finally, the overall error is calculated. (I couldn't show it ...)
【answer】
● Calculation of error gradient How to calculate
∇E = \frac{\partial E}{\partial w} = \biggl \{\frac{\partial E}{\partial w_1}・ ・ ・\frac{\partial E}{\partial w_M} \biggr \}
[Numerical differentiation] A general method of generating minute numbers in a program and calculating the derivative in a pseudo manner.
\frac{\partial E}{\partial w_m} \approx \frac{E(w_m + h)- E(w_m - h)}{2h}
However, there are major disadvantages. ・ To calculate $ E (w_m + h) $ and $ E (w_m-h) $ for each parameter $ w_m $ It is necessary to repeat the calculation of forward propagation, which increases the load.
⇒ ** Use the error back propagation method **
● Deep learning development environment ・ Local: CPU, GPU ・ Cloud: AWS, GCP
● Calculation of error gradient How to calculate
∇E = \frac{\partial E}{\partial w} = \biggl \{\frac{\partial E}{\partial w_1}・ ・ ・\frac{\partial E}{\partial w_M} \biggr \}
[Error back propagation method] The calculated error is differentiated in order from the output layer side and propagated to the previous layer and the previous layer. A method of calculating the differential value of each parameter ** analytically ** with a minimum of calculation.
By back-calculating the derivative from the calculation result (= error), avoiding the necessary recursive calculation, The derivative can be calculated. ⇒Use the chain rule.
☆ Confirmation test ☆ In the error back propagation method, unnecessary recursive processing can be avoided. Extract the source code that holds the calculation results that have already been performed.
[My answer]
python
#Error back propagation
def backward(x, d, z1, y):
print("\n#####Error back propagation start#####")
grad = {}
W1, W2 = network['W1'], network['W2']
b1, b2 = network['b1'], network['b2']
#Delta at the output layer
delta2 = functions.d_sigmoid_with_loss(d, y)
#Gradient of b2
grad['b2'] = np.sum(delta2, axis=0)
#Gradient of W2
grad['W2'] = np.dot(z1.T, delta2)
#Delta in the middle layer
delta1 = np.dot(delta2, W2.T) * functions.d_relu(z1)
#Gradient of b1
grad['b1'] = np.sum(delta1, axis=0)
#Gradient of W1
grad['W1'] = np.dot(x.T, delta1)
print_vec("Partial differential_dE/du2", delta2)
print_vec("Partial differential_dE/du2", delta1)
print_vec("Partial differential_Weight 1", grad["W1"])
print_vec("Partial differential_Weight 2", grad["W2"])
print_vec("Partial differential_Bias 1", grad["b1"])
print_vec("Partial differential_Bias 2", grad["b2"])
return grad
【answer】 Is it the same ...?
● Error back propagation method
\begin{align}
&E(y) = \frac{1}{2}\sum^J_{j=1}(y_j - d_j)^2 = \frac{1}{2}||y - d||^2
⇒ Error function=Let it be a square error function.\\
&y = u^{(L)}⇒ Output layer activation function=Make it an identity map.\\
&u^{(l)} = w^{(l)}z^{(l-1)} + b^{(l)}⇒ Calculation of total input\\\\
&\frac{\partial E}{\partial w^{(2)}_{ji}}=\frac{\partial E}{\partial y}\frac{\partial y}{\partial u}\frac{\partial u}{\partial w^{(2)}_{ji}}\\\\
&\frac{\partial E(y)}{\partial y} = \frac{\partial}{\partial y}\frac{1}{2}||y - d||^2 = y - d\\\\
&\frac{\partial y(u)}{\partial u} = \frac{\partial u}{\partial u} = 1\\
\end{align}
\frac{\partial u(w)}{\partial w_{ji}} = \frac{\partial}{\partial w_{ji}}(w^{(l)}z^{(l-1)}+b^{(l)})=\frac{\partial}{\partial w_{ji}} \left(
\begin{bmatrix} w_{11}z_1 +・ ・ ・+ w_{1i}z_i +・ ・ ・w_{1I}z_I \\
\vdots\\
w_{j1}z_1 +・ ・ ・+ w_{ji}z_i +・ ・ ・w_{jI}z_I\\
\vdots\\
w_{J1}z_1 +・ ・ ・+ w_{Ji}z_i +・ ・ ・w_{JI}z_I \end{bmatrix}
+\begin{bmatrix} b_1\\
\vdots\\
b_j\\
\vdots\\
b_J \end{bmatrix}
=\begin{bmatrix} 0\\
\vdots\\
z_i\\
\vdots\\
0 \end{bmatrix}
\right)
● Try to practice using the source code.
(1) Set the activation function to the ReLU function.
(2) Set the activation function to the sigmoid function x ReLU function.
(3) Set the activation function to sigmoid function x sigmoid function.
④ Set the input value to a random value from -5 to 5.
☆ Confirmation test ☆ Find the source code that corresponds to the two spaces
\frac{\partial E}{\partial y}
⇒delta2 = functions.d_mean_squared_error(d, y)
\frac{\partial E}{\partial y}\frac{\partial y}{\partial u}
⇒?
\frac{\partial E}{\partial y}\frac{\partial y}{\partial u}\frac{\partial u}{\partial w^{(2)}_{ji}}
⇒?
[My answer] I don't understand a little. It's difficult just to watch the video normally.
【answer】
・ Since it is an identity map, y = u.
・ Fetch the inner product value. The inner product value of the value passed to delta2 and the transposed version of the activation function of the intermediate layer. grad ['W2'] = np.dot(z1.T, delta2)
Obtain the knowledge necessary for design from the dissertation for implementation. Check the graphs of the experimental results as soon as possible and understand them visually.