Input layer / intermediate layer

Overview of neural network The flow is input layer → intermediate layer → output layer. Example: Neural network to determine what animal is from data about animals Input layer: A layer that gives arbitrary data. Example: body length, ear size, etc. Intermediate layer: A layer that biases and weights the data in the input layer Output layer: A layer in which the value calculated in the intermediate layer is output as an output. Example: From the entered data, this animal has a probability of being a dog of 0.1, a probability of being a cat of 0.05, and a probability of being a mouse of 0,85.

Confirmation test (1-2) Describe in less than two lines what deep learning is trying to do in the end. Also, which of the following values is the ultimate goal of optimization? Choose all. ① Input value [X] ② Output value [Y] ③ Weight [W] ④ Bias [b] ⑤ Total input [u] ⑥ Intermediate layer input [z] ⑦ Learning rate [ρ]

Estimate the parameter that minimizes the weight, which is the parameter for estimating the optimum output value. The parameters are mainly estimated by the weight (w) and bias (b).

Put the following network on paper. Input layer: 2 nodes 1 layer, intermediate layer: 3 nodes 2 layers, output layer: 1 node 1 layer

capture

Confirmation test (1-4) Let's put an example of animal classification in the diagram below.

Write this formula in python.

Confirmation test (1-5) Extract the source that defines the output of the middle layer from the 1-1 file.

Definition of middle layer output z = functions.relu(u) print_vec ("intermediate layer output", z)

u in z = functions.relu (u) represents the expression of the total input.

Exercise test In the initial state, the weight and bias are defined by themselves, the input value is multiplied by the weight, and the bias is added to determine the total input value. Next, when the weight array is changed to np.zeros (2), the weight array is changed to [0,0], and the total input value is also changed. The weight array of np.ones (2) below it is changed to [1,1]. np.random.rand (2) is randomly in the range of 0 to 1. Determine the number.

Activation function

In a neural network, a non-linear function that determines the output to the next layer, depending on the input value, to the next layer Determine the strength.

It can be classified into activation functions for the intermediate layer and the output layer. For middle layer: ReLu function, sigmoid function, step function (not used in deep learning) For output layer: softmax function, identity mapping, sigmoid function

Step function A function that outputs 1 when the threshold is exceeded and 0 when it does not. Output is 0 or 1 only It was used in Perceptron. As a problem, we could not cope with the slight change between 0 and 1, It can be used only when linear classification is possible (on / off).

Sigmoid function It is now possible to convey the strength of a signal in addition to on / off with a function that changes weakly between 0 and 1. The problem is that the change in output at a large value is small, which causes a vanishing gradient problem. 　 ReLu function The most used activation function now. It contributes to solving the vanishing gradient problem and making it sparse. The vanishing gradient problem and sparsification will be described later. Most used, but depending on the application, the sigmoid function may produce accurate results. The engineer needs to determine the most appropriate activation function each time.

Confirmation test (1-8) Extract the relevant part from the distributed source code. z = functions.sigmoid(u)

Output layer

The squared error is used as the error function. Squared error: Calculated by squaring the error of the output data from the correct answer data.

Confirmation test (1-9) Describe why you square instead of subtraction. Subtraction is affected by the sign of ±, and ± minimizes the error. Square so that it is not affected by the ± sign.

Describe what 1/2 means. In minimizing the parameters, differentiation is performed to obtain the gradient of the error. When performing the differentiation, it is multiplied by 1/2 to simplify the calculation.

Activation function for the output layer Difference from the middle layer The middle layer adjusts the signal strength before and after the threshold. The output layer converts the signal ratio as it is.

Regression task Use an identity map. The error function uses the squared error.

Classification task Binary classification: Use the sigmoid function. The error function is cross entropy Multi-class classification: Use the softmax function. The error function is cross entropy

Confirmation test (1-9) Show the source code corresponding to (1) to (3) of the softmax function, and explain the processing line by line. Source code of softmax function def softmax(x): if x.ndim == 2: If the number of dimensions of x is two, perform the following processing x = x.T Store the transpose of the argument x in x x = x --np.max (x, axis = 0) Subtract the maximum value of the transposed value y = np.exp (x) / np.sum (np.exp (x), axis = 0) The probability of the value selected from the whole is calculated. return y.T Returns the transposed value of y The place corresponding to ① is def softmax (x): The location corresponding to ② is np.exp (x) The place corresponding to ③ is np.sum (np.exp (x), axis = 0)

Confirmation test (1-10) Show the source code corresponding to (1) to (3) of the cross entropy function, and explain the processing line by line. Source code for cross entropy function def cross_entropy_error(d, y): if y.ndim == 1: If the number of dimensions of y is one dimension, perform the following processing d = d.reshape (1, d.size) Converts the argument d to a one-dimensional matrix. y = y.reshape (1, y.size) Converts the argument y to a one-dimensional matrix.

If the teacher data is one-hot-vector, convert it to the index of the correct label

if d.size == y.size: If the data sizes of d and y are the same, perform the following processing. d = d.argmax(axis=1) batch_size = y.shape[0] return -np.sum(np.log(y[np.arange(batch_size), d] + 1e-7)) / batch_size A small value (1e-7) is added so that the number of molecules does not become 0. Ingenuity in mounting The location corresponding to ① is def cross_entropy_error (d, y): The place corresponding to ② is return -np.sum (np.log (y [np.arange (batch_size), d] + 1e-7))

Gradient descent

There are gradient descent, stochastic gradient descent, and mini-batch gradient descent.

The purpose of deep learning is Estimate the parameters that minimize the weights and biases, which are the parameters for estimating the optimum output value (confirmation test 1-2). The gradient descent method is used as the method of estimation.

Confirmation test (1-11) Let's find the corresponding source code. network[key] -= learning_rate * grad[key]

The learning rate ε affects the efficiency of learning. If the learning rate is too large, it will not reach the minimum value and will diverge. If the learning rate is too small, it will take a considerable amount of time to reach the minimum value. The number of learnings is counted as an epoch.

Stochastic Gradient Descent (SGD) The stochastic gradient descent method uses the error of randomly sampled samples. The gradient descent method uses the average error of all samples. Remember the difference. The advantages are reduction of calculation amount, reduction of risk of converging to local minimum solution, and online learning.

Confirmation test (1-11) Summarize what online learning is in two lines. Each time learning data is input, learning is performed using only the input data.

Mini-batch gradient descent The mini-batch gradient descent method uses the mean error of samples belonging to a randomly divided set of data.

Error back propagation method

A method of differentiating the calculated error and propagating it from the output layer to the previous layer. This avoids unnecessary recursive calculations.

Confirmation test (1-14) The error propagation method can avoid unnecessary recursive processing. Holds the result of the calculation already performed Extract the source code. Source code for error backpropagation

def backward(x, d, z1, y): print ("\ n ##### Error back propagation start #####") grad = {} W1, W2 = network['W1'], network['W2'] b1, b2 = network['b1'], network['b2'] #Delta in the output layer delta2 = functions.d_sigmoid_with_loss(d, y) Gradient of # b2 grad['b2'] = np.sum(delta2, axis=0)

W2 gradient

grad['W2'] = np.dot(z1.T, delta2)

Delta in the middle layer

delta1 = np.dot(delta2, W2.T) * functions.d_relu(z1)

Gradient of # b1 grad['b1'] = np.sum(delta1, axis=0)

W1 gradient

grad['W1'] = np.dot(x.T, delta1)

①delta2 = functions.d_sigmoid_with_loss(d, y)

The result of combining the cross entropy and the sigmoid function is retained. 　　②grad['b2'] = np.sum(delta2, axis=0) The value obtained by vector-converting the value of ① is held. 　　③grad['W2'] = np.dot(z1.T, delta2)

Exercise In the above result, the ReLu function is used to perform forward propagation and error back propagation, and the result diagram also converges neatly. all right. Next, the result when the forward propagation and error back propagation functions are changed to the sigmoid function is shown in the result diagram below.

Visualization shows that the convergence is uneven by changing to the sigmoid function. Due to the code implementation, the forward propagation activation function and the error back propagation error function must be changed as a set.

The result of changing the x value range of the training data is shown in the result diagram below. It was confirmed that the convergence varies even if the range of x values is changed.

Confirmation test (1-15) Find the source code that corresponds to the two blanks. Blank above delta2 = functions.d_mean_squared_error(d, y) Blank below grad['W2'] = np.dot(z1.T, delta2)

[PYTHON] Report_Deep Learning (Part 1)