Start studying: Saturday, December 7th
Teaching materials, etc .: ・ Miyuki Oshige "Details! Python3 Introductory Note ”(Sotec, 2017): 12/7 (Sat) -12/19 (Thu) read ・ Progate Python course (5 courses in total): 12/19 (Thursday) -12/21 (Saturday) end ・ Andreas C. Müller, Sarah Guido "(Japanese title) Machine learning starting with Python" (O'Reilly Japan, 2017): 12/21 (Sat) -December 23 (Sat) ・ Kaggle: Real or Not? NLP with Disaster Tweets: Posted on Saturday, December 28th to Friday, January 3rd Adjustment ・ Wes Mckinney "(Japanese title) Introduction to data analysis by Python" (O'Reilly Japan, 2018): 1/4 (Wednesday) to 1/13 (Monday) read ・ ** Yasuki Saito "Deep Learning from Zero" (O'Reilly Japan, 2016): 1/15 (Wed) ~ **
p.239 Chapter 7 Finish reading up to the convolutional neural network.
-Optimization: Finding the optimal parameters that can reduce the value of the loss function as much as possible. The parameter space is complex and a very difficult problem. There are several optimizers.
-** Stochastic gradient descent **: The method learned up to Chapter 5. Using the gradient of the parameter, update it many times in the direction of the gradient and gradually approach it.
W ← W - η\frac{\partial L}{\partial W}
η is the learning rate. Update the left side with the value on the right side. The disadvantage of SGD is that the search path tends to be inefficient if the function is elongated, that is, if it is not isotropic.
-** Momentum **: Use concepts like momentum. By introducing a new variable called v, which acts like frictional force and air resistance, it slows down when no force is received.
v ← αv - η\frac{\partial L}{\partial L}
W ← W + v
-** AdaGrad **: A method of changing the value of the learning rate according to the degree of learning. Make it bigger at first and then make it smaller.
h ← h + \frac{\partial L}{\partial W} ⊙ \frac{\partial L}{\partial W}
W ← W - η\frac{1}{\sqrt{h}}\frac{\partial L}{\partial W}
⊙ is the Hadamard operator. It means multiplication for each element of the matrix. The larger h (the more it moves), the smaller the learning coefficient. In other words, the learning scale is adjusted when the parameters are updated.
-** Adam **: A method that combines momentum and AdaGrad.
-As mentioned above, there are various methods for optimizer, but each has its own strengths and weaknesses, so it cannot be said which one is superior. (However, many studies say that SGD is the preferred choice.)
・ ** Weight decay **: A method aimed at learning so that the weight parameter becomes small. By reducing the weight, overfitting is less likely to occur, which is close to improving generalization performance. (However, "0" breaks the contrasting structure of weights, and all have similar values.)
-** Xavier initial value **: For n nodes, use a Gaussian distribution with a standard deviation of (1 / √n) as the initial value. It is used as standard in deep learning frameworks. Suitable for sigmoid and tanh.
-Initial value of ** He **: A Gaussian distribution with a standard deviation of (2 / √n) is used as the initial value for n nodes. Suitable for ReLU. In the case of ReLU, the negative region becomes 0, so it can be interpreted as multiplying by a double coefficient in order to have a wider spread.
・ ** Batch Normalization **: It is often used in methods and competitions devised in 2015. There are advantages such as being able to proceed with learning quickly, not being so dependent on initial values, and suppressing overfitting. Data distribution is normalized by inserting what is called a Batch Norm layer between Affine and ReLU. ** Normalize for each mini-batch in units of mini-batch for learning **.
-** Dropout **: Similar to Weight decay, it is used as a method to suppress overfitting. During training, neurons in the hidden layer are randomly selected and the selected neurons are deleted. All neuron signals are transmitted during the test, but are output by multiplying the ratio erased during training. There is something close to a kind of ensemble method because it can be interpreted that neurons are randomly erased each time during learning, that is, a different model is trained each time.
-** Hyperparameters **: The number of neurons in each layer, batch size, learning rate and weight decay are applicable. Adjusting hyperparameters using test data leads to overfitting, so special data called validation data is used. I make this myself. (Np shuffle, sclearn train_data_split, etc. Used in Kaggle.)
・ For optimization, first set roughly, observe the result of recognition accuracy, and gradually narrow down to the range where good values exist. ** In the case of neural networks, it has been reported that random sampling and searching gives better results than regular searching such as grid search. ** ** Roughly speaking, a power scale of 10 is about 10 ^ (-3) to 10 ^ (3). It is effective to make the epoch for learning small because it is necessary to give up on things that seem to be bad at an early stage. An epoch is a unit when all the data is used up. If you want to train 10000 data in 100 mini-batch, 100 times = 1 epoch, learning record 22 as shown.
-Bayesian optimization is also effective. I had many chances to see it on Kaggle.
・ ** Convolutional neural network (CNN) ** In addition to the usual neural network, the concept of ** "Convoluntion layer" and "Pooling layer" ** is added. Two typical examples are ** LeNet ** and ** AlexNet **.
-Replace the layer connection "Affine --ReLU (sigmoid)" with the connection "Convoluntion --ReLU (sigmoid)-(Pooling)". (However, the part near the output layer is as usual.)
-The Affine layer used a fully connected layer that connects all neurons. The problem with this is that by treating all input data as equivalent neurons (same dimension), information about the shape cannot be utilized. On the other hand, the Convoluntion layer outputs the input data to the next layer in the same dimension, so the data can be understood more correctly (possibly).
-Convolution operation: Apply the filter window to the input data while sliding it at regular intervals. The variable that adjusts the adaptation interval of the filter is called stride. (Talk about how much to shift and adapt)
-Padding: Fill the area around the input data with fixed data (such as 0).
-Pooling: An operation that reduces the vertical and horizontal spaces. Look at the 4x4 matrix for each 2x2 area, and in the case of Max pooling, for example, perform operations such as extracting and outputting the maximum value for each area.
-There is a function called ** im2col ** that applies these convolution operations. im2col is a function that expands the input data so that it is convenient for the filter, and expands the applicable area of the filter one column at a time from the beginning. After expansion, it becomes larger than the number of elements of the original block and consumes a lot of memory, but ** The matrix calculation itself is highly optimized, so it is very possible to reduce it to the shape of this matrix. There are many benefits to. ** **
-The convolution layer filter can extract primitive information such as blobs (locally clustered areas) and edges (edges where colors change) and output them to the next area. ..
Recommended Posts