[PYTHON] Keras starting from nothing 2nd

Click here for Keras lottery starting from nothing [http://qiita.com/Ishotihadus/items/6ecf5684c2cbaaa6a5ef)

Last review

Last time created a rough data set and trained it roughly.

The input is a five-dimensional vector with elements 0 or more and less than 1. The output is 0 if the sum of the elements of the vector is 2.5 or less, and 1 if it is greater than 2.5. So-called "two-class classification" is done.

The data set was created as follows. The output is a one-hot two-dimensional vector so that the neural network can handle it easily.

data = np.random.rand(250,5)
labels = np_utils.to_categorical((np.sum(data, axis=1) > 2.5) * 1)

When making the model, I did the following. Two layers were overlapped (without input). The input is 5 dimensions, the hidden layer between them is 20 dimensions, and the output is 2 dimensions.

model = Sequential()
model.add(Dense(20, input_dim=5, activation='relu'))
model.add(Dense(2, activation='softmax'))
model.compile('rmsprop', 'categorical_crossentropy', metrics=['accuracy'])

Activation function

In a neural network, we take a weighted sum of the values that came from the previous layer. Add the bias constant to it and apply the function at the end. The function to be applied is the activation function. You can also apply an activation function using a layer called ʻActivation`. You can use this to specify parameters, but you don't have to use it.

Bias becomes a threshold when it is negative in meaning (it will not fire unless it exceeds this value!). By the way, the bias can always be 0 (specify bias = False).

Types of activation functions

I tried to summarize to some extent. It's difficult to say what to choose. I feel that tanh and relu are popular overall.

The following two take 1 from -1. tanh is steeper. When the input is 0, the output also works linearly around 0.0, but when the absolute value becomes large, it sticks to -1 or 1.

--Softsign --Hyperbolic tangent (tanh)

The following two take only 0 or more and 1 or less. The hard sigmoid is steeper and has a broken line. The sigmoid is smooth. When the input is 0, the output is 0.5. As above, it works linearly around 0, but sticks to 0 or 1 as the absolute value increases.

--Sigmoid function (sigmoid) --Hard sigmoid (hard_sigmoid)

The following two values take a value of 0 or more. Again, the ramp function is steeper and has a broken line. Soft Plus is (quite) smooth. When the value increases to some extent (when viewed from zero), it becomes linear, and when it decreases, it becomes close to zero. relu can also take a coefficient of less than 0.

--Softplus --Ramp function (relu)

All that remains is the linear function and softmax.

For the linear one, simply multiply the coefficient and add the bias (are the coefficient and bias learned?). However, it doesn't make much sense that the activation function is linear, so I may not use this much.

Softmax multiplies all the values output from that layer by an exponential function (ie, makes them all positive) and normalizes the sum to 1. After this function is applied, the sum is 1 and all values are positive, so it can be judged as a probability. That's why I used softmax at the end in the previous example.

Optimizer

The first argument to compile. There are various methods, but how to find out what parameters should be taken to reduce the error. Since it is not possible to intuitively calculate trivial parameters (such as the least squares method), I want to reduce the error in sequence by turning the iteration.

See this.

You can select sgd, rsmprop, adagrad, adadelta, adam, adamax, nadam. Overall, it seems to be a common belief that you should choose adam (in that it works reasonably well without much consideration of parameters). RNN (Recurrent Neural Network) is slow to optimize, so it seems better to use rmsprop.

In the end, all methods are to gradually reduce the derivative (gradient). SGD is just the name of "stochastic gradient descent". The reason why we want to reduce the derivative is that the smaller the gradient (the smaller the derivative), the closer to the extremum (maximum or minimum). If you reduce the derivative of the error function little by little, the value of the error function will automatically approach the minimum (of course, this depends on the shape of the function). However, the efficiency of approaching differs depending on the method, and sometimes it goes too far. It also falls into a local solution, that is, "it may be minimal, but overall it is not at all good." These are methods that can be made smaller while taking such things into consideration.

Objective function

The second argument to compile. The function you want to make smaller. It has almost the same meaning as error. I talked about "what optimization method do you want to reduce the error", but this is "what criteria should be used to determine the error". The "difference" here means the difference between the value estimated from the current neural network and the correct answer value.

You can choose the one below, but no matter which one you choose, it's still "big enough to be unpredictable."

--Mean squared error (mse: sum of squares of difference) --Average absolute error (msa: sum of absolute values of difference) --Average absolute error rate (mspa: sum of absolute values of the difference divided by the correct answer value (error rate)) --Logarithmic mean square error (msle: sum of squares of the difference in "logarithm of value plus 1") --Hinge loss sum (hinge) --Sum of the squares of hinge losses (squared_hinge) --Cross entropy when classifying 2 classes (binary_crossentropy) --Cross entropy during N-class classification (categorical_crossentropy) --Sparse N-class classification cross entropy (sparse_categorical_crossentropy) --KL Divergence (kld)

Hinge loss is a function that is 0 for 1 or more and $ 1-x $ for 1 or less for the value obtained by multiplying the estimated value and the correct answer value (the loss is small if the signs are the same).

Cross entropy is (almost) mutual information. You can see that the smaller the size, the better the categorization.

KL divergence generally represents the distance of the probability distribution. The KL divergence standard creates an atmosphere for estimating the probability distribution.

Cosine similarity (negative) is the degree of difference in the direction of the vector. Learn so that the directions of the vectors are close. Therefore, the size is ignored and learned.

Try to arrange the objective function

Let's arrange some options with the same dataset and the same situation.

This time, the output is one-dimensional, -1 or 1 (-1 if the total is 2.5 or less, 1 if it is greater than 2.5). Since the sign is important, the loss is the hinge loss, and the output activation function is tanh. I set the optimization function to adam.

import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Activation

data = np.random.rand(250,5)
labels = (np.sum(data, axis=1) > 2.5) * 2 - 1
model = Sequential([Dense(20, input_dim=5, activation='tanh'), Dense(1, activation='tanh')])
model.compile('adam', 'hinge', metrics=['accuracy'])
model.fit(data, labels, nb_epoch=150, validation_split=0.2)

test = np.random.rand(200, 5)
predict = np.sign(model.predict(test).flatten())
real = (np.sum(test, axis=1) > 2.5) * 2 - 1
print(sum(predict == real) / 200.0)

It takes a little time to learn, but it seems that the accuracy is reasonable. Well because the dataset is that.

Recommended Posts

Keras starting from nothing 2nd
Keras starting from nothing
Keras 5th starting from nothing
Keras starting from nothing 1st
Keras 4th starting from nothing
Keras starting from nothing 3rd
Django starting from scratch (part: 2)
Django starting from scratch (part: 1)
(Almost) troubleshooting techniques from nothing
[Introduction] From installing kibana to starting
Code wars kata starting from zero