Hello. The other day
-Japanese localization of "10 Deep Learning Trends at NIPS 2015" (?) | Memorandum
Read the article
If you don't use batch normalization, you'll lose your life If you aren't using batch normalization you should
So, I tried to implement and verify (?) Batch Normalization by Theano.
Is partly referred to.
Batch Normalization
Normalize each batch so that the mean is 0 and the variance is 1. Let $ B $ be a set of inputs for mini-batch and $ m $ be a batch size.
B = \{x_{1...m}\}\\
Below, $ \ epsilon $ seems to be a parameter for stabilization.
\epsilon = 10^{-5}\\
\mu_{B} \leftarrow \frac{1}{m} \sum_{i=1}^{m} x_i\\
\sigma^2_{B} \leftarrow \frac{1}{m} \sum_{i=1}^{m} (x_i - \mu_{B})^2\\
\hat{x_i} \leftarrow \frac{x_i - \mu_{B}}{\sqrt{\sigma^2_{B} + \epsilon}}\\
y_i \leftarrow \gamma \hat{x_i} + \beta
Regarding the above formula, $ \ gamma $ and $ \ beta $ are for scaling and shifting the values normalized by the parameters, respectively. It is necessary to learn each by the error back propagation method, but the derivation of the detailed formula is omitted here.
In a normal Fully-Connected Layer, it is necessary to calculate the mean and variance for the input dimension.
In other words, if the input shape is (BacthSize, 784)
, it is necessary to calculate the mean and variance of 784 pieces.
On the other hand, in Convolutional Layer, it is necessary to calculate the mean and variance for the number of channels.
In other words, if the input shape is (BatchSize, 64 (number of channels), 32, 32)
, it is necessary to calculate the mean and variance of 64 pieces.
As a merit of Batch Normalization, it seems that a large learning coefficient can be set and learning can be accelerated.
class BatchNormalizationLayer(object):
def __init__(self, input, shape=None):
self.shape = shape
if len(shape) == 2: # for fully connnected
gamma = theano.shared(value=np.ones(shape[1], dtype=theano.config.floatX), name="gamma", borrow=True)
beta = theano.shared(value=np.zeros(shape[1], dtype=theano.config.floatX), name="beta", borrow=True)
mean = input.mean((0,), keepdims=True)
var = input.var((0,), keepdims=True)
elif len(shape) == 4: # for cnn
gamma = theano.shared(value=np.ones(shape[1:], dtype=theano.config.floatX), name="gamma", borrow=True)
beta = theano.shared(value=np.zeros(shape[1:], dtype=theano.config.floatX), name="beta", borrow=True)
mean = input.mean((0,2,3), keepdims=True)
var = input.var((0,2,3), keepdims=True)
mean = self.change_shape(mean)
var = self.change_shape(var)
self.params = [gamma, beta]
self.output = gamma * (input - mean) / T.sqrt(var + 1e-5) + beta
def change_shape(self, vec):
ret = T.repeat(vec, self.shape[2]*self.shape[3])
ret = ret.reshape(self.shape[1:])
return ret
An example of how to use it (mostly pseudo code) is
...
input = previous_layer.output #Symbol variable, output of previous layer, shape=(batchsize, 784)
h = BatchNormalizationLayer(input, shape=(batchsize, 784))
#When activating
h.output = activation(h.output) # activation=Some activation function
...
params = ... + h.params + ... #Used when updating network parameters.
The data was experimented with a simple multi-layer neural network using MNIST.
--Number of middle layers: 10 --Number of units in the middle layer: 784 in total --Optimization method: Simple SGD (learning coefficient: 0.01) --Activation function: tanh --Dropout Ratio: 0.1 for the first intermediate layer, 0.5 for all but the input / output layers
Well Input layer → (Fully-Connected Layer → Batch Normalization Layer → Activation) * 10 → Output layer It's like that.
--Error function value <img src="https://qiita-image-store.s3.amazonaws.com/0/31899/2cb66cbb-0581-fe69-a043-7794a2103393.png ", width=640> --Classification accuracy <img src="https://qiita-image-store.s3.amazonaws.com/0/31899/bc14eede-2acf-591e-80b3-632269b0d19d.png ", width=640>
It may have been a little difficult to set up the experiment, but you may have found that it would be damaged if you did not use Batch Normalization.
Recommended Posts