Introduction

Hello. The other day

-Japanese localization of "10 Deep Learning Trends at NIPS 2015" (?) | Memorandum

Read the article

If you don't use batch normalization, you'll lose your life If you aren't using batch normalization you should

So, I tried to implement and verify (?) Batch Normalization by Theano.

[Survey]Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Is partly referred to.

Batch Normalization

algorithm

Normalize each batch so that the mean is 0 and the variance is 1. Let $ B $ be a set of inputs for mini-batch and $ m $ be a batch size.

B = \{x_{1...m}\}\\

Below, $ \ epsilon $ seems to be a parameter for stabilization.

\epsilon = 10^{-5}\\
\mu_{B} \leftarrow \frac{1}{m} \sum_{i=1}^{m} x_i\\
\sigma^2_{B} \leftarrow \frac{1}{m} \sum_{i=1}^{m} (x_i - \mu_{B})^2\\
\hat{x_i} \leftarrow \frac{x_i - \mu_{B}}{\sqrt{\sigma^2_{B} + \epsilon}}\\
y_i \leftarrow \gamma \hat{x_i} + \beta

Regarding the above formula, $ \ gamma $ and $ \ beta $ are for scaling and shifting the values normalized by the parameters, respectively. It is necessary to learn each by the error back propagation method, but the derivation of the detailed formula is omitted here.

For Fully-Connected Layer

In a normal Fully-Connected Layer, it is necessary to calculate the mean and variance for the input dimension. In other words, if the input shape is (BacthSize, 784), it is necessary to calculate the mean and variance of 784 pieces.

For Convolutional Layer

On the other hand, in Convolutional Layer, it is necessary to calculate the mean and variance for the number of channels. In other words, if the input shape is (BatchSize, 64 (number of channels), 32, 32), it is necessary to calculate the mean and variance of 64 pieces.

merit

As a merit of Batch Normalization, it seems that a large learning coefficient can be set and learning can be accelerated.

Implementation by Theano

class BatchNormalizationLayer(object):
	def __init__(self, input, shape=None):
		self.shape = shape
		if len(shape) == 2: # for fully connnected
			gamma = theano.shared(value=np.ones(shape[1], dtype=theano.config.floatX), name="gamma", borrow=True)
			beta = theano.shared(value=np.zeros(shape[1], dtype=theano.config.floatX), name="beta", borrow=True)
			mean = input.mean((0,), keepdims=True)
			var = input.var((0,), keepdims=True)
		elif len(shape) == 4: # for cnn
			gamma = theano.shared(value=np.ones(shape[1:], dtype=theano.config.floatX), name="gamma", borrow=True)
			beta = theano.shared(value=np.zeros(shape[1:], dtype=theano.config.floatX), name="beta", borrow=True)
			mean = input.mean((0,2,3), keepdims=True)
			var = input.var((0,2,3), keepdims=True)
			mean = self.change_shape(mean)
			var = self.change_shape(var)

		self.params = [gamma, beta]
		self.output = gamma * (input - mean) / T.sqrt(var + 1e-5) + beta
	
	def change_shape(self, vec):
		ret = T.repeat(vec, self.shape[2]*self.shape[3])
		ret = ret.reshape(self.shape[1:])
		return ret

An example of how to use it (mostly pseudo code) is

...
input = previous_layer.output #Symbol variable, output of previous layer, shape=(batchsize, 784)
h = BatchNormalizationLayer(input, shape=(batchsize, 784))
#When activating
h.output = activation(h.output) # activation=Some activation function
...
params = ... + h.params + ... #Used when updating network parameters.

Experiment

Experimental settings

The data was experimented with a simple multi-layer neural network using MNIST.

--Number of middle layers: 10 --Number of units in the middle layer: 784 in total --Optimization method: Simple SGD (learning coefficient: 0.01) --Activation function: tanh --Dropout Ratio: 0.1 for the first intermediate layer, 0.5 for all but the input / output layers

Batch Size：100 --Error function: Negative Log Likelihood

Well Input layer → (Fully-Connected Layer → Batch Normalization Layer → Activation) * 10 → Output layer It's like that.

Experimental result

--Error function value <img src="https://qiita-image-store.s3.amazonaws.com/0/31899/2cb66cbb-0581-fe69-a043-7794a2103393.png ", width=640> --Classification accuracy <img src="https://qiita-image-store.s3.amazonaws.com/0/31899/bc14eede-2acf-591e-80b3-632269b0d19d.png ", width=640>

Finally

It may have been a little difficult to set up the experiment, but you may have found that it would be damaged if you did not use Batch Normalization.

[PYTHON] Verification of Batch Normalization with multi-layer neural network