I implemented CNN in python. I implemented it only with numpy without using the deep learning library. I used ["Deep Learning"](http://www.amazon.co.jp/ Deep Learning-Machine Learning Professional Series-Okatani-Takayuki / dp / 4061529021) as a textbook.
Structure of this article
CNN
CNN is a forward propagation network that uses convolutional operations, and is mainly applied to image recognition. A general neural network is a fully connected unit of adjacent layers, CNNs have a special layer in which only specific units between adjacent layers are joined. These special layers perform the operations ** convolution ** and ** pooling **. The following describes convolution and pooling.
Convolution is an operation that takes the product of the corresponding pixels of the filter on the image and sums them. It has the function of detecting a shade pattern similar to the shade pattern of the filter. The image size is represented by $ W \ times W $, the index is represented by $ (i, j) $, and the pixel value is represented by $ x_ {ij} $. The filter size is represented by $ H \ times H $, the index is represented by $ (p, q) $, and the pixel value of the filter is represented by $ h_ {pq} $. The convolution is expressed by the following formula.
u_{ij} = \sum^{H-1}_{p=0} \sum^{H-1}_{q=0} x_{i+p, j+q} \, h_{pq}
When the filter is moved within the range that fits in the image, the image size of the convolution result is as follows. However, $ \ lfloor \ cdot \ rfloor $ is an operator that truncates the decimal point and converts it to an integer.
(W - 2 \lfloor H / 2 \rfloor) \times (W - 2 \lfloor H / 2 \rfloor)
In the convolution layer, the convolution operation is performed as shown in the figure below. Let the size of the input image be $ W \ times W \ times K $ and the size of the convolution filter be $ H \ times H \ times K \ times M $. $ K $ represents the number of image channels and $ M $ represents the number of filter types. The result of convolving the image of the $ k $ channel of the $ l --1 $ layer with the $ m $ th filter is as follows. However, $ b_ {ijm} $ represents the bias and $ f $ represents the activation function.
\begin{align}
u_{ijm} &= \sum^{K-1}_{k=0} \sum^{H-1}_{p=0} \sum^{H-1}_{q=0} z^{(l-1)}_{i+p, j+q, k} \, h_{pqkm} + b_{ijm} \\
\\
z_{ijm} &= f(u_{ijm})
\end{align}
The order of $ z_ {ijm} $ is regarded as the image of the $ M $ channel, and the input to the next layer is $ z ^ {(l)} _ {ijm} $. Also, the same weights are used repeatedly because the filters are applied while shifting. This is called ** weight sharing **.
In the figure above, the number of channels of the input image is $ K = 3 $, the filter type is $ M = 2 $, and the image of the $ 2 $ channel is output.
Pooling is an operation that combines the local areas of an image into a single value. It reduces the positional sensitivity of the features extracted in the convolutional layer so that the output of the pooling layer does not change with some misalignment. On an image of size $ W \ times W \ times K $, $ H \ times H $ square region centered on pixel $ (i, j) $ is taken, and the set of pixels in the region is $ P_ {ij} $. It is represented by. The value $ u_ {ijk} $ obtained by pooling can be expressed as follows.
u_{ijk} = \biggl(\frac{1}{H^2} \sum_{(p, q) \in P_{ij}} z^{P}_{pqk} \biggr)^{\frac{1}{P}}
When $ P = 1 $, it is called ** average pooling ** because it averages the pixels in the area. When $ P = \ infinity $, it is called ** maximum pooling ** because it takes the maximum value of the pixels in the area. The figure below shows the maximum pooling. The input image of $ 4 \ times 4 $ is pooled with the area size $ 2 \ times 2 $ and stride $ 2 $.
I will explain the learning of CNN. The chapter on error backpropagation in "Implementing neural networks with python" will be helpful.
In order to bring the output value calculated from the training data closer to the teacher label, consider minimizing the error function $ E $. Partially differentiate the error function $ E $ with the weight $ w $ and update the weights to approach $ 0 $.
w_{new} = w_{old} - \varepsilon \frac{\partial E}{\partial w_{old}}
The calculation method of error back propagation is the same as that of a general neural network. $ l + 1 $ Layer error $ \ delta ^ {(l + 1)} $ and weights $ w ^ {(l + 1)} $ and $ l $ The product of the differential values of the inputs from the layer, $ You can find the error in the l $ layer. However, the following two points are different from CNN and fully connected neural networks.
This time, we implemented the network used in Implementation of convolutional neural network by Chainer. The implemented code is listed here [https://github.com/shota-takayama/cnn). We will introduce the mounting points separately for the folding layer and the pooling layer.
In the convolutional layer, the local area of the image is first cut out, and the ones arranged in order are used as input for forward propagation.
For example, when the input image of $ 20 \ times 12 \ times 12 $ is folded with the filter of $ 50 \ times 20 \ times 5 \ times 5 $, the input image is molded as shown in the figure below.
The size of the molded input will be $ 64 \ times 20 \ times 5 \ times 5 $.
The number $ 64 $ is calculated as follows.
Calculates the convolution of the molded input image and the filter. Again, the input and filter sizes are $ 64 \ times 20 \ times 5 \ times 5 $ and $ 50 \ times 20 \ times 5 \ times 5 $, respectively.
np.tensordot(X, weight, ((1, 2, 3), (1, 2, 3)))By$64 \times 50$The output of is obtained.
<img width="1000" alt="tensordot.png " src="https://qiita-image-store.s3.amazonaws.com/0/82527/1f8aaf87-bc78-a0f2-e798-888dec58990b.png ">
The following code is the key part of forward propagation in the convolutional layer.
Since the number of input dimensions has been increased by one so that it can be learned in a mini-batch, the argument of `` `tensordot``` is shifted by one and ```axes = ((2, 3, 4), (1, 2, 3)) It is ```.
```py
def __forward(self, X):
s_batch, k, xh, xw = X.shape
m = self.weight.shape[0]
oh, ow = xh - self.kh / 2 * 2, xw - self.kw / 2 * 2
self.__patch = self.__im2patch(X, s_batch, k, oh, ow)
return np.tensordot(self.__patch, self.weight, ((2, 3, 4), (1, 2, 3))).swapaxes(1, 2).reshape(s_batch, m, oh, ow)
def __im2patch(self, X, s_batch, k, oh, ow):
patch = np.zeros((s_batch, oh * ow, k, self.kh, self.kw))
for j in range(oh):
for i in range(ow):
patch[:, j * ow + i, :, :, :] = X[:, :, j:j+self.kh, i:i+self.kw]
return patch
In the back propagation in the convolutional layer, the product of $ \ delta $ of the previous layer and the coefficient of the filter is the error of back propagation. I implemented it as follows. The error obtained for each local area is reshaped into the shape of the input image. This is the reverse of the process of cutting out the input image to the local area during forward propagation.
def backward(self, delta, shape):
s_batch, k, h, w = delta.shape
delta_patch = np.tensordot(delta.reshape(s_batch, k, h * w), self.weight, (1, 0))
return self.__patch2im(delta_patch, h, w, shape)
def __patch2im(self, patch, h, w, shape):
im = np.zeros(shape)
for j in range(h):
for i in range(w):
im[:, :, j:j+self.kh, i:i+self.kw] += patch[:, j * w + i]
return im
Forward propagation in the pooling layer also shapes the input image and arranges the local regions in order. It differs from the convolutional layer in that it stores the index of the pixel that gave the maximum value. This is because the error is backpropagated using the information from which pixel the value was propagated. The following code is the key part of forward propagation in the pooling layer.
def forward(self, X):
s_batch, k, h, w = X.shape
oh, ow = (h - self.kh) / self.s + 1, (w - self.kw) / self.s + 1
val, self.__ind = self.__max(X, s_batch, k, oh, ow)
return val
def __max(self, X, s_batch, k, oh, ow):
patch = self.__im2patch(X, s_batch, k, oh, ow)
return map(lambda _f: _f(patch, axis = 3).reshape(s_batch, k, oh, ow), [np.max, np.argmax])
def __im2patch(self, X, s_batch, k, oh, ow):
patch = np.zeros((s_batch, oh * ow, k, self.kh, self.kw))
for j in range(oh):
for i in range(ow):
_j, _i = j * self.s, i * self.s
patch[:, j * ow + i, :, :, :] = X[:, :, _j:_j+self.kh, _i:_i+self.kw]
return patch.swapaxes(1, 2).reshape(s_batch, k, oh * ow, -1)
As mentioned above, the back propagation in the pooling layer propagates the error as it is to the pixel with the maximum value. I implemented it as follows.
def backward(self, X, delta, act):
s_batch, k, h, w = X.shape
oh, ow = delta.shape[2:]
rh, rw = h / oh, w / ow
ind = np.arange(s_batch * k * oh * ow) * rh * rw + self.__ind.flatten()
return self.__backward(delta, ind, s_batch, k, h, w, oh, ow) * act.derivate(X)
def __backward(self, delta, ind, s_batch, k, h, w, oh, ow):
_delta = np.zeros(s_batch * k * h * w)
_delta[ind] = delta.flatten()
return _delta.reshape(s_batch, k, oh, ow, self.kh, self.kw).swapaxes(3, 4).reshape(s_batch, k, h, w)
I learned handwritten numbers using the MNIST dataset. The layer structure is the same as Implementation of convolutional neural network by Chainer.
Various parameters are as follows. Input data: $ 28 \ times 28 $ grayscale images $ 10000 $ sheets Learning rate: $ \ varepsilon = 0.005 $ Regularization term coefficient: $ \ lambda = 0.0001 $ Learning rate decay coefficient: $ \ gamma = 0.9 $ Batch size: $ 5 $ Epoch: $ 50 $ Test data: $ 100 $ grayscale image of the same size as the input data
Below is a graph depicting the loss in each epoch. Eventually the loss dropped to $ 0.104299490259 $. The identification accuracy of the $ 100 $ test image was $ \ boldsymbol {0.96} $.
I was able to implement CNN. I haven't implemented padding or batch normalization this time, but I'm tired so I'll finish it. I felt that the people who made the library were amazing.
Recommended Posts