[PYTHON] CNN implementation with just numpy

Introduction

I implemented CNN in python. I implemented it only with numpy without using the deep learning library. I used ["Deep Learning"](http://www.amazon.co.jp/ Deep Learning-Machine Learning Professional Series-Okatani-Takayuki / dp / 4061529021) as a textbook.

Structure of this article

CNN

CNN is a forward propagation network that uses convolutional operations, and is mainly applied to image recognition. A general neural network is a fully connected unit of adjacent layers, CNNs have a special layer in which only specific units between adjacent layers are joined. These special layers perform the operations ** convolution ** and ** pooling **. The following describes convolution and pooling.

Folding layer

Convolution is an operation that takes the product of the corresponding pixels of the filter on the image and sums them. It has the function of detecting a shade pattern similar to the shade pattern of the filter. The image size is represented by $ W \ times W $, the index is represented by $ (i, j) $, and the pixel value is represented by $ x_ {ij} $. The filter size is represented by $ H \ times H $, the index is represented by $ (p, q) $, and the pixel value of the filter is represented by $ h_ {pq} $. The convolution is expressed by the following formula.

u_{ij} = \sum^{H-1}_{p=0} \sum^{H-1}_{q=0} x_{i+p, j+q} \, h_{pq}

When the filter is moved within the range that fits in the image, the image size of the convolution result is as follows. However, $ \ lfloor \ cdot \ rfloor $ is an operator that truncates the decimal point and converts it to an integer.

(W - 2 \lfloor H / 2 \rfloor) \times (W - 2 \lfloor H / 2 \rfloor)

In the convolution layer, the convolution operation is performed as shown in the figure below. Let the size of the input image be $ W \ times W \ times K $ and the size of the convolution filter be $ H \ times H \ times K \ times M $. $ K $ represents the number of image channels and $ M $ represents the number of filter types. The result of convolving the image of the $ k $ channel of the $ l --1 $ layer with the $ m $ th filter is as follows. However, $ b_ {ijm} $ represents the bias and $ f $ represents the activation function.

\begin{align}
u_{ijm} &= \sum^{K-1}_{k=0} \sum^{H-1}_{p=0} \sum^{H-1}_{q=0} z^{(l-1)}_{i+p, j+q, k} \, h_{pqkm} + b_{ijm} \\
\\
z_{ijm} &= f(u_{ijm})
\end{align}

The order of $ z_ {ijm} $ is regarded as the image of the $ M $ channel, and the input to the next layer is $ z ^ {(l)} _ {ijm} $. Also, the same weights are used repeatedly because the filters are applied while shifting. This is called ** weight sharing **.

conv_layer.png

In the figure above, the number of channels of the input image is $ K = 3 $, the filter type is $ M = 2 $, and the image of the $ 2 $ channel is output.

Pooling layer

Pooling is an operation that combines the local areas of an image into a single value. It reduces the positional sensitivity of the features extracted in the convolutional layer so that the output of the pooling layer does not change with some misalignment. On an image of size $ W \ times W \ times K $, $ H \ times H $ square region centered on pixel $ (i, j) $ is taken, and the set of pixels in the region is $ P_ {ij} $. It is represented by. The value $ u_ {ijk} $ obtained by pooling can be expressed as follows.

u_{ijk} = \biggl(\frac{1}{H^2} \sum_{(p, q) \in P_{ij}} z^{P}_{pqk} \biggr)^{\frac{1}{P}}

When $ P = 1 $, it is called ** average pooling ** because it averages the pixels in the area. When $ P = \ infinity $, it is called ** maximum pooling ** because it takes the maximum value of the pixels in the area. The figure below shows the maximum pooling. The input image of $ 4 \ times 4 $ is pooled with the area size $ 2 \ times 2 $ and stride $ 2 $.

pool.png

Learning

I will explain the learning of CNN. The chapter on error backpropagation in "Implementing neural networks with python" will be helpful.

Weight update

In order to bring the output value calculated from the training data closer to the teacher label, consider minimizing the error function $ E $. Partially differentiate the error function $ E $ with the weight $ w $ and update the weights to approach $ 0 $.

w_{new} = w_{old} - \varepsilon \frac{\partial E}{\partial w_{old}}

Backpropagation of error

The calculation method of error back propagation is the same as that of a general neural network. $ l + 1 $ Layer error $ \ delta ^ {(l + 1)} $ and weights $ w ^ {(l + 1)} $ and $ l $ The product of the differential values of the inputs from the layer, $ You can find the error in the l $ layer. However, the following two points are different from CNN and fully connected neural networks.

Implementation in python

This time, we implemented the network used in Implementation of convolutional neural network by Chainer. The implemented code is listed here [https://github.com/shota-takayama/cnn). We will introduce the mounting points separately for the folding layer and the pooling layer.

Implementation of convolution layer

In the convolutional layer, the local area of the image is first cut out, and the ones arranged in order are used as input for forward propagation. For example, when the input image of $ 20 \ times 12 \ times 12 $ is folded with the filter of $ 50 \ times 20 \ times 5 \ times 5 $, the input image is molded as shown in the figure below. The size of the molded input will be $ 64 \ times 20 \ times 5 \ times 5 $. The number $ 64 $ is calculated as follows. (12 - \lfloor 5 / 2 \rfloor \times 2) \times (12 - \lfloor 5 / 2 \rfloor \times 2) = 64

im2patch.png

Calculates the convolution of the molded input image and the filter. Again, the input and filter sizes are $ 64 \ times 20 \ times 5 \ times 5 $ and $ 50 \ times 20 \ times 5 \ times 5 $, respectively.

np.tensordot(X, weight, ((1, 2, 3), (1, 2, 3)))By$64 \times 50$The output of is obtained.



 <img width="1000" alt="tensordot.png " src="https://qiita-image-store.s3.amazonaws.com/0/82527/1f8aaf87-bc78-a0f2-e798-888dec58990b.png ">

 The following code is the key part of forward propagation in the convolutional layer.
 Since the number of input dimensions has been increased by one so that it can be learned in a mini-batch, the argument of `` `tensordot``` is shifted by one and ```axes = ((2, 3, 4), (1, 2, 3)) It is ```.

```py
    def __forward(self, X):
        s_batch, k, xh, xw = X.shape
        m = self.weight.shape[0]
        oh, ow = xh - self.kh / 2 * 2, xw - self.kw / 2 * 2
        self.__patch = self.__im2patch(X, s_batch, k, oh, ow)
        return np.tensordot(self.__patch, self.weight, ((2, 3, 4), (1, 2, 3))).swapaxes(1, 2).reshape(s_batch, m, oh, ow)


    def __im2patch(self, X, s_batch, k, oh, ow):
        patch = np.zeros((s_batch, oh * ow, k, self.kh, self.kw))
        for j in range(oh):
            for i in range(ow):
                patch[:, j * ow + i, :, :, :] = X[:, :, j:j+self.kh, i:i+self.kw]
        return patch

In the back propagation in the convolutional layer, the product of $ \ delta $ of the previous layer and the coefficient of the filter is the error of back propagation. I implemented it as follows. The error obtained for each local area is reshaped into the shape of the input image. This is the reverse of the process of cutting out the input image to the local area during forward propagation.

    def backward(self, delta, shape):
        s_batch, k, h, w = delta.shape
        delta_patch = np.tensordot(delta.reshape(s_batch, k, h * w), self.weight, (1, 0))
        return self.__patch2im(delta_patch, h, w, shape)

    def __patch2im(self, patch, h, w, shape):
        im = np.zeros(shape)
        for j in range(h):
            for i in range(w):
                im[:, :, j:j+self.kh, i:i+self.kw] += patch[:, j * w + i]
        return im

Implementation of pooling layer

Forward propagation in the pooling layer also shapes the input image and arranges the local regions in order. It differs from the convolutional layer in that it stores the index of the pixel that gave the maximum value. This is because the error is backpropagated using the information from which pixel the value was propagated. The following code is the key part of forward propagation in the pooling layer.

    def forward(self, X):
        s_batch, k, h, w = X.shape
        oh, ow = (h - self.kh) / self.s + 1, (w - self.kw) / self.s + 1
        val, self.__ind = self.__max(X, s_batch, k, oh, ow)
        return val

    def __max(self, X, s_batch, k, oh, ow):
        patch = self.__im2patch(X, s_batch, k, oh, ow)
        return map(lambda _f: _f(patch, axis = 3).reshape(s_batch, k, oh, ow), [np.max, np.argmax])


    def __im2patch(self, X, s_batch, k, oh, ow):
        patch = np.zeros((s_batch, oh * ow, k, self.kh, self.kw))
        for j in range(oh):
            for i in range(ow):
                _j, _i = j * self.s, i * self.s
                patch[:, j * ow + i, :, :, :] = X[:, :, _j:_j+self.kh, _i:_i+self.kw]
        return patch.swapaxes(1, 2).reshape(s_batch, k, oh * ow, -1)

As mentioned above, the back propagation in the pooling layer propagates the error as it is to the pixel with the maximum value. I implemented it as follows.

    def backward(self, X, delta, act):
        s_batch, k, h, w = X.shape
        oh, ow = delta.shape[2:]
        rh, rw = h / oh, w / ow
        ind = np.arange(s_batch * k * oh * ow) * rh * rw + self.__ind.flatten()
        return self.__backward(delta, ind, s_batch, k, h, w, oh, ow) * act.derivate(X)

    def __backward(self, delta, ind, s_batch, k, h, w, oh, ow):
        _delta = np.zeros(s_batch * k * h * w)
        _delta[ind] = delta.flatten()
        return _delta.reshape(s_batch, k, oh, ow, self.kh, self.kw).swapaxes(3, 4).reshape(s_batch, k, h, w)

Experiment with MNIST dataset

I learned handwritten numbers using the MNIST dataset. The layer structure is the same as Implementation of convolutional neural network by Chainer.

Learning

Various parameters are as follows. Input data: $ 28 \ times 28 $ grayscale images $ 10000 $ sheets Learning rate: $ \ varepsilon = 0.005 $ Regularization term coefficient: $ \ lambda = 0.0001 $ Learning rate decay coefficient: $ \ gamma = 0.9 $ Batch size: $ 5 $ Epoch: $ 50 $ Test data: $ 100 $ grayscale image of the same size as the input data

result

Below is a graph depicting the loss in each epoch. Eventually the loss dropped to $ 0.104299490259 $. The identification accuracy of the $ 100 $ test image was $ \ boldsymbol {0.96} $.

loss.png

in conclusion

I was able to implement CNN. I haven't implemented padding or batch normalization this time, but I'm tired so I'll finish it. I felt that the people who made the library were amazing.

Recommended Posts

CNN implementation with just numpy
Moving average with numpy
Getting Started with Numpy
Learn with Cheminformatics NumPy
Matrix concatenation with Numpy
Hamming code with numpy
Regression analysis with NumPy
Extend NumPy with Rust
I wrote GP with numpy
Artificial data generation with numpy
Ensemble learning summary! !! (With implementation)
[Python] Calculation method with numpy
Try matrix operation with NumPy
Diffusion equation animation with NumPy
Debt repayment simulation with numpy
Try running CNN with ChainerRL
Implemented SMO with Python + NumPy
Stick strings together with Numpy
Easily build CNN with Keras
Use Maxout + CNN with Pylearn2
Handle numpy arrays with f2py
Use OpenBLAS with numpy, scipy
Survive Christmas with character-level CNN
Python3 | Getting Started with numpy
Neural network implementation (NumPy only)
Implementing logistic regression with NumPy
Decrypt the QR code with CNN
Implementation of Bulk Update with mongo-go-driver
Draw a beautiful circle with numpy
Play with Pythonista UI implementation [Action implementation]
Implement Keras LSTM feedforward with numpy
Just check serial communication with tk
Score-CAM implementation with keras. Comparison with Grad-CAM
Implementation of Light CNN (Python Keras)
Implementation of Dijkstra's algorithm with python
Extract multiple elements with Numpy array
That's why I quit pandas [Three ways to groupby.mean () with just NumPy]