I suddenly started studying in Chapter 7 of "Deep Learning from scratch-The theory and implementation of deep learning learned with Python". It is a memo of the trip.

The execution environment is macOS Mojave + Anaconda 2019.10, and the Python version is 3.7.4. For details, refer to Chapter 1 of this memo.

(To other chapters of this memo: Chapter 1 / Chapter 2 / Chapter 3 / Chapter 4 / [Chapter 5](https://qiita. com / segavvy / items / 8707e4e65aa7fa357d8a) / Chapter 6 / Chapter 7 / Chapter 8 / Summary)

This chapter describes Convolutional Neural Networks (CNN).

In addition to the existing Affine layer, Softmax layer, and ReLU layer, the Convolution layer and Pooling layer will appear.

The explanation of the convolution layer is easier to read if you have a little bit of image processing.

It says, "Images are usually three-dimensional shapes in the vertical, horizontal, and channel directions." However, since images are vertical and horizontal 2D data, isn't it 3D data with depth added? Some people may think that.

"Channel" here refers to information for each color such as RGB. For grayscale (black and white shades only) data such as MNIST, the density of one point can be expressed by one value, so one channel is sufficient, but in a color image, one point is red, green, and blue. Since it is expressed by the density of the three values of (RGB), three channels are required. The color information channels include not only RGB, but also CMYK, HSV, and transparency alpha. For details, go to "RGB CMYK" etc. and you will find a lot of explanations (although there are many stories that are a little closer to printing).

Also, the word "filter" is also special, and in image processing, it refers to the processing used to extract only the necessary parts (for example, contours) of an image or to remove unnecessary information. For those who are not familiar with it, it will be easier to understand if you get an overview of the convolution filter in image processing. @ t-tkd3a's 3x3 convolution filter result image is recommended because it is easy to imagine.

As an aside, this book seems to be a rule that does not add long vowel notation for katakana with three or more notes like "Layer". However, since the "filter" has a long vowel notation, it may be a unified omission. By the way, when Microsoft switched the katakana long vowel notation method in 2008 [^ 1], I was in charge of developing packaged applications for Windows, and I was in charge of correcting the wording of programs, manuals, etc. It was hard. Before that, I was involved in removing half-width kana from the GUI on Windows 98 ... In this industry, really Japanese is inconvenient: sweat:

Let's get back to the story and move on.

As for the pooling layer, I didn't have any particular stumbling blocks.

The implementation of the Convolution layer and Pooling layer is short in code, but complicated because the shape of the target data changes rapidly with ʻim2col`, `

numpy.ndarray.reshape` and`

numpy.ndarray.transpose`. It was confusing at first, but I could understand it by referring to @ daizutabi's "Deep Learning from scratch" Convolution / Pooling layer implementation. ..

First is the implementation of the Convolution layer. I have a lot of comments because I can't get my head on unless I write the shape.

`convolution.py`

```
# coding: utf-8
import os
import sys
import numpy as np
sys.path.append(os.pardir) #Add parent directory to path
from common.util import im2col, col2im
class Convolution:
def __init__(self, W, b, stride=1, pad=0):
"""Convolution layer
Args:
W (numpy.ndarray):Filter (weight), shape(FN, C, FH, FW)。
b (numpy.ndarray):Bias, shape(FN)。
stride (int, optional):Stride, default is 1.
pad (int, optional):Padding, default is 0.
"""
self.W = W
self.b = b
self.stride = stride
self.pad = pad
self.dW = None #Derivative value of weight
self.db = None #Derivative value of bias
self.x = None #Input for forward propagation required for back propagation
self.col_x = None #Input col expansion result at the time of forward propagation required for back propagation
self.col_W = None #Col expansion result of filter at the time of forward propagation required for back propagation
def forward(self, x):
"""Forward propagation
Args:
x (numpy.ndarray):input. The shape is(N, C, H, W)。
Returns:
numpy.ndarray:output. The shape is(N, FN, OH, OW)。
"""
FN, C, FH, FW = self.W.shape # FN:Number of filters, C:Number of channels, FH:Filter height, FW:width
N, x_C, H, W = x.shape # N:Batch size, x_C:Number of channels, H: Height of input data, W:width
assert C == x_C, f'Mismatch in the number of channels![C]{C}, [x_C]{x_C}'
#Output size calculation
assert (H + 2 * self.pad - FH) % self.stride == 0, 'OH is not divisible!'
assert (W + 2 * self.pad - FW) % self.stride == 0, 'OW is indivisible!'
OH = int((H + 2 * self.pad - FH) / self.stride + 1)
OW = int((W + 2 * self.pad - FW) / self.stride + 1)
#Expand input data
# (N, C, H, W) → (N * OH * OW, C * FH * FW)
col_x = im2col(x, FH, FW, self.stride, self.pad)
#Expand filter
# (FN, C, FH, FW) → (C * FH * FW, FN)
col_W = self.W.reshape(FN, -1).T
#Calculate output (col_x, col_W,The calculation for b is exactly the same as the Affine layer)
# (N * OH * OW, C * FH * FW)・(C * FH * FW, FN) → (N * OH * OW, FN)
out = np.dot(col_x, col_W) + self.b
#Result shaping
# (N * OH * OW, FN) → (N, OH, OW, FN) → (N, FN, OH, OW)
out = out.reshape(N, OH, OW, FN).transpose(0, 3, 1, 2)
#Save for backpropagation
self.x = x
self.col_x = col_x
self.col_W = col_W
return out
def backward(self, dout):
"""Backpropagation
Args:
dout (numpy.ndarray):The differential value and shape transmitted from the right layer(N, FN, OH, OW)。
Returns:
numpy.ndarray:Derivative value (gradient), shape(N, C, H, W)。
"""
FN, C, FH, FW = self.W.shape #The shape of the differential value is the same as W(FN, C, FH, FW)
#Expand the differential value from the right layer
# (N, FN, OH, OW) → (N, OH, OW, FN) → (N * OH * OW, FN)
dout = dout.transpose(0, 2, 3, 1).reshape(-1, FN)
#Derivative value calculation (col_x, col_W,The calculation for b is exactly the same as the Affine layer)
dcol_x = np.dot(dout, self.col_W.T) # → (N * OH * OW, C * FH * FW)
self.dW = np.dot(self.col_x.T, dout) # → (C * FH * FW, FN)
self.db = np.sum(dout, axis=0) # → (FN)
#Formatting the derivative of the filter (weight)
# (C * FH * FW, FN) → (FN, C * FH * FW) → (FN, C, FH, FW)
self.dW = self.dW.transpose(1, 0).reshape(FN, C, FH, FW)
#Forming the result (gradient)
# (N * OH * OW, C * FH * FW) → (N, C, H, W)
dx = col2im(dcol_x, self.x.shape, FH, FW, self.stride, self.pad)
return dx
```

Next is the implementation of the Pooling layer. This is also full of comments.

`pooling.py`

```
# coding: utf-8
import os
import sys
import numpy as np
sys.path.append(os.pardir) #Add parent directory to path
from common.util import im2col, col2im
class Pooling:
def __init__(self, pool_h, pool_w, stride=1, pad=0):
"""Pooling layer
Args:
pool_h (int):Pooling area height
pool_w (int):Pooling area width
stride (int, optional):Stride, default is 1.
pad (int, optional):Padding, default is 0.
"""
self.pool_h = pool_h
self.pool_w = pool_w
self.stride = stride
self.pad = pad
self.x = None #Input for forward propagation required for back propagation
self.arg_max = None #The col used for forward propagation, which is required for back propagation_x Position of each row
def forward(self, x):
"""Forward propagation
Args:
x (numpy.ndarray):Input, shape(N, C, H, W)。
Returns:
numpy.ndarray:Output, shape(N, C, OH, OW)。
"""
N, C, H, W = x.shape # N:Number of data, C:Number of channels, H:Height, W:width
#Output size calculation
assert (H - self.pool_h) % self.stride == 0, 'OH is not divisible!'
assert (W - self.pool_w) % self.stride == 0, 'OW is indivisible!'
OH = int((H - self.pool_h) / self.stride + 1)
OW = int((W - self.pool_w) / self.stride + 1)
#Expand and format input data
# (N, C, H, W) → (N * OH * OW, C * PH * PW)
col_x = im2col(x, self.pool_h, self.pool_w, self.stride, self.pad)
# (N * OH * OW, C * PH * PW) → (N * OH * OW * C, PH * PW)
col_x = col_x.reshape(-1, self.pool_h * self.pool_w)
#Calculate output
# (N * OH * OW * C, PH * PW) → (N * OH * OW * C)
out = np.max(col_x, axis=1)
#Result shaping
# (N * OH * OW * C) → (N, OH, OW, C) → (N, C, OH, OW)
out = out.reshape(N, OH, OW, C).transpose(0, 3, 1, 2)
#Save for backpropagation
self.x = x
self.arg_max = np.argmax(col_x, axis=1) # col_x Maximum position (index) of each row
return out
def backward(self, dout):
"""Backpropagation
Args:
dout (numpy.ndarray):The differential value and shape transmitted from the right layer(N, C, OH, OW)。
Returns:
numpy.ndarray:Derivative value (gradient), shape(N, C, H, W)。
"""
#Shape the differential value from the right layer
# (N, C, OH, OW) → (N, OH, OW, C)
dout = dout.transpose(0, 2, 3, 1)
#Initialize col for the resulting derivative with 0
# (N * OH * OW * C, PH * PW)
pool_size = self.pool_h * self.pool_w
dcol_x = np.zeros((dout.size, pool_size))
#Set the differential value of dout (= dout manma) only at the position adopted as the maximum value during forward propagation.
#The position of the value that was not adopted during forward propagation remains 0 at initialization.
#(Same as processing when x is greater than 0 and x is less than 0 in ReLU)
assert dout.size == self.arg_max.size, 'Col during forward propagation_Does not match the number of lines in x'
dcol_x[np.arange(self.arg_max.size), self.arg_max.flatten()] = \
dout.flatten()
#Formatting the derivative of the result 1
# (N * OH * OW * C, PH * PW) → (N, OH, OW, C, PH * PW)
dcol_x = dcol_x.reshape(dout.shape + (pool_size,)) #Last','Indicates a one-element tuple
#Formatting the derivative of the result 2
# (N, OH, OW, C, PH * PW) → (N * OH * OW, C * PH * PW)
dcol_x = dcol_x.reshape(
dcol_x.shape[0] * dcol_x.shape[1] * dcol_x.shape[2], -1
)
#Formatting the derivative of the result 3
# (N * OH * OW, C * PH * PW) → (N, C, H, W)
dx = col2im(
dcol_x, self.x.shape, self.pool_h, self.pool_w, self.stride, self.pad
)
return dx
```

Implement CNN by combining the previous implementations.

First, I will organize the input and output in this network.

layer | Input / output shape | Shape at the time of mounting |
---|---|---|

$ (Batch size N,Number of channels CH,Image height H,Width W) $ | $ (100, 1, 28, 28) $ | |

:one: Convolution | ↓ | |

$ (Batch size N,Number of filters FN,Output height OH,Width OW) $ | $ (100, 30, 24, 24) $ | |

:two: ReLU | ↓ | |

$ (Batch size N,Number of filters FN,Output height OH,Width OW) $ | $ (100, 30, 24, 24) $ | |

:three: Pooling | ↓ | |

$ (Batch size N,Number of filters FN,Output height OH,Width OW) $ | $ (100, 30, 12, 12) $ | |

:four: Affine | ↓ | |

$ (Batch size N,Hidden layer size) $ | $ (100, 100) $ | |

:five: ReLU | ↓ | |

$ (Batch size N,Hidden layer size) $ | $ (100, 100) $ | |

:six: Affine | ↓ | |

$ (Batch size N,Final output size) $ | $ (100, 10) $ | |

:seven: Softmax | ↓ | |

$ (Batch size N,Final output size) $ | $ (100, 10) $ |

The implementation of the Convlolution layer and the Pooling layer is as described above.

The Affine layer requires some modifications to the previous implementation. Previously [5.6.2 Batch Affine Layer](https://qiita.com/segavvy/items/8707e4e65aa7fa357d8a#562-%E3%83%90%E3%83%83%E3%83%81%E7%89% When implemented with 88affine% E3% 83% AC% E3% 82% A4% E3% 83% A4), the input was two-dimensional ($ batch size N $, image size), but this time, the fourth Since the input of the Affine layer is 4D (number of batches $ N $, number of filters $ FN $, pooling result $ OH $, $ OW $), it is necessary to deal with it. On page 152 of the book, there is a proviso that "The implementation of Affine in common / layers.py is an implementation that considers the case where the input data is a tensor (4D data)". I didn't know if it was left unattended, but it was supposed to be used this time.

The following is an implementation of the Affine layer that supports 3D or higher input.

`affine.py`

```
# coding: utf-8
import numpy as np
class Affine:
def __init__(self, W, b):
"""Affine layer
Args:
W (numpy.ndarray):weight
b (numpy.ndarray):bias
"""
self.W = W #weight
self.b = b #bias
self.x = None #Input (after 2D)
self.dW = None #Derivative value of weight
self.db = None #Derivative value of bias
self.original_x_shape = None #Original input shape (for input of 3D or more)
def forward(self, x):
"""Forward propagation
Args:
x (numpy.ndarray):input
Returns:
numpy.ndarray:output
"""
#Two-dimensional input of three dimensions or more (tensor)
self.original_x_shape = x.shape #Because it is necessary to save the shape and restore it by back propagation
x = x.reshape(x.shape[0], -1)
self.x = x
#Calculate output
out = np.dot(x, self.W) + self.b
return out
def backward(self, dout):
"""Backpropagation
Args:
dout (numpy.ndarray):Derivative value transmitted from the right layer
Returns:
numpy.ndarray:Derivative value
"""
#Derivative value calculation
dx = np.dot(dout, self.W.T)
self.dW = np.dot(self.x.T, dout)
self.db = np.sum(dout, axis=0)
#Return to the original shape
dx = dx.reshape(*self.original_x_shape)
return dx
```

The ReLU and Softmax layers are the same as in the previous implementation, but will be reprinted.

`relu.py`

```
# coding: utf-8
class ReLU:
def __init__(self):
"""ReLU layer
"""
self.mask = None
def forward(self, x):
"""Forward propagation
Args:
x (numpy.ndarray):input
Returns:
numpy.ndarray:output
"""
self.mask = (x <= 0)
out = x.copy()
out[self.mask] = 0
return out
def backward(self, dout):
"""Backpropagation
Args:
dout (numpy.ndarray):Derivative value transmitted from the right layer
Returns:
numpy.ndarray:Derivative value
"""
dout[self.mask] = 0
dx = dout
return dx
```

`softmax_with_loss.py`

```
# coding: utf-8
from functions import softmax, cross_entropy_error
class SoftmaxWithLoss:
def __init__(self):
"""Softmax-with-Loss layer
"""
self.loss = None #loss
self.y = None #softmax output
self.t = None #Teacher data (one-hot vector）
def forward(self, x, t):
"""Forward propagation
Args:
x (numpy.ndarray):input
t (numpy.ndarray):Teacher data
Returns:
float:Cross entropy error
"""
self.t = t
self.y = softmax(x)
self.loss = cross_entropy_error(self.y, self.t)
return self.loss
def backward(self, dout=1):
"""Backpropagation
Args:
dout (float, optional):Derivative value transmitted from the right layer. The default is 1.
Returns:
numpy.ndarray:Derivative value
"""
batch_size = self.t.shape[0] #Number of batches
dx = (self.y - self.t) * (dout / batch_size)
return dx
```

The functions required to implement the softmax layer are also reprinted as before. The functions not used this time are deleted.

`functions.py`

```
# coding: utf-8
import numpy as np
def softmax(x):
"""Softmax function
Args:
x (numpy.ndarray):input
Returns:
numpy.ndarray:output
"""
#For batch processing x is(Number of batches, 10)It becomes a two-dimensional array of.
#In this case, it is necessary to calculate well for each image using broadcast.
#Here, np so that it can be shared in both 1D and 2D..max()And np.sum()Is axis=-Calculated by 1
#Keepdims so that you can broadcast as it is=True to maintain the dimension.
c = np.max(x, axis=-1, keepdims=True)
exp_a = np.exp(x - c) #Overflow measures
sum_exp_a = np.sum(exp_a, axis=-1, keepdims=True)
y = exp_a / sum_exp_a
return y
def cross_entropy_error(y, t):
"""Calculation of cross entropy error
Args:
y (numpy.ndarray):Neural network output
t (numpy.ndarray):Correct label
Returns:
float:Cross entropy error
"""
#If there is one data, shape it (make one data line)
if y.ndim == 1:
t = t.reshape(1, t.size)
y = y.reshape(1, y.size)
#Calculate the error and normalize by the number of batches
batch_size = y.shape[0]
return -np.sum(t * np.log(y + 1e-7)) / batch_size
```

For the parameter optimizer, see 6.1 Parameter Updates (https://qiita.com/segavvy/items/ca4ac4c9ee1a126bff41#61-%E3%83%91%E3%83%A9%E3%83% I skipped the implementation just by reading A1% E3% 83% BC% E3% 82% BF% E3% 81% AE% E6% 9B% B4% E6% 96% B0), so I decided to use AdaGrad this time. I tried to implement. It's almost the same as the code in the book.

`ada_grad.py`

```
# coding: utf-8
import numpy as np
class AdaGrad:
def __init__(self, lr=0.01):
"""Parameter optimization with AdaGrad
Args:
lr (float, optional):Learning factor, default 0.01。
"""
self.lr = lr
self.h = None #Sum of squares of the gradient so far
def update(self, params, grads):
"""Parameter update
Args:
params (dict):The dictionary of parameters to be updated, key is'W1'、'b1'Such.
grads (dict):Gradient dictionary corresponding to params
"""
#initialization of h
if self.h is None:
self.h = {}
for key, val in params.items():
self.h[key] = np.zeros_like(val)
#update
for key in params.keys():
#h update
self.h[key] += grads[key] ** 2
#Parameter update, last 1e-7 avoids division by zero
params[key] -= self.lr * grads[key] / (np.sqrt(self.h[key]) + 1e-7)
```

CNN previously [5.7.2 Implementation of Neural Network for Error Backpropagation](https://qiita.com/segavvy/items/8707e4e65aa7fa357d8a#572-%E8%AA%A4%E5%B7%AE% E9% 80% 86% E4% BC% 9D% E6% 92% AD% E6% B3% 95% E3% 81% AB% E5% AF% BE% E5% BF% 9C% E3% 81% 97% E3% 81% 9F% E3% 83% 8B% E3% 83% A5% E3% 83% BC% E3% 83% A9% E3% 83% AB% E3% 83% 8D% E3% 83% 83% E3% 83% Based on the `TwoLayerNet`

made with 88% E3% 83% AF% E3% 83% BC% E3% 82% AF% E3% 81% AE% E5% AE% 9F% E8% A3% 85) I implemented it according to the instructions.

The code in the book uses ʻOrderedDict`, but like last time, we use the normal`

dict`here. This is because starting with Python 3.7, the insertion order of`

dict` objects is saved [^ 2]. Also, I stumbled upon the implementation of ʻaccuracy`

, so I will explain it later.

Below is the implementation of CNN.

`simple_conv_net.py`

```
# coding: utf-8
import numpy as np
from affine import Affine
from convolution import Convolution
from pooling import Pooling
from relu import ReLU
from softmax_with_loss import SoftmaxWithLoss
class SimpleConvNet:
def __init__(
self, input_dim=(1, 28, 28),
conv_param={'filter_num': 30, 'filter_size': 5, 'pad': 0, 'stride': 1},
hidden_size=100, output_size=10, weight_init_std=0.01
):
"""Simple convolutional neural network
Args:
input_dim (tuple, optional):Input data shape, default is(1, 28, 28)。
conv_param (dict, optional):Hyperparameters of the convolution layer,
The default is{'filter_num':30, 'filter_size':5, 'pad':0, 'stride':1}。
hidden_size (int, optional):The number of neurons in the hidden layer, the default is 100.
output_size (int, optional):The number of neurons in the output layer, the default is 10.
weight_init_std (float, optional):Adjustment parameter of the initial value of the weight. The default is 0.01。
"""
#Extract hyperparameters of convolution layer
filter_num = conv_param['filter_num'] #Number of filters
filter_size = conv_param['filter_size'] #Filter size (same height and width)
filter_stride = conv_param['stride'] #stride
filter_pad = conv_param['pad'] #Padding
#The hyperparameters of the pooling layer are fixed
pool_size = 2 #Size (same height and width)
pool_stride = 2 #stride
pool_pad = 0 #Padding
#Input data size calculation
input_ch = input_dim[0] #Number of input data channels
assert input_dim[1] == input_dim[2], 'Input data is assumed to have the same height and width!'
input_size = input_dim[1] #Input data size
#Calculation of output size of convolution layer
assert (input_size + 2 * filter_pad - filter_size) \
% filter_stride == 0, 'The output size of the convolution layer is not divisible!'
conv_output_size = int(
(input_size + 2 * filter_pad - filter_size) / filter_stride + 1
)
#Calculation of output size of pooling layer
assert (conv_output_size - pool_size) % pool_stride == 0, \
'The output size of the pooling layer is not divisible!'
pool_output_size_one = int(
(conv_output_size - pool_size) / pool_stride + 1 #Height / width size
)
pool_output_size = filter_num * \
pool_output_size_one * pool_output_size_one #Total size of all filters
#Weight initialization
self.params = {}
#Convolution layer
self.params['W1'] = weight_init_std * \
np.random.randn(filter_num, input_ch, filter_size, filter_size)
self.params['b1'] = np.zeros(filter_num)
#Affine layer 1
self.params['W2'] = weight_init_std * \
np.random.randn(pool_output_size, hidden_size)
self.params['b2'] = np.zeros(hidden_size)
#Affine layer 2
self.params['W3'] = weight_init_std * \
np.random.randn(hidden_size, output_size)
self.params['b3'] = np.zeros(output_size)
#Layer generation
self.layers = {} # Python 3.OrderedDict is unnecessary because the storage order of dictionaries is retained from 7
#Convolution layer
self.layers['Conv1'] = Convolution(
self.params['W1'], self.params['b1'], filter_stride, filter_pad
)
self.layers['Relu1'] = ReLU()
self.layers['Pool1'] = Pooling(
pool_size, pool_size, pool_stride, pool_pad
)
#Affine layer 1
self.layers['Affine1'] = \
Affine(self.params['W2'], self.params['b2'])
self.layers['Relu2'] = ReLU()
#Affine layer 2
self.layers['Affine2'] = \
Affine(self.params['W3'], self.params['b3'])
self.lastLayer = SoftmaxWithLoss()
def predict(self, x):
"""Inference by neural network
Args:
x (numpy.ndarray):Input to neural network
Returns:
numpy.ndarray:Neural network output
"""
#Propagate layers forward
for layer in self.layers.values():
x = layer.forward(x)
return x
def loss(self, x, t):
"""Loss function value calculation
Args:
x (numpy.ndarray):Input to neural network
t (numpy.ndarray):Correct label
Returns:
float:Loss function value
"""
#inference
y = self.predict(x)
# Softmax-with-Calculated by forward propagation of Loss layer
loss = self.lastLayer.forward(y, t)
return loss
def accuracy(self, x, t, batch_size=100):
"""Recognition accuracy calculation
batch_size is the batch size at the time of calculation. When trying to calculate a large amount of data at once
Because im2col eats too much memory and thrashing occurs and it does not work
To avoid that.
Args:
x (numpy.ndarray):Input to neural network
t (numpy.ndarray):Correct label (one-hot）
batch_size (int), optional):Batch size at the time of calculation, default is 100.
Returns:
float:Recognition accuracy
"""
#Calculation of the number of divisions
batch_num = max(int(x.shape[0] / batch_size), 1)
#Split
x_list = np.array_split(x, batch_num, 0)
t_list = np.array_split(t, batch_num, 0)
#Process in divided units
correct_num = 0 #Total number of correct answers
for (sub_x, sub_t) in zip(x_list, t_list):
assert sub_x.shape[0] == sub_t.shape[0], 'Did the division boundary shift?'
y = self.predict(sub_x)
y = np.argmax(y, axis=1)
t = np.argmax(sub_t, axis=1)
correct_num += np.sum(y == t)
#Calculation of recognition accuracy
return correct_num / x.shape[0]
def gradient(self, x, t):
"""Gradient for weight parameters calculated by error backpropagation
Args:
x (numpy.ndarray):Input to neural network
t (numpy.ndarray):Correct label
Returns:
dictionary:A dictionary containing gradients
"""
#Forward propagation
self.loss(x, t) #Propagate forward to calculate loss value
#Backpropagation
dout = self.lastLayer.backward()
for layer in reversed(list(self.layers.values())):
dout = layer.backward(dout)
#Extract the differential value of each layer
grads = {}
grads['W1'] = self.layers['Conv1'].dW
grads['b1'] = self.layers['Conv1'].db
grads['W2'] = self.layers['Affine1'].dW
grads['b2'] = self.layers['Affine1'].db
grads['W3'] = self.layers['Affine2'].dW
grads['b3'] = self.layers['Affine2'].db
return grads
```

The stumbling block in this implementation is ʻaccuracy`, which is omitted in the book.

During learning, the recognition accuracy is calculated in units of 1 epoch, but in the code written in Chapter 4, 60,000 pieces of training data were thrown in at once to obtain the recognition accuracy. However, if I do the same thing this time, it seems that the expansion of ʻim2col` consumes a lot of memory, and my VM with 4GB of memory stops at thrashing [^ 3]: sweat:

However, the book source still consumes less memory and works normally in my environment. It's strange, so when I followed the source, it was divided and processed internally. That's why I also imitate and divide it internally. Try using ` numpy.array_split`

to implement splitting. I did.

The learning is the previous [5.7.4 Learning using the error back propagation method](https://qiita.com/segavvy/items/8707e4e65aa7fa357d8a#574-%E8%AA%A4%E5%B7%AE%E9% 80% 86% E4% BC% 9D% E6% 92% AD% E6% B3% 95% E3% 82% 92% E4% BD% BF% E3% 81% A3% E3% 81% 9F% E5% AD% Implemented based on A6% E7% BF% 92). Below are some points.

--Unlike the last time, the input image this time is (1, 28, 28), so you need to specify `flatten = False`

when reading MNIST data with` load_mnist`

.
--The hyperparameter `learning_rate`

has been reduced for AdaGrad and has been tried several times to make it` 0.06`

.
--The number of updates was set to `6000`

(10 epochs) because the recognition accuracy of test data is relatively fast and stable.
--In the previous source, the display of the number of updates was shifted by 1, and the display of the recognition accuracy for the first time was not before the update but after the update once, so it was corrected.

Below is the implementation of learning.

`mnist.py`

```
# coding: utf-8
import os
import sys
import matplotlib.pylab as plt
import numpy as np
from ada_grad import AdaGrad
from simple_conv_net import SimpleConvNet
sys.path.append(os.pardir) #Add parent directory to path
from dataset.mnist import load_mnist
#Read MNIST training data and test data
(x_train, t_train), (x_test, t_test) = \
load_mnist(normalize=True, flatten=False, one_hot_label=True)
#Hyperparameter settings
iters_num = 6000 #Number of updates
batch_size = 100 #Batch size
learning_rate = 0.06 #Assuming learning rate, AdaGrad
train_size = x_train.shape[0] #Training data size
iter_per_epoch = max(int(train_size / batch_size), 1) #Number of iterations per epoch
#Simple convolutional neural network generation
network = SimpleConvNet(
input_dim=(1, 28, 28),
conv_param={'filter_num': 30, 'filter_size': 5, 'pad': 0, 'stride': 1},
hidden_size=100, output_size=10, weight_init_std=0.01
)
#Optimizer generation
optimizer = AdaGrad(learning_rate) # AdaGrad
#Confirmation of recognition accuracy before learning
train_acc = network.accuracy(x_train, t_train)
test_acc = network.accuracy(x_test, t_test)
train_loss_list = [] #Storage location of the transition of the value of the loss function
train_acc_list = [train_acc] #Storage location of changes in recognition accuracy for training data
test_acc_list = [test_acc] #Storage destination of transition of recognition accuracy for test data
print(f'Before learning[Training data recognition accuracy]{train_acc:.4f} [Test data recognition accuracy]{test_acc:.4f}')
#Start learning
for i in range(iters_num):
#Mini batch generation
batch_mask = np.random.choice(train_size, batch_size, replace=False)
x_batch = x_train[batch_mask]
t_batch = t_train[batch_mask]
#Gradient calculation
grads = network.gradient(x_batch, t_batch)
#Weight parameter update
optimizer.update(network.params, grads)
#Loss function value calculation
loss = network.loss(x_batch, t_batch)
train_loss_list.append(loss)
#Recognition accuracy calculation for each epoch
if (i + 1) % iter_per_epoch == 0:
train_acc = network.accuracy(x_train, t_train)
test_acc = network.accuracy(x_test, t_test)
train_acc_list.append(train_acc)
test_acc_list.append(test_acc)
#Progress display
print(
f'[epoch]{(i + 1) // iter_per_epoch:>2} '
f'[Number of updates]{i + 1:>5} [Loss function value]{loss:.4f} '
f'[Training data recognition accuracy]{train_acc:.4f} [Test data recognition accuracy]{test_acc:.4f}'
)
#Draw the transition of the value of the loss function
x = np.arange(len(train_loss_list))
plt.plot(x, train_loss_list, label='loss')
plt.xlabel('iteration')
plt.ylabel('loss')
plt.xlim(left=0)
plt.ylim(0, 2.5)
plt.show()
#Draw the transition of recognition accuracy of training data and test data
x2 = np.arange(len(train_acc_list))
plt.plot(x2, train_acc_list, label='train acc')
plt.plot(x2, test_acc_list, label='test acc', linestyle='--')
plt.xlabel('epochs')
plt.ylabel('accuracy')
plt.xlim(left=0)
plt.ylim(0, 1.0)
plt.legend(loc='lower right')
plt.show()
```

Below are the execution results. It took about an hour in my environment.

```
Before learning[Training data recognition accuracy]0.0909 [Test data recognition accuracy]0.0909
[epoch] 1 [Number of updates] 600 [Loss function value]0.0699 [Training data recognition accuracy]0.9784 [Test data recognition accuracy]0.9780
[epoch] 2 [Number of updates] 1200 [Loss function value]0.0400 [Training data recognition accuracy]0.9844 [Test data recognition accuracy]0.9810
[epoch] 3 [Number of updates] 1800 [Loss function value]0.0362 [Training data recognition accuracy]0.9885 [Test data recognition accuracy]0.9853
[epoch] 4 [Number of updates] 2400 [Loss function value]0.0088 [Training data recognition accuracy]0.9907 [Test data recognition accuracy]0.9844
[epoch] 5 [Number of updates] 3000 [Loss function value]0.0052 [Training data recognition accuracy]0.9926 [Test data recognition accuracy]0.9851
[epoch] 6 [Number of updates] 3600 [Loss function value]0.0089 [Training data recognition accuracy]0.9932 [Test data recognition accuracy]0.9850
[epoch] 7 [Number of updates] 4200 [Loss function value]0.0029 [Training data recognition accuracy]0.9944 [Test data recognition accuracy]0.9865
[epoch] 8 [Number of updates] 4800 [Loss function value]0.0023 [Training data recognition accuracy]0.9954 [Test data recognition accuracy]0.9873
[epoch] 9 [Number of updates] 5400 [Loss function value]0.0051 [Training data recognition accuracy]0.9959 [Test data recognition accuracy]0.9860
[epoch]10 [Number of updates] 6000 [Loss function value]0.0037 [Training data recognition accuracy]0.9972 [Test data recognition accuracy]0.9860
```

As a result, the recognition accuracy of the training data was 99.72%, and the recognition accuracy of the test data was 98.60%. With one epoch, it has already exceeded the previous recognition accuracy. Since the recognition accuracy of test data has not changed since around 7 epochs, it may have been just overfitting after that. Even so, the accuracy of 98.60% with a simple CNN is amazing.

I also tried running the source of the book, but for some reason the calculation of the recognition accuracy for each epoch is very fast. Mysteriously, I found that it was possible to sample with the ʻevaluate_sample_num_per_epoch` parameter of the`

Trainer` class, and the training image and test image were calculated with only the first 1,000 images. unfair! : unamused:

It's amazing that the necessary filters such as edge and blob extraction are automatically created. It is very interesting that the level of abstraction increases as the layers are layered.

It is said that big data and GPUs are making a big contribution to the development of deep learning, but I think that the spread of the cloud, which makes it possible to use huge machine resources at low cost, is also a big point.

Also, as a complete digression, it was said that LeNet's proposal was 1998, 20 years ago, and I was deeply moved, or rather, 1998 was a more recent impression. I don't want to get old: sweat:

It was a bit difficult to implement, but it helped me understand CNN. That's all for this chapter. If you have any mistakes, I would be grateful if you could point them out.

(To other chapters of this memo: Chapter 1 / Chapter 2 / Chapter 3 / Chapter 4 / [Chapter 5](https://qiita. com / segavvy / items / 8707e4e65aa7fa357d8a) / Chapter 6 / Chapter 7 / Chapter 8 / Summary)

[^ 1]: [Changes in the long vowel notation at the end of foreign words and katakana terms in Microsoft products and services](https://web.archive.org/web/20130228002415/http://www.microsoft.com/japan/ presspass / detail.aspx? newsid = 3491) (* Since there are no pages left at that time, Wikipedia> Choonpu B3% E7% AC% A6) is also a link to the Wayback Machine of the Internet Archive)

[^ 2]: See "Improvement of Python's Data Model" in What's New In Python 3.7.

[^ 3]: Thrashing is a phenomenon that occurs when memory is insufficient, and it is troublesome because it may become inoperable for each OS. If you are interested in OS memory management, please check out the previously posted Introduction to Memory Management for Everyone: 01! : grin:

Recommended Posts