When it comes to Chapter 7 convolutional neural networks, it looks quite different from what we've done up to Chapter 6. It looks like you're doing a lot different things, but in the end you'll find and store the weight and bias gradients. In other words, the basic principle has not changed at all, what has changed is the input data.
P207 If the input data is an image, the image is usually in 3D shape in the vertical, horizontal, and channel directions. However, when inputting to the fully connected layer, the 3D data must be flat—the 1D data. In fact, in previous examples using the MNIST dataset, the input image was in the shape of (1, 28, 28) — 1 channel, 28 pixels high, 28 pixels wide – but arranged in a row. You have entered 784 data into the first Affine layer. ・ ・ ・ The convolution layer, on the other hand, retains its shape. In the case of an image, the input data is received as 3D data, and the data is output to the next layer as 3D data as well. As a result, CNN can (potentially) correctly understand data that has shapes such as images.
In fact, I myself used this Self-study memo # 6-2 to process 3D data when processing Kaggle's cat and dog datasets. I convert it to a dimension and use it. If this can be processed in three dimensions, the recognition rate may improve.
These explanations are not difficult at all, and I can understand them as such, but since this formula suddenly appears on P212, what is this? Is that really the case? So I thought about it.
For the time being, let's think about the fact that there is no S (stride).
Let's check some input size and filter size
When the input size (n, n) and the filter size (m, m) The output size seems to be (n-m + 1, n-m + 1). If you apply the filter to the upper left corner, you can rotate it to the right (nm). It can rotate down (nm). So, adding 1 minute in the upper left corner, is it nm + 1?
So what happens with strides? When the stride is 2, the number of turns to the right (nm) is halved. (Nm) / 2 When it is 3, it becomes 1/3.
In other words, the number of times you can move is (nm) / s, so The output size is (nm) / s + 1.
Assuming that the input data size is (H, W), the padding is P, and the filter size is (FH, FW) n = H + 2 × P Similarly, n = W + 2 × P m=FH n=FW So The output size is OH=(H+2×P-FH)/s + 1 OW=(W+2×P-FW)/s + 1
From P230, there is a description of the class SimpleConvNet as an example for training MNIST data. Let me learn using this class
import sys, os
sys.path.append(os.pardir) #Settings for importing files in the parent directory
import numpy as np
from dataset.mnist import load_mnist
from common.simple_convnet import SimpleConvNet
from common.trainer import Trainer
#Data reading
(x_train, t_train), (x_test, t_test) = load_mnist(flatten=False)
max_epochs = 20
network = SimpleConvNet(input_dim=(1,28,28),
conv_param = {'filter_num': 30, 'filter_size': 5, 'pad': 0, 'stride': 1},
hidden_size=100, output_size=10, weight_init_std=0.01)
trainer = Trainer(network, x_train, t_train, x_test, t_test,
epochs=max_epochs, mini_batch_size=100,
optimizer='Adam', optimizer_param={'lr': 0.001},
evaluate_sample_num_per_epoch=1000, verbose=False)
I tried to verify the judgment contents of the test data.
import numpy as np
from common.simple_convnet import SimpleConvNet
from dataset.mnist import load_mnist
import pickle
import matplotlib.pyplot as plt
def showImg(x):
example = x.reshape((28, 28))
plt.imshow(example, cmap=plt.cm.binary)
#Evaluate with test data
x = x_test
t = t_test
network = SimpleConvNet(input_dim=(1,28,28),
conv_param = {'filter_num': 30, 'filter_size': 5, 'pad': 0, 'stride': 1},
hidden_size=100, output_size=10, weight_init_std=0.01)
y = network.predict(x)
accuracy_cnt = 0
for i in range(len(x)):
p= np.argmax(y[i])
#print(str(x[i]) + " : " + str(p))
if p == t[i]:
accuracy_cnt += 1
print("Correct answer:"+str(t[i])+"Inference result:"+str(p))
print("Accuracy:" + str(float(accuracy_cnt) / len(x)))
As a result, the correct answer rate is
What was wrong is like this
However, it took hours to process 60,000 data. Furthermore, after learning, I tried to process the test data, but I was struck by the problem of insufficient memory and could not proceed easily. Is Deep Learning too much for memory 4G?
For the time being, I was able to confirm that I was able to learn with high accuracy on CNN.
As usual, I would like to follow the contents of the program.
# coding: utf-8
import sys, os
sys.path.append(os.pardir) #Settings for importing files in the parent directory
import pickle
import numpy as np
from collections import OrderedDict
from common.layers import *
from common.gradient import numerical_gradient
class SimpleConvNet:
def __init__(self, input_dim=(1, 28, 28),
conv_param={'filter_num':30, 'filter_size':5, 'pad':0, 'stride':1},
hidden_size=100, output_size=10, weight_init_std=0.01):
filter_num = conv_param['filter_num']
filter_size = conv_param['filter_size']
filter_pad = conv_param['pad']
filter_stride = conv_param['stride']
input_size = input_dim[1]
conv_output_size = (input_size - filter_size + 2*filter_pad) / filter_stride + 1
pool_output_size = int(filter_num * (conv_output_size/2) * (conv_output_size/2))
#Weight initialization
self.params = {}
self.params['W1'] = weight_init_std * \
np.random.randn(filter_num, input_dim[0], filter_size, filter_size)
self.params['b1'] = np.zeros(filter_num)
self.params['W2'] = weight_init_std * \
np.random.randn(pool_output_size, hidden_size)
self.params['b2'] = np.zeros(hidden_size)
self.params['W3'] = weight_init_std * \
np.random.randn(hidden_size, output_size)
self.params['b3'] = np.zeros(output_size)
#Layer generation
self.layers = OrderedDict()
self.layers['Conv1'] = Convolution(self.params['W1'], self.params['b1'],
conv_param['stride'], conv_param['pad'])
self.layers['Relu1'] = Relu()
self.layers['Pool1'] = Pooling(pool_h=2, pool_w=2, stride=2)
self.layers['Affine1'] = Affine(self.params['W2'], self.params['b2'])
self.layers['Relu2'] = Relu()
self.layers['Affine2'] = Affine(self.params['W3'], self.params['b3'])
self.last_layer = SoftmaxWithLoss()
def predict(self, x):
for layer in self.layers.values():
x = layer.forward(x)
return x
def loss(self, x, t):
y = self.predict(x)
return self.last_layer.forward(y, t)
def accuracy(self, x, t, batch_size=100):
if t.ndim != 1 : t = np.argmax(t, axis=1)
acc = 0.0
for i in range(int(x.shape[0] / batch_size)):
tx = x[i*batch_size:(i+1)*batch_size]
tt = t[i*batch_size:(i+1)*batch_size]
y = self.predict(tx)
y = np.argmax(y, axis=1)
acc += np.sum(y == tt)
return acc / x.shape[0]
def gradient(self, x, t):
# forward
self.loss(x, t)
# backward
dout = 1
dout = self.last_layer.backward(dout)
layers = list(self.layers.values())
for layer in layers:
dout = layer.backward(dout)
grads = {}
grads['W1'], grads['b1'] = self.layers['Conv1'].dW, self.layers['Conv1'].db
grads['W2'], grads['b2'] = self.layers['Affine1'].dW, self.layers['Affine1'].db
grads['W3'], grads['b3'] = self.layers['Affine2'].dW, self.layers['Affine2'].db
return grads
def save_params(self, file_name="params.pkl"):
params = {}
for key, val in self.params.items():
params[key] = val
with open(file_name, 'wb') as f:
pickle.dump(params, f)
def load_params(self, file_name="params.pkl"):
with open(file_name, 'rb') as f:
params = pickle.load(f)
for key, val in params.items():
self.params[key] = val
for i, key in enumerate(['Conv1', 'Affine1', 'Affine2']):
self.layers[key].W = self.params['W' + str(i+1)]
self.layers[key].b = self.params['b' + str(i+1)]
The only difference is that the layers are stacked, and the others are not much different from the MultiLayerNet class.
self.layers['Conv1'] = Convolution(self.params['W1'], self.params['b1'],
conv_param['stride'], conv_param['pad'])
The Convolution class is also defined in layers.py
class Convolution:
def __init__(self, W, b, stride=1, pad=0):
self.W = W
self.b = b
self.stride = stride
self.pad = pad
#Intermediate data (used during backward)
self.x = None
self.col = None
self.col_W = None
#Gradient of weight / bias parameters
self.dW = None
self.db = None
def forward(self, x):
FN, C, FH, FW = self.W.shape
N, C, H, W = x.shape
out_h = 1 + int((H + 2*self.pad - FH) / self.stride)
out_w = 1 + int((W + 2*self.pad - FW) / self.stride)
col = im2col(x, FH, FW, self.stride, self.pad)
col_W = self.W.reshape(FN, -1).T
out = np.dot(col, col_W) + self.b
out = out.reshape(N, out_h, out_w, -1).transpose(0, 3, 1, 2)
self.x = x
self.col = col
self.col_W = col_W
return out
def backward(self, dout):
FN, C, FH, FW = self.W.shape
dout = dout.transpose(0,2,3,1).reshape(-1, FN)
self.db = np.sum(dout, axis=0)
self.dW = np.dot(self.col.T, dout)
self.dW = self.dW.transpose(1, 0).reshape(FN, C, FH, FW)
dcol = np.dot(dout, self.col_W.T)
dx = col2im(dcol, self.x.shape, FH, FW, self.stride, self.pad)
return dx
im2col The heart of this is the im2col function. Defined in util.py
def im2col(input_data, filter_h, filter_w, stride=1, pad=0):
N, C, H, W = input_data.shape
out_h = (H + 2*pad - filter_h)//stride + 1
out_w = (W + 2*pad - filter_w)//stride + 1
img = np.pad(input_data, [(0,0), (0,0), (pad, pad), (pad, pad)], 'constant')
col = np.zeros((N, C, filter_h, filter_w, out_h, out_w))
for y in range(filter_h):
y_max = y + stride*out_h
for x in range(filter_w):
x_max = x + stride*out_w
col[:, :, y, x, :, :] = img[:, :, y:y_max:stride, x:x_max:stride]
col = col.transpose(0, 4, 5, 1, 2, 3).reshape(N*out_h*out_w, -1)
return col
And this seems to be the cause of running out of memory. If you process more rows of data, you will get a Memorry Error here.
In the first three lines, we check the size of the input data and calculate the output size from the input size and filter size. The reason why // is used for division by stride seems to be to truncate after the decimal point if it is not divisible.
len(x_test) #The number of data
len(x_test[0]) #channel
len(x_test[0][0]) #height
len(x_test[0][0][0]) #Width
len(network.params['W1']) #Number of filters
len(network.params['W1'][0]) #Number of channels
len(network.params['W1'][0][0]) #Filter height
len(network.params['W1'][0][0][0]) #Filter width
conv_param = {'filter_num': 30, 'filter_size': 5, 'pad': 0, 'stride': 1},
Padding 0 and stride 1 are specified when the network object is created.
len(network.layers['Conv1'].forward(x_test)) #The number of data
len(network.layers['Conv1'].forward(x_test)[0]) #Number of filters
len(network.layers['Conv1'].forward(x_test)[0][0]) #Output height
len(network.layers['Conv1'].forward(x_test)[0][0][0]) #Output width
self.layers['Conv1'] = Convolution(self.params['W1'], self.params['b1'],
conv_param['stride'], conv_param['pad'])
class Convolution:
def forward(self, x):
FN, C, FH, FW = self.W.shape # 30, 1, 5, 5
N, C, H, W = x.shape # 10000, 1, 28, 28
out_h = 1 + int((H + 2*self.pad - FH) / self.stride) # 24
out_w = 1 + int((W + 2*self.pad - FW) / self.stride) # 24
col = im2col(x, FH, FW, self.stride, self.pad)
def im2col(input_data, filter_h, filter_w, stride=1, pad=0):
N, C, H, W = input_data.shape # 10000, 1, 28, 28
out_h = (H + 2*pad - filter_h)//stride + 1 # 24
out_w = (W + 2*pad - filter_w)//stride + 1 # 24
img = np.pad(input_data, [(0,0), (0,0), (pad, pad), (pad, pad)], 'constant')
input_data is 4 dimensions (10000 data rows, 1 channel, 28 height, 28 width) When pad = 0, [(0,0), (0,0), (0, 0), (0, 0)] do not pad. When pad = 1, [(0,0), (0,0), (1, 1), (1, 1)] pad one by one on the top, bottom, left and right of the height and width. When pad = 2, [(0,0), (0,0), (2, 2), (2, 2)] Pads two each on the top, bottom, left and right of the height and width. In this program example, pad = 0. The same as input_data is set in img.
col = np.zeros((N, C, filter_h, filter_w, out_h, out_w)) #10000, 1, 5, 5, 24, 24
The input data (image image) is expanded into an array col, but as a container for expanding the data, create an array of the size (number of data, channel, filter height, filter width, output height, output width). ..
for y in range(filter_h):
y_max = y + stride*out_h
for x in range(filter_w):
x_max = x + stride*out_w
col[:, :, y, x, :, :] = img[:, :, y:y_max:stride, x:x_max:stride]
col = col.transpose(0, 4, 5, 1, 2, 3).reshape(N*out_h*out_w, -1)
return col
I can't get the image here at all, so I tested it with the following simplified array.
import numpy as np
img= np.arange(64).reshape(N, C, 8, 8)
col = np.zeros((N, C, filter_h, filter_w, out_h, out_w))
for y in range(filter_h):
y_max = y + stride*out_h
for x in range(filter_w):
x_max = x + stride*out_w
col[:, :, y, x, :, :] = img[:, :, y:y_max:stride, x:x_max:stride]
col = col.transpose(0, 4, 5, 1, 2, 3).reshape(N*out_h*out_w, -1)
array([[ 0., 1., 2., 3., 8., 9., 10., 11., 16., 17., 18., 19., 24., 25., 26., 27.], [ 2., 3., 4., 5., 10., 11., 12., 13., 18., 19., 20., 21., 26., 27., 28., 29.], [ 4., 5., 6., 7., 12., 13., 14., 15., 20., 21., 22., 23., 28., 29., 30., 31.], [16., 17., 18., 19., 24., 25., 26., 27., 32., 33., 34., 35., 40., 41., 42., 43.], [18., 19., 20., 21., 26., 27., 28., 29., 34., 35., 36., 37., 42., 43., 44., 45.], [20., 21., 22., 23., 28., 29., 30., 31., 36., 37., 38., 39., 44., 45., 46., 47.], [32., 33., 34., 35., 40., 41., 42., 43., 48., 49., 50., 51., 56., 57., 58., 59.], [34., 35., 36., 37., 42., 43., 44., 45., 50., 51., 52., 53., 58., 59., 60., 61.], [36., 37., 38., 39., 44., 45., 46., 47., 52., 53., 54., 55., 60., 61., 62., 63.]])
For col [0], the part of the input data to which the filter is applied is extracted first. col [1] is the part where the filter is applied by shifting the stride 2 to the right by two. The following is an array in which the parts to which the filter is applied 9 times are extracted and arranged.
I'm not sure what I'm doing, but I can understand the result.
If you reshape the 4x4 filter into one column and perform col and dot operations, you can obtain the result of applying the filter 9 times in one operation.
col = im2col(x, FH, FW, self.stride, self.pad)
col_W = self.W.reshape(FN, -1).T
out = np.dot(col, col_W) + self.b
out = out.reshape(N, out_h, out_w, -1).transpose(0, 3, 1, 2)
Complete understanding of numpy.pad function Manipulate the two-dimensional array freely. [Initialization / Reference / Extraction / Calculation / Transposition]
