Considering the operation of 2D convolution, input $ (Batch, H, W, C_ {in}) $, output $ (Batch, H, W, C_ {out}) $, kernel size $ (3,3) $, Convolution weight $ W = (3,3, C_ {in}, C_ {out}) $
Effectively Conv2D operation
On the other hand, for input $ (Batch, H, W, C_ {in}) $
python
x[:,0]=input[:,0:H-2,0:W-2,:] \\
x[:,1]=input[:,0:H-2,1:W-1,:] \\
x[:,2]=input[:,0:H-2,2:W-0,:] \\
x[:,3]=input[:,1:H-1,0:W-2,:] \\
x[:,4]=input[:,1:H-1,1:W-1,:] \\
x[:,5]=input[:,1:H-1,2:W-0,:] \\
x[:,6]=input[:,2:H-0,0:W-2,:] \\
x[:,7]=input[:,2:H-0,1:W-1,:] \\
x[:,8]=input[:,2:H-0,2:W-0,:]
Extract $ (H-2, W-2) $ from $ (H, W) $ like, and convert it to a matrix like $ (Batch, HW, 9C_ {in}) $ before matrix operation. need to do it. Such a matrix transformation is called $ im2col $. It can be considered that this process doubles the number of input channels by the total number of kernel sizes. Also, the $ im2col $ process itself has no weight.
python
def im2col(input_data, filter_h, filter_w, stride=1, pad=0):
N, C, H, W = input_data.shape
out_h = (H + 2*pad - filter_h)//stride + 1
out_w = (W + 2*pad - filter_w)//stride + 1
img = np.pad(input_data, [(0,0), (0,0), (pad, pad), (pad, pad)], 'constant')
col = np.zeros((N, C, filter_h, filter_w, out_h, out_w))
for y in range(filter_h):
y_max = y + stride*out_h
for x in range(filter_w):
x_max = x + stride*out_w
col[:, :, y, x, :, :] = img[:, :, y:y_max:stride, x:x_max:stride]
col = col.transpose(0, 4, 5, 1, 2, 3).reshape(N*out_h*out_w, -1)
return col
In Pytorch, the im2col function is called the Unfold function. Therefore, it should be ** Conv2D = (im2col + matmul) = (Unfold + matmul) **. I tried to see if the main subject was really that way.
PyTorch is channel first with input $ (Batch, C_ {in}, H, W) = (25,3,32,32) $, output $ (Batch, C_ {out}, H, W) = (25,16) , 30,30) $, kernel size $ (3,3) $, weight $ W = (C_ {out}, 3 × 3 × C_ {in}) = (16,27) $.
python
import numpy as np
import torch
input = torch.tensor(np.random.rand(25,3,32,32)).float()
weight = torch.tensor(np.random.rand(16,3,3,3)).float()
weight2 = weight.reshape((16,27))
print('input.shape= ', input.shape)
print('weight.shape= ', weight.shape)
print('weight2.shape=', weight2.shape)
x = torch.nn.Unfold(kernel_size=(3,3), stride=(1,1), padding=(0,0), dilation=(1,1))(input)
output1 = torch.matmul(weight2, x).reshape((25,16,30,30))
print('x.shape= ', x.shape)
print('output1.shape=', output1.shape)
-----------------------------------------------------------
input.shape= torch.Size([25, 3, 32, 32])
weight.shape= torch.Size([16, 3, 3, 3])
weight2.shape= torch.Size([16, 27])
x.shape= torch.Size([25, 27, 900])
output1.shape= torch.Size([25, 16, 30, 30])
Here, when the Unfold function is input, $ x = (25, 3 × 3 × 3, 30 × 30) = (25,27,900) $, and when $ W = (16,27) $, $ matmul (W , x) = (25,16,30 × 30) $.
On the other hand, if the input $ (Batch, C_ {in}, H, W) = (25,3,32,32) $ and the weight of the Conv2D function is $ W = (16,3,3,3) $ The code for the output is below.
python
conv1 = torch.nn.Conv2d(3, 16, kernel_size=3, bias=False)
conv1.weight.data = weight
output2 = conv1(input)
print('conv1.weight.shape=', conv1.weight.shape)
print('output2.shape= ', output2.shape)
-----------------------------------------------------------
conv1.weight.shape= torch.Size([16, 3, 3, 3])
output2.shape= torch.Size([25, 16, 30, 30])
Comparing ** output1 ** obtained by (Unfold + matmul) and ** output2 ** obtained by Conv2D from these, the values were completely the same. Therefore, it was confirmed that it is computationally equivalent to ** Conv2D = (Unfold + matmul) **.
python
output1:
tensor([[[[7.4075, 7.1269, 6.2595, ..., 6.9860, 6.5256, 7.3597],
[6.4978, 7.3303, 6.7621, ..., 7.2054, 6.9357, 7.3798],
[5.9309, 5.5016, 6.3321, ..., 5.7143, 7.0358, 6.8819],
...,
[6.0168, 6.9415, 7.5508, ..., 5.4547, 4.7888, 6.0636],
[5.0191, 7.0944, 7.0875, ..., 3.9413, 4.1925, 5.5689],
[6.2448, 6.4813, 5.5424, ..., 4.2610, 5.8013, 5.3431]],
......
output2:
tensor([[[[7.4075, 7.1269, 6.2595, ..., 6.9860, 6.5256, 7.3597],
[6.4979, 7.3303, 6.7621, ..., 7.2054, 6.9357, 7.3798],
[5.9309, 5.5016, 6.3321, ..., 5.7143, 7.0358, 6.8819],
...,
[6.0168, 6.9415, 7.5508, ..., 5.4547, 4.7888, 6.0636],
[5.0191, 7.0944, 7.0874, ..., 3.9413, 4.1925, 5.5689],
[6.2448, 6.4813, 5.5424, ..., 4.2610, 5.8013, 5.3431]],
......
When kernel_size and stride are equal, it corresponds to the patch division of Vision Transformer. Well, patch splitting can be replaced by reshape and transpose without using Unfold ...
python
input = torch.tensor(np.random.rand(25,3,224,224)).float()
x = torch.nn.Unfold(kernel_size=(14,14), stride=(14,14), padding=(0,0), dilation=(1,1))(input)
-----------------------------------------------------------
input.shape= torch.Size([25, 3, 224, 224])
x.shape= torch.Size([25, 588, 256]) #(25,3*14*14,16*16)
In the story that Vision Transformer does not use Conv2D at all, since matmul is included in the calculation of Attention weight and Value, I had an unfounded delusion that Unfold + matmul is equivalent to Conv2D even in ViT.
The Unfold function is the im2col function in Pytorch, ** Conv2D = (Unfold + matmul) **. In tensorflow, it is the extract_image_patches function.
Recommended Posts