tensorflow == 2.2.0 keras == 2.3.1 (Default version of Google Colab as of 202.6.10)

You can find all the code on github. https://github.com/milky1210/Segnet The code in the article is an excerpt, so if you want to actually run it, please download the code.

In the problem of inferring what is reflected for each pixel of the image called SEMANTIC segmentation by deep learning, it is accurate to restore the feature map that was lowered in resolution by Pooling etc. to the original dimension. We propose a model that maps to the boundary line.

SegNet performs UpSampling after reducing the resolution in the convolution layer and the pooling layer like a normal FCN, but when increasing the resolution, it uses a technique called pooling indice to prevent the boundary from becoming blurred. There is. Here, Encode and Decode inherit the shape of the VGG16 model (a model famous for image classification). Pooling indices As shown in this figure, remember where Max was when Max Pooling was performed, and transfer each feature map to that position during UpSampling.

It is a data set that supports problems such as image recognition, image detection, and segmentation, which are also used in SegNet papers for performance verification. You can download it from here.

When downloaded, JPEGImages / and SegmentationObject / are included in VOCdevkit / VOC2012 /, and training and verification are performed using JPEGImage as an input image and SegmentationObject as an output image.

JPEGImages / ~ .jpg and Segmentation Object / ~ .png are supported in each directory. 22 classes are classified including background and boundaries.

In this article, we will only cover the definition of the model, the definition of the loss function, and training. In addition, training and verification will be conducted at a resolution of 64x64.

First, as a comparison target, SegNet (Encoder-decoder) without pooling indice is modeled as VGG16 as follows.

```
def build_FCN():
ffc = 32
inputs = layers.Input(shape=(64,64,3))
for i in range(2):
x = layers.Conv2D(ffc,kernel_size=3,padding="same")(inputs)
x = layers.BatchNormalization()(x)
x = layers.ReLU()(x)
x = layers.MaxPooling2D((2,2))(x)
for i in range(2):
x = layers.Conv2D(ffc*2,kernel_size=3,padding="same")(x)
x = layers.BatchNormalization()(x)
x = layers.ReLU()(x)
x = layers.MaxPooling2D((2,2))(x)
for i in range(3):
x = layers.Conv2D(ffc*4,kernel_size=3,padding="same")(x)
x = layers.BatchNormalization()(x)
x = layers.ReLU()(x)
x = layers.MaxPooling2D((2,2))(x)
for i in range(3):
x = layers.Conv2D(ffc*8,kernel_size=3,padding="same")(x)
x = layers.BatchNormalization()(x)
x = layers.ReLU()(x)
x = layers.MaxPooling2D((2,2))(x)
for i in range(3):
x = layers.Conv2D(ffc*8,kernel_size=3,padding="same")(x)
x = layers.BatchNormalization()(x)
x = layers.ReLU()(x)
x = layers.UpSampling2D((2,2))(x)
for i in range(3):
x = layers.Conv2D(ffc*4,kernel_size=3,padding="same")(x)
x = layers.BatchNormalization()(x)
x = layers.ReLU()(x)
x = layers.UpSampling2D((2,2))(x)
for i in range(3):
x = layers.Conv2D(ffc*2,kernel_size=3,padding="same")(x)
x = layers.BatchNormalization()(x)
x = layers.ReLU()(x)
x = layers.UpSampling2D((2,2))(x)
for i in range(2):
x = layers.Conv2D(ffc*2,kernel_size=3,padding="same")(x)
x = layers.BatchNormalization()(x)
x = layers.ReLU()(x)
x = layers.UpSampling2D((2,2))(x)
for i in range(2):
x = layers.Conv2D(ffc,kernel_size=3,padding="same")(x)
x = layers.BatchNormalization()(x)
x = layers.ReLU()(x)
x = layers.Conv2D(22,kernel_size=3,padding="same",activation="softmax")(x)
return models.Model(inputs,x)
```

When it is modeled after vgg16, it has such a structure, and it becomes a network with 24 convolution layers. Note that MaxPooling2D is used to make the image smaller and UpSampling2D is used to make the image larger. Next, let's look at the difference between Segnet and this model. First, Segnet holds the information corresponding to ArgMaxPooling2D in that layer as follows before performing MaxPooling2D. This function is not in Keras and uses tensorflow's. Therefore, it is necessary to create the original Keras Layer. If you define the function as below, it will be a layer that runs on Keras.

```
class MaxPoolingWithArgmax2D(Layer):
def __init__(self):
super(MaxPoolingWithArgmax2D,self).__init__()
def call(self,inputs):
output,argmax = tf.nn.max_pool_with_argmax(inputs,ksize=[1,2,2,1],strides=[1,2,2,1],padding='SAME')
argmax = K.cast(argmax,K.floatx())
return [output,argmax]
def compute_output_shape(self,input_shape):
ratio = (1,2,2,1)
output_shape = [dim//ratio[idx] if dim is not None else None for idx, dim in enumerate(input_shape)]
output_shape = tuple(output_shape)
return [output_shape,output_shape]
```

Define a layer to return to the location where it was argmax the next time you perform Up Sampling (this is quite long)

```
class MaxUnpooling2D(Layer):
def __init__(self):
super(MaxUnpooling2D,self).__init__()
def call(self,inputs,output_shape = None):
updates, mask = inputs[0],inputs[1]
with tf.variable_scope(self.name):
mask = K.cast(mask, 'int32')
input_shape = tf.shape(updates, out_type='int32')
# calculation new shape
if output_shape is None:
output_shape = (input_shape[0],input_shape[1]*2,input_shape[2]*2,input_shape[3])
self.output_shape1 = output_shape
# calculation indices for batch, height, width and feature maps
one_like_mask = K.ones_like(mask, dtype='int32')
batch_shape = K.concatenate([[input_shape[0]], [1 ], [1], [1]],axis=0)
batch_range = K.reshape(tf.range(output_shape[0], dtype='int32'),shape=batch_shape)
b = one_like_mask * batch_range
y = mask // (output_shape[2] * output_shape[3])
x = (mask // output_shape[3]) % output_shape[2]
feature_range = tf.range(output_shape[3], dtype='int32')
f = one_like_mask * feature_range
# transpose indices & reshape update values to one dimension
updates_size = tf.size(updates)
indices = K.transpose(K.reshape(
K.stack([b, y, x, f]),
[4, updates_size]))
values = K.reshape(updates, [updates_size])
ret = tf.scatter_nd(indices, values, output_shape)
return ret
def compute_output_shape(self,input_shape):
shape = input_shape[1]
return (shape[0],shape[1]*2,shape[2]*2,shape[3])
```

If Segnet is defined using the layer defined by these, it will be as follows.

```
def build_Segnet():
ffc = 32
inputs = layers.Input(shape=(64,64,3))
for i in range(2):
x = layers.Conv2D(ffc,kernel_size=3,padding="same")(inputs)
x = layers.BatchNormalization()(x)
x = layers.ReLU()(x)
x,x1 = MaxPoolingWithArgmax2D()(x)
for i in range(2):
x = layers.Conv2D(ffc*2,kernel_size=3,padding="same")(x)
x = layers.BatchNormalization()(x)
x = layers.ReLU()(x)
x,x2 = MaxPoolingWithArgmax2D()(x)
for i in range(3):
x = layers.Conv2D(ffc*4,kernel_size=3,padding="same")(x)
x = layers.BatchNormalization()(x)
x = layers.ReLU()(x)
x,x3 = MaxPoolingWithArgmax2D()(x)
for i in range(3):
x = layers.Conv2D(ffc*8,kernel_size=3,padding="same")(x)
x = layers.BatchNormalization()(x)
x = layers.ReLU()(x)
x,x4 = MaxPoolingWithArgmax2D()(x)
for i in range(3):
x = layers.Conv2D(ffc*8,kernel_size=3,padding="same")(x)
x = layers.BatchNormalization()(x)
x = layers.ReLU()(x)
x = layers.Dropout(rate = 0.5)(x)
x = MaxUnpooling2D()([x,x4])
for i in range(3):
x = layers.Conv2D(ffc*4,kernel_size=3,padding="same")(x)
x = layers.BatchNormalization()(x)
x = layers.ReLU()(x)
x = MaxUnpooling2D()([x,x3])
for i in range(3):
x = layers.Conv2D(ffc*2,kernel_size=3,padding="same")(x)
x = layers.BatchNormalization()(x)
x = layers.ReLU()(x)
x = MaxUnpooling2D()([x,x2])
for i in range(2):
x = layers.Conv2D(ffc,kernel_size=3,padding="same")(x)
x = layers.BatchNormalization()(x)
x = layers.ReLU()(x)
x = MaxUnpooling2D()([x,x1])
for i in range(2):
x = layers.Conv2D(ffc,kernel_size=3,padding="same")(x)
x = layers.BatchNormalization()(x)
x = layers.ReLU()(x)
x = layers.Conv2D(22,kernel_size=3,padding="same",activation="softmax")(x)
return models.Model(inputs,x)
```

This time, the loss function uses the cross entropy of each pixel. In addition, Adam (lr = 0.001, beta_1 = 0.9, beta_2 = 0.999) was used for optimization.

We confirmed how much the result changes depending on the presence or absence of pooling index. The loss in the training and the average of the correct answer rate at each pixel were graphed. First, the result of the model without Pooling Indice

The verification data had a correct answer rate of about 78%. Next, I will post the results of SegNet.

It was stable with a correct answer rate of about 82%, and we were able to see the behavior as per the paper.

Input from the left, without Pooling Indice, all test data with SegNet, GT

It was found that holding the Pooling Index can be expected to improve accuracy considerably.

Recommended Posts