We have summarized the semantic segmentation using the Microsoft Cognitive Toolkit (CNTK).
Part 1 will pre-learn the CNN used as the backbone for semantic segmentation. Use 1,000 categories of ImageNet images for CNN pre-learning.
I will introduce them in the following order.
ImageNet [1] is a large-scale image database with more than 140 million images registered. Until 2017, it was used in the image recognition competition ILSVCR.
This time, we collected 1,000 categories of training data by downloading using the URL of the image managed by ImageNet. However, since none of the 850th teddy and teddy bear could be downloaded, the image prepared in Computer Vision: Image Classification Part1 --Understanding COCO dataset I used it as a substitute.
Also, the downloaded images contained quite a few corrupted JPEG files and images that weren't related to the category, so I cleaned them automatically and manually. The final collection of images was 775,983.
The structure of the directory this time is as follows.
COCO MNIST NICS RTSS |―ImageNet |―n01440764 |―n01440764_0.jpg |―… rtss_imagenet.py rtss_vovnet57.py SSMD
VoVNet : One-Shot Aggregation module This time, we adopted VoVNet [2] (Variety of View Network) as a model of convolutional neural network. VoVNet is a CNN model that uses less memory and less computational costs than DenseNet [3].
One-Shot Aggregation module VoVNet uses the One-Shot Aggregation (OSA) module, which is boxed in the figure below.
VoVNet57 The network configuration of VoVNet57 is as follows.
Layer | Filters | Size/Stride | Input | Output |
---|---|---|---|---|
Convolution2D | 64 | 3x3/2 | 3x224x224 | 64x112x112 |
Convolution2D | 64 | 3x3/1 | 64x112x112 | 64x112x112 |
Convolution2D | 128 | 3x3/1 | 64x112x112 | 128x112x112 |
MaxPooling2D | 3x3/2 | 128x112x112 | 128x56x56 | |
OSA module | 128, 256 | 3x3/1, 1x1/1 | 128x56x56 | 256x56x56 |
MaxPooling2D | 3x3/2 | 256x56x56 | 256x28x28 | |
OSA module | 160, 512 | 3x3/1, 1x1/1 | 256x28x28 | 512x28x28 |
MaxPooling2D | 3x3/2 | 512x28x28 | 512x14x14 | |
OSA module | 192, 768 | 3x3/1, 1x1/1 | 512x14x14 | 768x14x14 |
OSA module | 192, 768 | 3x3/1, 1x1/1 | 768x14x14 | 768x14x14 |
OSA module | 192, 768 | 3x3/1, 1x1/1 | 768x14x14 | 768x14x14 |
OSA module | 192, 768 | 3x3/1, 1x1/1 | 768x14x14 | 768x14x14 |
MaxPooling2D | 3x3/2 | 768x14x14 | 768x7x7 | |
OSA module | 224, 1024 | 3x3/1, 1x1/1 | 768x7x7 | 1024x7x7 |
OSA module | 224, 1024 | 3x3/1, 1x1/1 | 1024x7x7 | 1024x7x7 |
OSA module | 224, 1024 | 3x3/1, 1x1/1 | 1024x7x7 | 1024x7x7 |
GlobalAveragePooling | global | 1024x7x7 | 1024x1x1 | |
Dense | 1000 | 1024x1x1 | 1000x1x1 | |
Softmax | 1000 | 1000 | 1000 |
It consists of a total of 57 layers of convolution and 32x downsampling. The total number of parameters is 31,429,159.
In the convolution layer, apply Batch Normalization [4] without using bias before inputting to the activation function.
The last fully connected layer uses the bias term instead of Batch Normalization.
We adopted Mish [5] as the activation function. Mish is an activation function that has been reported to outperform Swish [6] over ReLU. Mish can be easily implemented by combining the soft plus function and the tanh function, as expressed by the following formula.
Mish(x) = x \cdot \tanh \left( \log (1 + e^x) \right)
Mish looks like the figure below.
Mish avoids ReLU's deadly neuron, and while ReLU is discontinuous when differentiated, Mish is continuous no matter how many times it is differentiated, which makes the loss function smoother and easier to optimize.
The input image is divided by the maximum brightness value of 255.
The initial values of the parameters for each layer were set to the normal distribution of He [7].
The loss function is Cross Entropy Error, and the optimization algorithm is Stochastic Gradient Decent (SGD) with Momentum. The momentum was fixed at 0.9.
The Cyclical Learning Rate (CLR) [8] is used as the learning rate, the maximum learning rate is 0.1, the base learning rate is 1e-4, the step size is 10 times the number of epochs, and the policy is triangular2. I set it to.
As a measure against overfitting, I set the L2 regularization value to 0.0005.
Model training performed 100 Epoch with mini-batch training of mini-batch size 64.
・ CPU Intel (R) Core (TM) i7-5820K 3.30GHz ・ GPU NVIDIA Quadro RTX 5000 16GB
・ Windows 10 Pro 1909 ・ CUDA 10.0 ・ CuDNN 7.6 ・ Python 3.6.6 ・ Cntk-gpu 2.7 ・ Cntkx 0.1.50 ・ Numpy 1.17.3 ・ Opencv-contrib-python 4.1.1.26 ・ Pandas 0.25.0 ・ Requests 2.22.0
Programs downloaded from ImageNet and training programs are available on GitHub.
rtss_imagenet.py
rtss_vovnet57.py
The figure below is a visualization of the loss function and false recognition rate logs during training. The graph on the left shows the loss function, the graph on the right shows the false recognition rate, the horizontal axis shows the number of epochs, and the vertical axis shows the value of the loss function and the false recognition rate, respectively.
Now that we have a backbone CNN pre-training model, we'll complete Part 2 with the addition of mechanisms to achieve semantic segmentation.
ImageNet Microsoft COCO Common Objects in Context
Computer Vision : Image Classification Part1 - Understanding COCO dataset