Target

This is a continuation of object detection using the Microsoft Cognitive Toolkit (CNTK).

In Part2, object detection by CNTK will be performed using the training data prepared in Part1. It is assumed that CNTK and NVIDIA GPU CUDA are installed.

Introduction

In Computer Vision: Object Detection Part1 --Bounding Box preprocessing, from Microsoft Common Object in Contexts (COCO) [1], the bounding box We have prepared category labels and anchor boxes.

In Part2, we will create and train a 1-stage object detection model using a neural network.

Neural network structure

This time, I made a model that combines the Multi-scale feature map of SSD [[2]](# reference) and the Direct location prediction of YOLOv2 [[3]](# reference). The outline of the implemented neural network is shown in the figure below.

Add a convolutional layer to the feature map from the underlying pre-trained convolutional neural network (CNN). The added convolution layer does not adopt the bias term, the activation function adopts Exponential Linear Units (ELUs) [[4]](# reference), and Batch Normalization [[5]](# reference) is applied. I will.

In the final output convolution layer, the bias term is adopted without using the nonlinear activation function and Batch Normalization to perform bounding box, object degree, and categorization.

The idea was to detect small objects on the 26x26 feature map, medium objects on the 13x13 feature map, and large objects on the 7x7 feature map. The anchor boxes used are 26x26 (0.06, 0.08), 13x13 (0.19. 0.28), (0.31, 0.67), (0.66, 0.35), 7x7 (0.31, 0.67), (0.66, 0.35), ( 0.83, 0.83) is used.

YOLO's algorithm is used to predict the bounding box.

x = \sigma(t_x) + c_x \\
y = \sigma(t_y) + c_y \\
w = p_w \log(1 + e^{t_w}) \\
h = p_h \log(1 + e^{t_h}) \\
objectness = \sigma(t_o)

Now, apply the sigmoid function to the network output $ t_x, t_y $, and then add the upper left coordinates $ c_x, c_y $ of each grid cell to predict the center coordinates of each grid cell. To predict the width and height, apply the soft plus function to the network outputs $ t_w and t_h $, and then multiply by the anchor box. Apply the sigmoid function to the output $ t_o $ for object degree.

Settings in training

The initial value of the added convolution layer parameter was set to He's normal distribution [6].

This time we will use the multitasking loss function. Use Generalized IoU Loss [7] for bounding box regression, Binary Cross Entropy for objectivity prediction, and Cross Entropy Error for categorization. The details of the loss function will be explained later.

Loss = Generalized IoU Loss + Binary Cross Entropy + Cross Entropy Error

Adam [8] was used as the optimization algorithm. Adam's hyperparameters $ β_1 $ are set to 0.9 and $ β_2 $ are set to the default values of CNTK.

For the learning rate, use the Cyclical Learning Rate (CLR) [9], the maximum learning rate is 1e-3, the base learning rate is 1e-5, the step size is 10 times the number of epochs, and the strategy is Set to triangular2.

Model training performed 100 Epoch by mini-batch learning.

Implementation

Execution environment

hardware

・ CPU Intel (R) Core (TM) i7-5820K 3.30GHz ・ GPU NVIDIA Quadro RTX 5000 16GB

software

・ Windows 10 Pro 1909 ・ CUDA 10.0 ・ CuDNN 7.6 ・ Python 3.6.6 ・ Cntk-gpu 2.7 ・ Cntkx 0.1.13 ・ H5py 2.9.0 ・ Numpy 1.17.3 ・ Pandas 0.25.0 ・ Scikit-learn 0.21.3

Program to run

The training program is available on GitHub.

`ssmd_training.py`

Commentary

I will supplement the main contents of this implementation.

Generalized IoU Loss The squared error [10] and smooth L1 Loss [2] [11] are used for the bounding box regression loss function. Intersection over Union (IoU), which indicates the degree of overlap between the bounding box and the correct bounding box, may be adopted.

However, IoU has the problem of having more saddle points in the optimization, as the value will be 0 if the two bounding boxes do not overlap at all. The one proposed there is Generalized IoU (GIoU).

Assuming that the predictive bounding box is $ A $ and the correct bounding box is $ B $, GIoU looks like this:

IoU = \frac{A \cap B}{A \cup B} \\
GIoU = IoU  - \frac{C - (A \cup B)}{C} \\
GIoU Loss = 1 - GIoU

Where $ C $ represents the smallest rectangular area that surrounds the two bounding boxes. GIoU takes a value of [-1, 1].

Multi-Task Loss Training a neural network that performs multiple tasks defines a loss function for each task. As mentioned above, this loss function consists of the following loss functions.

Loss = Generalized IoU Loss + Binary Cross Entropy + Cross Entropy Error

Generalized IoU Loss for the center coordinates and width / height of the bounding box, Binary Cross Entropy for Objectness to determine if an object exists, and Cross Entropy Error for object categorization. Calculate the loss function.

Therefore, the formula for the loss function is:

Loss = \lambda^{coord}_{obj} \sum^N \sum^B \left\{1 - \left(IoU - \frac{C - (A \cup B)}{C} \right) \right\} +
\lambda^{coord}_{noobj} \sum^N \sum^B \left\{1 - \left(IoU - \frac{C - (A \cup B')}{C} \right) \right\} \\
+ \lambda^{conf}_{obj} \sum^N \sum^B -t \log(\sigma(t_o)) + \lambda^{conf}_{noobj} \sum^N \sum^B -(1 - t) \log(1 - \sigma(t_o)) \\
+ \lambda^{prob}_{obj} \sum^N \sum^B -t \log(p_c) + \lambda^{prob}_{noobj} \sum^N \sum^B -t \log(p_c) \\

\lambda^{coord}_{obj} = 1.0, \lambda^{coord}_{noobj} = 0.1, \lambda^{conf}_{obj} = 1.0, \lambda^{conf}_{noobj} = 0.1, \lambda^{prob}_{obj} = 1.0, \lambda^{prob}_{noobj} = 0.0

Here, $ A, B, and C $ represent the predicted bounding box, the correct bounding box, and the smallest rectangular area that surrounds the two bounding boxes, respectively, and $ B'$ represents the default bounding box. The default bounding box means a bounding box whose center coordinates and width / height of each grid cell are the same size as the anchor box.

The contribution of each loss function is adjusted by the coefficient $ \ lambda $, which is set to 1.0 if the object is present and 0.1 or 0.0 if the object is not present.

Dynamic Target Assignment In network training, not all predictive bounding boxes correspond to correct data. Therefore, we will take the measure of dynamically assigning the correct bounding box and category label.

For example, when the upper left figure in the figure below is the input image, the bounding box output by the network will be the red bounding box in the upper right figure if an object exists. However, the correct bounding box is the green bounding box in the lower left figure. Here, calculate the IoU of the output bounding box and the correct bounding box, and assign the correct bounding box and the correct category label to the predicted bounding box with the largest IoU. The lower right figure shows the predicted bounding box assigned the correct bounding box in blue.

However, some of the bounding boxes that were not assigned the correct bounding box have high IoU values, so assign the correct bounding box and the correct category label to them as well. The predicted bounding box to which the correct bounding box is assigned by this process is shown in light blue in the lower right figure.

If the correct bounding box cannot be assigned, the object does not exist and the default bounding box is assigned.

result

Training loss and error

The figure below is a visualization of each loss function during training. From the left, GIoU Loss for bounding box regression, Binary Cross Entropy for objectivity, and Cross Entropy Error for categorization. The horizontal axis represents the number of epochs, and the vertical axis represents the value of the loss function.

Validation mAP score

Now that we have trained the 1-stage object detection model, we evaluated the performance using the verification data.

For this performance evaluation, we calculated mean Average Precision (mAP). I used sklearn to calculate the mAP and set the IoU to 0.5. Using val2014 as the validation data resulted in the following:

mAP50 Score 10.3

FPS and demo

I also measured FPS, which is an index of execution speed. The measurement used the standard Python module time, and the hardware used was the GPU NVIDIA GeForce GTX 1060 6GB.

39.9 FPS

Below is a video of an object detection attempt with a trained model.

The result is not good. I would like to try again to detect objects.

reference

Microsoft COCO Common Objects in Context

Computer Vision : Image Classification Part2 - Training CNN model Computer Vision : Object Detection Part1 - Bounding Box preprocessing

Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. "Microsoft COCO: Common Objects in Context", European Conference on Computer Vision. 2014, pp 740-755.
Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg. "SSD: Single Shot MultiBox Detector", arXiv preprint arXiv:1512.02325 (2016). European Conference on Computer Vision. 2016, pp 21-37.
Joseph Redmon and Ali Farhadi. "YOLO9000: better, faster, stronger", The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2017, pp 7263-7271.
Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. "Fast and accurate deep network learning by exponential linear units (ELUs)." arXiv preprint arXiv:1511.07289 (2015).
Ioffe Sergey and Christian Szegedy. "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift", arXiv preprint arXiv:1502.03167 (2015).
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification", The IEEE International Conference on Computer Vision (ICCV). 2015, pp 1026-1034.
Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese., "Generalized Intersection over Union: A Metric and A Loss for Bounding Box Regression", The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2019, pp 658-666.
Diederik P. Kingma and Jimmy Lei Ba. "Adam: A method for stochastic optimization", arXiv preprint arXiv:1412.6980 (2014).
Leslie N. Smith. "Cyclical Learning Rates for Training Neural Networks", 2017 IEEE Winter Conference on Applications of Computer Vision. 2017, pp 464-472.
Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. "You Only Look Once: Unified, Real-Time Object Detection", The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016, pp 779-788.
Ross Girshick. "Fast R-CNN", The IEEE International Conference on Computer Vision (ICCV). 2015, pp 1440-1448.

[PYTHON] Computer Vision: Object Detection Part2-Single Shot Multi Detector