[PYTHON] Deep Learning of object detection 7 papers to read and point summary [Road to Efficient Det]

Long time no see. Starting with Hinton's overwhelming victory at ILSVRC by AlexNet in 2012, Deep Learning has come into the limelight in the world of image recognition.

In the world of object detection, models using deep learning are currently the mainstream. If you look at https://paperswithcode.com/sota/object-detection-on-coco,

スクリーンショット 2020-08-28 11.34.05.png

In COCO test-dev, the state-of-the-art (SoTA) model seems to be EfficientDet-D7x. There are some dogmatisms and prejudices, but I have collected seven papers to read in order to understand the Efficient Det.

I would like to focus on object detection after Deep Learning and write it as concisely and smoothly as possible.

What is object detection?

If you don't know about object detection, please watch the video below. It's a YOLOv2 video, but it's insanely cool.

https://youtu.be/VOC3huqHrss

Two-stage detector and one-stage detector

Object detectors are roughly classified into two-stage type and one-stage type. As a prerequisite knowledge, I will briefly introduce it here.

--Two-stage type --Faster R-CNN, Mask R-CNN, etc. where the area proposal part is separated. --One-stage type ――The area proposal part such as YOLO and SSD is not separated, and the processing can be done only once.

Below, I would like to enter the 7 papers on the subject. I will introduce them in order from the basic ones.

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks（2016）

Overview

Known as Faster R-CNN. A pioneering paper on object detection by deep learning. The originator of two-stage detectors.

What is amazing compared to previous research?

A significant increase in inference time was achieved for region-based CNN (R-CNN) and Fast R-CNN. Achieve end-to-end.

What is the key to technology?

RPN(Region Proposal Network) --Generate anchor box for feature map after passing through CNN --Foreground / background classification and rectangular regression for each anchor box
ROI Pooling --Outputs elements of a fixed size for rectangles of any size --Detection is divided into 2 stages --stage 1: Extract a rectangle that looks like an object with RPN --stage 2: Classification and rectangle regression using the features contained in the rectangle extracted by RPN

How did you verify that it was valid?

--Achieved SoTA at that time with PASCAL VOC and MS COCO --Reduced inference time per image to 200msec

Is there a discussion?

――What should I do to make it even faster?

You Only Look Once: Unified, Real-Time Object Detection（2016）

Overview

Known as YOLO. A model of the same period as the Faster R-CNN. The originator of one-stage detectors. The video introduced at the beginning is from YOLO v2. Currently, the author has changed, but it has been released up to v5.

What is amazing compared to previous research?

Anyway, it's fast. (SoTA has not been achieved in terms of accuracy) Real-time detection is possible with Faster R-CNN, but I think YOLO is the true real-time detection.

What is the key to technology?

――The simple idea of dividing into S × S grids and thinking for each grid. --A network structure that performs "classification" and "regression" at the same time using the final stage of the feature map.

How did you verify that it was valid?

--Achieves 45 FPS, which is much faster than Faster R-CNN. (Faster R-CNN is 5FPS) --Achieved 155 FPS with faster Fast YOLO.

Is there a discussion?

――How can you improve the accuracy while maintaining the inference speed?

SSD: Single Shot MultiBox Detector（2016）

Overview

Known as SSD. Like YOLO, it is an original one-stage detector.

What is amazing compared to previous research?

Achieves 59 FPS while maintaining the same accuracy as the two-stage type.

What is the key to technology?

--Use a hierarchical feature map. Define Anchor (called Default Box in the paper) for each feature map.

How did you verify that it was valid?

--Achieved 74.3% mAP, 59FPS in VOC2007. (YOLO is 63.4% mAP, 45FPS)

Is there a discussion?

――How can you detect smaller objects better?

Feature Pyramid Networks for Object Detection（2017）

Overview

Known as FPN. The theory behind the BiFPN used in EfficientDet. A new feature extraction method.

What is amazing compared to previous research?

By using pyramid-shaped features, it is easy to recognize objects of different scales.

What is the key to technology?

--By combining the lower feature map and the upper feature map, the lower feature amount, which was semantically weak in conventional SSDs, has been strengthened. --Overcome the detection of small objects, which was a weak point of SSD.

How did you verify that it was valid?

--By incorporating FPN into Faster R-CNN, we achieved SoTA at that time with COCO.

Is there a discussion?

――How can you make the image better without losing its features?

Focal Loss for Dense Object Detection（2018）

Overview

Known as Retina Net. Introduced a new concept called Focal Loss.

What is amazing compared to previous research?

When using Anchor Box etc., the background class inevitably increases, which was solved by Hard Negative Mining etc. in the past, but it is improved by a different approach of changing Loss.

What is the key to technology?

--Correction is added centering on cross entropy. --The well-classified example has been modified so that it does not affect Loss so much. (That is, easy examples have lower weights and focus on training hard negatives.)

How did you verify that it was valid?

--Realized SoTA at that time at COCO.

Is there a discussion?

--None.

EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks（2019）

Overview

Known as EfficientNet. Of the seven papers this time, only this paper is about "classification" instead of "detection". Announced by Google Brain. Adopted because it is used for the backbone of EfficientDet.

What is amazing compared to previous research?

A place where high accuracy was achieved with a considerably smaller number of parameters than before.

What is the key to technology?

--Optimized width (1 layer size), depth (number of layers), and resolution (input image size) for model scale-up --The structure of the model is determined using one coefficient, compound coefficient $ \ phi $.

How did you verify that it was valid?

--Achieved SoTA with 5 datasets including ImageNet.

Is there a discussion?

--None.

EfficientDet: Scalable and Efficient Object Detection（2020）

Overview

Known as Efficient Det. The latest object detection model announced in 2020. A model that changes the FPN of Retina Net to BiFPN and changes the backbone to Efficient Net.

What is amazing compared to previous research?

Compared to the existing model, which has the same degree of accuracy, EfficientDet has a considerably reduced number of parameters. Also, the number of operations (FLOPs) is small.

What is the key to technology?

--Proposed BiFPN, which is a method to get Feature Pyramid by mixing feature maps of multiple resolutions. --Introducing a parameter that scales the capacity of the network like EfficientNet to balance FLOPs and accuracy.

How did you verify that it was valid?

--According to the experimental results of the paper, the performance evaluation at COCO has reached the SoTA of 55.1%.

Is there a discussion?

--None.

Extra edition

It has nothing to do with EfficientDet, but I will leave the extra edition as well. If you are interested, please read it.

Libra R-CNN: Towards Balanced Learning for Object Detection ――We analyze the problems in the learning stage in object detection and propose improved methods for each problem.
End-to-End Object Detection with Transformers --A paper that applies Transformer, which is effective in natural language processing such as BERT and the topic GPT-3, to object detection.

in conclusion

This time I briefly wrote about the points of the dissertation, but I think that understanding will deepen if you read the original dissertation based on that.

Here is the implementation of EfficientDet. PyTorch: https://github.com/rwightman/efficientdet-pytorch TensorFlow: https://github.com/google/automl/tree/master/efficientdet (described in the paper)

Thank you for reading until the end!