Long time no see. Starting with Hinton's overwhelming victory at ILSVRC by AlexNet in 2012, Deep Learning has come into the limelight in the world of image recognition.
In the world of object detection, models using deep learning are currently the mainstream. If you look at https://paperswithcode.com/sota/object-detection-on-coco,
In COCO test-dev, the state-of-the-art (SoTA) model seems to be EfficientDet-D7x. There are some dogmatisms and prejudices, but I have collected seven papers to read in order to understand the Efficient Det.
I would like to focus on object detection after Deep Learning and write it as concisely and smoothly as possible.
If you don't know about object detection, please watch the video below. It's a YOLOv2 video, but it's insanely cool.
https://youtu.be/VOC3huqHrss
Object detectors are roughly classified into two-stage type and one-stage type. As a prerequisite knowledge, I will briefly introduce it here.
--Two-stage type --Faster R-CNN, Mask R-CNN, etc. where the area proposal part is separated. --One-stage type ――The area proposal part such as YOLO and SSD is not separated, and the processing can be done only once.
Below, I would like to enter the 7 papers on the subject. I will introduce them in order from the basic ones.
Known as Faster R-CNN. A pioneering paper on object detection by deep learning. The originator of two-stage detectors.
A significant increase in inference time was achieved for region-based CNN (R-CNN) and Fast R-CNN. Achieve end-to-end.
--Achieved SoTA at that time with PASCAL VOC and MS COCO --Reduced inference time per image to 200msec
――What should I do to make it even faster?
Known as YOLO. A model of the same period as the Faster R-CNN. The originator of one-stage detectors. The video introduced at the beginning is from YOLO v2. Currently, the author has changed, but it has been released up to v5.
Anyway, it's fast. (SoTA has not been achieved in terms of accuracy) Real-time detection is possible with Faster R-CNN, but I think YOLO is the true real-time detection.
――The simple idea of dividing into S × S grids and thinking for each grid. --A network structure that performs "classification" and "regression" at the same time using the final stage of the feature map.
--Achieves 45 FPS, which is much faster than Faster R-CNN. (Faster R-CNN is 5FPS) --Achieved 155 FPS with faster Fast YOLO.
――How can you improve the accuracy while maintaining the inference speed?
Known as SSD. Like YOLO, it is an original one-stage detector.
Achieves 59 FPS while maintaining the same accuracy as the two-stage type.
--Use a hierarchical feature map. Define Anchor (called Default Box in the paper) for each feature map.
--Achieved 74.3% mAP, 59FPS in VOC2007. (YOLO is 63.4% mAP, 45FPS)
――How can you detect smaller objects better?
Known as FPN. The theory behind the BiFPN used in EfficientDet. A new feature extraction method.
By using pyramid-shaped features, it is easy to recognize objects of different scales.
--By combining the lower feature map and the upper feature map, the lower feature amount, which was semantically weak in conventional SSDs, has been strengthened. --Overcome the detection of small objects, which was a weak point of SSD.
--By incorporating FPN into Faster R-CNN, we achieved SoTA at that time with COCO.
――How can you make the image better without losing its features?
Known as Retina Net. Introduced a new concept called Focal Loss.
When using Anchor Box etc., the background class inevitably increases, which was solved by Hard Negative Mining etc. in the past, but it is improved by a different approach of changing Loss.
--Correction is added centering on cross entropy. --The well-classified example has been modified so that it does not affect Loss so much. (That is, easy examples have lower weights and focus on training hard negatives.)
--Realized SoTA at that time at COCO.
--None.
Known as EfficientNet. Of the seven papers this time, only this paper is about "classification" instead of "detection". Announced by Google Brain. Adopted because it is used for the backbone of EfficientDet.
A place where high accuracy was achieved with a considerably smaller number of parameters than before.
--Optimized width (1 layer size), depth (number of layers), and resolution (input image size) for model scale-up --The structure of the model is determined using one coefficient, compound coefficient $ \ phi $.
--Achieved SoTA with 5 datasets including ImageNet.
--None.
Known as Efficient Det. The latest object detection model announced in 2020. A model that changes the FPN of Retina Net to BiFPN and changes the backbone to Efficient Net.
Compared to the existing model, which has the same degree of accuracy, EfficientDet has a considerably reduced number of parameters. Also, the number of operations (FLOPs) is small.
--Proposed BiFPN, which is a method to get Feature Pyramid by mixing feature maps of multiple resolutions. --Introducing a parameter that scales the capacity of the network like EfficientNet to balance FLOPs and accuracy.
--According to the experimental results of the paper, the performance evaluation at COCO has reached the SoTA of 55.1%.
--None.
It has nothing to do with EfficientDet, but I will leave the extra edition as well. If you are interested, please read it.
This time I briefly wrote about the points of the dissertation, but I think that understanding will deepen if you read the original dissertation based on that.
Here is the implementation of EfficientDet. PyTorch: https://github.com/rwightman/efficientdet-pytorch TensorFlow: https://github.com/google/automl/tree/master/efficientdet (described in the paper)
Thank you for reading until the end!
Recommended Posts