For whom

For those who already understand deep learning and want to master the mechanism from image classification to object detection

Since there are many formulas, if you want to check the code, go to the following

[Specific implementation example](http://qiita.com/GushiSnow/private/8c946208de0d6a4e31e7#%E5%85%B7%E4%BD%93%E7%9A%84%E3%81%AA%E5%AE % 9F% E8% A3% 85% E4% BE% 8B)

bonus

Translated a book about Keras. It covers a wide range of areas such as image identification, image generation, natural language processing, time series prediction, and reinforcement learning. [Intuitive Deep Learning-Recipes for shaping ideas with Python x Keras](https://www.amazon.co.jp/Deep-Learning-%E2%80%95Python%C3%97Keras%E3%81%A7% E3% 82% A2% E3% 82% A4% E3% 83% 87% E3% 82% A2% E3% 82% 92% E5% BD% A2% E3% 81% AB% E3% 81% 99% E3% 82% 8B% E3% 83% AC% E3% 82% B7% E3% 83% 94-Antonio-Gulli / dp / 4873118263 / ref = sr_1_1? s = books & ie = UTF8 & qid = 1530227887 & sr = 1-1 & keywords =% E7% 9B % B4% E6% 84% 9F + Deep + Learning)

Purpose

I wrote it because I wanted to systematically summarize the technology related to object detection and understand the code base.

This article is created by referring to the chapter on object recognition in image recognition, which is a good book.

[Image recognition](https://www.amazon.co.jp/%E7%94%BB%E5%83%8F%E8%AA%8D%E8%AD%98-%E6%A9%9F%E6 % A2% B0% E5% AD% A6% E7% BF% 92% E3% 83% 97% E3% 83% AD% E3% 83% 95% E3% 82% A7% E3% 83% 83% E3% 82 % B7% E3% 83% A7% E3% 83% 8A% E3% 83% AB% E3% 82% B7% E3% 83% AA% E3% 83% BC% E3% 82% BA-% E5% 8E% 9F% E7% 94% B0-% E9% 81% 94% E4% B9% 9F / dp / 4061529129 "Image recognition")

Screen Shot 2017-06-22 at 9.37.01.png

Overall picture

Screen Shot 2017-06-22 at 9.59.59.png

It is roughly divided into three phases.

1: Extraction of object area candidates

It is a method to extract the area candidates of the object from the image. It is the part that affects accuracy and speed. As shown in the figure, there is a method to prepare a small window (bounding box) and extract area candidates while shifting a certain number of pixels. If this is shifted for each pixel, it is necessary to evaluate the image size W * H. Therefore, in order to reduce this calculation cost, it is common to narrow down the candidates by a method that evaluates the image quality of the object.

2: Object recognition of object area candidates

You need to be aware of what you see in the candidates. This can be solved with a general supervised classification problem. The important thing here is the selection of teacher data. Negative examples (wrong data) become classifiers that can solve only simple problems unless you select an example that is difficult to classify, and the performance will not be practical. Therefore, it is important to select it as a negative example that is difficult to classify. In this figure, the part where the whole picture of the cow is taken is appropriate as a negative example.

3: Narrow down the detection area

Even if there is only one target object, multiple areas will appear. The appropriate detection area is determined by selecting the part with only the maximum detection score from these.

Each method

Extraction of object area candidates	Object recognition of object area candidates	Narrowing down the detection area
Sliding window method(Inefficient but simple)	HOG features+Linear SVM	NMS(IoU is used)
Selective search method(Efficient)	DPM(Consider the deformation of the object)
Branch and bound method(Efficient)	Hog features+ LatentSVM(Filter position consideration)
attentional cascade(Fast object detection but requires a classifier)	Exampler-SVM(Classify individual objects)
	Rectangular features+ Adaboost(High classification performance at low cost)

Screen Shot 2017-06-22 at 10.26.32.png

To collect good negative cases in object recognition of object area candidates, store the negative cases that the classifier misclassified as a cache, and remove the negative case candidates that could be classified to efficiently collect and learn. I can.

Object detection by deep learning

Object detection by deep learning has the advantage of being able to use CNN features that can extract high-quality features compared to the above method. Let's actually see the flow of object detection by deep learning.

From now on, the small window is referred to as the bounding box.

R-CNN (Region)

Screen Shot 2017-06-22 at 10.55.35.png

It is a method of regression of the bounding box, but the proposed bounding box

\vec{r} = (r_x, r_y, r_w, r_h)^T

True bounding box

\vec{g} = (g_x, g_y, g_w, g_h)^T

The model parameter W to get the true bounding box is solved below.

\vec{W} = argmin_{\vec{w}}\sum^{N}_{n=1}({\vec{t}_n - \vec{W}^Tf(\vec{r}_n)})^2 + \lambda\|\vec{W}\|^2_{F}

Value for the target regression here

\vec{t} = (t_x, t_y, t_w, t_h)^T

t uses the true bounding box g defined earlier and the proposed bounding box r

t_x = (g_x - r_x) / r_w,
t_y = (g_y - r_y) / r_h,
t_w = log(g_w / r_w),
t_h = log(g_h / r_h),

Define it as above. The reason for doing the above is presumed as follows. The reason for the guess is that I haven't investigated that it wasn't in the book. If you are interested, please check it out.

Since there are various types of bounding box sizes, the center position is calculated by the ratio of the difference value to the width value. The width value is calculated as a ratio to the true value. However, since the value can be extremely small or large, the effect is reduced by the logarithm.

It seems that we are devising for the above without using the value directly.

Fast R-CNN

R-CNN had to do CNN for each object area. Fast R-CNN differs in that the CNN feature uses the entire image. Since the CNN features differ for each cropped image area at that time, the difference is that it is necessary to convert to fixed-length features by RoI pooling.

Screen Shot 2017-06-22 at 11.24.35.png

What is RoI pooling? See page 10 below

Paper introduction: Fast R-CNN & Faster R-CNN

Learning method

We take a method of optimizing multitasking loss for simultaneous learning of class recognition and regression to the bounding box.

It is assumed that each correct bounding box is given the correct position t and the label u used when obtained by R-CNN. Multitasking loss is expressed by the following formula.

Posterior probability of class

\vec{p} = (p^0, p^1,... p^N_c)^T

Relative position and size of bounding box

\vec{v} = (v_x, v_y, v_w, v_h)^T

Based on the above

J(\vec{p}, u, \vec{v}, \vec{t}) = J_{cls}(\vec{p}, u) + \lambda[u >= 1]J_{loc}(\vec{v}, \vec{t})

J_cls is the loss of class recognition and J_loc is the loss of bounding box regression.

J_cls is calculated as a negative logarithm of posterior probabilities p ^ u for true class u.

J_{cls}(\vec{p}, u) = -\log{p^u}

J_loc is

J_{loc} = \sum_{i \in { \{x,y,w,h} \}}smooth_{L1}(t_i - v_i)

smooth_{L1}(x) = \left\{
\begin{array}{ll}
0.5x^2 & if (|x| < 1) \\
|x| - 0.5 &otherwise
\end{array}
\right.

The smooth function corrects the relative position difference so that it becomes large when it is smaller than 1, and otherwise it is subtracted by the median of 0.5 so that it does not become an extremely large value.

Computational efficiency

Since the full combination is processed for each image, the mini-batch works on one image so that the feature map can be used efficiently. Specifically, N (the number of images) is reduced and R (the number of bounding boxes) is increased.

Faster R-CNN

In Fast R-CNN, it was necessary to calculate the object area candidates in another module (selective detection method). Faster R-CNN uses a method of creating a region network that estimates the object region from a feature map called RPN and integrating it with Fast R-CNN.

Screen Shot 2017-06-22 at 12.10.38.png

Suggest a bounding box with a score by RPN. RPN consists of a part that learns the parameters of the bounding box and a separate network that predicts the presence or absence of an object, and combines these to realize RPN.

Prepare K anchor boxes whose shape has been decided in advance. Prepare a standard bounding box centered on the local area of the input (which is derived by edge calculation etc.). This area is a hyperparameter element. The bounding box prediction outputs a 4k-dimensional vector containing the relative position and aspect ratio from each anchor box. (x, y, w, h) * k

[Aspect ratio](https://ja.wikipedia.org/wiki/%E3%82%A2%E3%82%B9%E3%83%9A%E3%82%AF%E3%83%88%E6% AF% 94)

Since the classification network judges the presence or absence of an object in two classes, it outputs a 2k-dimensional vector. (Yes, no) * k

Derivation of optimal bounding box by minimizing multitasking loss similar to Fast R-CNN. By learning RPN and Fast R-CNN alternately, you will learn the entire network of Faster R-CNN. First, learn only with RPN so that the optimum bounding box can be derived, then learn R-CNN, and then learn Fast R-CNN.

YOLO (You only look once)

So far, the theme has been to find a good bounding box, but there are attempts to detect objects directly. That is YOLO.

Screen Shot 2017-06-22 at 12.43.34.png

YOLO procedure

1: Divide the input image into S * S areas 2: Derivation of class probabilities of objects in the region 3: Calculate the parameters (x, y, h, w) and reliability of B bounding boxes (hyperparameters)

Degree of reliability

q = P_r(Obj) \times IoU^{truth}_{pred}

IoU^{truth}_{pred}

Is the degree of agreement between the prediction and the correct bounding box. The product of the object class probability and the reliability of each bounding box is used for object detection.


P_r(C_i|Obj) \times P_r(Obj) \times IoU^{truth}_{pred}

The YOLO network is as follows.

Screen Shot 2017-06-22 at 12.57.23.png

The output is the number of bounding boxes and the number of classes, including the image area divided into S * S, (x, y, h, w) and reliability.

The reliability is expressed by the following formula. Measure the match of the bounding box.

IoU = \frac{area(R_p \bigcap R_g)}{area(R_p \bigcup R_g)}

Screen Shot 2017-06-22 at 13.02.20.png

SSD

I will also touch on SSD, which was not in the book but is useful as a method.

・ Speed comparison

Screen Shot 2017-06-30 at 10.53.05.png

・ Accuracy comparison

The difference between SSD512 and SSD300 is the size of the input image

Screen Shot 2017-06-30 at 10.54.04.png

advantage

--Simple network configuration similar to YOLO

high speed --High accuracy

Reason for advantage

--Providing a model that is not affected by the scale of the image by preparing an output layer according to the aspect ratio and training it. ――The simple end-to-end model eliminates the need for extra processing, so it is faster.

model comparison

Screen Shot 2017-06-26 at 10.19.02.png

The figure above compares the End to End model YOLO and SSD. In the case of SSD, we prepare multiple feature maps with different aspect ratios and input them to the final layer so that they can be applied even if the image resolutions are different.

About the output layer

Screen Shot 2017-06-26 at 10.17.24.png

The meaning of 8732 in the figure is the number of bounding boxes. If the number is large, the accuracy will increase, but the speed will decrease, so there will be a trade-off relationship.

The output layer has the number of classes C and offset (x, y, h, w), and the number of bounding boxes associated with them k. Since it is necessary to prepare them for each feature map, the size of the feature map is m * n. In that case, the following is the size of the output layer.


(c+4)kmn

The loss function finds two points, the deviation of the position of the object and the deviation of the classification of the class. N is the number of default bounding boxes matched (0 is set to 0 because the loss diverges to infinity). α controls the importance of hyperparameter class identification or offset regression) Where x is 1 if the box of true data j and the box i of forecast data match, 0 if they do not match (p is the class)

x^p_{ij} = {1, 0}


L(x,c,l,g) = 1/N(L_{conf}(x, c) + \alpha L_{loc}(x,l,g))

Loss function with respect to position

l is the predicted position

L_{loc}(x,l,g) = \sum^N_{i \in Pos}\sum_{m \in {cx, cy, w, h}} x^k_{ij} {\rm smooth_{L1}}(l^m_i-\hat{g}^m_j)

smooth_{L1}(x) = \left\{
\begin{array}{ll}
0.5x^2 & if (|x| < 1) \\
|x| - 0.5 &otherwise
\end{array}
\right.

The default bounding box is represented by d, the true bounding box is represented by g, and the true value is normalized to the bounding box scale.


\hat{g}^{cx}_j = (g^{cx}_j - d^{cx}_i) / d^{w}_i,
\hat{g}^{cy}_j = (g^{cy}_j - d^{cy}_i) / d^{h}_i,
\hat{g}^{w}_j = \log(g^{w}_j / d^{w}_i),
\hat{g}^{h}_j = \log(g^{h}_j / d^{h}_i),

Loss function for class

The first term represents the prediction for each class and the second term represents the background prediction.

L_{conf}(x, c) = -\sum^N_{i \in Pos}x^p_{ij}\log(\hat{c}^p_i) -\sum^N_{i \in Neg}x^p_{ij}\log(\hat{c}^0_i)

Classification is a softmax function

\hat{c}^p_i = \frac{\exp(c^p_i)}{\sum_p{\exp(c^p_i)}}

Choosing scales and aspect ratios for default boxes

Since the feature maps are multi-scale, each feature map gives a role to what size object to detect. The larger m, the smaller the scale. This means that the deeper the model, the smaller the object is detected in the feature map.

s_k = s_{min} + \frac{s_{max} - s_{min}}{m-1}(k-1)

Set the aspect ratio of the bounding box prepared by default as follows.


a_r = {1, 2, 3, 1/2, 1/3}

Calculate the width and height of each and prepare a bounding box.

width


w^a_k = s_k \sqrt{a_r}

height


h^a_k = s_k / \sqrt{a_r}

If the aspect ratio is 1, prepare a bounding box with the following scale applied.


s_k' = \sqrt{s_ks_k+1}

Hard negative mining

Since many negative bounding boxes appear, sort them in order of reliability, pick them up from the top, and modify them so that the ratio is 3: 1 (negative example: positive example).

Data augmentation

--Whole image --Select a sample with the degree of overlap (Jaccard) with the true value for each cropped image of 0.1, 0.3, 0.5, 0.7, 0.9 --Random sample of cropped image

Specific implementation example

You may hear people say that they understand abstract concepts and methods and how to implement them. Implement with Keras v2.0 by referring to the code below.

A port of SSD: Single Shot MultiBox Detector to Keras framework.

Since the original code is not compatible with keras 2.0 series, I will refer to the code that has been corrected by pull request. Described the environment provision by Docker.

https://github.com/SnowMasaya/ssd_keras

Understanding Model

Tensorflow has a visualization tool called Tensorboard, which is used for visualization.

--Model visualization

Graph the Tensorboard model to get the big picture.

CNN layer

Screen Shot 2017-06-26 at 14.33.10.png

The part where the feature map is united

--Offset (position): mbox_loc --Confidence: mbox_conf --Each boundary box: mbox_priorbox

Screen Shot 2017-06-26 at 14.33.44.png

Final layer

Predicted using combined feature maps

Screen Shot 2017-06-26 at 14.37.28.png

Understanding the specific code

Once you understand the conceptual diagram, you will understand the specific processing.

Model description

ssd_v2.py

Screen Shot 2017-06-26 at 10.19.02.png

In ssd, different layers of the feature map are combined and output.

The following is the process of concatenating the offset and class identification layers respectively. Since the 0th dimension is the dimension of data, the dimension of the feature quantity of the 1st dimension increases.

        mbox_loc = concatenate([conv4_3_norm_mbox_loc_flat,
                                fc7_mbox_loc_flat,
                                conv6_2_mbox_loc_flat,
                                conv7_2_mbox_loc_flat,
                                conv8_2_mbox_loc_flat,
                                pool6_mbox_loc_flat],
                               axis=1, name='mbox_loc')
        mbox_conf = concatenate([conv4_3_norm_mbox_conf_flat,
                                 fc7_mbox_conf_flat,
                                 conv6_2_mbox_conf_flat,
                                 conv7_2_mbox_conf_flat,
                                 conv8_2_mbox_conf_flat,
                                 pool6_mbox_conf_flat],
                                axis=1, name='mbox_conf')

num_boxes = mbox_loc._keras_shape[-1] // 4

The number of boxes can be obtained by dividing the feature amount of the position (all concatenated) by 4, so use that value.

The dimensions concatenated below are modified to the offset dimension and the class identification dimension.

Dimension of num_boxes: 7308 Dimension of mbox_loc: 29232 (7308 * 4) Dimension of mbox_conf: 153468 (7308 * number of classes (21))

        mbox_loc = Reshape((num_boxes, 4),
                           name='mbox_loc_final')(mbox_loc)
        mbox_conf = Reshape((num_boxes, num_classes),
                            name='mbox_conf_logits')(mbox_conf)

In the output layer, it is a process that concatenates offset, class identification, and bounding box (variance of x, y, h, w and 4 coordinates each). Since the 0th dimension is the dimension of data and the 1st dimension is the dimension of the number of bounding boxes, the dimension of the feature quantity of the 2nd dimension increases.

    predictions = concatenate([mbox_loc,
                               mbox_conf,
                               mbox_priorbox],
                              axis=2,
                              name='predictions')

Description of processing required for SSD

ssd_utils.py sets the bounding box.

Method list

--decode_boxes: Converting position predictions to matching bounding box values.

It uses four position offsets, a bounding box offset, and a bounding box variance as arguments. The reason for using variance is that the value is not uniquely determined, so it is possible to make predictions with a certain range.

1: Find the center position, width, and height from the offset information of the bounding box. 2: In order to obtain the boundary box to be decoded, the center position, width, and height of the decoded boundary box are obtained using the above values and variance. Since the predicted value is small, it is converted to a value of sufficient size by exp. A point to note is that dispersion is taken into consideration. The predicted center point, width, and height are included in consideration of probabilistic ones to allow for deviations in values due to variance. 3: Convert from center position for minimum and maximum offsets 4: Combine the obtained values into one vector 5: Return only the area 0 or more and 1 or less from the converted value.


    def decode_boxes(self, mbox_loc, mbox_priorbox, variances):
        prior_width = mbox_priorbox[:, 2] - mbox_priorbox[:, 0]
        prior_height = mbox_priorbox[:, 3] - mbox_priorbox[:, 1]
        prior_center_x = 0.5 * (mbox_priorbox[:, 2] + mbox_priorbox[:, 0])
        prior_center_y = 0.5 * (mbox_priorbox[:, 3] + mbox_priorbox[:, 1])

        decode_bbox_center_x = mbox_loc[:, 0] * prior_width * variances[:, 0]
        decode_bbox_center_x += prior_center_x
        decode_bbox_center_y = mbox_loc[:, 1] * prior_width * variances[:, 1]
        decode_bbox_center_y += prior_center_y
        decode_bbox_width = np.exp(mbox_loc[:, 2] * variances[:, 2])
        decode_bbox_width *= prior_width
        decode_bbox_height = np.exp(mbox_loc[:, 3] * variances[:, 3])
        decode_bbox_height *= prior_height

        decode_bbox_xmin = decode_bbox_center_x - 0.5 * decode_bbox_width
        decode_bbox_ymin = decode_bbox_center_y - 0.5 * decode_bbox_height
        decode_bbox_xmax = decode_bbox_center_x + 0.5 * decode_bbox_width
        decode_bbox_ymax = decode_bbox_center_y + 0.5 * decode_bbox_height

        decode_bbox = np.concatenate((decode_bbox_xmin[:, None],
                                      decode_bbox_ymin[:, None],
                                      decode_bbox_xmax[:, None],
                                      decode_bbox_ymax[:, None]), axis=-1)
       
        decode_bbox = np.minimum(np.maximum(decode_bbox, 0.0), 1.0)

        return decode_bbox

detection_out: Returns the predicted result

1: Get position, variance, bounding box, and confidence from predicted values 2: Convert position value to bounding box 3: If the conviction of the class is above a certain level, find the value of the bounding box. 4: Return the top 200 results


    def detection_out(self, predictions, background_label_id=0, keep_top_k=200,
                      confidence_threshold=0.01):

        mbox_loc = predictions[:, :, :4]
        variances = predictions[:, :, -4:]
        mbox_priorbox = predictions[:, :, -8:-4]
        mbox_conf = predictions[:, :, 4:-8]
        results = []
        for i in range(len(mbox_loc)):
            results.append([])
            
            decode_bbox = self.decode_boxes(mbox_loc[i],
                                            mbox_priorbox[i], variances[i])
            
            for c in range(self.num_classes):
                if c == background_label_id:
                    continue
                c_confs = mbox_conf[i, :, c]
                c_confs_m = c_confs > confidence_threshold
                if len(c_confs[c_confs_m]) > 0:
                    boxes_to_process = decode_bbox[c_confs_m]
                    confs_to_process = c_confs[c_confs_m]
                    feed_dict = {self.boxes: boxes_to_process,
                                 self.scores: confs_to_process}
                    idx = self.sess.run(self.nms, feed_dict=feed_dict)
                    good_boxes = boxes_to_process[idx]
                    confs = confs_to_process[idx][:, None]
                    labels = c * np.ones((len(idx), 1))
                    c_pred = np.concatenate((labels, confs, good_boxes),
                                            axis=1)
                    results[-1].extend(c_pred)
            
            if len(results[-1]) > 0:
                results[-1] = np.array(results[-1])
                argsort = np.argsort(results[-1][:, 1])[::-1]
                results[-1] = results[-1][argsort]
                results[-1] = results[-1][:keep_top_k]
        return results

ssd_layers.py sets the class of PriorBox which determines the size of the bounding box. The black and red lines in the figure.

Screen Shot 2017-06-26 at 17.00.25.png

1: Get the width and height of the feature map 2: Get the width and height of the input image 3: Add bounding box size to fit aspect ratio 4: The processing differs depending on whether the aspect ratio is 1 or not. 5: Definition of the center position of the box 6: Minimum and maximum bounding box settings 7: Distribution settings 8: Set bounding box and distribution and return in Tensorflow format


class PriorBox(Layer):
    
    #abridgement

    def call(self, x, mask=None):
        if hasattr(x, '_keras_shape'):
            input_shape = x._keras_shape
        elif hasattr(K, 'int_shape'):
            input_shape = K.int_shape(x)

        layer_width = input_shape[self.waxis]
        layer_height = input_shape[self.haxis]
         
        img_width = self.img_size[0]
        img_height = self.img_size[1]
        # define prior boxes shapes
        box_widths = []
        box_heights = []

        for ar in self.aspect_ratios:
            if ar == 1 and len(box_widths) == 0:
                box_widths.append(self.min_size)
                box_heights.append(self.min_size)
            elif ar == 1 and len(box_widths) > 0:
                box_widths.append(np.sqrt(self.min_size * self.max_size))
                box_heights.append(np.sqrt(self.min_size * self.max_size))
            elif ar != 1:
                box_widths.append(self.min_size * np.sqrt(ar))
                box_heights.append(self.min_size / np.sqrt(ar))
        box_widths = 0.5 * np.array(box_widths)
        box_heights = 0.5 * np.array(box_heights)

        #Get the step width by dividing the image size by the feature size
        step_x = img_width / layer_width
        step_y = img_height / layer_height
        #linspace processing
        #     https://docs.scipy.org/doc/numpy/reference/generated/numpy.linspace.html
        # np.linspace(2.0, 3.0, num=5)
        #   -> array([ 2.  ,  2.25,  2.5 ,  2.75,  3.  ])
        #Get vertical and horizontal arrays for the number of features for each step width
        linx = np.linspace(0.5 * step_x, img_width - 0.5 * step_x,
                           layer_width)
        liny = np.linspace(0.5 * step_y, img_height - 0.5 * step_y,
                           layer_height)
        #processing of meshgrid
        #     https://docs.scipy.org/doc/numpy/reference/generated/numpy.meshgrid.html
        # xv, yv = np.meshgrid(x, y)
        # xv
        # -> array([[ 0. ,  0.5,  1. ],
        #          [ 0. ,  0.5,  1. ]])
        # yv
        # -> array([[ 0.,  0.,  0.],
        #          [ 1.,  1.,  1.]])
        #Match the feature array created earlier
        centers_x, centers_y = np.meshgrid(linx, liny)
        centers_x = centers_x.reshape(-1, 1)
        centers_y = centers_y.reshape(-1, 1)

        num_priors_ = len(self.aspect_ratios)
        prior_boxes = np.concatenate((centers_x, centers_y), axis=1)
        prior_boxes = np.tile(prior_boxes, (1, 2 * num_priors_))
        prior_boxes[:, ::4] -= box_widths
        prior_boxes[:, 1::4] -= box_heights
        prior_boxes[:, 2::4] += box_widths
        prior_boxes[:, 3::4] += box_heights
        prior_boxes[:, ::2] /= img_width
        prior_boxes[:, 1::2] /= img_height
        prior_boxes = prior_boxes.reshape(-1, 4)
        if self.clip:
            prior_boxes = np.minimum(np.maximum(prior_boxes, 0.0), 1.0)
        num_boxes = len(prior_boxes)
        if len(self.variances) == 1:
            variances = np.ones((num_boxes, 4)) * self.variances[0]
        elif len(self.variances) == 4:
            variances = np.tile(self.variances, (num_boxes, 1))
        else:
            raise Exception('Must provide one or four variances.')
        prior_boxes = np.concatenate((prior_boxes, variances), axis=1)
        prior_boxes_tensor = K.expand_dims(K.variable(prior_boxes), 0)
        if K.backend() == 'tensorflow':
            pattern = [tf.shape(x)[0], 1, 1]
            prior_boxes_tensor = tf.tile(prior_boxes_tensor, pattern)
        return prior_boxes_tensor

Learning

Learning processing is done with SSD_training.ipynb.

Since the learning process is performed by model.fit_generator, the generator process including data augmentation is performed by Generator. Teacher data (labels, offsets, bounding boxes) It is set in ʻassign_boxes of ssd_utils.py`.

ssd_utils.py sets the bounding box.

The bounding box has an offset value and a variance value for each offset

priors[i] = [xmin, ymin, xmax, ymax, varxc, varyc, varw, varh].

Method list

--assign_boxes: Assign only priority boxes during learning --encode_box: ʻCalled by assign_boxes to change the bounding box into a deep learning space --iou: ʻCalculate the number of bounding box intersections called by encode_box

iou

Screen Shot 2017-06-27 at 13.42.35.png

1: Get the upper left and lower right coordinates using the true box and the predicted box. 2: Calculate the area of the overlap between the true box and the predicted box based on the obtained coordinates. 3: Calculate the predicted box area. 4: Calculate the area of the true box. 5: Subtract the inner area from the total area of the true box and the predicted box. 6: Divide the area of the overlapping part by the value of 5 (the area of the non-overlapping part).

The meaning of 6 is that the larger the area of the overlapping part, the smaller the area of the non-overlapping part, and it is an index to understand how close the predicted box is to the true box.

        inter_upleft = np.maximum(self.priors[:, :2], box[:2])
        inter_botright = np.minimum(self.priors[:, 2:4], box[2:])
        inter_wh = inter_botright - inter_upleft
        inter_wh = np.maximum(inter_wh, 0)
        inter = inter_wh[:, 0] * inter_wh[:, 1]
        # compute union
        area_pred = (box[2] - box[0]) * (box[3] - box[1])
        area_gt = (self.priors[:, 2] - self.priors[:, 0])
        area_gt *= (self.priors[:, 3] - self.priors[:, 1])
        union = area_pred + area_gt - inter
        # compute iou
        iou = inter / union
        return iou

encode_box

1: Among the boxes acquired by ʻiou`, play the bounding box with an intersection ratio of 0.5 or less. 2: Get the center position and width of the true box Get the center position and width of the predicted box that meets the 3: 1 condition 4: Prepare an Encode box for learning and use 5: Draw the center position of the true box and the center position of the predicted box (you can see how much difference there is) Divide the value of 6: 5 by the width of the predicted box (you can see the ratio) Divide the value of 7: 6 by the variance of the predicted box 8: Divide the width of the encoded box by the width of the true box and the width of the predicted box to get the logarithm 9: Divide the width of the encoded box by the predicted width of the box

Processing 8 and 9 is a conversion process for the loss function with respect to position.

        iou = self.iou(box)
        encoded_box = np.zeros((self.num_priors, 4 + return_iou))
        assign_mask = iou > self.overlap_threshold
        if not assign_mask.any():
            assign_mask[iou.argmax()] = True
        if return_iou:
            encoded_box[:, -1][assign_mask] = iou[assign_mask]
        assigned_priors = self.priors[assign_mask]
        box_center = 0.5 * (box[:2] + box[2:])
        box_wh = box[2:] - box[:2]
        assigned_priors_center = 0.5 * (assigned_priors[:, :2] +
                                        assigned_priors[:, 2:4])
        assigned_priors_wh = (assigned_priors[:, 2:4] -
                              assigned_priors[:, :2])
        # we encode variance
        encoded_box[:, :2][assign_mask] = box_center - assigned_priors_center
        encoded_box[:, :2][assign_mask] /= assigned_priors_wh
        encoded_box[:, :2][assign_mask] /= assigned_priors[:, -4:-2]
        encoded_box[:, 2:4][assign_mask] = np.log(box_wh /
                                                  assigned_priors_wh)
        encoded_box[:, 2:4][assign_mask] /= assigned_priors[:, -2:]
        return encoded_box.ravel()

assign_boxes

1: Prepare a bounding box and assignments initialized with the number of classes, offset, and variance. 2: Prepare an encoded true box 3: Get only the maximum value of the true box and the predicted box 4: Get only the index of the maximum value of the true box and the predicted box Select only those with a ratio of 5: 0 or more 6: Substitute the offset of the box that encodes the offset to be assigned that satisfies the above conditions. 7: Class assignment 8: Prepare positive samples in advance to study with positive and negative samples

        # 
        assignment = np.zeros((self.num_priors, 4 + self.num_classes + 8))
        assignment[:, 4] = 1.0
        if len(boxes) == 0:
            return assignment
        encoded_boxes = np.apply_along_axis(self.encode_box, 1, boxes[:, :4])
        encoded_boxes = encoded_boxes.reshape(-1, self.num_priors, 5)
        best_iou = encoded_boxes[:, :, -1].max(axis=0)
        best_iou_idx = encoded_boxes[:, :, -1].argmax(axis=0)
        best_iou_mask = best_iou > 0
        best_iou_idx = best_iou_idx[best_iou_mask]
        assign_num = len(best_iou_idx)
        encoded_boxes = encoded_boxes[:, best_iou_mask, :]
        #Encoded coordinate assignment
        assignment[:, :4][best_iou_mask] = encoded_boxes[best_iou_idx,
                                                         np.arange(assign_num),
                                                         :4]
        assignment[:, 4][best_iou_mask] = 0
        #Class assignment
        assignment[:, 5:-8][best_iou_mask] = boxes[best_iou_idx, 4:]
        #Allocation of positive samples for learning
        assignment[:, -8][best_iou_mask] = 1
        return assignment

ssd_training.py

The loss function for position and class identification is set in ssd_training.py. The first value setting determines the number of classes, the ratio of class loss function to position loss function, and the ratio of negative examples.

class MultiboxLoss(object):

    def __init__(self, num_classes, alpha=1.0, neg_pos_ratio=3.0,
                 background_label_id=0, negatives_for_hard=100.0):
        self.num_classes = num_classes
        self.alpha = alpha
        self.neg_pos_ratio = neg_pos_ratio
        if background_label_id != 0:
            raise Exception('Only 0 as background label id is supported')
        self.background_label_id = background_label_id
        self.negatives_for_hard = negatives_for_hard

Below is the l1_smooth function used in the position loss function.


    def _l1_smooth_loss(self, y_true, y_pred):
        abs_loss = tf.abs(y_true - y_pred)
        sq_loss = 0.5 * (y_true - y_pred)**2
        l1_loss = tf.where(tf.less(abs_loss, 1.0), sq_loss, abs_loss - 0.5)
        return tf.reduce_sum(l1_loss, -1)

Below is the soft_max function used in the class loss function.


    def _softmax_loss(self, y_true, y_pred):
        y_pred = tf.maximum(tf.minimum(y_pred, 1 - 1e-15), 1e-15)
        softmax_loss = -tf.reduce_sum(y_true * tf.log(y_pred),
                                      axis=-1)
        return softmax_loss

The multi-loss loss, which is the sum of the position loss function and the class identification loss function, is calculated below.

1: Calculate identification and position loss 2: Calculate positive loss 3: Calculate the loss of negative cases and get only those with high certainty 4: Calculate the sum of negative and positive losses


    def compute_loss(self, y_true, y_pred):
        batch_size = tf.shape(y_true)[0]
        num_boxes = tf.to_float(tf.shape(y_true)[1])

        #Calculate the loss of all boxes
        conf_loss = self._softmax_loss(y_true[:, :, 4:-8],
                                       y_pred[:, :, 4:-8])
        loc_loss = self._l1_smooth_loss(y_true[:, :, :4],
                                        y_pred[:, :, :4])

        #Calculate positive loss
        num_pos = tf.reduce_sum(y_true[:, :, -8], axis=-1)
        pos_loc_loss = tf.reduce_sum(loc_loss * y_true[:, :, -8],
                                     axis=1)
        pos_conf_loss = tf.reduce_sum(conf_loss * y_true[:, :, -8],
                                      axis=1)

        #Calculate negative losses and get only those with high confidence
        #Get the number of negative cases
        num_neg = tf.minimum(self.neg_pos_ratio * num_pos,
                             num_boxes - num_pos)
        # 
        pos_num_neg_mask = tf.greater(num_neg, 0)
        has_min = tf.to_float(tf.reduce_any(pos_num_neg_mask))
        num_neg = tf.concat(axis=0, values=[num_neg,
                                [(1 - has_min) * self.negatives_for_hard]])
        num_neg_batch = tf.reduce_min(tf.boolean_mask(num_neg,
                                                      tf.greater(num_neg, 0)))
        num_neg_batch = tf.to_int32(num_neg_batch)
        confs_start = 4 + self.background_label_id + 1
        confs_end = confs_start + self.num_classes - 1
        max_confs = tf.reduce_max(y_pred[:, :, confs_start:confs_end],
                                  axis=2)
        _, indices = tf.nn.top_k(max_confs * (1 - y_true[:, :, -8]),
                                 k=num_neg_batch)
        batch_idx = tf.expand_dims(tf.range(0, batch_size), 1)
        batch_idx = tf.tile(batch_idx, (1, num_neg_batch))
        full_indices = (tf.reshape(batch_idx, [-1]) * tf.to_int32(num_boxes) +
                        tf.reshape(indices, [-1]))
        neg_conf_loss = tf.gather(tf.reshape(conf_loss, [-1]),
                                  full_indices)
        neg_conf_loss = tf.reshape(neg_conf_loss,
                                   [batch_size, num_neg_batch])
        neg_conf_loss = tf.reduce_sum(neg_conf_loss, axis=1)

        # loss is sum of positives and negatives
        total_loss = pos_conf_loss + neg_conf_loss
        total_loss /= (num_pos + tf.to_float(num_neg_batch))
        num_pos = tf.where(tf.not_equal(num_pos, 0), num_pos,
                            tf.ones_like(num_pos))
        total_loss += (self.alpha * pos_loc_loss) / num_pos
        return total_loss

Training data

image data Label data: Offset and class description

Label data is described in xml in the following format. You can see the class label and offset.

<annotation>
        <folder>VOC2007</folder>
        <filename>000032.jpg</filename>
        <source>
                <database>The VOC2007 Database</database>
                <annotation>PASCAL VOC2007</annotation>
                <image>flickr</image>
                <flickrid>311023000</flickrid>
        </source>
        <owner>
                <flickrid>-hi-no-to-ri-mo-rt-al-</flickrid>
                <name>?</name>
        </owner>
        <size>
                <width>500</width>
                <height>281</height>
                <depth>3</depth>
        </size>
        <segmented>1</segmented>
        <object>
                <name>aeroplane</name>
                <pose>Frontal</pose>
                <truncated>0</truncated>
                <difficult>0</difficult>
                <bndbox>
                        <xmin>104</xmin>
                        <ymin>78</ymin>
                        <xmax>375</xmax>
                        <ymax>183</ymax>
                </bndbox>
        </object>
        <object>
                <name>aeroplane</name>
                <pose>Left</pose>
                <truncated>0</truncated>
                <difficult>0</difficult>
                <bndbox>
                        <xmin>133</xmin>
                        <ymin>88</ymin>
                        <xmax>197</xmax>
                        <ymax>123</ymax>
                </bndbox>
        </object>
        <object>
                <name>person</name>
                <pose>Rear</pose>
                <truncated>0</truncated>
                <difficult>0</difficult>
                <bndbox>
                        <xmin>195</xmin>
                        <ymin>180</ymin>
                        <xmax>213</xmax>
                        <ymax>229</ymax>
                </bndbox>
        </object>
        <object>
                <name>person</name>
                <pose>Rear</pose>
                <truncated>0</truncated>
                <difficult>0</difficult>
                <bndbox>
                        <xmin>26</xmin>
                        <ymin>189</ymin>
                        <xmax>44</xmax>
                        <ymax>238</ymax>
                </bndbox>
        </object>
</annotation>

Since even one image can have multiple offsets, there are as many label data formats as there are bounding boxes. The definition of the bounding box is prior_boxes_ssd300.pkl. prior_box_variance represents the variance of the bounding box.

[xmin, ymin, xmax, ymax, binary_class_label[Depends on the number of classes],  prior_box_xmin, prior_box_ymin, prior_box_xmax, prior_box_ymax, prior_box_variance_xmin, prior_box_variance_ymin, prior_box_variance_xmax, prior_box_variance_ymax,]
[xmin, ymin, xmax, ymax, binary_class_label[Depends on the number of classes],  prior_box_xmin, prior_box_ymin, prior_box_xmax, prior_box_ymax, prior_box_variance_xmin, prior_box_variance_ymin, prior_box_variance_xmax, prior_box_variance_ymax,]

:

How to prepare your own learning data

I think there is a request to prepare learning data and annotate it by yourself. It is recommended to use the following tools because you can prepare the annotation data in the same xml format as this time.

https://github.com/tzutalin/labelImg

However, I am addicted to the installation, so I will tell you the case that I was addicted to in my environment.

OS: macOS Sierra 10.12.5 (16F73)

Set up a virtual environment for python! !! There may be various things, but if you don't do this, it will be disastrous when you get hooked.

Download and install SIP

https://riverbankcomputing.com/software/sip/download

cd  {download folder}/SIP

python configure.py
make
make install

Download and install PyQt5

I installed PyQt5 because I tried it in python3 environment. The latest version 5.8 will not start due to a bug. Therefore, let's explicitly specify the previous version and download it.

pip install PyQt5==5.7.1

installation of libxml

Since libxml processing is performed, install it below. (For Mac)

brew install libxml2

When saving a file with a Japanese name, garbled characters appear, so this has been fixed. Check and fix the following until the pull requests are merged.

https://github.com/SnowMasaya/labelImg/commit/066eb78704fb0bc551dbb5aebccd8804dae3ed9e

Trained model

Caffe offers a number of trained models. If you want to use the trained model in Keras, you need a converter.

Screenshot from 2017-07-18 08:47:54.png

You can get the trained model of Caffe below.

https://github.com/weiliu89/caffe/tree/ssd

Use the following for the conversion.

deploy.prototxt
*.caffemodel

deploy.protxt needs to convert ʻinput layer` as below. The * part depends on the model.

Before conversion

input: "data"
input_shape {
  dim: *
  dim: *
  dim: *
  dim: *
}

After conversion

layer {
  name: "input_1"
  type: "Input"
  top: "data"
  input_param {
    # These dimensions are purely for sake of example;
    # see infer.py for how to reshape the net to the given input size.
    shape { dim: * dim: * dim: * dim: * }
  }
}