Introduction

I will explain the papers on self supervised learning that I often hear recently. However, there are many self supervised papers, so this time I will focus on the task called __dense tracking __ and follow the process of development up to the present. There are five next papers to explain.

[1] Tracking emerges by colorizing videos ^ 1 __ Abbreviation Vid. Color __ [2] Learning Correspondence from the Cycle-consistency of Time ^ 2 __ Abbreviation CycleTime__ [3] Self-supervised Learning for Video Correspondence Flow ^ 3 __ Abbreviation CorrFlow__ [4]: Joint-task Self-supervised Learning for Temporal Correspondence ^ 4 __ Abbreviation UVC__ [5]: MAST: A Memory-Augmented Self-Supervised Tracker ^ 5 __ Abbreviation MAST__

The fifth paper is the latest paper adopted by CVPR2020, which will be explained in part 2 (coming soon). The 1st to 4th papers are listed in the 5th paper as a comparison method of self supervised. Therefore, I intend to follow the flow of the dense tracking task by reading parts 1 and 2 together. スクリーンショット 2020-04-16 21.40.44.png Fig. 1 Dense tracking (self supervised system) performance comparison [^ 5]

The figures used are quoted from the original paper unless otherwise specified.

Tracking Emerges by Colorizing Videos^[1] This is a representative paper of the video self supervised. Please see the demo first.

Our latest work shows that learning to colorize videos causes visual tracking to emerge automatically!

Blog: https://t.co/FDVzJmmZ7h
Paper: https://t.co/U4jS83iI7B @alirezafathi @kevskibombom @sguada @abhi2610 pic.twitter.com/R3vMR3raFJ
— Carl Vondrick (@cvondrick) June 27, 2018

The basic idea of what you are doing in this demo is the same as Optical Flow. (Easy-to-understand explanation of optical flow) It predicts which pixel in the time $ T $ frame each pixel in the time $ T-1 $ frame corresponds to. In the first frame, if you specify the pixel of the target you want to track with the mask, you can track the target by catching the movement of the pixel in the subsequent frames.

In addition to masks, keypoint tracking such as joint positions can be performed in the same way. スクリーンショット 2020-04-10 18.51.49.png Figure 2 Tracking label data Such tasks are sometimes called __dense tracking __ in the sense of tight pixel association between frames.

Colorization So how do you learn? Conventional networks such as optical flow prediction give the correct optical flow as an annotation. However, this paper proposes learning that does not require any annotations. That is Video colorization. スクリーンショット 2020-04-05 18.51.51.png Figure 3 Color restoration using video

The two frames of __Target and Reference are converted to grayscale once, and by passing them through CNN, the movement of the pixel is predicted and the correspondence (pointer) of the pixel position is obtained. Then, the color of the Target Frame is predicted by copying it from the color of the Reference Frame using the pointer. __ By doing this, you can calculate the loss of whether the prediction is correct with the predicted coloring and the actual color of the Target Frame, and learning is possible without the annotation of optical flow.

The specific calculation method is explained. The goal is to create a pointer to copy the color of the Target frame from the color of the Reference frame. It can be expressed as follows by multiplying it with a mathematical formula. $y_{j}=\sum_{i} A_{i j} c_{i}$

$ y_ {j} $ is the predicted color of the jth pixel of the Target frame, and $ c_ {i} $ is the ith pixel color of the Reference frame. If the formula is difficult to understand, please see the reference article And $ A_ {i j} $ is the transformation matrix from the i-th pixel to the j-th pixel. That is, a pointer. This time, multiple pixels can be referred to by setting the element of $ A_ {i, j} $ to 0 ~ 1 by sofmax.

スクリーンショット 2020-04-05 18.00.40.png Figure 4 Network structure First, as shown in the above figure, both Reference frame and Target frame are converted to grayscale, and then each feature vector $ f $ is obtained through CNN. Now, in order to make a pointer, we only need to know the similarity between the i-th pixel of the Reference frame and the j-th pixel of the Target frame, so we will take the inner product. $A_{i,j}=f_{i}^Tf_{j}$

However, this time I want to normalize the similarity as a probability of 0 to 1, so I will use the one multiplied by softmax. $A_{i j}=\frac{\exp \left(f_{i}^{T} f_{j}\right)}{\sum_{k} \exp \left(f_{k}^{T} f_{j}\right)}$

Now that the pointer is created, the error is calculated using the color of the Target frame as shown below.

\min _{\theta} \sum_{j} \mathcal{L}\left(y_{j}, c_{j}\right)

As a supplement, in this paper, instead of predicting the RGB values as they are in order to simplify the problem, after converting to Lab space, clustering by kmeans is performed in the dataset, divided into 16 types of clusters, and classified colors. It is formulated by the classification (cross entropy loss) of.

The above is the framework of self supervised proposed in this paper. It's too smart to use the color before conversion as a label by daring to convert it to Gray scale! We have succeeded in automatically creating labels by making good use of the assumption that the same parts of the same object in the video have the same color (at least between short frames). From now on, I will follow the history of self supervised development starting from this paper.

Learning Correspondence from the Cycle-consistency of Time ^[2] Next, I will introduce the paper of CVPR 2019. The abbreviation is Cycle Time. Unlike Colorization, this method automatically generates labels using the idea of Cycle consistency. スクリーンショット 2020-04-06 9.25.37.png Figure 5 cycle consistency

To put it simply, __Cycle consistency __ is the idea that __ "If you go and come back, it should match the original state" __. As shown in the above figure, first play the video in reverse, predict the position at time $ T-1 $ from the object position at time $ T $, and predict the position at time $ T-2 $ from the prediction of time $ T-1 $. To do. This time, when the time $ T-1 $, $ T $ is predicted from the prediction of the time $ T-2 $ by playing forward in reverse, the object position at the first specified time $ T $ and the time returned by the forward playback The $ T $ position predictions should match. It is cycle consistency to compare this and calculate Loss.

Let's take a closer look at the content of the paper.

Proposed method

スクリーンショット 2020-04-06 21.13.55.png Figure 6 Network diagram

Forecasting is done on a patch-by-patch basis. That is, let's think about predicting where in the image of $ T-i + 2 $ the patch appropriately cut from $ T-i + 1 $ corresponds. This is done by the network $ \ mathcal {T} $ so let's take a look inside. $ \ mathcal {T} $ is the first image of time $ T-i + 2 $, __ $ I_ {T-i + 2} $ __ and patch __ cut from the image of time $ T-i + 1 $ Pass both $ p_ {T-i + 1} $ __ through a Resnet-based encoder to extract their respective feature maps __ $ x ^ I, x ^ p $ _. Next, in the same way as before, if you take the inner product, you can get the similarity matrix $ A (i, j) . $A(j, i)=\frac{\exp \left(x^{I}(j)^{\top} x^{p}(i)\right)}{\sum{j} \exp \left(x^{I}(j)^{\top} x^{p}(i)\right)}$$

However, this time, I want to see not only the correspondence of colors but also the correspondence of position coordinates, so it is necessary to convert the coordinates.

Therefore, by passing the matrix $ A (j, i) $ through a shallow network, we will output the geometric transformation parameter $ \ theta $. After that, if you convert the coordinates of $ I_ {T-i + 2} $ according to $ \ theta $, you can get the predicted patch in $ T-i + 2 $.

In the same way, predict $ T-i + 3 $ from $ T-i + 2 $, and repeat the process to predict the sequential playback of consecutive $ i $ frames ($ ti $ to $ t-1 ). You can write like this. $\mathcal{T}^{(i)}\left(x_{t-i}^{I}, x^{p}\right)=\mathcal{T}\left(x_{t-1}^{I}, \mathcal{T}\left(x_{t-2}^{I}, \ldots \mathcal{T}\left(x_{t-i}^{I}, x^{p}\right)\right)\right)$$

Also, reverse playback is the same. $\mathcal{T}^{(-i)}\left(x_{t-1}^{I}, x^{p}\right)=\mathcal{T}\left(x_{t-i}^{I}, \mathcal{T}\left(x_{t-i+1}^{I}, \ldots \mathcal{T}\left(x_{t-1}^{I}, x^{p}\right)\right)\right)$ So if you combine these two expressions, the cycle consistency loss will be as follows.

\mathcal{L}_{l o n g}^{i}=l_{\theta}\left(x_{t}^{p}, \mathcal{T}^{(i)}\left(x_{t-i+1}^{I}, \mathcal{T}^{(-i)}\left(x_{t-1}^{I}, x_{t}^{p}\right)\right)\right)

$ l_ {\ theta} $ is a function that calculates the coordinate deviation of patch with MSE.

Also, not only the position but also the difference between the feature maps of the patches will be calculated.

\mathcal{L}_{s i m}^{i}=-\left\langle x_{t}^{p}, \mathcal{T}\left(x_{t-i}^{I}, x_{t}^{p}\right)\right\rangle

That is the basic idea of Cycle consitency.

However, if this is left as it is, it will not be possible to deal with the following cases where the object is hidden once and then visible again. スクリーンショット 2020-04-06 22.04.19.png Fig. 7 When cycle consistency is difficult (left: the front of the face cannot be seen. Right: frame out is sandwiched)

Looking at the above figure, it seems that even if it is difficult to associate adjacent frames, it is possible to associate them with distant frames. Therefore, in this paper, not only images that are adjacent in chronological order but also predictions that skip $ i $ frames are included.

\mathcal{L}_{s k i p}^{i}=l_{\theta}\left(x_{t}^{p}, \mathcal{T}\left(x_{t}^{I}, \mathcal{T}\left(x_{t-i}^{I}, x_{t}^{p}\right)\right)\right)

It was long, but the final loss is the sum of the above three losses,

\mathcal{L}=\sum_{i=1}^{k} \mathcal{L}_{s i m}^{i}+\lambda \mathcal{L}_{s k i p}^{i}+\lambda \mathcal{L}_{l o n g}^{i}

Can be written as. The formula has become longer, but I hope you can get a feel for the Cycle consistency.

Self-supervised Learning for Video Correspondence Flow^[3] In the previous two papers, we have seen two self supervised methods, Colorization and cycle-consistency. Next, I will introduce a paper that combines both. The abbreviation is Color Flow.

スクリーンショット 2020-04-07 23.38.27.png Figure 8 Algorithm recognition image

This paper is structured to list the problems of the Video Colorization paper introduced at the beginning and propose methods to solve them. Regarding the issues listed, there are the following two points.

__ Exercise 1 __: Since the matching is performed after the color information is grayscaled, the special color information is missing. __ Exercise 2 __: As the number of frames to be predicted becomes longer, wrong predictions accumulate and the predictions drift.

With this in mind, let's look at the proposed method.

Proposed method

Exercise 1

Dropping a color image into grayscale and then putting it in CNN for prediction is quite a waste because RGB color information cannot be used for matching prediction. However, it is necessary to put some bottleneck in order to learn as self supervised.

Therefore, in this method, instead of simply dropping to grayscale, a bottleneck is created by randomly setting the RGB channel to 0. (See the figure below) In addition, perturbations such as brightness and contrast will be added.

スクリーンショット 2020-04-08 21.33.03.png Figure 9 Bottleneck improvement

By doing this, learning becomes possible while retaining some color information. Not only that, you can expect the effect of dropout by randomly setting the channel to 0, and data augementation can be performed automatically by changing the brightness. You can expect a considerable improvement in robustness compared to simply dropping it to grayscale. At the time of testing, it is easy because you only have to enter RGB as it is.

Exercise 2

Obviously, it is difficult to predict if the images to be compared are separated in time. Especially when there are occlusions and shape changes, it becomes more difficult to find a corresponding point. Also, once the prediction is incorrect, the next frame is predicted based on that prediction, so errors accumulate and the prediction drifts steadily. This paper proposes to solve such a problem in Long-term that the color of __model prediction is sometimes used as the color of ground truth __.

\hat{I}_{n}=\left\{\begin{array}{ll}\psi\left(A_{(n-1, n)}, I_{n-1}\right) & (1) \\ \psi\left(A_{(n-1, n)}, \hat{I}_{n-1}\right) & (2)\end{array}\right.

Equation (1) above predicts the color $ \ hat {I} _ {n} $ at time $ n $ using the image $ I_ {n-1} $ at time $ n-1 $.

Occasionally, instead of $ I_ {n-1} $ as in (2), the previous prediction $ \ hat {I} _ {n-1} $ is used for prediction. By doing this, we are trying to restore the track from the unpredictable state. The percentage of predictive colors used will increase as the model learns. In addition, this idea is called Scheduled Sampling and is a method widely used in Seq2Seq.

And finally, even in the long-term prediction, the constraint by cycle consistecy is applied to further improve the robustness. The final loss looks like this:

L=\alpha_{1} \cdot \sum_{i=1}^{n} \mathcal{L}_{1}\left(I_{i}, \hat{l}_{i}\right)+\alpha_{2} \cdot \sum_{j=n}^{1} \mathcal{L}_{2}\left(I_{j}, \hat{l}_{j}\right)

$ L_ {1} and L_ {2} $ are the color prediction errors in the forward and backward of cycle consistency, respectively. Note that the paper does not use the cycle consistency loss itself like the second paper. Colorization is main and cycle consistency is used as a regularizer to support Long-term.

Joint-task Self-supervised Learning for Temporal Correspondence ^[4] The next paper is from NeuroIPS 2019. The abbreviation is UVC. (I didn't know the origin of the name, so please let me know if you know) Now, the feature of this paper is that the bounding box prediction is inserted before the pixel level matching is performed. See the figure below. スクリーンショット 2020-04-12 1.31.00.png Figure 10 Comparison of matching at the pixel level (bottom figure) and matching at the box level (top figure)

Figure (b) is a match made only at the pixel level as in previous papers. For example, taking the yellow line as an example, there are two people who came in red clothes, so the response is wrong. As you can see, pixel-level matching is effective for seeing small changes in objects, but it is not suitable for matching with semantic elements of __object (invariant to rotation and viewpoint). __ On the other hand, the detection by the bounding box is the opposite and is considered to be complementary to each other. Therefore, should we first detect the box area as shown in Fig. (A) and then make a pixel-level matching prediction within that area? That is the purpose of the dissertation.

Proposed method

スクリーンショット 2020-04-12 1.51.25.png Figure 11 Network diagram

Box area is detected in the first half of the figure (Region-level localization), and pixel matching is performed in the area found in the second half (Fine-grained matching).

Region level localization The goal is to find a bbox where the patch cut in the Reference frame corresponds to the Target frame. As in the example, both are grayscaled and then passed through CNN to calculate the familiar similarity matrix. $A_{i j}=\frac{\exp \left(f_{1 i}^{\top} f_{2 j}\right)}{\sum_{k} \exp \left(f_{1 k}^{\top} f_{2 j}\right)}, \quad \forall i \in\left[1, N_{1}\right], j \in\left[1, N_{2}\right]$

Up to this point, the procedure is similar to the procedure in the second paper (Cycle Time). CycleTime passed an additional network from this matrix to map the position coordinates. However, in this paper, $ A_ {ij} $ should be almost sparse (matrix elements are 1 for only one corresponding pixle and 0 for the other elements), and the position coordinates are converted by the following formula. .. $l_{j}^{12}=\sum_{k=1}^{N_{1}} l_{k}^{11} A_{k j}, \quad \forall j \in\left[1, N_{2}\right]$

Where $ l_ {j} ^ {mn} $ is the coordinates of the image $ m $ that moves to the $ j $ th pixel of the image $ n $. Refer to Supplementary Article 2

You can use this formula to find out which pixel coordinates of $ p_ {1} $ move to each pixel of $ f_ {2} $. On the other hand, if you ask for $ l_ {j} ^ {21} $, you can find out where $ p_ {1} $ is in $ f_ {2} $. If you average $ l_ {j} ^ {21} $ here, you can calculate the bounding box center $ C ^ {21} $. C^{21}=\frac{1}{N_{1}} \sum_{i=1}^{N_{1}} l_{i}^{21} Since the center of the bounding box has been estimated, let's define the size of the box as well. $ w $, $ h $ is simply defined by the average value of the deviation between each coordinate of $ l_ {j} ^ {21} $ and the center $ C . $\hat{w}=\frac{2}{N_{1}} \sum_{i=1}^{N_{1}}\left|x_{i}-C^{21}(x)\right|_{1}$$

With the above, we have successfully estimated the bbox in $ f_ {2} $. The feature map $ p_ {2} $, which is obtained by cutting $ f_ {2} $ in this bbox, is used for the next Fine-grained matching.

Fine-grained matching スクリーンショット 2020-04-12 10.40.32.png Figure 12 Network (Fine-grained matching section)

After that, as in the first paper, you can calculate the similarity matrix $ A_ {pp} $ with $ p_ {1} $, $ p_ {2} $ and restore the color. However, in this paper, the color is predicted by the Encoder-Decoder method instead of copying the color directly with the pointer (see the figure). This paper argues that the advantage of this is that it uses CNN embedding rather than using $ A_ {pp} $ directly, so more global contextual information can be used. Loss There are three main types of Loss to use. The first is Loss__ whether the color restoration is done correctly (the formula is not described in the paper, but it was compared with L1 loss if it was implemented) The second is a constraint on bbox prediction called concentration regularization. This will try to get the pixels together as much as possible, assuming that the pixels in the bbox will be closer together as they move. (See the left side of the figure below)

スクリーンショット 2020-04-12 10.53.24.png Figure 13 Image of two regularizations used for loss

L_{c}=\left\{\begin{array}{ll}0, & \left\|l_{j}^{21}(x)-C^{21}(x)\right\|_{1} \leq w \text { and }\left\|l_{j}^{21}(y)-C^{21}(y)\right\|_{1} \leq h \\ \frac{1}{N_{1}} \sum_{j=1}^{N_{1}}\left\|l_{j}^{21}-C^{21}\right\|_{2}, & \text { otherwise }\end{array}\right.

By penalizing the corresponding points that extend beyond the bbox, it is possible to prevent only one place from being associated with a completely different place.

The third is a constraint called Orthogonal regularization. This is essentially the same as the Cycle consistecy between two frames. The idea was that Cycle consistency would be restored after converting from frame 1 to frame 2 and then from frame 2 to frame 1. Here, the correspondence between the frame-to-frame coordinates $ l $ and the feature map $ f $ had the following relationships as explained in the bbox chapter.

l^{\hat{1} 2}=l^{11} A_{1 \rightarrow 2}, \quad l^{\hat{1} 1}=l^{\hat{1} 2} A_{2 \rightarrow 1}

\hat{f}_{2}=f_{1} A_{1 \rightarrow 2}, \quad \hat{f}_{1}=\hat{f}_{2} A_{2 \rightarrow 1}

Here, we can see that $ A_ {1 \ rightarrow 2} ^ {-1} = A_ {2 \ rightarrow 1} $ should be established in order for Cycle consistency to be established.

Now, if there is a one-to-one pixel correspondence here, it is indicated by $ f_ {1} f_ {1} ^ {\ top} = f_ {2} f_ {2} ^ {\ top} $. So we can assume that the absolute amount of color (color energy) has not changed. Refer to Supplementary Article 3

Using the above, $ A $ will go straight to each other like $ A_ {2 \ rightarrow 1} = A_ {1 \ rightarrow 2} ^ {-1} = A_ {1 \ rightarrow 2} ^ {\ top} $ If you do, you can tell that Cycle consistency is established.

Derivation

\hat{f}_{2}=f_{1} A_{1 \rightarrow 2}

Transpose both sides with $f_{2}^{\top}=(f_{1} A_{1 \rightarrow 2})^{\top}=A_{1 \rightarrow 2}^{\top}f_{1}^{\top}$ Therefore $f_{2}f_{2}^{\top}=f_{1} A_{1 \rightarrow 2}A_{1 \rightarrow 2}^{\top}f_{1}^{\top}$ Because this satisfies $ f_ {1} f_ {1} ^ {\ top} = f_ {2} f_ {2} ^ {\ top} $ $ A_ {1 \ rightarrow 2} ^ {-1} = A_ {1 \ rightarrow 2} ^ {\ top} $ holds

Therefore, the Cycle consistency loss can be easily calculated by taking the MSE loss of $ f_ {1} $ and $ A_ {1 \ rightarrow 2} ^ {\ top} A_ {1 \ rightarrow 2} f_ {1} $. This is Orthogonal regularization. Of course, the same calculation is performed for the coordinates $ l $.

Recognition example (from the author github)

Summary

So far, I have introduced four papers on self supervised for dense tracking.

These four books have much in common in similar studies, but I think you can feel the trend of improving the dense tracking method little by little.

As a personal reflection, I am afraid that the explanation with a lot of mathematical formulas may have made it the most difficult article in my history. (Please comment if you have any suggestions)

However, it is really interesting and attractive to enable unlabeled learning with interesting ideas such as coloring grayscale and repeating forward and reverse playback, so I hope that readers will also be able to understand it.

[PYTHON] [Paper reading] Self supervised learning Paper commentary part1 ~ history of development of dense tracking ~

Introduction

Proposed method

Proposed method

Exercise 1

Exercise 2

Proposed method

Summary