motivation

Wouldn't it be nice if you could make a video (2D image) 3D? You can also use it for robotics and autonomous driving. You can also enjoy the muffled videos in 3D. Dreams and so on swell.

Introduction. What is the content of this article?

This article introduces ** Unsupervised Monocular Depth Learning in Dynamic Scenes ** (https://arxiv.org/abs/2010.16404), a self-supervised monocular depth and motion estimation method proposed in 2020.

To put it plainly, it is a deep learning method that estimates 3D (Depth) and the motion of an object from an ordinary (2D) video without the need for humans to prepare correct answer data **.

Summary

You can learn depth and motion with just ordinary 2D video. It's amazing.
You don't need to know the characteristics of the camera (viewing angle, etc.). It will be estimated collectively. It's amazing.
The shooting target may be moving as well as the camera itself. Global motion (camera movement) and motion of the shooting target can be estimated separately. It's amazing.
Difficult to learn. It requires learning techniques to avoid inappropriate local optimization and score goals.

What is self-supervised learning in the first place?

** It is an attractive method that generates correct answer data from the data itself and does not label or annotate it by humans **. There is already a great article, so please refer to another article for details. https://qiita.com/omiita/items/a7429ec42e4eef4b6a4d

The method introduced this time solves the problem of estimating the image of the second frame (correct answer) from the image of the first frame by using ** images of consecutive frames, and in the process, the target image An approach ** that attempts to predict 3D coordinates (xyz) and motion.

Model overview (during inference)

Again, this method solves the problem of estimating the image of the second frame (correct answer) from the image of the first frame by using the images of consecutive frames, and in the process, the target image The purpose is to predict 3D coordinates (xyz) and motion.

The learning phase is a bit complicated, so let's first look at an easy-to-understand inference model. The overall configuration is as shown in the figure below.

As shown in the figure, it is formed by two major networks, Depth Network and Motion Network. ** Depth Network estimates the depth from the image, and Motion Network estimates the motion of the camera or object, and camera parameters using the generated Depth image and 2 frames of RGB image. ** If you have information on camera parameters (viewing angle, etc.) and a Depth image, you can express the target three-dimensionally (in xyz).

The input / output of each network is roughly summarized below.

(1) Depth Network -** Input: ** RGB image (3ch)

-** Output: ** Depth image (1ch)

Overview: U-Net like architecture. The treatise is based on ResNet18. Use Random Normalization instead of Batch Normalization. (2) Motion Network -** Input: ** Reference & next frame RGBD. 8ch in total. RGB image of reference frame & Depth image of reference frame generated by DepthNet & RGB image of the next frame & Depth image of the next frame generated by DepthNet

-** Output @ Bottleneck part: ** Camera motion (XYZ translation and XYZ Euler angles, 6 parameters in total) Camera matrix (focal length and shooting center in each direction of image height & width. 4 parameters in total)

-** Output @ Decoder part: ** Motion motion (xyz vector 3ch motion for each pixel in the image)

Overview: Also U-Net like architecture. In the paper, the Conv of the Decoder part is branched and increased based on FlowNet. The features estimated from the entire image are estimated from the bottleneck part, and the pixel-level motion estimation (residual motion) is estimated while increasing the resolution with the decoder.

Model overview (during learning)

At the time of training, based on the information obtained from the above model, ** create a warp image that projects the image of the second frame onto the positional relationship of the original image **. In other words, it logically derives something like "Given the inferred camera motion and the movement of the object, it would look like this in the original frame." We will train this warp image to match the original image.

In addition to that, multiple constraints are given for learning. The configuration is as follows. It's a little complicated. I will briefly explain Loss.

photometric consistency loss ** Loss trying to match the warp image with the next frame image reconstructed to the original frame with the original image. ** ** The sum of L1Loss based on the RGB difference between the warp image and the original frame image, and SSIM-based Loss with image similarity.
motion consistency loss ** Loss that forces consistency so that the predicted motion (rotation and translation) from the original frame ① to the next frame ② is opposite to the predicted motion (rotation and translation) from the next frame ② to the original frame ①. ** **
depth smoothness regularization Based on the assumption that there is no steep depth change in the part where the color change is small (non-edge), ** Loss that suppresses the large depth change in the part where the RGB change is small. ** **
motion regularization L1 smoothness loss that suppresses sudden changes in the ** motion prediction value of adjacent pixels, and L1/2 sparsity loss ** that makes the motion prediction value sparse.

Motion regularization is a bit tricky, but you'll prefer flat peaks while gently suppressing changes in the motion prediction heatmaps. I am aware that the speed of an object does not change from place to place.

Prediction results (excerpt from the treatise)

The figure below is an example of inference results for each data set. From left to right, the original image, the inference depth image (disparity, 1/depth), and the inference motion image. Sometimes it's weird, but it looks pretty neat and clear. You can also predict camera parameters, so you can learn using Youtube videos.

What's amazing after all?

Subjective, I like the following:

You can estimate the depth of one video without a teacher!

Image ⇒ I think the easiest way to learn depth is to acquire data as a set of two and perform image to image supervised learning. The fashionable point of this method is that you do not need correct answer data.

No camera parameters required!

If you approach it normally, you will want to make the depth data three-dimensional using information such as the viewing angle of the camera, but even that is a bold guess.

You can estimate XYZ motion, not pixel movement!

It is not the movement of pixels on two axes, up, down, left, and right. You can estimate the movement of the three axes of XYZ. (Different from Optical Flow)

Camera motion and object motion can be estimated separately!

You can estimate the motion caused by the camera and the motion caused by the moving object separately. When you want to know the movement of an object, you can eliminate the effect of the camera making it appear to move.

Compared to past methods ...

One year before this method, I published Thesis Depth from videos in the wild. In the previous paper, by masking the moving object with a pre-learned object detection model, it was possible to separate the moving object from the camera motion. (Excerpt from the figure below and the treatise)

On the other hand, the method introduced this time eliminates the need for object detection of moving objects by giving motion-related regularization and assumptions. It's simpler as a model.

For the genealogy of monocular depth, DeNA Miyazawa's materials are insanely rich, so please refer to it. https://www.slideshare.net/KazuyukiMiyazawa/depth-from-videos-in-the-wild-unsupervised-monocular-depth-learning-from-unknown-cameras-167891145

I actually tried it.

While referring to Github author implementation, I tried to ** create and learn a model by myself **. ** The following is a little maniac, so if you are not interested, please skip it. ** **

result

I will attach some inference results of the model after 80 epoch training with about 10,000 KITTI tracking data. ** The top three are relatively good examples, and the bottom three are slightly disappointing examples. ** Well, after all, it's a monocular estimation, so I can't read the depth completely, but I think it's a good line.

Even if you misread the depth, you may be able to match the tsuji with motion, so strange inference results will come out. The road sign on the lower left is a typical example, and the depth is strange by all means. Since I misread the depth, the road sign will be distorted in the next frame if only the movement of the camera is considered, but by thinking that the road sign is moving, I forcibly adjust the tsuji. The loss at this point is 0.02 for motion regularization, 0.0007 for depth smooth, and 0.52 for cycle loss of rgb & motion. The ssim loss of image similarity increases by far.

Implementation points

To be honest, I struggled quite a bit until I was able to get the results shown above.

Model building itself shouldn't be too difficult if you're used to processing depth images. However, I got the impression that it is quite difficult to balance learning ** because there are so many estimation targets that they influence each other to produce the final warp image. Personally, learning was more difficult than modeling.

** There are local optimizations everywhere during learning, so if you don't think about it, you'll quickly get stuck in a hole. ** ** I don't think it's fun to read, but there are some addictive points, for example:

** 1. Addicted by excessive camera movement (depth: small, motion: large) ** Since the target is close to the camera and the camera has moved so much, there is no place to lose between frames and it is stable. As a countermeasure, forcibly give motion restrictions or devise to reduce the initial value of learning. It is better to have a proper gradient limit (clip) at the time of learning.

** 2. I'm addicted to matching Tsuji with all motions (Motion: Large) ** Even if a distant object moves a lot or a near object moves a little, the change in appearance from the camera is the same, so there is a tsuji and it is stable. As a countermeasure, freeze the estimation layer of moving object motion halfway. At the beginning, I'll do my best with only depth and camera motion, and add moving motion at the end.

** 3. It's better if it doesn't move, and I'm addicted to it (motion: under) ** A neat stability point that it is better to eliminate motion if it moves poorly and the shape of the image collapses. If the initial value is too small or the gradient limit at the time of learning is too strong, the model will be lazy and learning will not progress, so it is necessary to give moderate randomness.

I feel like I'm addicted to many other things. It may be better to give appropriate constraints such as maximum motion depending on the application and video. I really respect the author.

at the end

Last year, shortly after I first started using Python, I saw a commentary on Depth form videos in the wild at Nikkei Robotics, and was impressed by the deep learning ghosts. At that time, the implementation ability was low and I couldn't do anything about it, but I was able to implement the same series of treatises in about a year after that, so I could feel a little growth.

However, I haven't studied enough yet, so please point out any mistakes. Thank you for reading.

[PYTHON] Delusion of 3D from 2D video [Unsupervised Monocular Depth Learning in Dynamic Scenes]