Overview

How is everyone doing. I think that the epidemic of the new coronavirus infection has subsided a little, and many people are gradually returning to work or school. Now, this time, I would like to focus on the reconstruction task that utilizes the generative model again. In particular, ** I will try to reconstruct the "video". ** ** (What you can understand in this article is the experimental results and discussion of the reconstruction method using the encoder-decoder-based ** 3d-convolution, which can be extended to detect anomalies in moving images, and ** theories such as mathematical formulas. I will not follow the background.) ** All of this implementation is available on here. ** Implemented by PyTorch.

In the previous article, Verification and implementation of video reconstruction method using GRU and Autoencoder, we considered the following model.

図4.png

The reason why I thought about this model is that latent variables can be expressed as series data. In other words, 3d-conv encodes a video into a single latent variable, but that's overkill, isn't it? That was the motivation. In fact, in some papers it is difficult to "reconstruct" 3d-conv. There is a mention. (On the other hand, recent CVPR is expected to utilize 3d-conv when a large amount of data is collected in video recognition, but since this is a discriminative model, it will be a different field from this generation task.)

Now, however, I wanted to experimentally confirm that the reconstruction of 3d-conv would work. ** Is video recognition working, but reconstruction really working? Although I somehow understood it theoretically, I always wondered. That's why I came to write this article. I will start from the explanation of the model immediately.

For comparison with the method using GRU, there are some overlaps with the previous article, but please forgive me.

Video reconstruction model

The model to be implemented this time is shown below.

$ \ boldsymbol {x_1, x_2, ..., x_T} $ means a video with a length of T, and $ \ boldsymbol {x_t} $ is each frame. The encoder using 3D-CNN receives the moving image and maps it to one point of $ \ boldsymbol {z_T} $. Using this, the procedure is to map to the observation space with the decoder. As a reconstruction task, the parameters are optimized to minimize the I / O difference.

I think that the method using 3D-CNN is very simple and easy to understand. It is possible to extract the features of time and space at once ** by 3D convolution without biting the time series model. Regarding 3D-CNN processing, there are other commentary articles, so I will hand over to that lol

Model learning / verification

** The flow of reconstruction is as follows. ** **

Prepare a human action dataset
Learn 3D-CNN Autoencoder
Reconstruct the video using the model learned in 2.

1.human action dataset Use the familiar human action dataset. This data was used for verification in a video generation model called MocoGAN, and as the name suggests, it contains the appearance of people walking and waving.

You can download it from here. (The above image is also quoted from the data in this link.)

2. Learning 3D-CNN Autoencoder

Next, we will train the model using the above data. The loss function is MSE, which naturally minimizes the error between input and output. For more information on the model, please see here. Below is the implementation of model.

`network.py`



class ThreeD_conv(nn.Module):
    def __init__(self, opt, ndf=64, ngpu=1):
        super(ThreeD_conv, self).__init__()
        self.ngpu = ngpu
        self.ndf = ndf
        self.z_dim = opt.z_dim
        self.T = opt.T
        self.image_size = opt.image_size
        self.n_channels = opt.n_channels
        self.conv_size = int(opt.image_size/16)

        self.encoder = nn.Sequential(
            nn.Conv3d(opt.n_channels, ndf, 4, 2, 1, bias=False),
            nn.BatchNorm3d(ndf),
            nn.ReLU(inplace=True),
            nn.Conv3d(ndf, ndf * 2, 4, 2, 1, bias=False),
            nn.BatchNorm3d(ndf * 2),
            nn.ReLU(inplace=True),
            nn.Conv3d(ndf * 2, ndf * 4, 4, 2, 1, bias=False),
            nn.BatchNorm3d(ndf * 4),
            nn.ReLU(inplace=True),
            nn.Conv3d(ndf * 4, ndf * 8, 4, 2, 1, bias=False),
            nn.BatchNorm3d(ndf * 8),
            nn.ReLU(inplace=True),
        )
        self.fc1 = nn.Sequential(
            nn.Linear(int((ndf*8)*(self.T/16)*self.conv_size*self.conv_size),self.z_dim ),#6*6
            nn.ReLU(inplace=True),
        )
        self.fc2 = nn.Sequential(
            nn.Linear(self.z_dim,int((ndf*8)*(self.T/16)*self.conv_size*self.conv_size)),#6*6
            nn.ReLU(inplace=True),
        )
        self.decoder = nn.Sequential(
            nn.ConvTranspose3d((ndf*8), ndf*4, 4, 2, 1, bias=False),
            nn.BatchNorm3d(ndf * 4),
            nn.ReLU(inplace=True),
            nn.ConvTranspose3d(ndf*4, ndf * 2, 4, 2, 1, bias=False),
            nn.BatchNorm3d(ndf * 2),
            nn.ReLU(inplace=True),
            nn.ConvTranspose3d(ndf * 2, ndf , 4, 2, 1, bias=False),
            nn.BatchNorm3d(ndf),
            nn.ReLU(inplace=True),
            nn.ConvTranspose3d(ndf , opt.n_channels, 4, 2, 1, bias=False),
            nn.Tanh(),
        )

The learning turned 5,000 itr, and the loss changed as follows. It became almost 0 in the second half, and I don't see much change, but I have the impression that it has converged safely.

3. Reconstruction of video by model

Make inferences using the model. In the above implementation, the reconstruction result is saved in generated_videos in the logs folder for each check point specified by the argument. As we learned, we showed the following behavior. The upper row of each itr is the input, and the lower row is the output.

--0 itr eyes Of course, it cannot be reconstructed at all.

--1,000 itr eyes Although it is blurry, it has a human shape.

--4,000 itr eyes It's a little clear, but it seems that blurring and blurring have occurred, and even the smallest details such as human hands have not been reproduced. Furthermore, let's compare the results with GRU-AE. The following is the result of reconstruction by GRU-AE. This is a comparison of the methods in the previous article under the same conditions as above. The 0 itr eye is omitted.

--500 itr eyes Impression that it is not too terrible. Is it going well?

―― 1,500 itr eyes Oh. That's a good idea.

--4,000 itr eyes It became indistinguishable for a moment which was the real one. If you look closely, it may be blurry, but lol

Summary

This time, I tried to reconstruct a moving image using 3DCNN-AE. As a result, it's as expected, but the generated video is not good. It is not that the movement cannot be reproduced, but it is inferior to GRU-AE in terms of the sharpness of each image. There are many voices who regard 3D-CNN as a problem in anomaly detection papers, and this time I was able to understand it experimentally. On the other hand, ** 3D-CNN is a promising star in recognition tasks. ** In an environment where a large amount of data can be collected, it seems that video recognition is treated as a favorite rather than a 2D approach like GRU. But in the generation task, it is different. There is little data, and it seems that "strong features" like supervised learning cannot be acquired. It seems that the day when 3D-conv will be used as a favorite for video anomaly detection is still ahead. .. .. Thank you for watching until the end.

[PYTHON] Reconstruction of moving images by Autoencoder using 3D-CNN