Overview

How is everyone doing. With the spread of coronavirus infections, many people may be working or doing research at home. For the past few months, I've been staring at my PC at home every day. I'm really tired lol

Now, this time, I would like to focus on the reconstruction task that utilizes the generative model. In particular, try to reconstruct the "video". (What you can understand in this article is the experimental results and consideration of the reconstruction method that bites the encoder-decoder-based time series model, which can be extended to anomaly detection of moving images, and the theoretical background such as mathematical formulas. I won't chase.)

What is often applied in the field of "anomaly detection" is the reconstruction of ** "images" **. This is a method of calculating the degree of anomaly by inputting an image into an encoder-decoder model, reconstructing it, and taking the difference between input and output. VAE is particularly famous as a generative model that can be used for image reconstruction, and recently, anomaly detection methods (AnoGAN, EfficientGAN, etc.) that utilize GAN have also appeared. The following is an example of AnoGAN.

The above image is taken from here

However, for some reason, when it comes to reconstructing "videos," I can't find any research cases. (In some cases of video "generation", there are some methods that utilize GAN.)

(** Addendum (2020/05/26) **: An example of utilizing 3D convolution as in Abnormal Event Detection in Videos using Spatiotemporal Autoencoder It seems that the use of Spatio Temporal Networks (STN) is being sought as a task for extracting spatiotemporal features such as movies. [Deep Learning for Anomaly Detection: A Survey](https://arxiv.org/ From abs / 1901.03407). I don't see much literature in Japanese, but lol. In the future, I would like to report on the research status of these moving image anomaly detection as an article. )

I think that image anomaly detection technology has become extremely widespread in the world for industrial applications. I think there is a certain amount of demand for moving images, but isn't it much developed?

Of course, if you don't care about anomaly detection and "reconstruction" of videos, there are various things from long ago. (Classification using SVM based on old-fashioned ST-Patch features, etc.)

However, it is a world where deep learning has become so popular. You can also reconstruct the video! This time, we will implement a video reconstruction model by combining the time series model GRU and the encoder-decoder model. ** All the code is available on here. ** **

Video reconstruction model

The model to be implemented this time is shown below.

図4.png

$ \ boldsymbol {x_1, x_2, ..., x_T} $ means a video with a length of T, and $ \ boldsymbol {x_t} $ is each frame. The encoder takes an image of each frame and maps it to $ \ boldsymbol {z} $. By repeating this T times, the latent variable of series length corresponding to the input video, $ \ boldsymbol {Z_1, Z_2, ..., Z_T} $ is obtained. This latent variable is modeled using GRU. The output at each t corresponds to $ \ boldsymbol {\ hat {z_t}} $. The rest is the procedure of using this to map to the observation space with the decoder.

I'm not sure if the above model can guarantee strict mathematical correctness because I'm not sure about the iron plate of the moving image reconstruction task, but I'm not sure if it's particularly excellent, but ** Can be reconstructed. ** **

Here is a supplementary point, but it is about why the moving image is not encoded into one point in the latent space. In a normal encoder-decoder model, the input image is encoded with only one latent variable. I thought I would follow this when expanding to a moving image, but while conducting a survey of GAN-based papers in my research, ** it is too much to correspond to a point in the latent space * I found that there was a discussion about *.

I agree with this, as a rule of thumb in the past. Previously, I tried to reconstruct a moving image using Keras, but the above method did not produce the desired results.

Therefore, this time, taking advantage of that reflection, I thought about handling the latent variables of the series and came to define the above model.

Model learning / verification

** The flow of reconstruction is as follows. ** **

Prepare a human action dataset
Learn with GRU-AE
Reconstruct the video using the trained model

1.human action dataset This data was used for verification in a video generation model called MocoGAN, and as the name suggests (I don't know the first appearance), it shows people walking and waving. It has become a thing.

You can download it from here. (The above image is also quoted from the data in this link.)

2. Learning with GRU-AE

Next, we will train the model using the above data. The loss function is MSE, which naturally minimizes the error between input and output. For details of the model, see the implementation code (implemented by PyTorch) in here.

The learning was turned 10,000 itr, and the loss changed as follows. It became almost 0 in the second half, and I don't see much change, but I have the impression that it has converged safely.

ダウンロード.png

3. Reconstruction of video by model

Make inferences using the model. In the above implementation, the reconstruction result is saved in generated_videos in the logs folder for each check point specified by the argument. As we learned, we showed the following behavior. The upper row of each itr is the input, and the lower row is the output.

--0 itr eyes Of course, it cannot be reconstructed at all.

--1,000 itr eyes Although it is blurry, it has a human shape.

--5,000 itr eyes When it reaches 5,000 times, the movement is well reproduced.

―― 9,000 itr eyes The sample is subtle, but the impression is that it has become possible to reconstruct it in a nearly perfect form.

Summary

This time, I tried to reconstruct a moving image using GRU-AE. It's a simple model, but the point is that it handles latent variables in chronological order. Using this method, it seems that we can expect expansion to anomaly detection. But to be honest, what are you doing these days as novelty compared to GAN? That is the place. Furthermore, it is undeniable that it is honestly inferior (some parts are blurred) when compared with images reconstructed from GAN. (Previously my article also introduced VAEGAN) However, it is a simple method and easy to use, so I would appreciate it if you could try it.

[PYTHON] Verification and implementation of video reconstruction method using GRU and Autoencoder