[PYTHON] Verification and implementation of video reconstruction method using GRU and Autoencoder

Overview

How is everyone doing. With the spread of coronavirus infections, many people may be working or doing research at home. For the past few months, I've been staring at my PC at home every day. I'm really tired lol

Now, this time, I would like to focus on the reconstruction task that utilizes the generative model. In particular, try to reconstruct the "video". (What you can understand in this article is the experimental results and consideration of the reconstruction method that bites the encoder-decoder-based time series model, which can be extended to anomaly detection of moving images, and the theoretical background such as mathematical formulas. I won't chase.)

What is often applied in the field of "anomaly detection" is the reconstruction of ** "images" **. This is a method of calculating the degree of anomaly by inputting an image into an encoder-decoder model, reconstructing it, and taking the difference between input and output. VAE is particularly famous as a generative model that can be used for image reconstruction, and recently, anomaly detection methods (AnoGAN, EfficientGAN, etc.) that utilize GAN have also appeared. The following is an example of AnoGAN.

20190420005824.png

The above image is taken from here

However, for some reason, when it comes to reconstructing "videos," I can't find any research cases. (In some cases of video "generation", there are some methods that utilize GAN.)

(** Addendum (2020/05/26) **: An example of utilizing 3D convolution as in Abnormal Event Detection in Videos using Spatiotemporal Autoencoder It seems that the use of Spatio Temporal Networks (STN) is being sought as a task for extracting spatiotemporal features such as movies. [Deep Learning for Anomaly Detection: A Survey](https://arxiv.org/ From abs / 1901.03407). I don't see much literature in Japanese, but lol. In the future, I would like to report on the research status of these moving image anomaly detection as an article. )

I think that image anomaly detection technology has become extremely widespread in the world for industrial applications. I think there is a certain amount of demand for moving images, but isn't it much developed?

Of course, if you don't care about anomaly detection and "reconstruction" of videos, there are various things from long ago. (Classification using SVM based on old-fashioned ST-Patch features, etc.)

However, it is a world where deep learning has become so popular. You can also reconstruct the video! This time, we will implement a video reconstruction model by combining the time series model GRU and the encoder-decoder model. ** All the code is available on here. ** **

Video reconstruction model

The model to be implemented this time is shown below.

図4.png

$ \ boldsymbol {x_1, x_2, ..., x_T} $ means a video with a length of T, and $ \ boldsymbol {x_t} $ is each frame. The encoder takes an image of each frame and maps it to $ \ boldsymbol {z} $. By repeating this T times, the latent variable of series length corresponding to the input video, $ \ boldsymbol {Z_1, Z_2, ..., Z_T} $ is obtained. This latent variable is modeled using GRU. The output at each t corresponds to $ \ boldsymbol {\ hat {z_t}} $. The rest is the procedure of using this to map to the observation space with the decoder.

I'm not sure if the above model can guarantee strict mathematical correctness because I'm not sure about the iron plate of the moving image reconstruction task, but I'm not sure if it's particularly excellent, but ** Can be reconstructed. ** **

Here is a supplementary point, but it is about why the moving image is not encoded into one point in the latent space. In a normal encoder-decoder model, the input image is encoded with only one latent variable. I thought I would follow this when expanding to a moving image, but while conducting a survey of GAN-based papers in my research, ** it is too much to correspond to a point in the latent space * I found that there was a discussion about *.

I agree with this, as a rule of thumb in the past. Previously, I tried to reconstruct a moving image using Keras, but the above method did not produce the desired results.

Therefore, this time, taking advantage of that reflection, I thought about handling the latent variables of the series and came to define the above model.

Model learning / verification

** The flow of reconstruction is as follows. ** **

  1. Prepare a human action dataset
  2. Learn with GRU-AE
  3. Reconstruct the video using the trained model

1.human action dataset This data was used for verification in a video generation model called MocoGAN, and as the name suggests (I don't know the first appearance), it shows people walking and waving. It has become a thing.

epoch_real_60.png epoch_real_30.png

You can download it from here. (The above image is also quoted from the data in this link.)

2. Learning with GRU-AE

Next, we will train the model using the above data. The loss function is MSE, which naturally minimizes the error between input and output. For details of the model, see the implementation code (implemented by PyTorch) in here.

The learning was turned 10,000 itr, and the loss changed as follows. It became almost 0 in the second half, and I don't see much change, but I have the impression that it has converged safely.

ダウンロード.png

3. Reconstruction of video by model

Make inferences using the model. In the above implementation, the reconstruction result is saved in generated_videos in the logs folder for each check point specified by the argument. As we learned, we showed the following behavior. The upper row of each itr is the input, and the lower row is the output.

--0 itr eyes Of course, it cannot be reconstructed at all. real_itr0_no0.png recon_itr0_no0.png

--1,000 itr eyes Although it is blurry, it has a human shape. real_itr1000_no0.png recon_itr1000_no0.png

--5,000 itr eyes When it reaches 5,000 times, the movement is well reproduced. real_itr5000_no0.png recon_itr5000_no0.png

―― 9,000 itr eyes The sample is subtle, but the impression is that it has become possible to reconstruct it in a nearly perfect form. real_itr9000_no0.png recon_itr9000_no0.png

Summary

This time, I tried to reconstruct a moving image using GRU-AE. It's a simple model, but the point is that it handles latent variables in chronological order. Using this method, it seems that we can expect expansion to anomaly detection. But to be honest, what are you doing these days as novelty compared to GAN? That is the place. Furthermore, it is undeniable that it is honestly inferior (some parts are blurred) when compared with images reconstructed from GAN. (Previously my article also introduced VAEGAN) However, it is a simple method and easy to use, so I would appreciate it if you could try it.

Recommended Posts

Verification and implementation of video reconstruction method using GRU and Autoencoder
Implementation and experiment of convex clustering method
Reconstruction of moving images by Autoencoder using 3D-CNN
Implementation of object authenticity judgment condition using __bool__ method
[Deep Learning from scratch] Implementation of Momentum method and AdaGrad method
Implementation of TF-IDF using gensim
Examination of Forecasting Method Using Deep Learning and Wavelet Transform-Part 2-
Explanation and implementation of SocialFoceModel
Reconstruction of cone beam CT (CBCT) using python and TIGRE
Implementation of ML-EM method, cross-section reconstruction algorithm for CT scan
Mathematical explanation of binary search and ternary search and implementation method without bugs
Aligning scanned images of animated video paper using OpenCV and Python
[Python] Implementation of Nelder–Mead method and saving of GIF images by matplotlib
Implementation of Datetime picker action using line-bot-sdk-python and implementation sample of Image Carousel
[Recommendation] Summary of advantages and disadvantages of content-based and collaborative filtering / implementation method
Introduction and Implementation of JoCoR-Loss (CVPR2020)
Explanation and implementation of ESIM algorithm
Introduction and implementation of activation function
Explanation and implementation of simple perceptron
Implementation of desktop notifications using Python
Examination of exchange rate forecasting method using deep learning and wavelet transform
Example of using class variables and class methods
Implementation of dialogue system using Chainer [seq2seq]
Implementation and explanation using XGBoost for beginners
Explanation and implementation of Decomposable Attention algorithm
Implementation of "blurred" neural network using Chainer