[PYTHON] Painting-style neural network algorithm (Magenta translation ② where artificial intelligence creates art and music)

Introduction

This article is the documentation for Google's Magenta project. A translation of "A Neural Algorithm of Artistic Style" (Cinjon Resnick). This document is a review article of Paper of the same name by Gatys et al.. It is a deliverable distributed under Apache License, Version 2.0.

Google Brain has launched Magenta, a project to generate art and music through deep learning. One of Magenta's goals is to showcase the research of the project and publish review articles for several papers.

This review article introduces a paper called "A Neural Algorithm of Artistic Style". This is a paper that describes research on style conversion by deep learning, and it has become a hot topic. Nowadays, researches such as video conversion have been announced, and we can feel the new possibilities of neural networks.

Painting style neural network algorithm

Recently, in August 2015, Gatys and colleagues from the University of Tubingen published "A Neural Algorithm of Artistic Style". This paper explained how to express one work of art in the style of another, and ran around the Facebook walls (posts) around the world. It has attracted public attention and has been recognized as being able to use this technology in the tools we have built for image apps to create creative art.

You can do this.

image

This paper assumes a technique that connects the style of input image S with the content of input image C (the one that is transferred to the image). The image above is "Starry Night Tubingen", S is a photo of Van Gogh's Starry Night, and C is a photo of the University of Tubingen. This technology consists of the style loss Ls and the content loss Lc, [Energy Minimization Problem](https://ja.wikipedia.org/wiki/%E6%9C%80%E9%81%A9%E5% It is necessary to assemble 8C% 96% E5% 95% 8F% E9% A1% 8C). The key idea is to use a deep convolutional network (VGG-19) that allows hierarchical understanding of images. That is. As an index showing the style of painting, the correlation of multi-layered features is extracted from VGG. On the other hand, what represents content corresponds to the amount of expression of a specific layer.

Content loss was defined by the raw L2 error of a particular layer. To be clearer, we used the conv4_2 layer for Lc and calculated the squared error loss by halving the output of the layer when using X and the output of the layer when using C.

On the other hand, as an image loss, this paper used the Gram matrix. This matrix is the dot product of the vectorized features of a given layer. Empirically, these are very good substitutes for feature correlation, and the L2 error between the Gram matrix of one image and the Gram matrix of another is a very good way to compare how close their styles are. It works. More intuitively, when you think of algorithms like texture modeling, you can think of the Gram matrix as a statistic that summarizes the space within what represents the feature. Using those features is a good substitute for seeing similar styles.

After all, Ls is calculated using the mean square error between Gram matrices. Calculate the mean square error of the X and S Gram matrices for each layer of Conv1_1, conv2_2, conv3_1, conv4_1, conv5_1. The sum of these errors is the lost Ls of the style of painting.

Make X the first white noise image and combine these losses with [L-BFGS method](https://ja.wikipedia.org/wiki/%E6%BA%96%E3%83%8B%E3%83] % A5% E3% 83% BC% E3% 83% 88% E3% 83% B3% E6% B3% 95) to minimize and produce a style conversion effect. Of course, you may have to adjust, and the Lc and Ls weight parameters are somewhat C and S dependent. Initializing X with either an S or C image will probably work, but with deterministic results. In practice, the network first adapts to low-level style features, and then gradually modifies the content of the image. It takes 3-5 minutes on the GPU to complete each image. It should also be mentioned that the effect on other images depends on which convolutional network is used. For example, networks trained for face recognition will work well for facial style transformations.

The contribution of this research has expanded beyond machine learning. Well known to the public, it attracted a diverse and new professional. Since making his debut and paving the way, he has made a lot of achievements both in improving effectiveness and adapting to new areas. Here, I will briefly explain three of these. Color-preserving style conversion, video style conversion, and instantaneous style conversion.

Color-preserving style conversion

Let's start with the most recent innovations in this area. This paper by Gatys et al. Revised the first style conversion method by preserving the color of the content image. Two techniques are explained. The first is by converting the color scheme of the style image to match the color scheme of the content image. This new S'is used as a style input instead of the previous S. To achieve this, this paper describes two different linear transformations.

Another technique described is conversion in luminance space only. First, the brightness channel is extracted from S and C, the style conversion is performed in this brightness area, and the color channel is added to the output of the style conversion. There is also a brief debate about these techniques, comparing their advantages and disadvantages. You can see the output in the image below. Using Picasso's "Seated Nude", I converted the image of New York at night to this style, and the color scheme of the original image is maintained.

image

Video style conversion

Ruder et al.'S Paper sees what happens when you try to apply style conversion to video. Here, if you simply apply Gatys's algorithm independently to the time series of the frame, the result of the style conversion is not stable, so it may flicker or have wrong breaks. Therefore, this paper explains how to make the transformation regular by using a technique called optical flow. It uses state-of-the-art estimation algorithms such as DeepFlow and EpicFlow.

In addition, several techniques have been used to further enhance consistency throughout the frame. This includes detecting the boundaries of regions and movements by moving the optical flow in both directions. It also controls long-term consistency by penalizing deviations from frames that are distant in time.

After all, the result was very impressive. Although not seamless, the frame is consistent and impressive. You can see the operation example on Youtube.

Instantaneous style conversion

Johnson et al.'S Paper asks and answers the question of speed. Gatys and Ruder's work has long optimization steps that take 3-5 minutes per frame to calculate. I modified the configuration and added another deep network called “Image Transformer Network” (ITN) before VGG. As a result, it was produced with just one forward propagation, an imaging that satisfied Gatys' optimization steps.

In the method of this research, the style image S is prepared in advance, and VGG is treated as a black box that returns the total of the style and content loss given by S. The input of ITN is the image C of the content you want to convert. Train your network to convert C to C'by optimizing style and content loss. Since S is fixed to all Cs, it is possible to achieve style conversion without the long optimization using forward and back propagation used in Gatys' original research. I can do it.

It is debated whether the quality will deteriorate or improve. What is clear, however, is that it is the only model at the moment that can perform style conversions of 15 frames per second.

Future outlook

This is a really interesting area. Because you can imagine and create various policies. How about starting with what you know and getting better, like real-time optical flow? How about developing new art to seamlessly transform characters from video scenes? What about new areas like music? I would love to hear Dylan look like Disney.

Recommended Posts

Painting-style neural network algorithm (Magenta translation ② where artificial intelligence creates art and music)
Generation of time series by recurrent neural network (Magenta translation ① where artificial intelligence makes art and music)