Style conversion method using Convolutional Neural Network

"A Neural Algorithm of Artistic Style" (hereinafter Neural style) is famous as a style conversion method using Convolutional Neural Network (CNN), and the following implementation there is.

Torch implementation: https://github.com/jcjohnson/neural-style
Chainer implementation: https://github.com/mattya/chainer-gogh

Another method of style conversion is "Combining Markov Random Fields and Convolutional Neural Networks for Image Synthesis" (hereinafter "style conversion using MRF"). I will introduce this.

I implemented this technique in Chainer. The source code can be found at https://github.com/dsanno/chainer-neural-style.

What is style conversion?

Prepare two images, a content image and a style image
The content of the image is a content image, and the style of the image is similar to the style image.

Style conversion example

Sample by the author of the paper

https://github.com/chuanli11/CNNMRF

Sample by Chainer implementation

	Generated image	Content image	Style image
Oil painting
Watercolor			Chihiro Iwasaki's "Hanaguruma」
Pen drawing

Image source:

Content image: Photo image Pakutaso
Oil style image: chainer-gogh repository
Pen drawing style image: Alice-in-wonderland.net

Overview

Style conversion using MRF generates the following image.

The content is close to the content image
Local style is close to the local style of the style image

The difference is that Neural-style brings the overall style of the image closer to the style image, while MRF brings the local style closer to the style image.

Existing implementation

Torch implementation: https://github.com/chuanli11/CNNMRF

algorithm

The title includes "Markov Random Field (MRF, Markov Random Field)", but since MRF is not the heart of the algorithm, I will omit the explanation.

Input and output

Enter the content image and style image. Content images are represented by $ x_c $ and style images are represented by $ x_s $. The generated image to be output is represented by $ x $.

CNN

This method uses CNN for image recognition such as VGG as well as Neural style.

Layer output

The output of a particular layer when the image $ x $ is input to the CNN is represented by $ \ Phi (x) $. The layer output when the content image is input is $ \ Phi (x_c) $, and the layer output when the style image is input is $ \ Phi (x_s) $.

patch

Generate a patch from the CNN layer output. A patch is a collection of $ k \ times k $ areas of layer output into one, and is a vector of length $ k \ times k \ times C $ ($ C $ is the number of channels in layer output). You can generate multiple patches from one layer, and the $ i $ th patch is represented by $ \ Psi_i (\ Phi (x)) $.

Definition of energy

This algorithm defines energy as a function of $ x $. Then calculate $ x $ to minimize the energy. The definition of the energy function is as follows. The explanation of each section is explained below.

E(x) = E_s(\Phi(x), \Phi(x_s)) + \alpha_1 E_c(\Phi(x), \Phi(x_c)) + \alpha_2 \Upsilon(x)

MRFs loss function

One item $ E_s $ is called MRFs loss function and its definition is as follows.

E_s(\Phi(x), \Phi(x_s)) = \sum^{m}_{i=1}||\Psi_i(\Phi(x)) - \Psi_{NN(i)}(\Phi(x_s))||^2

However, $ NN (i) $ is defined by the following formula.

NN(i) := \mathop{\rm arg\,max}\limits_{j=1,...,m_s} \frac{\Psi_i(\Phi(x)) \cdot \Psi_j(\Phi(x_s))}{|\Psi_i(\Phi(x))| \cdot |\Psi_j(\Phi(x_s))|}

In the paper, it is argmin, but when I look at the implementation, argmax seems to be correct. This formula has the following meanings.

For each patch generated from $ x $, select the closest one from the patch set generated from $ x_s $ Use the correlation between patches as a measure of proximity
The closer the selected patch is to the patch generated from $ x $, the smaller the energy

Content loss function

The two items $ E_c $ are called the Content loss function and their definitions are as follows.

E_c(\Phi(x), \Phi(x_c)) = ||\Phi(x) - \Phi(x_c)||^2

This means that the closer the CNN layer generated from $ x $ and the CNN layer generated from $ x_c $, the smaller the energy.

Regularizer

The three items $ \ Upsilon $ are regularization terms for smoothing the image. The definition is as follows, $ x_ {i, j} $ is the value of the pixel whose x coordinate is $ i $ and y coordinate is $ j $. The smaller the difference between adjacent pixels, the smaller the energy.


\Upsilon(x) = \sum_{i,j}((x_{i,j+1} - x_{i,j})^2 + (x_{i+1,j} - x_{i,j})^2)

Implementation

I implemented it with Chainer. For CNN, we use the VGG 16 layers model, which is also familiar to Chainer-gogh. The source code can be found at https://github.com/dsanno/chainer-neural-style.

Comparison of methods

Execution time

Execution time is shorter than Neural-style. Neural-style requires thousands of iterations, while style conversion using MRF requires hundreds of iterations.

Difference in style

With Neural-style, the color usage changes drastically, but the style conversion using MRF does not change the color usage significantly, and I get the impression that the touch of the picture changes.

application

There is neural-doodle as an application of style conversion using MRF. neural-doodle allows you to specify which style to specify in which area of the image. As you can see by looking at the linked image, the face photo is converted to Van Gogh's portrait style, but by specifying the style for each area, a more natural conversion is realized.

In neural-doodle, style specification is realized by concatenating the patch vector output from the CNN layer with the vector representing the style number (one-hot vector indicating which style the patch corresponds to).

References

Chuan Li, Michael Wand, 2016, "Combining Markov Random Fields and Convolutional Neural Networks for Image Synthesis"
Leon A. Gatys, Alexander S. Ecker, Matthias Bethge, 2016, "A Neural Algorithm of Artistic Style"
Alex J. Champandard, 2016, "Semantic Style Transfer and Turning Two-Bit Doodles into Fine Artworks"

[PYTHON] Another style conversion method using Convolutional Neural Network