"A Neural Algorithm of Artistic Style" (hereinafter Neural style) is famous as a style conversion method using Convolutional Neural Network (CNN), and the following implementation there is.
Another method of style conversion is "Combining Markov Random Fields and Convolutional Neural Networks for Image Synthesis" (hereinafter "style conversion using MRF"). I will introduce this.
I implemented this technique in Chainer. The source code can be found at https://github.com/dsanno/chainer-neural-style.
https://github.com/chuanli11/CNNMRF
Generated image | Content image | Style image | |
---|---|---|---|
Oil painting | |||
Watercolor | Chihiro Iwasaki's "Hanaguruma」 | ||
Pen drawing |
Image source:
Style conversion using MRF generates the following image.
The difference is that Neural-style brings the overall style of the image closer to the style image, while MRF brings the local style closer to the style image.
The title includes "Markov Random Field (MRF, Markov Random Field)", but since MRF is not the heart of the algorithm, I will omit the explanation.
Enter the content image and style image. Content images are represented by $ x_c $ and style images are represented by $ x_s $. The generated image to be output is represented by $ x $.
CNN
This method uses CNN for image recognition such as VGG as well as Neural style.
The output of a particular layer when the image $ x $ is input to the CNN is represented by $ \ Phi (x) $. The layer output when the content image is input is $ \ Phi (x_c) $, and the layer output when the style image is input is $ \ Phi (x_s) $.
Generate a patch from the CNN layer output. A patch is a collection of $ k \ times k $ areas of layer output into one, and is a vector of length $ k \ times k \ times C $ ($ C $ is the number of channels in layer output). You can generate multiple patches from one layer, and the $ i $ th patch is represented by $ \ Psi_i (\ Phi (x)) $.
This algorithm defines energy as a function of $ x $. Then calculate $ x $ to minimize the energy. The definition of the energy function is as follows. The explanation of each section is explained below.
E(x) = E_s(\Phi(x), \Phi(x_s)) + \alpha_1 E_c(\Phi(x), \Phi(x_c)) + \alpha_2 \Upsilon(x)
MRFs loss function
One item $ E_s $ is called MRFs loss function and its definition is as follows.
E_s(\Phi(x), \Phi(x_s)) = \sum^{m}_{i=1}||\Psi_i(\Phi(x)) - \Psi_{NN(i)}(\Phi(x_s))||^2
However, $ NN (i) $ is defined by the following formula.
NN(i) := \mathop{\rm arg\,max}\limits_{j=1,...,m_s} \frac{\Psi_i(\Phi(x)) \cdot \Psi_j(\Phi(x_s))}{|\Psi_i(\Phi(x))| \cdot |\Psi_j(\Phi(x_s))|}
In the paper, it is argmin, but when I look at the implementation, argmax seems to be correct. This formula has the following meanings.
Content loss function
The two items $ E_c $ are called the Content loss function and their definitions are as follows.
E_c(\Phi(x), \Phi(x_c)) = ||\Phi(x) - \Phi(x_c)||^2
This means that the closer the CNN layer generated from $ x $ and the CNN layer generated from $ x_c $, the smaller the energy.
Regularizer
The three items $ \ Upsilon $ are regularization terms for smoothing the image. The definition is as follows, $ x_ {i, j} $ is the value of the pixel whose x coordinate is $ i $ and y coordinate is $ j $. The smaller the difference between adjacent pixels, the smaller the energy.
\Upsilon(x) = \sum_{i,j}((x_{i,j+1} - x_{i,j})^2 + (x_{i+1,j} - x_{i,j})^2)
I implemented it with Chainer. For CNN, we use the VGG 16 layers model, which is also familiar to Chainer-gogh. The source code can be found at https://github.com/dsanno/chainer-neural-style.
Execution time is shorter than Neural-style. Neural-style requires thousands of iterations, while style conversion using MRF requires hundreds of iterations.
With Neural-style, the color usage changes drastically, but the style conversion using MRF does not change the color usage significantly, and I get the impression that the touch of the picture changes.
There is neural-doodle as an application of style conversion using MRF. neural-doodle allows you to specify which style to specify in which area of the image. As you can see by looking at the linked image, the face photo is converted to Van Gogh's portrait style, but by specifying the style for each area, a more natural conversion is realized.
In neural-doodle, style specification is realized by concatenating the patch vector output from the CNN layer with the vector representing the style number (one-hot vector indicating which style the patch corresponds to).
Recommended Posts