[PYTHON] Make people smile with Deep Learning

Deep Feature Interpolation for Image Content Changes

Deep Feature Interpolation (DFI) is a technique for giving an image a specific attribute (for example, "smile", "elderly", "beard"). As a method of giving a specific attribute, a method using a Generative Adversarial Network (GAN) such as "Autoencoding beyond pixels using a learned similarity metric" is known. However, DFI takes a different approach than GAN. The paper can be found at https://arxiv.org//abs/1611.05507.

Overview

As known from "A Neural Algorithm of Artistic Style" etc., from the Feature Map (intermediate layer output) obtained by inputting the image to CNN. , You can restore the image. There is an explanation in Algorithm for converting style, so I think you will deepen your understanding by reading this blog. In DFI, "Feature Map of an image with a specific attribute" is obtained by adding an attribute vector to "Feature Map obtained from an image". Then, the "image with specific attributes" is restored from the obtained Feature Map.

Image conversion procedure

Follow the procedure below to convert the image.

  1. Prepare a CNN to be used for image recognition, such as VGG 19-layer model.
  2. Prepare the conversion source image (called the original image).
  3. Decide the attributes you want to give to the original image (called the desired attributes).
  4. Collect each of the following sets of images.
  1. Enter the images contained in the target set into the CNN to get the Feature Map. Calculate its average $ \ bar {\ phi} ^ {t} $.
  2. Similarly, for the images included in the source set, calculate the average $ \ bar {\ phi} ^ {s} $ of the CNN Feature Map.
  3. Calculate the attribute vector $ w = \ bar {\ phi} ^ {t}-\ bar {\ phi} ^ {s} $.
  4. Enter the original image $ x $ into the CNN to get the Feature Map $ \ phi (x) $.
  5. Calculate the weighted sum $ \ phi (x) + \ alpha w $ of the Feature Map and the attribute vector obtained from the original image.
  6. Optimize the converted image $ z $. Optimize the Feature Map $ \ phi (z) $ obtained when you enter $ z $ into the CNN so that it approaches $ \ phi (x) + \ alpha w $.

Here is a diagram of the algorithm used in the paper. The Step number is in the paper and is different from this article.

dfi.jpg

Implementation

I implemented it with Chainer. https://github.com/dsanno/chainer-dfi

Make people smile with DFI

Let's use DFI to give the face image a smile attribute.

Use Labeled Faces in the Wild dataset

In the paper, the Feature Map was calculated using images from the Labeled Faces in the Wild (LFW) dataset. The LFW contains more than 13,000 face images and vectors that quantify the attributes of the face images such as "Male" and "Smiling". In the paper, as a similar image to be used as a source / target set, an image with many common attributes with the original image was selected. Try to make the image included in LFW smile in the same way. The result is as follows.

The original image Smile Smile+Open your mouth
dfi01_org.jpg out9_04.jpg out20_04.jpg
dfi02_org.jpg out10_04.jpg out22_04.jpg
dfi03_org.jpg out7_04.jpg out21_04.jpg

The parameters etc. are as follows.

Use your own image

I tried to convert the image distributed in Pakutaso.

sample.png

I will make a smile on my face that makes me feel a little nervous. The weight parameter $ \ alpha $ has been changed in the range 0.1-0.5. If the weight is too large, the image will be distorted too much.

Weight 0.1 Weight 0.2 Weight 0.3
sample_w01.png sample_w02.png sample_w03.png
Weight 0.4 Weight 0.5
sample_w04.png sample_w05.png

Do you really have attributes?

I've shown above that you can make a face image smile, but does this technique really add attributes to the image? I think no. The CNN used is for image recognition purposes, and its Feature Map is not learning about any particular attribute. Therefore, what can be generated using the Feature Map difference between the source and the target set is the "average image difference between the source and target". If the attribute is a smile, it means that a smile image is generated by adding "image difference between a non-smiling face and a smile" to the original image.

Since only the image difference is added, it is necessary to align the facial parts arrangement between the original image and the source / target image. In fact, if the original image and the source / target image do not have the same face position, the lips may appear in strange positions. The face image dataset used does not include parts placement information or face orientation attributes, but the position and size of the face in the image are aligned, so I think it's generally working. In order to perform more natural image conversion, I think it is necessary to select the source / target image in consideration of the facial parts arrangement.

Comparison with GAN-based method

The paper also compares the generated images with GAN-based methods. Please see the paper for the images actually generated. Here, we will compare the characteristics of the methods.

DFI GAN base
Do you need a trained model? necessary Unnecessary
Pre-learning Unnecessary Need to learn generative model
One image generation time(When using GPU) Dozens of seconds It takes less than a second
Image required for image generation Dozens to hundreds of similar images are required when generating images None

About implementation

I will write about the difference between the description of the paper and the implementation used this time. If you're not interested in the details of the implementation, you can skip it.

Whether to normalize the Feature map

The paper says "We use the convolutional layers of the normalized VGG-19 network pre-trained on ILSVRC2012," and you can see that the Feature Map is normalized. Feature Map normalization is described in "Understanding Deep Image Representations by Inverting Them", but it means dividing Feature Map by L2 norm. Normalization was not performed for the following reasons.

Whether to combine feature maps

In the paper, conv3_1, conv4_1, and conv5_1 among the intermediate layers of VGG-19 layers are used as Feature Maps. This point is the same for this implementation. In addition, the paper states "The vector $ \ phi (x) $ consists of concatenated activations of the It says convnet when applied to image x ", and you can see that multiple Feature Maps are combined. However, it worked without combining Feature Maps, so it was not combined in this implementation.

Attribute vector normalization

Attribute vector in the dissertationw = \bar{\phi}^{t} - \bar{\phi}^{s},w / ||w||And L2 normalized and used. This is because the attribute weighting factor $ \ alpha $ makes the selection of the source image and source target robust (so that the same $ \ alpha $ is used). However, attribute vector normalization is only valid if the Feature Map is normalized. If the Feature Map is not normalized, it is natural to determine the length of the attribute vector in consideration of the size of the Feature Map, so in the implementationw / ||w||The vector obtained by multiplying by L2 norm of the Feature Map of the original image is used as the attribute vector.

References

Recommended Posts

Make people smile with Deep Learning
Try deep learning with TensorFlow
Deep Kernel Learning with Pyro
Try Deep Learning with FPGA
Generate Pokemon with Deep Learning
Try Deep Learning with FPGA-Select Cucumbers
Cat breed identification with deep learning
Try deep learning with TensorFlow Part 2
Check squat forms with deep learning
Categorize news articles with deep learning
Forecasting Snack Sales with Deep Learning
Deep Learning
I tried to make deep learning scalable with Spark × Keras × Docker
Classify anime faces with deep learning with Chainer
Make your own PC for deep learning
Try with Chainer Deep Q Learning --Launch
Try deep learning of genomics with Kipoi
Sentiment analysis of tweets with deep learning
Deep Learning Memorandum
Start Deep learning
Python Deep Learning
The story of doing deep learning with TPU
99.78% accuracy with deep learning by recognizing handwritten hiragana
I tried to make deep learning scalable with Spark × Keras × Docker 2 Multi-host edition
Make Lambda Layers with Lambda
First Deep Learning ~ Struggle ~
Learning Python with ChemTHEATER 03
A story about predicting exchange rates with Deep Learning
Learning Python with ChemTHEATER 05-1
Python: Deep Learning Practices
Deep learning / activation functions
Deep Learning from scratch
Make Yubaba with Discord.py
Deep learning 1 Practice of deep learning
Deep learning / cross entropy
First Deep Learning ~ Preparation ~
First Deep Learning ~ Solution ~
[AI] Deep Metric Learning
Learning Python with ChemTHEATER 02
I tried deep learning
Reinforcement learning 37 Make an automatic start with Atari's wrapper
Learning Python with ChemTHEATER 01
Extract music features with Deep Learning and predict tags
Classify anime faces by sequel / deep learning with Keras
Python: Deep Learning Tuning
Deep learning large-scale technology
Make slides with iPython
Deep learning / softmax function
I tried to make Othello AI that I learned 7.2 million hands by deep learning with Chainer
Try to build a deep learning / neural network with scratch
Create an environment for "Deep Learning from scratch" with Docker
(Now) Build a GPU Deep Learning environment with GeForce GTX 960
Recognize your boss and hide the screen with Deep Learning
[Deep learning] Image classification with convolutional neural network [DW day 4]
I captured the Touhou Project with Deep Learning ... I wanted to.
Deep Learning with Shogi AI on Mac and Google Colab
I tried to divide with a deep learning language model
HIKAKIN and Max Murai with live game video and deep learning
Easy deep learning web app with NNC and Python + Flask
Sine curve estimation with self-made deep learning module (python) + LSTM
Machine learning learned with Pokemon