ConSinGAN: I tried using GAN that can be generated from one image

Introduction

The technology called GAN (Genarative Adversarial Network), which generates new images from existing images by deep learning, is the technology that I was most impressed with in deep learning, but since learning usually requires a huge amount of training data and time, it is quite individual. I couldn't try it. At that time, I saw the following paper that developed a GAN that can be generated from only one training image, and examined it.

Improved Techniques for Training Single-Image GANs

This is a paper submitted in March of this year, which was submitted last year.

SinGAN: Learning a Generative Model from a Single Natural Image

It seems to be an improvement of SinGAN (Single-Image-GAN) proposed in.

GAN GAN (Generative Adversarial Network) itself is well known and is explained on many sites, so I will omit the details. Please refer to here. GAN is one of the "generative models" that learns the distribution of training data and generates similar data. Create a Generator that generates fake data and a Discriminator that distinguishes it from the real one, and train them alternately. This will eventually generate data that is indistinguishable from the real thing. This is often compared to making counterfeit notes. The counterfeiter (Generator) learns to deceive the police (Discriminator), and the police learns to detect the counterfeit note. Here, the training data is $ \ boldsymbol {x} $, the noise is $ z $, the probability distribution is $ p (\ boldsymbol {x}) $, $ p (z) $, the Generator is $ G $, and the Discriminator is $ D. If the probability that $, Discriminator correctly identifies $ \ boldsymbol {x} $ is $ D (\ boldsymbol {x}) $, and the data generated from $ z $ of Generator is $ G (z) $, the objective function is as follows. It looks like.

{\rm min}_{G}{\rm max}_{D}V(D,G)=\mathbb{E}_{\boldsymbol{x}\sim p_{data}(\boldsymbol{x})}[{\rm log}D(\boldsymbol{x})]+\mathbb{E}_{z\sim p_{z}(z)}[{\rm log}(1-D(G(z)))]

The Generator mistakes the real thing for a fake ($ D (\ boldsymbol {x}) \ to 0 ) and the fake for a real thing ( D (G (z)) \ to 1 $), so $ V (D) , G) $ learns to be smaller, and Discriminator learns the opposite so that $ V (D, G) $ is larger. There are many derivatives of GAN, and many amazing technologies such as image conversion (line art ⇔ photo, summer photo ⇔ winter photo) and image generation of non-existent human beings have been created. (For more information here)

SinGAN Conventional GAN requires a large amount of training data for learning, but SinGAN (Single-Image-GAN) is a GAN that can be learned from a single image as its name suggests. SinGAN tasks include Unconditional Image Generation and Image Harmonization. (See below)

Learning method

In SinGAN, multiple as shown in the figure (cited from Paper Short Summary) Using Generator, each Generator receives the output image of the previous Generator as input. In addition, each Generator learns individually, and the weight of the previous Generator is fixed at the time of learning. (First, learn G0, fix G0 and learn G1, fix G0 and G1 and learn G2 ...) The important point here is that Discriminator allows you to view an image as a patch rather than as a whole, creating an image that looks like the real thing at one point but looks different as a whole. (Split Colosseum in the figure)

ConSinGAN Since it was discovered that the above method limits the interaction between generators, the newly proposed method is ConSinGAN (Concurrently-Single-Image-GAN), which learns simultaneously (Concurrently) without fixing the Generator. is.

Learning method

If all Generators are learned at the same time, overfitting will occur, so the following two points are incorporated in the paper.

  1. Learn the latter three Generators at the same time.
  2. Decrease the learning rate by 1/10 each time you go to the previous Generator. This looks like the figure (quoted from Paper Short Summary) .. The optimization method is Adam.

Reducing the learning rate here creates a variety of generated images, so the reproducibility of the training image is lost and there is a trade-off relationship.

By reducing the number of Generators with this ConSinGAN learning method, the learning time is about 1/6 that of SinGAN, and the performance is higher.

Model architecture

image.png

Both Genarator and Discriminator have a structure in which multiple convolution layers are stacked as shown in the figure (quoted from Paper). Here, the feature map that will be the input to the next Generator adds noise for diversity after upsampling, and is connected by a residual connection so that the output does not deviate significantly.

Objective function

The objective function at stage $ n $ is as follows.

{\rm min}_{G_{n}}{\rm max}_{D_{n}}L_{\rm adv}(G_n,D_n)+\alpha L_{\rm rec}(G_n)

Here, $ L_ {\ rm adv} (G_n, D_n) $ uses the Wasserstein distance as the objective function WGAN-GP (Paper //arxiv.org/abs/1704.00028)) is a term that expresses the accuracy of identification by the objective function, and $ L_ {\ rm rec} (G_n) $ expresses the stability of learning by the distance between the generated image and the training image. It is a term.

L_{\rm adv}(G_n,D_n)=\mathbb{E}_{z\sim p_{z}(z)}[D(G(z))]-\mathbb{E}_{{\boldsymbol x}\sim p_{data}}[D({\boldsymbol x})]+\lambda \mathbb{E}_{{\hat{\boldsymbol x}}\sim p_{{\hat{\boldsymbol x}}}}[(||\nabla_{{\hat{\boldsymbol x}}} D({\hat{\boldsymbol x}})||_2-1)^2]\\

L_{\rm rec}(G_n)=||G_n(x_0)-x_n||_2^2

$ \ Alpha $ is the default 10 constant, $ {\ hat {\ boldsymbol x}} $ is the point on the straight line connecting the training data and the generated data, $ \ lambda $ is the constant, $ x_n $ is the training image, $ x_0 $ is the input image to $ G_n $. However, the input of $ L_ {\ rm adv} (G_n, D_n) $ differs depending on the task. For example, in Unconditional Image Generation, noise is given, but in Image Harmonization, the training image is Augmented (cut out part or change color, etc.) It seems that it was better to give an image (with noise added).

task

This time I tried using Unconditional Image Generation and Image Harmonization. Other interesting tasks are mentioned, so if you are interested, please see the paper.

Unconditional Image Generation Random to the training image as shown in the figure (quoted from Paper Short Summary) You can add noise $ z $ to create a realistic, non-existent image while preserving the global structure. (You can see that it is adjusted even if you change the image size.)

a

Image Harmonization Training image (painting) as shown in the figure (quoted from Paper Short Summary) Etc.) and harmonize the added objects with the style of the trained image.

Fine tuning

With Image Harmonization, you can get better results by further training the trained model with a naive image (Naive in the figure above). (Fine-tune in the above figure)

Implementation

I cloned the public GitHub and ran it on Google Colaboratory, which can use GPU for free.

Setup


#Repository clone
!git clone https://github.com/tohinz/ConSinGAN.git
#Library installation used
pip install -r requirements.txt

On the way, I got an error saying that the version of the Colaboratory tool is different, but the operation was fine.

Unconditional Generation Place the training images in Images / Generation /. I will use the Rialto Bridge (rialto.jpg 1867 x 1400 pixels) that I shot in Venice before.

Unconditional_Generation


!python main_train.py --gpu 0 --train_mode generation --input_name Images/Generation/rialto.jpg

The GPU used by --gpu (0 by default), the task specification (generation, harmonization, etc.) with --train_mode, and the path of the training image with --input_name. Although not changed this time, the learning rate (--lr_scale) and the number of generators (--train_stage) can be changed as desired. By default, the number of Generators is 5, and each is learning with 2000 iterations. The training took 69 minutes, probably because of its large size. The result is TrainedModels / rialto / yyyy_mm_dd_hh_mm_ss_generation_train_depth_3_lr_scale_0.1_act_lrelu_0.05. Here are some generated samples.

gen_sample_4.jpggen_sample_8.jpggen_sample_23.jpg

It's a little noisy, but a squishy bridge has been created.

Image Harmonization

Painting × Cat

Let's match the picture of the cat with the painting (making hay). The training image (300 x 300 pixels hoshikusa.jpg) is on the left, and the naive image (142 x 130 pixels hoshikusa_naive.jpg) is on the right. Place these in Image / Harmonization /. Originally, it seems that a mask image that cuts out the part of the object to be added is placed, but since it could not be created, it is created without a mask.

Training


!python main_train.py --gpu 0 --train_mode harmonization --train_stages 3 --min_size 120 --lrelu_alpha 0.3 --niter 1000 --batch_norm --input_name Images/Harmonization/hoshikusa.jpg 

The training time was 15 minutes. Then apply the naive image to the training model.

Evaluation


!python evaluate_model.py --gpu 0 --model_dir TrainedModels/hoshikusa/yyyy_mm_dd_hh_mm_ss_harmonization_train_depth_3_lr_scale_0.1_BN_act_lrelu_0.3 --naive_img Images/Harmonization/hoshikusa_naive.jpg

The result can be TrainedModels / hoshikusa / yyyy_mm_dd_hh_mm_ss_harmonization_train_depth_3_lr_scale_0.1_BN_act_lrelu_0.3 / Evaluation /.

The resolution is poor, but the color of the cat in the photo has changed. I will try fine tuning further.

Fine tuning


!python main_train.py --gpu 0 --train_mode harmonization --input_name Images/Harmonization/hoshikusa.jpg --naive_img Images/Harmonization/hoshikusa_naive.jpg --fine_tune --model_dir TrainedModels/hoshikusa/yyyy_mm_dd_hh_mm_ss_harmonization_train_depth_3_lr_scale_0.1_BN_act_lrelu_0.3

When the default number of iterations was 2000 times (11 minutes), it was integrated with the background color as shown on the left and overfitted. The paper states that 500 times is sufficient. (right)

Black and white background × Ramen

Let's convert the photo into a cartoon style. The left is the training image (pen_building.jpg 600 x 337 pixels) and the right is the naive image (pen_building_naive.jpg 283 x 213 pixels).

I did the same and the result was as follows (training time 9 minutes): The left is normal evaluation, and the right is fine tuning (100 iterations).

I made ramen in a cartoon style. This task seems to be sufficient without fine tuning.

Summary

This time, I read a paper about ConSinGAN, which is an improved version of SinGAN that can be generated with a single training image, using a GAN model that I was personally interested in. It's fun because there are many models of image generation technology that are impressive and the results are easy to understand. However, it was a pity that I couldn't publish what I made using my favorite anime and manga images due to copyright issues.

Recommended Posts

ConSinGAN: I tried using GAN that can be generated from one image
I tried using PI Fu to generate a 3D model of a person from one image
I tried using UnityCloudBuild API from Python
I tried using Headless Chrome from Selenium
I made a simple timer that can be started from the terminal
I tried using the Python library "pykakasi" that can convert kanji to romaji.
I made a Docker image that can call FBX SDK Python from Node.js
I tried using the image filter of OpenCV
I tried to easily create a high-precision 3D image with one photo [1]. (Depth can now be edited in PNG.)
I tried to make a memo app that can be pomodoro, but a reflection record
Features that can be extracted from time series data
I tried to detect the iris from the camera image
ANTs image registration that can be used in 5 minutes
I tried using PySpark from Jupyter 4.x on EMR
I tried reading data from a file using Node.js.
I tried to compress the image using machine learning
I tried using parameterized
I tried using argparse
I tried using mimesis
I tried using anytree
I tried using aiomysql
I tried using Summpy
I tried using coturn
I tried using Pipenv
I tried using matplotlib
I tried using "Anvil".
I tried using Hubot
I tried using openpyxl
I tried using Ipython
I tried using PyCaret
I tried using cron
I tried using ngrok
I tried using face_recognition
I tried using PyCaret
I tried using Heapq
I tried using doctest
I tried using folium
I tried using jinja2
I tried using folium
I tried using time-window
I made a module that can be glitched easily, but I can't pass arguments from entry_points
I tried using the Python library from Ruby with PyCall
Miscellaneous notes that I tried using python for the matter
Scripts that can be used when using bottle in Python
I investigated the pretreatment that can be done with PyCaret
I tried to get data from AS / 400 quickly using pypyodbc
[Flask] I tried to summarize the "docker-compose configuration" that can be created quickly for web applications
[Graduation from article scattering] I tried to develop a service that can list articles by purpose
I tried to expand the database so that it can be used with PES analysis software
I tried AutoGluon's Image Classification
[I tried using Pythonista 3] Introduction
I tried using Random Forest
I tried using BigQuery ML
I tried using Amazon Glacier
I tried using git inspector
[Python] I tried using OpenPose
I tried using magenta / TensorFlow
I tried using AWS Chalice
I tried using Slack emojinator
[Python] I made my own library that can be imported dynamically
I made an AI that crops an image nicely using Salience Map