[PYTHON] StyleGAN Paper Introduction & Experiment

Introduction

Among the rapidly evolving GANs in recent years, I wanted to study StyleGAN, which is one of the most famous ones, so I chose this theme.

In the first half, as an introduction to the dissertation, I will summarize what I learned about the structure and characteristics of StyleGAN. In the second half, I tried image generation using the actually learned StyleGAN, so I will write the result.

Thesis introduction

StyleGAN (v1) StyleGAN was announced in 2018. (Paper link) The following is a quote from an example image generated by StyleGAN, but I think it produces a high quality image that is indistinguishable from a real photo. stylegan-teaser.png

The features of StyleGAN will be explained below.

Generator structure

I would say that the feature of StyleGAN is mainly in the structure of the generator. In the figure below in the paper, the left side is the structure of the conventional GAN (here, PGGAN) generator, and the right side is the structure of the Style GAN generator. generator.png

In the conventional GAN, a latent variable (latent z) is randomly generated and input from the first layer of the generator. On the other hand, in the case of StyleGAN, the first input of generator is a fixed value, latent z is first converted through the Mapping network, and then input using AdaIN in various places in the middle of generator. In addition, randomly generated noise is added to each layer of the generator.

Style Mixing The style information entered in each layer of generator does not have to be the same, and multiple style information can be combined. For example, assuming that there are latent variables z1 and z2 that generate certain images A and B, the characteristics of A and B can be obtained by inputting the style derived from z1 up to a certain layer of the generator and then the style derived from z2. You can generate an image that looks like a mixture of. (Style mixing)

At this time, if you switch from z1 to z2 in the first stage, the major features of B (face orientation and shape) will be reflected, but if you switch in the second stage, only detailed features (hair color, etc.) will be reflected. I already know.

Progressive Growing This is the learning method proposed by PGGAN (Progressive-Growing GAN). * StyleGAN is based on PGGAN. It has been reported that a high resolution image such as 1024x1024 can be stably generated by increasing the resolution of the generated image by adding layers of generator and discriminator step by step during learning.

However, Progressive Growing seems to have some disadvantages. In StyleGAN2 below, that area is also considered.

StyleGAN2 StyleGAN2 was announced in 2019 as an improved version of StyleGAN. (Paper link) The following is an example of the image generated by StyleGAN2. stylegan2-teaser-1024x256.png It is difficult to see the difference in quality from StyleGAN, but it has been reported that the characteristic water droplet pattern generated by StyleGAN has been eliminated and the score of FID, which is an index of image quality, has improved significantly. I will.

The main points for improvement are as follows. --AdaIN-equivalent processing is realized in a single Conv layer (Weight demodulation) --Improved regularization (Path length regularization, Lazy regularization) --Improved network structure, eliminating the need for Progressive Growing

Each item is described below.

Achieves processing equivalent to AdaIN in a single Conv layer

In StyleGAN, the normalization process of AdaIN was found to cause water droplet-like artifacts, and StyleGAN2 has improved this. Below, the structure of the generator is quoted from the paper. stylegan2_generator.png

The left side (a) and (b) are the original StyleGAN, and the rightmost (d) is the result of changing to a form that does not use AdaIN. The same operation as normalization by AdaIN is realized by an operation called Weight demodulation (dividing the weight of the Conv layer by the standard deviation).

The point is that Weight demodulation is performed based on the assumption of distribution without using the statistics of the actual input data, and it is reported that this solved the problem of water droplet pattern.

Improvement of regularization

We have added this to the regularization term because the Perceptual Path Length, which indicates whether the latent space is perceptually smooth, is important for improving the quality of the generated image. (Path length regularization) The understanding of this area is ambiguous, but does it mean that perceptually similar images should be generated (learned to do so) for z that are close in distance in the latent space? ..

It is also reported that for the main loss term, updating the regularization term did not adversely affect the score even if it was less frequent. (Reducing the update frequency of the regularization term is called "Lazy regularization") This reduces calculation costs and memory usage, and also contributes to shortening learning time.

Improved network structure, eliminating the need for Progressive Growing

Progressive Growing has the advantage of being able to stably learn high-resolution image generation, There is a problem that local parts such as eyes and teeth do not follow the overall movement (face orientation). The following figure is an example. You can see that the alignment of teeth does not move even if the direction of the face changes. Compared to GAN a few years ago, I think it's amazing that we are discussing such details. phase_artifacts.png

The above problem is attributed to the fact that progressive growth tends to generate frequent features by gradually increasing the resolution, and the network structure has been reviewed so that learning can be successful without using it. As a result of the experiment, it was shown that it is effective to introduce the skip structure to both the generator and the discriminator, and high quality image generation was successful even without Progressive Growing. (→ Solve the problem that teeth and eyes do not follow the direction of the face)

... That's all for studying, and from now on, let's actually try image generation by StyleGAN.

Experiment with trained model

In conducting the image generation experiment, I used the StyleGAN implementation published below. (GitHub) stylegans-pytorch

The official StyleGAN is TensorFlow, but the above is reproduced and implemented in PyTorch. It was very helpful to have a detailed explanation from the environment preparation to the procedure for converting learned weights. It is compatible with both StyleGAN1 and 2, but this time I tried it with StyleGAN1. (Because there seemed to be many types of weights that could be used)

Preparation / operation check

The environment is as follows. It is an environment created by putting Ubuntu and CUDA etc. in a game PC. OS : Ubuntu 18.04.4 LTS GPU : GeForce RTX 2060 SUPER x1

The preparation was completed without any problems according to the README procedure. The image actually generated using the converted weight is as follows. stylegan1_pt.png

Since the same image as the original one was generated, it was confirmed that the weight conversion and the generation process by generotor were done correctly.

I also tried this because it was compatible with a trained model for 2D character generation. [face_v1_1] anime_face_v1_1_pt.png [portrait_v1] anime_portrait_v1_pt.png

interpolation experiment

Up to this point, I was able to try image generation using StyleGAN, but since it was a big deal, I decided to try interpolation of the latent variable z, which is often seen in GAN videos.

With reference to the original waifu / run_pt_stylegan.py, I wrote a process to interpolate z with the trained model of the anime character and save the result as a gif.

anime_face_interpolation.py


import argparse
from pathlib import Path
import pickle
import numpy as np
import cv2
import torch
from tqdm import tqdm

def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument('--model', type=str, default='face_v1_1',
                        choices=['face_v1_1','face_v1_2','portrait_v1','portrait_v2'])
    parser.add_argument('--weight_dir', type=str, default='../../data')
    args = parser.parse_args()
    return args

def prepare_generator(args):
    from run_pt_stylegan import ops_dict, setting
    if 'v1' in args.model:
        from stylegan1 import Generator, name_trans_dict
    else:
        from stylegan2 import Generator, name_trans_dict

    generator = Generator()

    cfg = setting[args.model]
    with (Path(args.weight_dir)/cfg['src_weight']).open('rb') as f:
        src_dict = pickle.load(f)

    new_dict = {k : ops_dict[v[0]](src_dict[v[1]]) \
                for k,v in name_trans_dict.items() if v[1] in src_dict}
    generator.load_state_dict(new_dict)
    return generator

def make_latents_seq():
    n_latent_point = 3
    interpolation_step = 13
    n_image = 3
    latent_dim = 512

    #Randomly generate latents as the starting point
    points = np.random.randn(n_latent_point, n_image, latent_dim)

    results = []
    for i in range(n_latent_point):
        s = points[i]
        e = points[i+1] if i+1 < n_latent_point else points[0]
        latents_ = np.linspace(s, e, interpolation_step, endpoint=False) #Linear interpolation
        results.append(latents_)

    return np.concatenate(results)

def generate_image(generator, latents, device):
    img_size = 320
    latents = torch.from_numpy(latents.astype(np.float32))

    with torch.no_grad():
        N, _ = latents.shape
        generator.to(device)
        images = np.empty((N, img_size, img_size, 3), dtype=np.uint8)

        for i in range(N):
            z = latents[i].unsqueeze(0).to(device)
            img = generator(z)
            normalized = (img.clamp(-1, 1) + 1) / 2 * 255
            np_img = normalized.permute(0, 2, 3, 1).squeeze().cpu().numpy().astype(np.uint8)
            images[i] = cv2.resize(np_img, (img_size, img_size),
                                   interpolation=cv2.INTER_CUBIC)

    def make_table(imgs):
        num_H, num_W = 1, 3 #Number of images to line up(Vertical,side)
        H = W = img_size
        num_total = num_H * num_W

        canvas = np.zeros((H*num_H, W*num_W, 3), dtype=np.uint8)
        for i, p in enumerate(imgs[:num_total]):
            h, w = i//num_W, i%num_W
            canvas[H*h:H*-~h, W*w:W*-~w, :] = p[:, :, ::-1]
        return canvas

    return make_table(images)

def save_gif(images, save_path, fps=10):
    from moviepy.editor import ImageSequenceClip
    images = [cv2.cvtColor(img, cv2.COLOR_BGR2RGB) for img in images]
    clip = ImageSequenceClip(images, fps=fps)
    clip.write_gif(save_path)

if __name__ == '__main__':
    args = parse_args()
    device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

    generator = prepare_generator(args).to(device)

    latents_seq = make_latents_seq()

    print('generate images ...')
    frames = []
    for latents in tqdm(latents_seq):
        img = generate_image(generator, latents, device)
        frames.append(img)

    save_gif(frames, f'{args.model}_interpolation.gif')

In make_latents_seq (), linear interpolation is performed starting from three points of latent z to generate z as a sequence.

The results are as follows. [face_v1_1] face_v1_1_interpolation_2c.gif [portrait_v1] portrait_v1_interpolation_2c.gif

As expected, I got a video that the character's face changes continuously. It depends on the data at the time of learning, but I think it is interesting because various drawing styles are mixed. (Somehow I have a familiar face ...) The area around the head and below the shoulders seems to change quite randomly, but as for the face part, you can see that it is a proper face at any moment of interpolation. Does this mean that the latent spaces are connected perceptually smoothly?

Also, if you don't care about the gif file size, you can increase the interpolation_step to make the animation smoother.

end

For StyleGAN, we introduced the paper and conducted an experiment using a trained model. I would appreciate it if you could point out any mistakes in the interpretation of the dissertation. It may take a long time on my home machine, but I would like to try learning for myself someday.

Recommended Posts

StyleGAN Paper Introduction & Experiment
Introduction
Introduction to Deep Learning ~ CNN Experiment ~