[PYTHON] Try running wav2pix to generate a face image from voice (animation face generation is also available)

Introduction

The world of deep learning, especially Generative Adversarial Networks (GAN), has grown dramatically in recent years, and I think that research is progressing in various fields such as text-to-image, voice quality conversion, and sound source separation.

In this talk, I will write loosely about wav2pix, which generates face images from voice.

Paper: WAV2PIX: SPEECH-CONDITIONED FACE GENERATION USING GENERATIVEADVERSARIAL NETWORKS

Rough overview

無題45.png https://imatge-upc.github.io/wav2pix/

The proposed model consists of the following three modules.

I will briefly explain each module.

First, regarding Speech Encoder, it seems that you are using the decoder of Speech Enhancement Generative Adversarial Network (SEGAN). SEGAN is an end-to-end model of speech enhancement using GAN. I will omit the detailed explanation, but please refer to the Demo here.

Next, it seems that the Generator Network and Discriminator Network are inspired by Least Squares Generative Adversarial Networks (LSGAN). LSGAN deepened my understanding at this site.

Quick Start From now on, we will explain the sample execution of wav2pix.

Execution environment

The environment I tried this time is as follows. OS: Ubuntu 18.04 LTS CPU: i3-4130 3.40GHz Memory: 16GB GPU: GeForce GTX 1660 Ti (6GB)

Docker Version: Docker version 19.03.8

1. Get & build Dockerfile

imatge-upc / wav2pix describes how to execute it, but I made my own Dockerfile for those who have trouble preparing the execution environment. So, this time I will mainly write the operation with Docker.

First, let's get the Dockerfile I made. ★ Please note the following points!

--The image size to create is about 5.5GB --It takes a reasonable amount of time to create an image

host


$ git clone https://github.com/Nahuel-Mk2/docker-wav2pix.git
$ cd docker-wav2pix/
$ docker build . -t docker-wav2pix

When you're done, make sure you have an image.

host


$ docker images
REPOSITORY          TAG                               IMAGE ID            CREATED             SIZE
docker-wav2pix      latest                            8265bc421f7a        4 hours ago         5.36GB

2. Start Docker / Train / Test

2.1. Start Docker

Let's start Docker.

host


$ docker run -it --rm --gpus all --ipc=host docker-wav2pix

2.2. Train Add some effort before doing the train. Overwrite the required path in the config file and save it. ★ If you do not do this, train and test will throw an error at runtime, so be careful!

container


$ echo -e "# training pickle files path:\n"\
"train_faces_path: /home/user/wav2pix/pickle/faces/train_pickle.pkl\n"\
"train_audios_path: /home/user/wav2pix/pickle/audios/train_pickle.pkl\n"\
"# inference pickle files path:\n"\
"inference_faces_path: /home/user/wav2pix/pickle/faces/test_pickle.pkl\n"\
"inference_audios_path: /home/user/wav2pix/pickle/audios/test_pickle.pkl" > /home/user/wav2pix/config.yaml

If you can do the above, let's run Train.

container


$ cd wav2pix
$ python runtime.py

★ It took about 3 hours in my environment to finish the train. If you wait patiently or specify the runtime epoch as follows, it will end early. ★ You can safely ignore the Visdom error.

container


$ python runtime.py --epochs 100

--epochs: Specifying the number of epochs (default 200)

2.3. Test When the train is finished, run Test. Since the operation to call the learned model is required, it is necessary to write an additional argument than when executing Train.

container


$ python runtime.py --pre_trained_disc /home/user/wav2pix/checkpoints/disc_200.pth --pre_trained_gen /home/user/wav2pix/checkpoints/gen_200.pth --inference

--pre_trained_disc: Trained Discriminator path --pre_trained_gen: Trained Generator path --inference: Inference execution flag

When you're done, check the generated image.

host


$ docker cp 89c8d43b0765:/home/user/wav2pix/results/ .

★ If you don't know the CONTAINER ID, run "docker ps" to check it.

host


$ docker ps
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES
89c8d43b0765        docker-wav2pix      "/bin/bash"         4 hours ago         Up 4 hours                              vigilant_murdock

3. Image generated from audio (partial excerpt)

←: Generated image Real image: → ←: Generated image Real image: → jaimealtozano_1550.jpg1550_2.png    jaimealtozano_1600.jpg1560_2.png

javiermuniz_1250.jpg1250_2.png    javiermuniz_1300.jpg1300_2.png

For the two people in this sample data set, I think that the generated images that show that they are relatively faces have been created. I also found that I understand my personality to some extent. However, it was pointed out in the paper that the image was rough.

bonus

Ex.1 Creating the data set required to generate an animated face image

From here, I would like to generate an anime face image using this wav2pix. That said, there is no dataset that includes audio and animated face images, so you need to create your own. Therefore, we will create a Virtual YouTuber (VTuber) dataset by referring to the YouTuber dataset created in the paper.

The figure below shows how to create the dataset explained in the paper. It is a flow to process YouTuber video separately for video and Speech, and finally create it as a pair of data. The main change is only the face detection cascade file. The cascade file used is here.

無題50.png

The videos used to create the data with the VTubers targeted this time are as follows. For audio, the data is the section without BGM or SE. (Titles omitted)

--Kizuna AI

image.png

[Broadcast accidents will be issued as they are! ] 1 million people thank you commemorative LIVE delivery! !! [Live broadcast] We all talked about anime!

――Neko image.png To become a virtual Youtuber [Live008]

--Suisei Hoshimachi image.png [Official] "Suisei Hoshimachi's MUSIC SPACE" # 01 first half (broadcast on April 5, 2020) [Official] "Suisei Hoshimachi's MUSIC SPACE" # 01 second half (broadcast on April 5, 2020) [Official] "Suisei Hoshimachi's MUSIC SPACE" # 04 first half (broadcast on April 26, 2020)

Ex.2 Image generated from voice (animated face)

First, I will show the animated face image generated for each epoch.

From each epoch number, a similar facial image was generated in epoch10, but as the number increases, it can be seen that the individuality is clearly reflected in the generated image. Now, let's compare the following two types to see how close the generated image is to the real thing.

◆ Comparison between the generated image (epoch200) and the real thing

part1                  part2

←: Generated image Real image: → ←: Generated image Real image: → image.pngimage.png    image.pngimage.png image.pngimage.png    image.pngimage.png image.pngimage.png    image.pngimage.png

From part1, it was confirmed that images that could firmly learn individuality were generated from each VTuber voice. However, I thought it should be noted that part2 may generate images that are different from the real ones.

Summary

This time, I explained wav2pix, which generates a face image from voice, and ran a sample. I also tried to generate an animated face image by changing the data set. As for the animated face image, I was able to generate something more tangible than I expected, so it may be good to try to increase the resolution in the future. Also, I can't help but wonder if it would be possible to generate illustrations from audio in the future if there were various facial images.

Reference site

SPEECH-CONDITIONED FACE GENERATION USING GENERATIVE ADVERSARIAL NETWORKS I did machine learning to switch the voices of Kizuna AI and Nekomasu Unable to write to file </torch_18692_1954506624> Docker run reference

Recommended Posts

Try running wav2pix to generate a face image from voice (animation face generation is also available)
Try to extract a character string from an image with Python3
How to make a face image data set used in machine learning (3: Face image generation from candidate images Part 1)
Security Server Docker Image that allows you to easily try X-Road is now available, so give it a try 1
Try to generate an image with aliasing
Try to beautify with Talking Head Anime from a Single Image [python preparation]
How to generate a Python object from JSON
Perform a Twitter search from Python and try to generate sentences with Markov chains.