[PYTHON] Implement Conditional Variational Autoencoder (CVAE) in TensorFlow 2 system

TL;DR --A type of autoencoder ** Conditional Variational Autoencoder ** (** CVAE **) has been implemented by modifying the TensorFlow sample. --I played with the implemented CVAE learning MNIST data. --Sample code is available in the following repository

github.com/kn1cht/tensorflow_v2_cvae_sample

sample-cvae-mnist.ipynb	Google Colab	GitHub
sample-cvae-mnist-manifold.ipynb	Google Colab	GitHub

Generated 2D Manifold

Introduction

** Conditional Variational Autoencoder (CVAE) ** is a (semi) supervised generative model that can generate data corresponding to labels. Variational autoencoder as introduced in "Comparison of AutoEncoder, VAE, CVAE-Why can VAE generate continuous images?" This can be achieved simply by adding the process of entering a label in (VAE).

Implementation examples in Chainer, PyTorch, and TensorFlow 1 series can be found on the net, but there seems to be no example written in TensorFlow 2 series. Therefore, I will write an article that also serves as a study of TensorFlow itself. In implementing it ** I modified the VAE sample according to the TensorFlow formula without changing it **, so I hope that following the changes will help you understand the contents.

environment

Python 3.7.7
tensorflow-gpu 2.0.0
CUDA v10.0
cuDNN v7.6.5

Model description

Since easy-to-understand explanation already exists, I will describe it here.

Autoencoder (AE)

It was originally proposed as unsupervised learning with ** dimensionality reduction **. AE consists of two models, Encoder and Decoder. Encoder compresses input $ \ boldsymbol {x} $ to $ \ boldsymbol {z} $, and Decoder compresses from $ \ boldsymbol {z} $ to $ \ boldsymbol {x}. Attempts to reproduce $. The $ \ boldsymbol {z} $ that appears in the middle is called a ** latent variable **, and can be said to represent the characteristics of the data in a small dimension.

In addition to restoring the input as is, AE can be applied to noise removal and anomaly detection.

Variational Autoencoder (VAE)

VAE has made it possible to use it for ** data generation ** by incorporating a probability distribution. Estimate the parameters $ \ mu $ and $ \ sigma $ of the multivariate Gaussian distribution with Encoder, and create $ \ boldsymbol {z} $ from the obtained probability distribution. Since $ \ boldsymbol {z} $ can be obtained from a continuous probability distribution, it is possible to generate data that does not exist in the dataset.

In actual learning, $ \ boldsymbol {z} $ is calculated by an approximation method called Reparametrization Trick so that error back propagation can be performed. In addition, regularization is performed by including the KL divergence (Kullback-Leibler divergence) of the obtained distribution and the standard normal distribution in the objective function.

Objective function to maximize with VAE. The first term on the right side is the expected value of the log-likelihood of the output obtained by Decoder, and the second term is the regularization term.

\mathcal{L}(\boldsymbol{x},\boldsymbol{z}) = \mathbb{E}_{q(\boldsymbol{z}|\boldsymbol{x})}[\log p(\boldsymbol{x}|\boldsymbol{z})] - D_{KL}[q(\boldsymbol{z}|\boldsymbol{x})||p(\boldsymbol{z})]

Conditional Variational Autoencoder (CVAE)

CVAE enables ** data generation by specifying a label ** by adding label $ y $ to each of Encoder and Decoder as input. Since it gives label information, it becomes supervised learning, but if you devise it, it seems that it can also be used as semi-supervised learning that does not require labels for all.

The objective function changes from VAE as follows. However, the Encoder / Decoder only considers $ y $, and the implementation is OK with the VAE objective function.

\mathcal{L}(\boldsymbol{x},\boldsymbol{z},y) = \mathbb{E}_{q(\boldsymbol{z}|\boldsymbol{x},y)}[\log p(\boldsymbol{x}|\boldsymbol{z},y)] - D_{KL}[q(\boldsymbol{z}|\boldsymbol{x},y)||p(\boldsymbol{z}|y)]

Implementation of CVAE

Now let's implement CVAE. According to the original paper, semi-supervised learning is possible by combining VAE (M1 model) and CVAE (M2 model), but in this article, that is not the case. Learn by associating labels with all data.

The entire code is published in the following repository.

github.com/kn1cht/tensorflow_v2_cvae_sample

TensorFlow Official VAE Sample

For VAE, a sample is included in the official TensorFlow tutorial.

Convolutional Variational Autoencoder | TensorFlow Core

** It is published on Google Colaboratory **, so you can just click it and it will work.

cvae.ipynb - Colaboratory

It should be noted that the Convolutional Variational Autoencoder is called CVAE here, and it is ** different from the CVAE described in this article **. The model in this sample is a normal VAE, just because it has a convolution layer.

We will modify the VAE that learns this MNIST to realize Conditional VAE.

From VAE to CVAE

Vae.py, which is the part that defines the model extracted from the above sample code, and the CVAE created from it I have the code in the repository as cvae.py.

Since it is easier to understand by looking at the difference, I will explain while posting diff of both. ..

CVAE.init()

--- vae.py
+++ cvae.py

-  def __init__(self, latent_dim):
+  def __init__(self, latent_dim, label_size):
     super(CVAE, self).__init__()
-    self.latent_dim = latent_dim
+    (self.latent_dim, self.label_size) = (latent_dim, label_size)
     self.encoder = tf.keras.Sequential(
         [
-            tf.keras.layers.InputLayer(input_shape=(28, 28, 1)),
+            tf.keras.layers.InputLayer(input_shape=(28, 28, label_size + 1)),
             tf.keras.layers.Conv2D(
                 filters=32, kernel_size=3, strides=(2, 2), activation='relu'),
             tf.keras.layers.Conv2D(
@@ -30,7 +31,7 @@ class CVAE(tf.keras.Model):

     self.decoder = tf.keras.Sequential(
         [
-            tf.keras.layers.InputLayer(input_shape=(latent_dim,)),
+            tf.keras.layers.InputLayer(input_shape=(latent_dim + label_size,)),
             tf.keras.layers.Dense(units=7*7*32, activation=tf.nn.relu),
             tf.keras.layers.Reshape(target_shape=(7, 7, 32)),
             tf.keras.layers.Conv2DTranspose(

First is the definition part of the model. The label $ y $ type (10 types for MNIST) is set to label_size, and the input size of each Encoder / Decoder is increased by label_size. You can now combine the label converted to a One-hot representation with your input.

CVAE.sample()

   @tf.function
-  def sample(self, eps=None):
+  def sample(self, eps=None, y=None):
     if eps is None:
       eps = tf.random.normal(shape=(100, self.latent_dim))
-    return self.decode(eps, apply_sigmoid=True)
+    return self.decode(eps, y, apply_sigmoid=True)

sample () is the process of receiving latent variables and labels and generating data.

CVAE.encode()

-  def encode(self, x):
-    mean, logvar = tf.split(self.encoder(x), num_or_size_splits=2, axis=1)
+  def encode(self, x, y):
+    n_sample = x.shape[0]
+    image_size = x.shape[1:3]
+
+    y_onehot = tf.reshape(tf.one_hot(y, self.label_size), [n_sample, 1, 1, self.label_size]) # 1 x 1 x label_size
+    k = tf.ones([n_sample, *image_size, 1]) # {image_size} x 1
+    h = tf.concat([x, k * y_onehot], 3) # {image_size} x (1 + label_size)
+
+    mean, logvar = tf.split(self.encoder(h), num_or_size_splits=2, axis=1)
     return mean, logvar

This is the process to make the Encoder read the input. First, convert the label y to the One-hot representation y_onehot. Separately, create a tensor k with a shape of" image size (28 x 28 for MNIST) x 1 "and all 1 elements. When k * y_onehot is calculated, it becomes" 28 x 28 x label_size "by the broadcast function, and it can be combined with x.

(For this part, I referred to Implementation of Mr. ysasaki6023. It seems to work even if you connect and y)

CVAE.decode()

-  def decode(self, z, apply_sigmoid=False):
-    logits = self.decoder(z)
+  def decode(self, z, y=None, apply_sigmoid=False):
+    n_sample = z.shape[0]
+    if not y is None:
+      y_onehot = tf.reshape(tf.one_hot(y, self.label_size), [n_sample, self.label_size]) # label_size
+      h = tf.concat([z, y_onehot], 1) # latent_dim + label_size
+    else:
+      h = tf.concat([z, tf.zeros([n_sample, self.label_size])], 1)  # latent_dim + label_size
+    logits = self.decoder(h)
     if apply_sigmoid:
       probs = tf.sigmoid(logits)
       return probs

Similarly, pass z and y_onehot to Decoder. The reason why y is None or not is to make it possible to try data generation without passing a label. However, since I didn't study without labels, I ended up with only images that I didn't understand ...

compute_loss()

-def compute_loss(model, x):
-  mean, logvar = model.encode(x)
+def compute_loss(model, xy):
+  (x, y) = xy # x: image, y: label
+  mean, logvar = model.encode(x, y)
   z = model.reparameterize(mean, logvar)
-  x_logit = model.decode(z)
+  x_logit = model.decode(z, y)
   cross_ent = tf.nn.sigmoid_cross_entropy_with_logits(logits=x_logit, labels=x)
   logpx_z = -tf.reduce_sum(cross_ent, axis=[1, 2, 3])
   logpz = log_normal_pdf(z, 0., 0.)

This is the processing of the objective function during learning. As mentioned above, the implementation of the objective function is the same as VAE, so only y is added to the argument.

train_step()

 @tf.function
-def train_step(model, x, optimizer):
+def train_step(model, xy, optimizer):
   """Executes one training step and returns the loss.

   This function computes the loss and gradients, and uses the latter to
   update the model's parameters.
   """
   with tf.GradientTape() as tape:
-    loss = compute_loss(model, x)
+    loss = compute_loss(model, xy)
   gradients = tape.gradient(loss, model.trainable_variables)
   optimizer.apply_gradients(zip(gradients, model.trainable_variables))

It is a process that turns learning one step.

Input dataset preparation

Use the TensorFlow dataset function (tf.data.Dataset) to create a MNIST ** image and label paired input **. I'm a beginner of TensorFlow, so when I saw the official sample, I was worried about "If I shuffle the train_dataset like this, the correspondence between the image and the label will be broken ...?" Of course, there are functions that meet such needs, and they are explained carefully in this tutorial.

-Load images using tf.data | TensorFlow Core

train_dataset_x = tf.data.Dataset.from_tensor_slices(x_train)
test_dataset_x = tf.data.Dataset.from_tensor_slices(x_test)
print(train_dataset_x, test_dataset_x)

train_dataset_y= tf.data.Dataset.from_tensor_slices(y_train)
test_dataset_y = tf.data.Dataset.from_tensor_slices(y_test)
print(train_dataset_y, test_dataset_y)

First, convert the images and labels to a ** dataset without ** shuffling.

train_dataset_xy = tf.data.Dataset.zip((train_dataset_x, train_dataset_y))
train_dataset_xy = train_dataset_xy.shuffle(train_size).batch(batch_size)
test_dataset_xy = tf.data.Dataset.zip((test_dataset_x, test_dataset_y))
test_dataset_xy = test_dataset_xy.shuffle(train_size).batch(batch_size)
print(train_dataset_xy, test_dataset_xy)

Once you have each dataset, you can combine it with tf.data.Dataset.zip () to make a ** (image, label) pair **. Even if you shuffle or mini-batch from here, the correspondence between the two will not be broken. When iterated, a tuple of (image tensor, label tensor) will appear, so you can use them together or use them individually.

Play with CVAE

Now that we have CVAE, we trained MNIST data and played with it. The code in this section is available in the repository and Google Colaboratory. If you are a Google Colab, you can actually move it.

sample-cvae-mnist.ipynb	Google Colab	GitHub
sample-cvae-mnist-manifold.ipynb	Google Colab	GitHub

Hyperparameters

In the official TensorFlow sample, the dimension of the latent variable (latent_dim) was set to 2 in order to represent the entire latent variable as a 2D image (2D manifold). However, Comparison of AutoEncoder, VAE, CVAE-Why can VAE generate continuous images? , the larger the number of dimensions, the clearer the result. Except for the last manifold, ** Latent variables are 64 dimensions **.

The entire MNIST data (60000 and 10000, respectively) was used for training and testing, and all training data were trained with correct labels. The number of epochs is 100 (50 for manifolds), and the mini-batch size is 32.

Image restoration

Let's restore the 32 images from the beginning of MNIST.

First, give a ** image and the correct label ** and run it on the Decoder.

The details have been blurred, but the natural numbers have been restored.

Next, give all 32 sheets a ** "8" label **. Since the data of "8" is not included in the 32 inputs handled here, data that does not exist in the dataset will be generated.

It has partially collapsed, but it seems that about half of the characters are acceptable as 8. I was able to confirm that ** CVAE can generate the data of the specified label **.

Continuous change of handwriting

A feature of the VAE series is that it can ** generate continuous data **. With VAE, for example, by continuously changing from a latent variable of "0" to a latent variable of "1", you can create an image that gradually changes to another number.

In CVAE, it is said that the label information is removed from the latent variable and ** it comes to express the difference in handwriting **. Therefore, fix the label and change the latent variable continuously to create an image in which the handwriting changes continuously. The input was selected from the first 32 sheets of MNIST as in the previous section.

It is a continuous image generated with latent variables from "thick line 0" to "thin line 0" and a label of "4". As I went down, the lines became thinner and the aspect ratio of the characters changed.

This is a continuous image generated with latent variables from "1 tilted to the right" to "3 tilted to the left" and a label of "6". You can see that the slope of the numbers changes gradually.

2D Manifold An image generated from the entire space of a two-dimensional latent variable and arranged vertically and horizontally is called ** 2D Manifold **. With VAE, the image looks like this.

Let's generate a CVAE 2D Manifold with some labels. Note that the orientation of the image is meaningless because the orientation in the latent space changes with each learning.

This is the result when "4" is specified. You can see that the latent variable has information on the aspect ratio and slope of the character.

The result of "2" was personally interesting. ** Whether to write the lower part of the number by rounding it ** is divided. It was confirmed that CVAE removes the label information from the latent space and retains the handwriting information.

in conclusion

In this article, we modified the TensorFlow 2 series VAE sample to implement CVAE. I also tried image restoration and continuous change with CVAE that learned MNIST.

Since I am a beginner of TensorFlow, it was very helpful to be able to implement it while referring to the sample code that works reliably. I feel that I understand how to write it, so I will try to learn various data in the future.

Reference article

-Implementing Conditional variational AutoEncoder with Tensorflow --Qiita --CVAE implementation report by TensorFlow 1 system. Many points to keep in mind for implementation are explained. -Comparison of AutoEncoder, VAE, CVAE-Why can VAE generate continuous images? ――The basics of autoencoder, VAE, and CVAE are explained in an easy-to-understand manner with figures. The experimental results of restoration by changing the dimension of latent variables and continuous changes are also posted. -Journey around the deep generative model (2): VAE --Qiita --This is a detailed article that also includes the formula of the objective function. Further new achievements such as β-VAE and VQ-VAE are also introduced.