This is a continuation of DCGAN using the Microsoft Cognitive Toolkit (CNTK).
In Part3, we will train WassersteinGAN by CNTK using the image data prepared in Part1. It is assumed that CNTK and NVIDIA GPU CUDA are installed.
GAN: DCGAN Part2-Training DCGAN model dealt with the face generation model by Deep Convolutional Generative Adversarial Network (DCGAN), so we created Wasserstein GAN in Part3. And train.
Wasserstein GAN The network structure of Wasserstein GAN [1] implemented this time is the same as Part2.
However, Wasserstein GAN does not use the sigmoid function in the Discriminator's output layer, so the Discriminator is called Critic. This implementation also replaces all Critic Batch Normalization with Layer Normalization [2].
The difference between the original GAN [3] (let's call it Vanilla GAN (VGAN)) and Wasserstein GAN (WGAN) will be explained later.
The initial values of the transposed convolution / convolution layer parameters were set to the normal distribution [4] with a variance of 0.02.
The loss function implemented this time is shown in the following equation. [5]
\max_{C} \mathbb{E}_{x \sim p_{r}(x)}[C(x)] - \mathbb{E}_{z \sim p_z(z)}[C(G(z))] + \lambda \mathbb{E}_{x' \sim p_{x'}(x')}(||\nabla_{x'} C(x')||_2 - 1)^2 \\
\min_{G} -\mathbb{E}_{z \sim p_z(z)}[C(G(z))] \\
x' = \epsilon x + (1 - \epsilon) G(z), \epsilon \sim U[0, 1]
Where $ C $ and $ G $ represent Critic and Generator, respectively, $ x $ is the input image, $ z $ is the latent variable, $ p_r $ is the distribution of real image data, and $ p_z $ is the fake image. The prior distribution that produces the data, $ U $, represents a uniform distribution. This time I implemented Wasserstein GAN with gradient penalty [5] and set $ \ lambda $ to 10.
Adam [6] is used as the optimization algorithm for both Generator and Discriminator. The learning rate was set to 1e-4, Adam's hyperparameters $ β_1 $ were set to 0.0, and $ β_2 $ was set to the default value of CNTK.
Model training performed 50,000 Iterations with mini-batch training of mini-batch size 16.
-CPU Intel (R) Core (TM) i7-6700K 4.00GHz ・ GPU NVIDIA GeForce GTX 1060 6GB
・ Windows 10 Pro 1909 ・ CUDA 10.0 ・ CuDNN 7.6 ・ Python 3.6.6 ・ Cntk-gpu 2.7 ・ Opencv-contrib-python 4.1.1.26 ・ Numpy 1.17.3 ・ Pandas 0.25.0
The training program is available on GitHub.
wgan_training.py
Although it lacks proof and rigor in some places, I would like to deepen my understanding of GAN mathematics.
To do this, we start with the probabilistic distribution measures Kullback-Leibler divergence and Jensen-Shannon divergence.
Kullback-Leibler divergence and Jensen-Shannon divergence
** Kullback-Leibler divergence ** is a measure of the two probability distributions $ P (x) and Q (x) $. $ H $ represents entropy.
\begin{align}
D_{KL} (P || Q) &= \sum_x P(x) \log \frac{P(x)}{Q(x)} \\
&= H(P, Q) - H(P)
\end{align}
However, KL divergence has no symmetry, that is
The entropy $ H $ is expressed by the following formula.
H(P, Q) = \mathbb{E}_{x \sim P(x)} [- \log Q(x)] \\
H(P) = \mathbb{E}_{x \sim P(x)} [- \log P(x)]
If the notation is slightly modified using this, it can be expressed as follows.
\begin{align}
D_{KL} (P || Q) &= H(P, Q) - H(P) \\
&= \mathbb{E}_{x \sim P(x)} [- \log Q(x)] - \mathbb{E}_{x \sim P(x)} [- \log P(x)] \\
&= \mathbb{E}_{x \sim P(x)} [- \log Q(x) - (- \log P(x))] \\
&= \mathbb{E}_{x \sim P(x)} [\log P(x) - \log Q(x)] \\
\end{align}
On the other hand, ** Jensen-Shannon divergence **, which is a derivative of KL divergence, is defined as follows.
D_{JS} (P || Q) = \frac{1}{2} D_{KL} (P || M) + \frac{1}{2} D_{KL} (Q || M) \\
M = \frac{P + Q}{2}
JS divergence has symmetry and is $ 0 \ leq D_ {JS} \ leq 1 $. Therefore, a large JS divergence means that the two distributions are not similar, and a small JS divergence means that the two distributions are similar.
Vanilla GAN Generative models, including GAN, aim to acquire such generative models based on the hypothesis that the data actually observed has some generative model.
First, let $ D and G $ be Discriminator, Generator, $ p_r $ be the distribution of real data, and $ p_z $ be the prior distribution to generate fake data. Also, let the evaluation functions of Discriminator and Generator be $ V_D and V_G $.
Considering that Discriminator is a problem of distinguishing between genuine data and fake data, the evaluation function of Discriminator can be expressed by the following formula.
V_D = \mathbb{E}_{x \sim p_r(x)}[\log D(x)] + \mathbb{E}_{z \sim p_z(z)}[\log (1 - D(G(z)))]
Furthermore, if we introduce a zero-sum game in which the sum of the gains of the Discriminator and the evaluation function of the Generator is 0, it is natural to define the evaluation function of the Generator as
V_G = - V_D
Here, the Nash equilibrium of the zero-sum game (the solution that is the local minimum for $ V_D $ and the local minimum for $ V_G $) is known to be the minimax solution, so the loss function of VGAN is defined.
\min_G \max_D V(G, D) = \mathbb{E}_{x \sim p_r(x)}[\log D(x)] + \mathbb{E}_{z \sim p_z(z)}[\log (1 - D(G(z)))]
Now we need to show that the optimization problem in the above equation has a unique solution $ G ^ * $, which is $ p_g = p_r $. At that time, the following formula is used as a tacit understanding.
\mathbb{E}_{z \sim p_z(z)}[\log (1 - D(G(z)))] = \mathbb{E}_{x \sim p_g(x)}[\log (1 - D(x))]
Considering the loss function of VGAN as a continuous function,
\begin{align}
V(G, D) &= \int_x p_r(x) \log(D(x))dx + \int_z p_z(z) \log(1 - D(G(z)))dz \\
&= \int_x p_r(x) \log(D(x)) + p_g(x) \log(1 - D(x))dx
\end{align}
Here, considering the critical point of the function $ f (y) = a \ log y + b \ log (1-y) $, it is differentiated to 0, so
f'(y) = 0 \Rightarrow \frac{a}{y} - \frac{b}{1 - y} = 0 \Rightarrow y = \frac{a}{a + b}
Also, considering the second derivative in $ \ frac {a} {a + b} $, we can see that it is convex upward because it is $ a, b \ in (0, 1) $.
f'' \left( \frac{a}{a + b} \right) = - \frac{a}{(\frac{a}{a + b})^2} - \frac{b}{(1 - \frac{a}{a + b})^2} < 0
Therefore, $ \ frac {a} {a + b} $ is the maximum. Therefore,
\begin{align}
V(G, D) &= \int_x p_r(x) \log(D(x))dx + \int_z p_z(z) \log(1 - D(G(z)))dz \\
&\leq \int_x \max_y p_r(x) \log(D(x)) + p_g(x) \log(1 - D(x))dx
\end{align}
It turns out that when $ D_ {G} (x) = \ frac {p_r} {p_r + p_g} $, it is the maximum value, which is a sufficiently unique solution. However, it is not possible to actually find the optimal solution for $ D $. Because there is no way to know the true real data distribution $ p_r $. However, it turns out that $ p_r $ indicates the existence of an optimal solution for $ G $, and that training should focus on approximating $ D $.
Next, in considering the optimal solution for Generator, I would like to reiterate that the final goal of GAN is $ p_g = p_r $. At this time, $ D_ {G} ^ {*} $ is
D_{G}^{*} = \frac{p_r}{p_r + p_g} = \frac{1}{2}
It will be. Given the generator minimization problem when $ D_ {G} ^ {*} $ is obtained,
\begin{align}
\max_D V(G, D_{G}^{*}) &= \mathbb{E}_{x \sim p_r(x)}[\log D_{G}^{*}(x)] + \mathbb{E}_{z \sim p_z(z)}[\log (1 - D_{G}^{*}(G(z)))] \\
&= \mathbb{E}_{x \sim p_r(x)}[\log D_{G}^{*}(x)] + \mathbb{E}_{x \sim p_g(x)}[\log (1 - D_{G}^{*}(x))] \\
&= \mathbb{E}_{x \sim p_r(x)} \left[\log \frac{p_r}{p_r + p_g} \right] + \mathbb{E}_{x \sim p_g(x)} \left[\log \frac{p_g}{p_r + p_g} \right] \\
&= \mathbb{E}_{x \sim p_r(x)} [\log p_r - \log (p_r + p_g)] + \mathbb{E}_{x \sim p_g(x)} [\log p_g - \log (p_r + p_g)] \\
&= \mathbb{E}_{x \sim p_r(x)} \left[\log p_r - \log \left(\frac{p_r + p_g}{2} \cdot 2 \right) \right] + \mathbb{E}_{x \sim p_g(x)} \left[\log p_g - \log \left(\frac{p_r + p_g}{2} \cdot 2 \right) \right] \\
&= \mathbb{E}_{x \sim p_r(x)} \left[\log p_r - \log \left(\frac{p_r + p_g}{2} \right) - \log 2 \right] + \mathbb{E}_{x \sim p_g(x)} \left[\log p_g - \log \left(\frac{p_r + p_g}{2} \right) - \log 2 \right] \\
&= \mathbb{E}_{x \sim p_r(x)} \left[\log p_r - \log \left(\frac{p_r + p_g}{2} \right) \right] + \mathbb{E}_{x \sim p_r(x)} [- \log 2] + \mathbb{E}_{x \sim p_g(x)} \left[\log p_g - \log \left(\frac{p_r + p_g}{2} \right) \right] + \mathbb{E}_{x \sim p_g(x)} [ - \log 2] \\
\end{align}
Here, using KL divergence,
\begin{align}
\max_D V(G, D_{G}^{*}) = D_{KL} \left(p_r \middle| \middle| \frac{p_r + p_g}{2} \right) + D_{KL} \left(p_g \middle| \middle| \frac{p_r + p_g}{2} \right) - \log 4
\end{align}
Furthermore, if $ \ frac {p_r + p_g} {2} = M $, then from JS divergence,
\max_D V(G, D_{G}^{*}) = 2 \cdot D_{JS} (p_r || p_g) - \log 4
Therefore, when $ p_g = p_r $, we find that we have $-\ log 4 $ as a candidate for the global minimum, and at the same time we can think of the Generator minimization problem as minimizing JS divergence.
From the above theoretical background, it is possible to learn the generative model with sufficient expressiveness and real data, but it is still difficult to train GAN.
Wasserstein GAN A paper analyzing GAN [7] clarified the reason why VGAN training was difficult, and the authors of this paper proposed WGAN [[1]](# reference) as a solution. #It is a reference.
In VGAN, Generator eventually scaled to JS divergence, while in WGAN it scaled to Wasserstein distance.
Wasserstein distance, also known as Earth-Mover (EM) distance, is a measure based on transport optimization problems, where it represents the cost of bringing one probability distribution closer to another.
W(p_r, p_g) = \inf_{\gamma \in \Pi(p_r, p_g)} \mathbb{E}_{(x, y) \sim \gamma} [||x - y||]
Wasserstein distance has beneficial properties not found in KL divergence or JS divergence. If the two probability distributions do not overlap, the KL divergence will diverge and the JS divergence will be $ \ log 2 $, making it indistinguishable, but the Wasserstein distance will take a smooth value and therefore a gradient. The optimization by law is stable.
And Wasserstein distance is transformed based on Kantorovich-Rubinstein duality, resulting in a loss function like the one below. Where $ C, G $ stands for Critic, Generator.
\min_G \max_{||C||_L \leq K} \mathbb{E}_{x \sim p_r(x)}[C(x)] - \mathbb{E}_{z \sim p_z(z)}[C(G(z))]
However, Wasserstein distance is subject to a constraint called Lipshitz continuity, and we use the method of clipping the weight parameter as a way to guarantee this.
However, since weight clipping is a brute force method, training may fail, and gradient penalty [5] has been proposed as a remedy.
Gradient penalty takes advantage of the fact that in optimized Critic, the L2 norm of the gradient for any point between the real and generated data is 1, and the loss function has a L2 norm of the gradient other than 1. Is a way to penalize.
\lambda \mathbb{E}_{x' \sim p_{x'}(x')}(||\nabla_{x'} C(x')||_2 - 1)^2
Any point $ x'$ between the real data and the generated data is represented by an image that is a random blend of the real image data and the image data generated by the Generator.
x' = \epsilon x + (1 - \epsilon) G(z), \epsilon \sim U[0, 1]
Also, since weight clipping and gradient penalty are too constrained to express the Generator, we use a regularization term that brings the L2 norm of the gradient close to 1 in the neighborhood of the real data DRAGAN [8] Is also proposed.
The figure below is a visualization of each loss function during training. The horizontal axis represents the number of repetitions, and the vertical axis represents the value of the loss function. The values of both Critic and Generator are very large.
The figure below shows the face image generated by the trained Generator. Despite the large loss function, some images are failing, but they appear to produce a better looking face image than Part 2.
The figure below shows the transition of image generation during training with animation.
As with Part2, when I measured the Inception Score [10] with Inception-v3 [9], the following results were obtained.
Inception Score 2.14
CNTK 206 Part C: Wasserstein and Loss Sensitive GAN with CIFAR Data
GAN : DCGAN Part1 - Scraping Web images GAN : DCGAN Part2 - Training DCGAN model