Generative Adversarial Networks (GAN)
http://iailab.kaist.ac.kr/
Industrial AI Lab at KAIST
Table of Contents
Source
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
- by 최윤제 (고려대 석사생)
- YouTube: https://www.youtube.com/watch?v=odpjk7_tGY0
- Slides: https://www.slideshare.net/NaverEngineering/1-gangenerative-adversarial-network
CSC321 Lecture 19: GAN
- By Prof. Roger Grosse at Univ. of Toronto
- http://www.cs.toronto.edu/~rgrosse/courses/csc321_2018/
CS231n: CNN for Visual Recognition
- Lecture 13: Generative Models
- By Prof. Fei-Fei Li at Stanford University
- http://cs231n.stanford.edu/
1. Discriminative Model v.s. Generative Model¶
Discriminative model
A discriminative model learns the boundary between classes directly from data. Rather than understanding what the data looks like, it focuses on how to tell them apart. Formally, it models the conditional probability:
$$P(y \mid x)$$
where:
- $x$ : input data (e.g., an image)
- $y$ : class label (e.g., man or woman)
Cenerative model
A generative model learns the distribution of training data itself. Rather than learning to classify existing data, it learns what the data looks like — so that it can create new samples that resemble the training data. Formally, it models:
$$P(x) \quad \text{or} \quad P(x, y)$$
where $x$ is the data (e.g., an image) and $y$ is an optional class label.
2. Probabilty Perspective of Generative Model¶
2.1. Probability Basics (Review)¶
Consider rolling a die. The outcome is not fixed in advance — it can be any value from 1 to 6. We call this outcome a random variable $X$.
For each possible value of $X$, we can assign a probability. This assignment is called a probability mass function (PMF):
$$P(X = x) \quad \text{for } x \in \{1, 2, 3, 4, 5, 6\}$$
The table shows one example. The value 4 never appears ($P(X=4) = 0$), and the value 6 appears twice as often as the others ($P(X=6) = \frac{2}{6}$). The remaining values each occur with probability $\frac{1}{6}$.
The bar chart visualizes this PMF. The height of each bar represents the probability of that outcome. Two properties always hold for any valid PMF:
$$ \begin{align*} P(X = x) &\geq 0 \quad \text{for all } x \\\\ \sum_{x} P(X = x) &= 1 \end{align*} $$
In practice, we estimate $P(X)$ by counting how often each outcome appears across many trials. As the number of trials increases, the empirical frequency converges to the true theoretical probability — this is the law of large numbers:
$$P(X = x) \approx \frac{\text{count}(X = x)}{N} \quad \text{as } N \to \infty$$
This is the discrete case — $X$ takes values from a finite set. When $X$ is continuous, the PMF is replaced by a probability density function (PDF), and the sum becomes an integral.
2.2. From Dice to Images¶
Now consider a very different kind of random variable. What if $x$ is an actual image from a training dataset? Each image can be represented as a $64 \times 64 \times 3$ dimensional vector — one value per pixel per color channel, giving 12,288 dimensions in total.
The images above are some realizations (samples) from this space. Each one is a single point in a 12,288-dimensional space, drawn from the unknown distribution $p_{\text{data}}(x)$.
Of course, not every point in this high-dimensional space corresponds to a realistic face. The vast majority of randomly chosen vectors would produce noise. The real face images occupy a very small, structured region of this space — and $p_{\text{data}}(x)$ assigns high probability density precisely to that region.
The goal of the generative model is to learn where that region is, and how probability mass is distributed within it — so that new samples drawn from $p_{\text{model}}(x)$ land in the same region and look like plausible face images.
This is essentially a probability density function estimation problem.
2.3. Probability Density Function of Data and Model¶
The first figure shows that there exists a true $p_{\text{data}}(x)$ — a probability density function that represents the distribution of actual images. This function is unknown and cannot be accessed directly. All we have is a finite set of samples drawn from it — the training images.
The second figure shows what we are trying to achieve. The goal is to find a $p_{\text{model}}(x)$ — the distribution of images generated by the model — that approximates $p_{\text{data}}(x)$ as closely as possible. If this approximation is good enough, then sampling from $p_{\text{model}}(x)$ will produce new images that are indistinguishable from real ones.
One way to measure how well $p_{\text{model}}(x)$ approximates $p_{\text{data}}(x)$ is the Kullback–Leibler (KL) Divergence:
$$D_{\text{KL}}(p_{\text{data}} \| p_{\text{model}}) = \int p_{\text{data}}(x) \log \frac{p_{\text{data}}(x)}{p_{\text{model}}(x)}\, dx$$
This quantity is zero when the two distributions are identical, and grows larger as they diverge. Training a generative model can be understood as minimizing this gap — pushing $p_{\text{model}}(x)$ progressively closer to $p_{\text{data}}(x)$ until the generated samples become realistic.
The key point is that $p_{\text{data}}(x)$ is unknown. We never have access to this function directly — we only have a finite set of samples drawn from it (the training dataset). The goal of a generative model is to learn $p_{\text{model}}(x)$ that approximates this unknown $p_{\text{data}}(x)$ closely enough that samples drawn from $p_{\text{model}}(x)$ are indistinguishable from real data.
2.4. How Does a Neural Network Represent a Probability Distribution?¶
A neural network is deterministic. Given the same input, it always produces the same output. So it does not represent a probability distribution the way one might first imagine.
One might think the network encodes $p_{\text{model}}(x)$ as a function — with input neurons receiving values of $x$ at regular intervals, and output neurons returning the corresponding density values. This is not what happens.
Instead, the idea is much simpler:
We already know how to sample from a simple distribution — a standard Gaussian $z \sim \mathcal{N} (0, I)$. We feed this sampled value $z$ into the network as input. The network then applies a deterministic transformation:
$$x = G(z)$$
Because $z$ is random, $x$ is also random. Different samples of $z$ produce different outputs $x$. The collection of all such outputs $\{G(z)\}$ follows some distribution — and that distribution is what we call $p_{\text{model}}(x)$.
The network never explicitly computes or stores $p_{\text{model}}(x)$. The distribution exists only implicitly — as the statistical behavior of the outputs across many different inputs $z$. Training adjusts the parameters of $G$ so that this implicit output distribution matches $p_{\text{data}}(x)$ as closely as possible.
The target of this transformation is $p_{\text{data}}(x)$ — but $p_{\text{data}}(x)$ is never known explicitly. What we have instead is a finite set of samples drawn from it: the training images. These samples are the only evidence we have about what $p_{\text{data}}(x)$ looks like.
Training is therefore an indirect process. Rather than minimizing a distance to $p_{\text{data}}(x)$ directly, we adjust the parameters of $G$ so that the outputs $\{G(z)\}$ resemble the training samples as closely as possible. The assumption is that if the generated samples are statistically similar to the training data, then $p_{\text{model}}(x)$ must be close to $p_{\text{data}}(x)$.
The neural network is responsible for this nonlinear transformation. It does not receive any explicit description of $p_{\text{data}}(x)$ — it infers the appropriate mapping entirely from the training samples. This is what it means for a generative model to be data-driven.
Generative model of high dimensional space
The figure below illustrates the training process of a generative model in high-dimensional image space.
On the left, $z$ is sampled from a unit Gaussian — a simple, low-dimensional distribution that is easy to sample from. This vector is passed into the generative model (a neural network with parameters $\theta$), which applies a nonlinear transformation and produces a sample in image space.
The middle panel shows the generated distribution $\hat{p}(x)$ — the region of image space that the model currently covers. The right panel shows the true data distribution $p(x)$, which is never known explicitly. The black dots represent the training samples — the only observations we have of $p(x)$.
The "loss" connects the two panels. This is the key: training is driven by measuring the discrepancy between $\hat{p}(x)$ and the training samples, and using that signal to update $\theta$. The parameters of the network are adjusted iteratively so that the generated distribution $\hat{p}(x)$ moves closer to $p(x)$.
Generating Images After Training
Once the generative model is trained, generating a new image is straightforward.
Each dimension of the code vector $z$ is sampled independently from a simple distribution — such as a Gaussian or uniform distribution. This vector is then fed into the deterministic generator network, which outputs a new image.
The key point is that the entire mapping — from independent Gaussian noise all the way to a coherent image — is learned from data. After training, the network has encoded the structure of the training distribution into its parameters $\theta$, and a single forward pass through the network is all that is needed to produce a new sample.
Measuring the discrepancy between $\hat{p}(x)$ and the training samples is challenging. How should we define and compute this discrepancy in a way that is both meaningful and computationally tractable?
Ian Goodfellow came up with a great idea for measuring the discrepancy between $\hat{p}(x)$ and $p_{\text{data}}(x)$, and developed GAN — Generative Adversarial Network. The original paper, "Generative Adversarial Nets," was published in 2014 while Goodfellow was a PhD student at the Université de Montréal under the supervision of Yoshua Bengio. The story behind the idea is well known in the deep learning community: Goodfellow conceived the core concept of the adversarial framework in a single evening and implemented a working prototype the same night. The paper has since become one of the most cited works in the history of deep learning, and Yann LeCun, one of the pioneers of the field, described GAN as "the most interesting idea in the last ten years in machine learning." We will study it in the following session.
3. Generative Adversarial Networks (GAN)¶
In generative modeling, we would like to train a network that models a distribution over data — such as a distribution over images. As discussed, this requires measuring the discrepancy between the generated distribution $\hat{p}(x)$ and the true data distribution $p_{\text{data}}(x)$, which is challenging to do explicitly.
GANs take a fundamentally different approach: they do not work with any explicit density function at all. Instead, they adopt a game-theoretic approach — measuring the quality of the generator indirectly, through the judgment of a second network.
3.1. Adversarial Nets Framework¶
One natural way to judge the quality of a generative model is to sample from it and ask: do these samples look real? This is exactly the criterion GANs operationalize.
The goal is to train a generator network to produce samples which are indistinguishable from real data, as judged by a discriminator network whose job is to tell real from fake.
GAN trains two networks simultaneously and in opposition:
The discriminator network $D$ takes an image as input and outputs the probability that it came from the real data distribution. It is trained to correctly classify real images as real and generated images as fake.
The generator network $G$ takes a noise vector $z \sim \mathcal{N}(0, I)$ as input and outputs a synthetic image $G(z)$. It is trained to produce realistic-looking samples that fool the discriminator into believing they are real.
3.2. GAN in Detail¶
Discriminator Network
The discriminator $D$ is a neural network that takes an image as input and outputs a single scalar value $D(x) \in [0, 1]$, representing the probability that the input image is real.
The figure illustrates the first half of the adversarial framework. A real image $x$ of size $64 \times 64 \times 3$ is fed into the discriminator. The discriminator should classify a real image as real — meaning its output $D(x)$ should be close to 1.
The discriminator is trained with two types of inputs:
- Real images sampled from the training data, for which the target output is 1
- Fake images $G(z)$ produced by the generator, for which the target output is 0
The bottom row of the figure (shown faded) previews the generator side of the framework — a noise vector $z$ is passed through the generator $G$ to produce a fake image $G(z)$, which is then also passed into the discriminator to produce $D(G(z))$. This value should be close to 0 from the discriminator's perspective.
Generator Network
The generator $G$ is a neural network that takes a latent code $z$ as input and produces a synthetic image $G(z)$.
The generator's goal is to create an image that is indistinguishable from real data — to deceive the discriminator into believing that $G(z)$ is a real image. From the generator's perspective, the output $D(G(z))$ should be close to 1.
This is the opposite of what the discriminator wants. The discriminator tries to push $D(G(z))$ toward 0, while the generator tries to push it toward 1. This tension is what drives the adversarial training process.
3.3. Loss Function of GAN¶
The GAN objective can be derived from a familiar starting point — the binary cross-entropy loss used in logistic regression:
$$\text{loss} = -y \log h(x) - (1-y) \log (1-h(x))$$
where $y \in \{0, 1\}$ is the true label and $h(x)$ is the predicted probability.
The discriminator is essentially a binary classifier that assigns label $y = 1$ to real images and $y = 0$ to fake images. Applying the cross-entropy loss to both cases and taking expectations gives the discriminator's objective:
$$\max_D \; \mathbb{E}_{x \sim p_{\text{data}}} [\log D(x)] + \mathbb{E}_{z \sim p(z)} [\log (1 - D(G(z)))]$$
The generator's objective is the opposite — it wants to fool the discriminator, so it minimizes the same expression with respect to its own parameters:
$$\min_G \; \mathbb{E}_{z \sim p(z)} [\log (1 - D(G(z)))]$$
However, this formulation causes a practical problem known as saturation. Early in training, when $G$ is poor, the discriminator can reject generated samples with high confidence because they are clearly different from real data. In this regime, $D(G(z)) \approx 0$, and $\log(1 - D(G(z))) \approx 0$, which means the gradient flowing back to $G$ is near zero — the generator receives almost no learning signal.
To address this, rather than training $G$ to minimize $\log(1 - D(G(z)))$, we instead train $G$ to maximize $\log D(G(z))$:
$$\max_G \; \mathbb{E}_{z \sim p(z)} [\log D(G(z))]$$
This is called the non-saturating game. The two objectives — $\min \log(1-D(G(z)))$ and $\max \log D(G(z))$ — have the same fixed point in theory, but the non-saturating version provides much stronger gradients early in learning, when the generator needs them most.
3.4. Soving a MinMax Problem¶
Training a GAN requires optimizing two networks simultaneously — which raises an immediate question: how do we train two networks that have opposing objectives? The answer is to alternate between them. Rather than updating $G$ and $D$ at the same time, we fix one and update the other, then swap. This turns the minimax problem into a sequence of standard gradient steps.
Step 1: Fix $G$ and perform a gradient step to update $D$:
$$\max_{D} \; \mathbb{E}_{x \sim p_{\text{data}}(x)}\left[\log D(x)\right] + \mathbb{E}_{z \sim p_{z}(z)}\left[\log (1-D(G(z)))\right]$$
Step 2: Fix $D$ and perform a gradient step to update $G$:
$$\max_{G} \; \mathbb{E}_{z \sim p_{z}(z)}\left[\log D(G(z))\right]$$
Since most deep learning frameworks are built around minimization, the equivalent formulation using gradient descent is:
Step 1: Fix $G$ and perform a gradient step to update $D$:
$$\min_{D} \; \mathbb{E}_{x \sim p_{\text{data}}(x)}\left[-\log D(x)\right] + \mathbb{E}_{z \sim p_{z}(z)}\left[-\log (1-D(G(z)))\right]$$
Step 2: Fix $D$ and perform a gradient step to update $G$:
$$\min_{G} \; \mathbb{E}_{z \sim p_{z}(z)}\left[-\log D(G(z))\right]$$
These two formulations are equivalent — the second is simply the first with the sign flipped, expressing the same optimization in terms of minimization rather than maximization. In practice, the two steps are repeated alternately for each mini-batch throughout training.
4. GAN with MNIST¶
To make the concepts from the previous sections concrete, we will implement a GAN from scratch and observe how it behaves in practice. To keep things simple, we will train a GAN to generate images of the digit 2 from the MNIST dataset, using basic fully-connected neural networks (ANN) for both the generator and the discriminator.
4.1. GAN Implementation¶
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
(train_x, train_y), _ = tf.keras.datasets.mnist.load_data()
train_x = train_x[np.where(train_y == 2)]
train_x = train_x/255.0
train_x = train_x.reshape(-1, 784)
print('train_iamges :', train_x.shape)
generator = tf.keras.models.Sequential([
tf.keras.layers.Dense(units = 256, activation = 'relu', input_dim = 100),
tf.keras.layers.Dense(units = 784, activation = 'sigmoid')
])
discriminator = tf.keras.models.Sequential([
tf.keras.layers.Dense(units = 256, activation = 'relu', input_dim = 784),
tf.keras.layers.Dense(units = 1, activation = 'sigmoid'),
])
discriminator.compile(optimizer = tf.keras.optimizers.Adam(learning_rate = 0.0001),
loss = 'binary_crossentropy')
Note that discriminator.trainable = False is set before building the combined model. This ensures that when training the combined model, only the generator's parameters are updated. If this were not set, the discriminator's parameters would also change during the generator update step, which would break the adversarial training process.
The combined model chains the generator and the discriminator together: it takes a noise vector as input, passes it through the generator to produce a fake image, and then passes that image through the (frozen) discriminator to produce a scalar output. This allows gradients to flow back through the discriminator and into the generator during training.
discriminator.trainable = False
combined_input = tf.keras.layers.Input(shape = (100,))
generated = generator(combined_input)
combined_output = discriminator(generated)
combined = tf.keras.models.Model(inputs = combined_input, outputs = combined_output)
combined.compile(optimizer = tf.keras.optimizers.Adam(learning_rate = 0.0002),
loss = 'binary_crossentropy')
The make_noise function samples from a standard Gaussian distribution $\mathcal{N}(0, 1)$ to produce the latent code $z$. Since each dimension is sampled independently from the same distribution, this is white noise — and that is why the function is named make_noise.
def make_noise(samples):
return np.random.normal(0, 1, [samples, 100])
def plot_generated_images(generator, samples = 3):
noise = make_noise(samples)
generated_images = generator.predict(noise)
generated_images = generated_images.reshape(samples, 28, 28)
for i in range(samples):
plt.subplot(1, samples, i+1)
plt.imshow(generated_images[i], 'gray', interpolation = 'nearest')
plt.axis('off')
plt.tight_layout()
plt.show()
Since GAN has two networks, we train their parameters alternatively — fixing one while updating the other.
Step 1: Fix $G$ and perform a gradient step to update $D$:
$$\min_{D} \; \mathbb{E}_{x \sim p_{\text{data}}(x)}\left[-\log D(x)\right] + \mathbb{E}_{z \sim p_{z}(z)}\left[-\log (1-D(G(z)))\right]$$
Step 2: Fix $D$ and perform a gradient step to update $G$:
$$\min_{G} \; \mathbb{E}_{z \sim p_{z}(z)}\left[-\log D(G(z))\right]$$
n_iter = 20000
batch_size = 100
fake = np.zeros(batch_size)
real = np.ones(batch_size)
for i in range(n_iter):
# Train Discriminator
noise = make_noise(batch_size)
generated_images = generator.predict(noise, verbose = 0)
idx = np.random.randint(0, train_x.shape[0], batch_size)
real_images = train_x[idx]
# update D's parameters
D_loss_real = discriminator.train_on_batch(real_images, real)
D_loss_fake = discriminator.train_on_batch(generated_images, fake)
D_loss = D_loss_real + D_loss_fake
# Train Generator (update G's parameters)
noise = make_noise(batch_size)
G_loss = combined.train_on_batch(noise, real)
if i % 5000 == 0:
print('Discriminator Loss: ', D_loss)
print('Generator Loss: ', G_loss)
plot_generated_images(generator)
4.2. After Training¶
After training, we no longer need the discriminator. We use only the generator network to produce new images. By sampling a noise vector $z \sim \mathcal{N}(0, 1)$ and passing it through the trained generator, we can generate new images that resemble the training data.
plot_generated_images(generator)
5. Conditional GAN (cGAN)¶
A standard GAN learns to generate samples from $p_{\text{data}}(x)$ without any control over what is generated. Each sample $G(z)$ is drawn randomly from whatever distribution the generator has learned — there is no way to specify what kind of image should be produced.
Conditional GAN addresses this limitation by conditioning both the generator and the discriminator on additional information $c$, such as a class label.
$$G(z, c) \rightarrow \hat{x} \qquad D(x, c) \rightarrow [0, 1]$$
The conditioning information $c$ is provided as an additional input to both networks. This allows the generator to produce samples that correspond to a specific class, and the discriminator to evaluate not just whether an image looks real, but whether it looks real given the specified condition.
The objective function is a straightforward extension of the original GAN objective:
$$\min_G \max_D \; \mathbb{E}_{x \sim p_{\text{data}}(x|c)}\left[\log D(x, c)\right] + \mathbb{E}_{z \sim p_{z}(z)}\left[\log (1 - D(G(z, c), c))\right]$$
The only difference from the unconditional case is that every term is now conditioned on $c$. The generator learns $p_{\text{model}}(x \mid c)$ rather than $p_{\text{model}}(x)$ — the distribution of images given a particular condition.
The conditioning variable $c$ can take many forms in practice:
- a class label (e.g., generate an image of a cat)
- a text description (e.g., generate an image from a caption)
- another image (e.g., image-to-image translation)
- any auxiliary information available at training time
This simple modification transforms GAN from an uncontrolled sampler into a controllable generative model, making it applicable to a much broader range of tasks.
5.1. Conditional GAN Implimentation¶
Rather than studying cGAN theoretically, we will go through a hands-on implementation using the MNIST dataset. This will make the key ideas — how the conditioning information is incorporated into both networks, how training proceeds, and what the results look like — concrete and intuitive.
import tensorflow as tf
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
(train_x, train_y), (test_x, test_y) = tf.keras.datasets.mnist.load_data()
train_x, test_x = train_x/255.0 , test_x/255.0
train_x, test_x = train_x.reshape(-1,784), test_x.reshape(-1,784)
train_y = tf.keras.utils.to_categorical(train_y, num_classes = 10)
test_y = tf.keras.utils.to_categorical(test_y, num_classes = 10)
print('train_x: ', train_x.shape)
print('test_x: ', test_x.shape)
print('train_y: ', train_y.shape)
print('test_y: ', test_y.shape)
In cGAN, the generator takes both a noise vector $z$ and a conditioning variable $c$ as input. Here, the conditioning variable is the class label $c$, represented as a one-hot encoded vector of dimension 10 — one element for each digit in MNIST.
The noise vector $z$ has dimension 128, and the label vector $c$ has dimension 10. The two are concatenated before being fed into the network, giving a combined input dimension of $128 + 10 = 138$.
By concatenating $z$ and $c$, the generator learns to produce images that correspond to the specified digit. For example, when $c = [0, 1, 0, 0, 0, 0, 0, 0, 0, 0]$ (the one-hot encoding for digit 1), the generator is expected to produce an image that looks like a handwritten 1.
generator_model = tf.keras.models.Sequential([
tf.keras.layers.Dense(units = 256, activation = 'relu', input_dim = 138),
tf.keras.layers.Dense(units = 784, activation = 'sigmoid')
])
noise = tf.keras.layers.Input(shape = (128,))
label = tf.keras.layers.Input(shape = (10,))
model_input = tf.keras.layers.concatenate([noise, label], axis = 1)
generated_image = generator_model(model_input)
generator = tf.keras.models.Model(inputs = [noise, label], outputs = generated_image)
generator.summary()
Similarly, the discriminator in cGAN takes both an image and a conditioning label as input. The image has dimension 784 (28×28 pixels), and the label is a one-hot encoded vector of dimension 10. The two are concatenated before being fed into the network, giving a combined input dimension of $784 + 10 = 794$.
By conditioning the discriminator on the label $c$, it learns to evaluate not just whether an image looks real, but whether it looks like a plausible example of the specified digit. For instance, if $c$ corresponds to digit 3, the discriminator must judge whether the input image looks like a realistic handwritten 3 — not just a realistic digit in general.
discriminator_model = tf.keras.models.Sequential([
tf.keras.layers.Dense(units = 256, activation = 'relu', input_dim = 794),
tf.keras.layers.Dense(units = 1, activation = 'sigmoid')
])
input_image = tf.keras.layers.Input(shape = (784,))
label = tf.keras.layers.Input(shape = (10,))
model_input = tf.keras.layers.concatenate([input_image, label], axis = 1)
validity = discriminator_model(model_input)
discriminator = tf.keras.models.Model(inputs = [input_image, label], outputs = validity)
discriminator.compile(optimizer = tf.keras.optimizers.Adam(learning_rate = 0.0002),
loss = ['binary_crossentropy'])
discriminator.summary()
discriminator.trainable = False
noise = tf.keras.layers.Input(shape = (128,))
label = tf.keras.layers.Input(shape = (10,))
generated_image = generator([noise, label])
validity = discriminator([generated_image, label])
combined = tf.keras.models.Model(inputs = [noise, label], outputs = validity)
combined.compile(optimizer = tf.keras.optimizers.Adam(learning_rate = 0.0002),
loss = ['binary_crossentropy'])
combined.summary()
def create_noise(samples):
return np.random.normal(0, 1, [samples, 128])
def plot_generated_images(generator):
noise = create_noise(10)
label = np.arange(0, 10).reshape(-1, 1)
label_onehot = np.eye(10)[label.reshape(-1)]
generated_images = generator.predict([noise, label_onehot])
plt.figure(figsize = (12, 3))
for i in range(generated_images.shape[0]):
plt.subplot(1, 10, i + 1)
plt.imshow(generated_images[i].reshape((28, 28)), 'gray', interpolation = 'nearest')
plt.title('Digit: {}'.format(i))
plt.axis('off')
plt.show()
n_iter = 30000
batch_size = 50
valid = np.ones(batch_size)
fake = np.zeros(batch_size)
for i in range(n_iter):
# Train Discriminator
idx = np.random.randint(0, train_x.shape[0], batch_size)
real_images, labels = train_x[idx], train_y[idx]
noise = create_noise(batch_size)
generated_images = generator.predict([noise,labels], verbose = 0)
d_loss_real = discriminator.train_on_batch([real_images, labels], valid)
d_loss_fake = discriminator.train_on_batch([generated_images, labels], fake)
d_loss = d_loss_real + d_loss_fake
# Train Generator
noise = create_noise(batch_size)
labels = np.random.randint(0, 10, batch_size)
labels_onehot = np.eye(10)[labels]
g_loss = combined.train_on_batch([noise, labels_onehot], valid)
if i % 5000 == 0:
print('Discriminator Loss: ', d_loss)
print('Generator Loss: ', g_loss)
plot_generated_images(generator)
5.2. Meaning of Latent Space in cGAN¶
The cGAN implementation above demonstrates the core idea of conditional generation, even if the quality of the generated images leaves room for improvement. One important question worth asking at this point is: what does the latent space actually represent in a cGAN, and how is it different from a standard GAN?
In a standard GAN, the position in the latent space determines everything about the generated image — both what digit it is and what it looks like. Different positions in the latent space correspond to different digit classes as well as different visual styles.
In cGAN, the class information is provided separately through the conditioning label $c$. This frees the latent space from having to encode class identity. Instead, the latent space only needs to encode lower-level visual features such as stroke width, angle, or writing style.
As shown in the figure, the same point in the latent space can be used to generate two completely different digits — a 1 and a 3 — simply by changing the conditioning label. The latent point encodes how the digit is written, while the label determines which digit is written.
This separation makes the latent space more interpretable and the generation process more controllable:
- The conditioning label $c$ controls the class — which digit to generate
- The latent code $z$ controls the style — features such as stroke width or angle that are shared across all digit classes
In practice, this means we can fix $z$ and vary $c$ to generate different digits with the same handwriting style, or fix $c$ and vary $z$ to generate different styles of the same digit.
6. InfoGAN (Information Maximizing GAN)¶
The cGAN result above raises a natural question: what if we do not have class labels? In cGAN, the conditioning label $c$ must be provided explicitly at training time — meaning we need a labeled dataset. This is a significant limitation when labels are unavailable or expensive to obtain.
In a standard generative model, there is no control over the features of the data being generated. The noise vector $z$ encodes everything about the generated image — digit identity, stroke width, angle, and any other factor of variation — all entangled together in a single unstructured vector. There is no way to manipulate one factor independently of the others. Changing a single dimension of $z$ may simultaneously affect the digit class, the thickness of the strokes, and the tilt of the character in unpredictable ways.
InfoGAN addresses this with a simple but powerful modification to the standard GAN framework. A separate latent code $c$ is introduced alongside the noise vector $z$, and the generator is encouraged to use $c$ to encode distinct, interpretable factors of variation in a disentangled way — learned entirely through unsupervised learning, without any labels. The generator takes both as input:
$$G(z, c) \rightarrow \hat{x}$$
The problem is that without any constraint, the generator will simply ignore $c$ and encode everything in $z$, leaving $c$ entangled with or irrelevant to the output. To prevent this, InfoGAN adds a constraint: the generated image $G(z, c)$ must retain enough information to recover $c$ from it. In other words, if we look at a generated image, we should be able to tell what value of $c$ was used to produce it. This is what mutual information $I(c;\, G(z, c))$ measures — how much knowing the generated image tells us about $c$.
Maximizing this quantity forces $c$ to encode something meaningful and consistent about the output, pushing the model toward a disentangled representation where each dimension of $c$ corresponds to an independent, interpretable factor. The full objective is:
$$\min_G \max_D \; \mathbb{E}_{x \sim p_{\text{data}}}[\log D(x)] + \mathbb{E}_{z, c}[\log(1 - D(G(z, c)))] - \lambda I(c;\, G(z, c))$$
To compute this in practice, an auxiliary network $Q$ is added to the framework. $Q$ takes a generated image $\hat{x}$ as input and predicts the latent code $c$ that was used to produce it. If $Q$ can accurately recover $c$ from $G(z, c)$, it means the generated image carries clear information about $c$ — and the mutual information is high. Training $Q$ to predict $c$ well is therefore equivalent to maximizing the mutual information. In practice, $Q$ shares most of its layers with the discriminator $D$, so it adds very little computational overhead.
The latent code $c$ can take various forms. For continuous codes, $c$ typically takes values in the range $[-1, 1]$, encoding smooth, continuous factors of variation such as stroke width or rotation angle. For discrete codes, $c$ can represent categorical information such as digit class.
The three networks — $G$, $D$, and $Q$ — are trained jointly, and the generative model learns interpretable information from the data entirely by itself.
The two figures below show concrete examples of what InfoGAN learns on MNIST when two continuous latent codes are used.
In the first figure, varying one continuous latent code from left to right produces digit 2s with progressively changing rotation angle — the digit gradually tilts as the code value changes.
In the second figure, varying the other continuous latent code produces digit 2s with progressively changing curvature in the tail — from a sharply curved tail on the left to a more rounded, looped style on the right.
It is important to note that these factors — rotation and tail curvature — were never specified in advance. InfoGAN did not receive any instruction to discover them. They emerged entirely from unsupervised learning, driven only by the mutual information constraint. The interpretation of what each latent code represents is something we as humans infer after training by observing how the generated images change as each code is varied. The model simply learns whatever structure in the data is most consistent and informative — and it is up to us to give that structure a name.
6.1. Meaning of Latent Space in InfoGAN¶
The figure illustrates how the generator in InfoGAN operates at inference time. The same point in the latent space $z$ is used in both cases, but the continuous latent code $c$ is set to $-1$ on the left and $+1$ on the right. The resulting images show the same digit 2 with a noticeably different visual style — demonstrating that the latent code independently controls an interpretable factor of variation while the latent point remains fixed.
This is the key advantage of InfoGAN over a standard GAN. By fixing $z$ and varying $c$, we can control specific attributes of the generated output in a predictable and interpretable way — without any labels or explicit supervision. The latent space $z$ and the latent code $c$ each play a distinct role: $z$ determines the general characteristics of the sample drawn from the data distribution, while $c$ provides fine-grained control over the interpretable factors that InfoGAN has learned to disentangle from the data.
6.2. InfoGAN Implementation¶
Deep Convolutional GAN (DCGAN)
All of the GAN examples so far have used fully connected neural networks for both the generator and the discriminator. While this is sufficient to demonstrate the core ideas, fully connected networks are not well suited for image generation — they do not exploit the spatial structure of images and scale poorly to higher resolutions.
DCGAN is a direct extension of the standard GAN framework that addresses this by replacing fully connected layers with convolutional and convolutional-transpose layers in the discriminator and generator, respectively. Convolutional layers are well suited for image data because they capture local spatial patterns and are invariant to the position of features within the image. As a result, DCGAN produces significantly higher quality images than a fully connected GAN, and has become the standard backbone architecture for image generation tasks — including InfoGAN, which we will implement next.
(Simple) InfoGAN Structure as an Example
The figure below shows the DCGAN architecture used as the backbone for our InfoGAN implementation. To keep things simple, the model is trained to generate only the digit 2, and two continuous latent codes are used for $c$.
The generator takes a noise vector $z$ as input and progressively expands it into a full image through a series of convolutional-transpose layers.
The discriminator follows the opposite direction. It takes a 28$\times$28 image as input and passes it through a series of convolutional layers, eventually producing a single scalar output — the probability that the input image is real.
In the InfoGAN extension, the $Q$ network shares the convolutional layers of the discriminator and adds a separate fully connected branch of 128 units on top. This branch outputs the predicted values of the two continuous latent codes $c_1$ and $c_2$. Since $Q$ reuses the discriminator's feature extraction layers, it adds minimal computational overhead while enabling the model to learn disentangled representations from the data.
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
(train_x, train_y), _ = tf.keras.datasets.mnist.load_data()
train_x = train_x[np.where(train_y == 2)]
train_x = train_x/255.0
train_x = train_x.reshape(-1, 28, 28, 1)
print('train_iamges :', train_x.shape)
The generator takes a concatenated input of dimension 62 + 2 = 64, where 62 dimensions correspond to the noise vector $z$ and 2 dimensions correspond to the continuous latent codes $c_1$ and $c_2$.
generator = tf.keras.models.Sequential([
tf.keras.layers.Dense(units = 1024,
use_bias = False,
input_shape = (62 + 2,)),
tf.keras.layers.BatchNormalization(),
tf.keras.layers.ReLU(),
tf.keras.layers.Dense(units = 7*7*128,
use_bias = False),
tf.keras.layers.BatchNormalization(),
tf.keras.layers.ReLU(),
tf.keras.layers.Reshape((7, 7, 128)),
tf.keras.layers.Conv2DTranspose(64,
(4, 4),
strides = (2, 2),
padding = 'same',
use_bias = False),
tf.keras.layers.BatchNormalization(),
tf.keras.layers.ReLU(),
tf.keras.layers.Conv2DTranspose(1,
(4, 4),
strides = (2, 2),
padding = 'same',
use_bias = False,
activation = 'sigmoid')
])
The discriminator in InfoGAN is structured around three components that share a common feature extraction backbone.
The
extractornetwork takes a 28$\times$28$\times$1 image as input, producing a 1024-dimensional feature vector. This shared feature vector is then consumed by both thed_networkand theq_network.The
d_networktakes the 1024-dimensional feature vector and maps it to a single scalar output via a sigmoid activation — the probability that the input image is real. This is the standard discriminator output.The
q_networktakes the same 1024-dimensional feature vector and passes it through an additional fully connected layer of 128 units, followed by a final layer that outputs 2 values — the predicted values of the continuous latent codes $c_1$ and $c_2$.
By sharing the extractor between d_network and q_network, the discriminator and $Q$ network are trained jointly on the same feature representation. The d_network learns to distinguish real from fake images, while the q_network learns to recover the latent codes $c$ from the generated images — enforcing the mutual information constraint without significant additional computational cost.
extractor = tf.keras.models.Sequential([
tf.keras.layers.Conv2D(64,
(4, 4),
strides = (2, 2),
padding = 'same',
use_bias = False,
input_shape = [28, 28, 1]),
tf.keras.layers.LeakyReLU(),
tf.keras.layers.Conv2D(128,
(4, 4),
strides = (2, 2),
padding = 'same',
use_bias = False),
tf.keras.layers.LayerNormalization(),
tf.keras.layers.LeakyReLU(),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(units = 1024,
use_bias = False),
tf.keras.layers.LayerNormalization(),
tf.keras.layers.LeakyReLU()
])
d_network = tf.keras.models.Sequential([
tf.keras.layers.Dense(units = 1,
input_shape = (1024,),
use_bias = False,
activation = 'sigmoid')
])
q_network = tf.keras.models.Sequential([
tf.keras.layers.Dense(units = 128,
use_bias = False,
input_shape = (1024,)),
tf.keras.layers.LayerNormalization(),
tf.keras.layers.LeakyReLU(),
tf.keras.layers.Dense(units = 2,
use_bias = False)
])
combined_input = tf.keras.layers.Input(shape = (28, 28, 1))
combined_feature = extractor(combined_input)
combined_output = d_network(combined_feature)
discriminator = tf.keras.models.Model(inputs = combined_input,
outputs = combined_output)
discriminator.compile(optimizer = tf.keras.optimizers.Adam(learning_rate = 2e-4),
loss = 'binary_crossentropy')
extractor.trainable = False
d_network.trainable = False
combined_input = tf.keras.layers.Input(shape = (62 + 2,))
generated = generator(combined_input)
combined_feature = extractor(generated)
combined_output = d_network(combined_feature)
combined_d = tf.keras.models.Model(inputs = combined_input,
outputs = combined_output)
combined_d.compile(optimizer = tf.keras.optimizers.Adam(learning_rate = 1e-3),
loss = 'binary_crossentropy')
extractor.trainable = False
d_network.trainable = False
combined_input = tf.keras.layers.Input(shape = (62 + 2,))
generated = generator(combined_input)
combined_feature = extractor(generated)
combined_latent = q_network(combined_feature)
combined_q = tf.keras.models.Model(inputs = combined_input,
outputs = combined_latent)
combined_q.compile(optimizer = tf.keras.optimizers.Adam(learning_rate = 1e-3),
loss = 'mean_squared_error')
def make_noise(samples):
return np.random.uniform(-1, 1, size = [samples, 62])
def make_code(samples):
return 2*np.random.rand(samples, 2) - 1
def plot_generated_images(generator):
z = np.random.randn(1, 62).repeat(5, axis = 0)
c = np.stack([np.linspace(-1, 1, 5), np.zeros(5)]).T
noise = np.concatenate([z, c], -1)
generated_images = generator.predict(noise, verbose = 0)
generated_images = generated_images.reshape(5, 28, 28)
print('')
print('Continuous Latent Code 1')
for i in range(5):
plt.subplot(1, 5, i+1)
plt.imshow(generated_images[i], 'gray', interpolation = 'nearest')
plt.axis('off')
plt.tight_layout()
plt.show()
z = np.random.randn(1, 62).repeat(5, axis = 0)
c = np.stack([np.zeros(5), np.linspace(-1, 1, 5)]).T
noise = np.concatenate([z, c], -1)
generated_images = generator.predict(noise, verbose = 0)
generated_images = generated_images.reshape(5, 28, 28)
print('Continuous Latent Code 2')
for i in range(5):
plt.subplot(1, 5, i+1)
plt.imshow(generated_images[i], 'gray', interpolation = 'nearest')
plt.axis('off')
plt.tight_layout()
plt.show()
print('')
n_iter = 5000
batch_size = 256
real = np.ones((batch_size, 1))
fake = np.zeros((batch_size, 1))
for i in range(n_iter):
# Train Discriminator
for _ in range(2):
z = make_noise(batch_size)
c = make_code(batch_size)
noise = np.concatenate([z, c], -1)
generated_images = generator.predict(noise, verbose = 0)
idx = np.random.choice(len(train_x), batch_size, replace = False)
real_images = train_x[idx]
D_loss_real = discriminator.train_on_batch(real_images, real)
D_loss_fake = discriminator.train_on_batch(generated_images, fake)
D_loss = D_loss_real + D_loss_fake
# Train Generator & Q Net
for _ in range(1):
z = make_noise(batch_size)
c = make_code(batch_size)
noise = np.concatenate([z, c], -1)
G_loss = combined_d.train_on_batch(noise, real)
Q_loss = combined_q.train_on_batch(noise, c)
# Print Loss
if (i + 1) % 500 == 0:
print('Epoch: {:5d} | Discriminator Loss: {:.3f} | Generator Loss: {:.3f} | Q Net Loss: {:.3f}'.format(i + 1, D_loss, G_loss, Q_loss))
plot_generated_images(generator)
images_save_1 = []
images_save_2 = []
for i in range(8):
z = np.random.randn(1, 62).repeat(8, axis = 0)
# Continuous Latent Code 1
c = np.stack([np.linspace(-1, 1, 8), np.zeros(8)]).T
noise = np.concatenate([z, c], -1)
generated_images = generator.predict(noise, verbose=0)
generated_images = generated_images.reshape(8, 28, 28)
images_save_1.append(generated_images)
# Continuous Latent Code 2
c = np.stack([np.zeros(8), np.linspace(-1, 1, 8)]).T
noise = np.concatenate([z, c], -1)
generated_images = generator.predict(noise, verbose=0)
generated_images = generated_images.reshape(8, 28, 28)
images_save_2.append(generated_images)
print('Continuous Latent Code 1')
fig, ax = plt.subplots(8, 8, figsize = (10, 10))
for i in range(8):
for j in range(8):
ax[i][j].imshow(images_save_1[i][j], 'gray')
ax[i][j].set_xticks([])
ax[i][j].set_yticks([])
plt.show()
From the results, we can post-hoc interpret what each latent code has learned:
- $c_1$ appears to control the curvature of the tail of the digit 2 — transitioning from a sharply curved, angular tail on the left to a more rounded, looped style on the right
print('Continuous Latent Code 2')
fig, ax = plt.subplots(8, 8, figsize = (10, 10))
for i in range(8):
for j in range(8):
ax[i][j].imshow(images_save_2[i][j], 'gray')
ax[i][j].set_xticks([])
ax[i][j].set_yticks([])
plt.show()
From the results, we can post-hoc interpret what each latent code has learned:
- $c_2$ appears to control the rotation angle — the digit gradually tilts as the code value changes.
These interpretations are post-hoc. The model was never told to learn rotation or curvature — it discovered these attributes entirely on its own, driven only by the mutual information constraint. The smooth, continuous transitions visible across both axes are a characteristic sign of successful disentanglement: each latent code independently controls one interpretable aspect of the output without interfering with the other.
It is also important to note that since these representations are learned from data rather than specified in advance, the specific attributes encoded by each latent code are not guaranteed to be the same across different training runs. A different training run may produce a model where $c_1$ encodes rotation and $c_2$ encodes tail curvature, or where the two codes capture entirely different factors of variation altogether. The mutual information constraint ensures that the latent codes encode something meaningful and disentangled — but it does not determine which specific attributes will be discovered.
7. CycleGAN¶
CycleGAN is an extension of the standard GAN framework designed for image-to-image translation — the task of learning to change the style of an image from one domain to another.
The representative examples are:
- converting Monet paintings to photographs,
- translating zebras to horses, and
- transforming summer landscapes to winter scenes, and vice versa.
- In the bottom row, the same photograph is translated into the styles of four different painters — Monet, Van Gogh, Cezanne, and Ukiyo-e
7.1. Limitation of Paired Datasets¶
A natural approach to image-to-image translation would be to train a supervised model on paired examples — an input image and its corresponding target image. As shown in the below figure, paired data provides a clean, direct supervision signal.
However, collecting such paired datasets is impossible in most practical cases. There is no way to obtain a photograph of a horse and its exact zebra equivalent in the same pose, lighting, and background. CycleGAN addresses this by working with unpaired data — two independent sets of images from each domain, with no correspondence between them.
As the problem setting becomes more realistic, the assumptions we can make about the data become weaker, and the algorithms must become correspondingly more sophisticated. So far, we have focused on algorithms and network architectures under the assumption that data are well prepared and readily available. However, the cost and time required to prepare training data are far from negligible. Unsupervised learning is generally cheaper than supervised learning, and unpaired data is much easier to obtain than paired data. At the same time, working with less structured data demands more careful and sophisticated algorithm design.
CycleGAN is a good example of this trade-off — it sacrifices the convenience of paired supervision in favor of a more flexible and broadly applicable framework. Let us now see how it overcomes the technical challenges posed by unpaired data.
7.2. From Naive GAN to CycleGAN¶
A naive approach would be to train a single generator $G_{XY}$ that takes an image from domain $X$ (horse) and translates it into domain $Y$ (zebra), supervised only by a discriminator in domain $Y$.
In the GAN examples studied so far, the generator takes a noise vector $z$ as input and produces an image. In this case, however, the input is itself an image — the generator must transform one image into another while preserving its content.
This changes the architectural requirements slightly. A simple fully connected or transposed-convolution network is no longer appropriate, because we are not generating an image from a noise — we are transforming an existing one. Instead, CycleGAN uses an autoencoder-style architecture for the generator, as illustrated in the figure.
The generator consists of two parts:
- An encoder that progressively compresses the input image into a compact latent representation, capturing the content of the image in a lower-dimensional space
- A decoder that reconstructs an image from this latent representation, but in the target domain
The latent representation in this autoencoder plays a role analogous to the latent code $z$ in a standard GAN — it is a compressed description of the image content. The key difference is that here the latent space is not explicitly controlled or sampled from a known distribution. It is determined entirely by the encoder's compression of the input image. The decoder then learns to reconstruct from this representation in the style of the target domain, effectively disentangling content from style within the network's internal representation.
As shown in the figure, this setup has a fundamental problem: the adversarial loss alone does not constrain the generator to preserve the content of the input image. The generator could produce a realistic-looking zebra that has nothing to do with the original horse image — a different pose, different background, or an entirely unrelated scene — and still fool the discriminator.
Cycle Consistency Loss
CycleGAN solves this by introducing a second generator $G_{YX}: Y \rightarrow X$ and enforcing a cycle consistency constraint: translating an image from $X$ to $Y$ and then back to $X$ should recover the original image.
$$G_{YX}(G_{XY}(x)) \approx x \qquad G_{XY}(G_{YX}(y)) \approx y$$
As shown in the figure, the horse image is first translated to a zebra by $G_{XY}$, and then translated back to a horse by $G_{YX}$. The cycle consistency loss penalizes any deviation from the original:
$$\mathcal{L}_{\text{cyc}} = \mathbb{E}_{x \sim p_{\text{data}}(x)}\left[\|G_{YX}(G_{XY}(x)) - x\|_1\right] + \mathbb{E}_{y \sim p_{\text{data}}(y)}\left[\|G_{XY}(G_{YX}(y)) - y\|_1\right]$$
This constraint acts as a form of self-supervision that forces the two generators to be consistent with each other and prevents them from making arbitrary mappings.
The full objective combines the adversarial losses for both generators with the cycle consistency loss:
$$\mathcal{L} = \mathcal{L}_{\text{GAN}}(G_{XY}, D_Y) + \mathcal{L}_{\text{GAN}}(G_{YX}, D_X) + \lambda \mathcal{L}_{\text{cyc}}$$
where $\lambda$ controls the relative importance of the cycle consistency loss. The discriminator $D_Y$ judges whether an image belongs to domain $Y$, and $D_X$ judges whether an image belongs to domain $X$. In practice, only one translation direction is often needed — for example, horse to zebra — in which case only $D_Y$ can be used.
8. Adversarial Autoencoder (AAE)¶
In the CycleGAN architecture discussed above, the latent space of the autoencoder-style generator is determined entirely by the encoder's compression of the input — it is not explicitly controlled or constrained to follow any particular distribution. This points to a more fundamental limitation of standard autoencoders.
An autoencoder is a manifold learning model, not a generative model. It learns a compact representation of the training data by mapping inputs to a lower-dimensional latent space and reconstructing them. However, the structure of this latent space is arbitrary — it is shaped differently for each training run, and there is no guarantee that the space is continuous or well-organized. As a result, sampling from the latent space of a standard autoencoder does not reliably produce meaningful or realistic outputs. The autoencoder learns where the training data lives in the latent space, but it does not learn how to fill that space in a way that supports generation.
To use an autoencoder as a generative model, we need to encode data into a controllable latent space — one with a known probability distribution that we can sample from — and then generate new data by decoding samples drawn from that distribution.
This is where the adversarial approach comes in. By introducing a discriminator that encourages the encoder to map data into a latent space that follows a specified prior distribution — such as $\mathcal{N}(0, I)$ — we can impose explicit structure on the latent space. The result is the Adversarial Autoencoder (AAE), which combines the reconstruction objective of an autoencoder with the distributional matching objective of a GAN.
The figure illustrates the AAE architecture. The upper path is a standard autoencoder: the encoder maps an input image $\mathbf{x}$ to a latent representation $\mathbf{z} \sim q(\mathbf{z}|\mathbf{x})$, and the decoder reconstructs the image from $\mathbf{z}$. This path is trained with a standard reconstruction loss.
The lower path introduces the adversarial component. A prior distribution $p(\mathbf{z})$ — such as a standard Gaussian — is specified in advance, and samples are drawn from it. A discriminator is then placed at the latent space to distinguish between two types of inputs:
- $\mathbf{z} \sim q(\mathbf{z})$: latent codes produced by the encoder from real data
- $\mathbf{z} \sim p(\mathbf{z})$: samples drawn from the specified prior distribution
The discriminator is trained to tell these two sources apart, while the encoder is trained to fool the discriminator — producing latent codes that are indistinguishable from samples drawn from $p(\mathbf{z})$. This adversarial objective forces the aggregate posterior $q(\mathbf{z})$ to match the prior $p(\mathbf{z})$, imposing explicit distributional structure on the latent space.
The two objectives are trained jointly:
- The reconstruction loss ensures that the autoencoder faithfully encodes and decodes the input data
- The adversarial loss ensures that the latent space follows the specified prior distribution
Once trained, new data can be generated simply by sampling $\mathbf{z} \sim p(\mathbf{z})$ and passing it through the decoder — something that is not possible with a standard autoencoder, where the structure of the latent space is uncontrolled.
The adversarial idea introduced by GAN — placing a discriminator to provide a learned, adaptive signal — turns out to be a broadly useful principle that extends well beyond image generation. In the AAE, the discriminator is not used to distinguish real images from fake ones, but to match two distributions in the latent space.
8.1. Incorporating Label Information¶
Just as cGAN extends the standard GAN by conditioning on label information, AAE can also incorporate class labels into the adversarial training process.
The upper path remains unchanged — the encoder maps the input image to a latent code, and the decoder reconstructs the image from it. The key difference appears in the lower path: when samples are drawn from the prior $p(\mathbf{z})$, the class label (shown as a one-hot encoded vector) is concatenated with the latent code before being passed into the discriminator.
By conditioning the discriminator on the label, the adversarial objective now encourages the encoder to produce latent codes that match the prior distribution for each class independently. The result is a structured latent space in which different regions correspond to different class labels — making the latent space not only continuous and well-organized, but also semantically meaningful.
8.2. Implementation of AAE¶
Again, the MNIST dataset will be used to demonstrate how AAE can be implemented.
To keep the latent space visualization interpretable, only digits 0 through 5 are used, and the latent space is kept two-dimensional so that the distribution of encoded samples can be plotted directly.
The prior distribution $p(\mathbf{z})$ used here is a Gaussian mixture with six Gaussian clusters arranged in a circle, one for each digit class. Rather than matching a single Gaussian as in a standard AAE, the encoder is encouraged to map each digit class into a separate cluster, producing a structured and interpretable latent space.
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import random
mnist = tf.keras.datasets.mnist
(train_x, train_y), (test_x, test_y) = mnist.load_data()
(train_x, train_y), (test_x, test_y) = mnist.load_data()
train_x, test_x = train_x.reshape(-1, 784)/255.0, test_x.reshape(-1, 784)/255.0
# Use only 0,1,2,3,4,5 digits to visualize latent sapce
train_idx0 = np.array(np.where(train_y == 0))
train_idx1 = np.array(np.where(train_y == 1))
train_idx2 = np.array(np.where(train_y == 2))
train_idx3 = np.array(np.where(train_y == 3))
train_idx4 = np.array(np.where(train_y == 4))
train_idx5 = np.array(np.where(train_y == 5))
train_idx = np.sort(np.concatenate((train_idx0, train_idx1, train_idx2, train_idx3, train_idx4, train_idx5), axis = None))
test_idx0 = np.array(np.where(test_y == 0))
test_idx1 = np.array(np.where(test_y == 1))
test_idx2 = np.array(np.where(test_y == 2))
test_idx3 = np.array(np.where(test_y == 3))
test_idx4 = np.array(np.where(test_y == 4))
test_idx5 = np.array(np.where(test_y == 5))
test_idx = np.sort(np.concatenate((test_idx0, test_idx1, test_idx2, test_idx3, test_idx4, test_idx5), axis = None))
train_imgs = train_x[train_idx]
train_labels = train_y[train_idx]
test_imgs = test_x[test_idx]
test_labels = test_y[test_idx]
n_train = train_imgs.shape[0]
n_test = test_imgs.shape[0]
print ("The number of training images : {}, shape : {}".format(n_train, train_imgs.shape))
print ("The number of testing images : {}, shape : {}".format(n_test, test_imgs.shape))
# Define Structure
# Encoder
encoder = tf.keras.models.Sequential([
tf.keras.layers.Dense(units = 500, activation = 'relu', input_shape = (784,)),
tf.keras.layers.Dense(units = 300, activation = 'relu'),
tf.keras.layers.Dense(units = 2, activation = None)
])
# Decoder
decoder = tf.keras.models.Sequential([
tf.keras.layers.Dense(units = 300, activation = 'relu', input_shape = (2,)),
tf.keras.layers.Dense(units = 500, activation = 'relu'),
tf.keras.layers.Dense(units = 784, activation = None)
])
# Discriminator
discriminator = tf.keras.models.Sequential([
tf.keras.layers.Dense(units = 100, activation = 'relu', input_shape = (2,)),
tf.keras.layers.Dense(units = 100, activation = 'relu'),
tf.keras.layers.Dense(units = 1, activation = 'sigmoid')
])
Here, the three models are defined and compiled. Pay attention to which parameters are set to be trainable or untrainable at each point — this may seem unclear at this stage, but will become apparent when the training loop is examined in the next cell.
The autoencoder chains the encoder and decoder together and is compiled with a mean squared error reconstruction loss. Both the encoder and decoder parameters are trainable here.
Before compiling the discriminator, encoder.trainable = False is set. This means that when the discriminator is trained, the encoder parameters will not be updated — only the discriminator parameters will change.
Before building the combined model, discriminator.trainable = False and encoder.trainable = True are set. The combined model chains the encoder and the frozen discriminator together. When this model is trained, the gradient flows through the discriminator and into the encoder — but only the encoder parameters are updated. This is the adversarial update step for the encoder.
autoencoder = tf.keras.models.Sequential([encoder, decoder])
autoencoder.compile(optimizer = tf.keras.optimizers.Adam(0.0005),
loss = 'mean_squared_error')
encoder.trainable = False
discriminator.compile(optimizer = tf.keras.optimizers.Adam(learning_rate = 0.0005),
loss = 'binary_crossentropy')
discriminator.trainable = False
encoder.trainable = True
combined_input = tf.keras.layers.Input(shape = (28*28,))
latent_variable = encoder(combined_input)
combined_output = discriminator(latent_variable )
combined = tf.keras.models.Model(inputs = combined_input, outputs = combined_output)
combined.compile(optimizer = tf.keras.optimizers.Adam(learning_rate = 0.0005),
loss = 'binary_crossentropy')
def gaussian_mixture_sampler(batchsize, ndim, num_labels):
if ndim % 2 != 0:
raise Exception("ndim must be a multiple of 2.")
def sample(x, y, label, num_labels):
shift = 1.4
r = 2 * np.pi / float(num_labels) * float(label)
new_x = x * np.cos(r) - y * np.sin(r)
new_y = x * np.sin(r) + y * np.cos(r)
new_x += shift * np.cos(r)
new_y += shift * np.sin(r)
return np.array([new_x, new_y]).reshape((2,)), tf.one_hot(label, num_labels + 1)
x_var = 0.3
y_var = 0.05
x = np.random.normal(0, x_var, (batchsize, ndim // 2))
y = np.random.normal(0, y_var, (batchsize, ndim // 2))
z = np.empty((batchsize, ndim), dtype = np.float32)
z_label = np.empty((batchsize, num_labels + 1), dtype = np.float32)
for batch in range(batchsize):
for zi in range(ndim // 2):
z[batch, zi*2:zi*2+2], z_label[batch] = sample(x[batch, zi], y[batch, zi], random.randint(0, num_labels - 1), num_labels)
return z, z_label
prior, one_hot_label = gaussian_mixture_sampler(1000, 2, 6)
c = np.array(['r', 'g', 'b', 'c', 'y', 'k'])
plt.scatter(prior[:, 0], prior[:, 1], s = 1, c = c[list(np.argmax(one_hot_label, 1))])
plt.show()
def plot_latent_space(encoder, samples = 1000):
idx = np.random.randint(0, train_imgs.shape[0], samples)
latent_fake = encoder.predict(train_imgs[idx], verbose = 0)
c = np.array(['r', 'g', 'b', 'c', 'y', 'k'])
for i, el in enumerate([0, 1, 2, 3, 4, 5]):
label_idx = np.where(train_labels[idx] == el)[0]
plt.scatter(latent_fake[label_idx, 0], latent_fake[label_idx, 1], s = 1, label = el, c = c[i])
plt.legend()
plt.xticks([])
plt.yticks([])
plt.show()
Training alternates between three steps in each iteration:
- The autoencoder is updated to minimize reconstruction loss — ensuring the encoder and decoder faithfully represent the input data. At this step, both the encoder and decoder parameters are trainable.
- The discriminator is updated to distinguish between latent codes produced by the encoder and samples drawn from the Gaussian mixture prior. At this step,
encoder.trainable = Falseensures that the encoder parameters are not affected, and only the discriminator is updated. - The encoder is updated adversarially via the
combinedmodel to fool the discriminator — pushing the latent codes toward the prior distribution. At this step,discriminator.trainable = Falsefreezes the discriminator, and only the encoder parameters are updated through the adversarial gradient.
The careful management of which parameters are trainable at each step is essential for correct adversarial training. The same principle applies here as in the GAN training loop: when updating one network, the other must be held fixed to ensure that the gradient flows in the intended direction.
n_iter = 10000
batch_size = 100
fake = np.zeros(batch_size)
real = np.ones(batch_size)
for i in range(n_iter):
idx = np.random.randint(0, train_imgs.shape[0], batch_size)
# Train Autoencoder
AE_loss = autoencoder.train_on_batch(train_imgs[idx], train_imgs[idx])
# Train Discriminator
latent_true, _ = gaussian_mixture_sampler(batch_size, 2, 6)
latent_fake = encoder.predict(train_imgs[idx], verbose = 0)
D_loss_real = discriminator.train_on_batch(latent_true, real)
D_loss_fake = discriminator.train_on_batch(latent_fake, fake)
D_loss = D_loss_real + D_loss_fake
# Train Encoder
idx = np.random.randint(0, train_imgs.shape[0], batch_size)
Adv_loss = combined.train_on_batch(train_imgs[idx], real)
if i % 1000 == 0:
print('Autoencoder Loss: ', AE_loss)
print('Discriminator Loss: ', D_loss)
print('Adversarial Loss: ', Adv_loss)
plot_latent_space(encoder)
The previous AAE example demonstrated that the adversarial training successfully pushes the latent space to follow the Gaussian mixture prior — and the resulting latent space appears to be reasonably well separated by digit class. However, this separation is not guaranteed since the model is never explicitly told which cluster should correspond to which digit. The assignment of digit classes to clusters is arbitrary and may vary across different training runs.
To achieve a truly controlled and semantically meaningful latent space — where each digit class is consistently mapped to a specific region — we need to incorporate label information directly into the training process. This is the motivation for the conditional AAE presented in the next example.
# Encoder
encoder = tf.keras.models.Sequential([
tf.keras.layers.Dense(units = 500, activation = 'relu', input_shape = (784,)),
tf.keras.layers.Dense(units = 300, activation = 'relu'),
tf.keras.layers.Dense(units = 2, activation = None)
])
# Decoder
decoder = tf.keras.models.Sequential([
tf.keras.layers.Dense(units = 300, activation = 'relu', input_shape = (2,)),
tf.keras.layers.Dense(units = 500, activation = 'relu'),
tf.keras.layers.Dense(units = 784, activation = None)
])
# discriminator_label
discriminator_label = tf.keras.models.Sequential([
tf.keras.layers.Dense(units = 100, activation = 'relu', input_shape = (2 + 6 + 1,)),
tf.keras.layers.Dense(units = 100, activation = 'relu'),
tf.keras.layers.Dense(units = 1, activation = 'sigmoid')
])
The encoder and decoder architectures are identical to the previous unconditional AAE — the encoder maps a 784-dimensional input to a 2-dimensional latent code, and the decoder reconstructs the 784-dimensional image from it.
The key difference is in the discriminator. In the previous example, discriminator took only the 2-dimensional latent code as input. Here, discriminator_label takes a concatenated input of dimension $2 + 6 + 1 = 9$ — the 2-dimensional latent code combined with a 7-dimensional one-hot label vector. The 7 dimensions correspond to 6 digit classes plus one additional index reserved for unlabeled samples.
By conditioning the discriminator on the label, the adversarial training can now explicitly encourage each digit class to occupy a specific and consistent region of the latent space — addressing the limitation of the unconditional AAE discussed above.
autoencoder = tf.keras.models.Sequential([encoder, decoder])
autoencoder.compile(optimizer = tf.keras.optimizers.Adam(0.0005),
loss = 'mean_squared_error')
encoder.trainable = False
discriminator_label.compile(optimizer = tf.keras.optimizers.Adam(learning_rate = 0.0005),
loss = 'binary_crossentropy')
discriminator_label.trainable = False
encoder.trainable = True
combined_input = tf.keras.layers.Input(shape = (28*28 + 6 + 1,))
latent_variable = encoder(combined_input[:, :28*28])
latent_label = tf.concat([latent_variable, combined_input[:, 28*28:]], 1)
combined_label_output = discriminator_label(latent_label)
combined_label = tf.keras.models.Model(inputs = combined_input, outputs = combined_label_output)
combined_label.compile(optimizer = tf.keras.optimizers.Adam(learning_rate = 0.0005),
loss = 'binary_crossentropy')
n_iter = 10000
batch_size = 100
fake = np.zeros(batch_size)
real = np.ones(batch_size)
for i in range(n_iter):
idx = np.random.randint(0, train_imgs.shape[0], batch_size)
# Train Autoencoder
AE_loss = autoencoder.train_on_batch(train_imgs[idx], train_imgs[idx])
# Train Discriminator
# positive phase
latent_true, one_hot_true_label = gaussian_mixture_sampler(batch_size, 2, 6)
latent_true = tf.concat([latent_true, one_hot_true_label], 1)
D_loss_real = discriminator_label.train_on_batch(latent_true, real)
latent_true, one_hot_true_label = gaussian_mixture_sampler(batch_size, 2, 6)
one_hot_fake_label = tf.one_hot([6] * batch_size, 6 + 1)
latent_true = tf.concat([latent_true, one_hot_fake_label], 1)
D_loss_real += discriminator_label.train_on_batch(latent_true, real)
latent_fake = encoder.predict(train_imgs[idx], verbose = 0)
one_hot_fake_label = tf.one_hot([6] * batch_size, 6 + 1)
latent_fake = tf.concat([latent_fake, one_hot_fake_label], 1)
D_loss_fake = discriminator_label.train_on_batch(latent_fake, fake)
D_loss = D_loss_real + D_loss_fake
# Train Generator
idx = np.random.randint(0, train_imgs.shape[0], batch_size)
one_hot_fake_label = tf.one_hot([6] * batch_size, 6 + 1)
train_input = tf.concat([train_imgs[idx], one_hot_fake_label], 1)
Adv_loss = combined_label.train_on_batch(train_input, real)
if i % 1000 == 0:
print('Autoencoder Loss: ', AE_loss)
print('Discriminator Loss: ', D_loss)
print('Adversarial Loss: ', Adv_loss)
plot_latent_space(encoder)
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')