Autoencoder

Table of Contents



In [ ]:
from IPython.display import YouTubeVideo
YouTubeVideo('ul27ZUdYaBY', width = "560", height = "315")
Out[ ]:



1. Unsupervised Learning

Unsupervised learning is a type of machine learning where a model learns patterns, structures, or relationships in the data without labeled outputs. Unlike supervised learning, which uses input-output pairs for training, unsupervised learning works with unlabeled data, meaning the algorithm must discover inherent features or groupings within the dataset.


Definition

  • Unsupervised learning refers to most attempts to extract information from a distribution that do not require human labor to annotate example
  • Main task is to find the 'best' representation of the data

Dimension Reduction

  • Attempt to compress as much information as possible in a smaller representation
  • Preserve as much information as possible while obeying some constraint aimed at keeping the representation simpler



2. Autoencoders

An autoencoder is a type of neural network used for unsupervised learning that aims to learn efficient, compressed representations of input data. It consists of two main components: an encoder and a decoder. The encoder compresses the input into a lower-dimensional representation (called the latent space or bottleneck), while the decoder reconstructs the input from this compressed representation.

The objective of an autoencoder is to minimize the reconstruction error—the difference between the original input and its reconstruction.


Definition

  • An autoencoder is a neural network that is trained to attempt to copy its input to its output
  • The network consists of two parts: an encoder and a decoder that produce a reconstruction

Encoder and Decoder

  • Encoder function : $z = f(x)$
  • Decoder function : $x = g(z)$
  • We learn to set $g\left(f(x)\right) = x$

The top network in the below figure shows a trivial autoencoder where the input data is directly passed to the output without any significant compression or transformation. This structure may be able to perfectly reconstruct the input data, but it fails to learn any meaningful abstraction or representation. On the other hand, the bottom network illustrates a typical autoencoder with a squeezed bottleneck, which compresses the input into a lower-dimensional latent space before reconstructing it. This narrow bottleneck structure offers several key benefits:

  • The top structure, though capable of reconstructing the input, does not learn useful or general features. It essentially passes information without processing or summarizing it.

  • In contrast, the bottom structure is meaningful because the bottleneck introduces a constraint that forces the network to learn relevant features instead of copying the input.




Structure of an Autoencoder

  • Encoder (Compression Path):
    • The encoder maps the input $x$ to a lower-dimensional latent representation $z$.
    • This transformation can be represented as:

      $$ z = f(x) $$
    • The encoder typically consists of fully connected (dense) or convolutional layers, followed by non-linear activations (e.g., ReLU, sigmoid).

  • Latent Space (Bottleneck):
    • The bottleneck layer $z$ represents the compressed encoding of the input.
    • The size of this layer controls how much information is retained from the input.

  • Decoder (Reconstruction Path):
    • The decoder maps the latent representation $z$ back to the original input space:

      $$ \hat{x} = g(z) $$
    • The decoder reconstructs the input $x$ as closely as possible to the original input.

  • Loss Function:
    • The loss function measures the reconstruction error, typically using mean squared error (MSE):

      $$ \mathcal{L}(x, \hat{x}) = \|x - \hat{x}\|^2 $$

Key Features of Autoencoders

The squeezed bottleneck in an autoencoder acts as a filter that retains essential features while discarding irrelevant or redundant information. It enhances the model's ability to learn compact, robust representations of the data and improves its generalization ability, making it effective for tasks such as dimensionality reduction, denoising, anomaly detection, and representation learning.

(1) Dimensionality Reduction

  • The bottleneck layer forces the network to compress the input data into a lower-dimensional representation.

  • This reduction removes redundant and non-essential information, retaining only the most important features.

  • Example: Instead of storing all pixel values of an image, the autoencoder learns a compact representation (e.g., shape, edges, or texture patterns) to reconstruct the image.

(2) Feature Extraction

  • By limiting the size of the bottleneck, the autoencoder learns meaningful, abstract features rather than memorizing the input.

  • The compressed features in the latent space can be used for classification, clustering, or visualization tasks, similar to principal components in PCA but in a non-linear manner.

(3) Preventing Overfitting

  • A wide bottleneck can allow the autoencoder to simply copy the input data, leading to poor generalization.

  • A squeezed bottleneck enforces a constraint that forces the network to generalize the underlying structure of the data rather than learning a direct mapping of the inputs.

(4) Noise Reduction (Denoising Autoencoders)

  • In tasks like denoising and anomaly detection, the bottleneck prevents the network from reconstructing noise, as the network focuses only on core patterns.

(5) Efficient Encoding for Data Compression

  • Autoencoders are often used for data compression by representing the input data with fewer bits (from high-dimensional space to low-dimensional space).

  • The "squeezing" helps reduce data storage needs while maintaining the ability to reconstruct the original input with minimal loss.



3. Autoencoder with TensorFlow

  • MNIST example

  • Use only (1, 5, 6) digits to visualize in 2-D

In [ ]:
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
%matplotlib inline

np.random.seed(42)
In [ ]:
# Load Data

mnist = tf.keras.datasets.mnist

(train_x, train_y), (test_x, test_y) = mnist.load_data()
train_x, test_x = train_x.reshape(-1, 784)/255.0, test_x.reshape(-1, 784)/255.0
In [ ]:
# Use only 1,5,6 digits to visualize latent sapce

train_idx1 = np.array(np.where(train_y == 1))
train_idx5 = np.array(np.where(train_y == 5))
train_idx6 = np.array(np.where(train_y == 6))
train_idx = np.sort(np.concatenate((train_idx1, train_idx5, train_idx6), axis = None))

test_idx1 = np.array(np.where(test_y == 1))
test_idx5 = np.array(np.where(test_y == 5))
test_idx6 = np.array(np.where(test_y == 6))
test_idx = np.sort(np.concatenate((test_idx1, test_idx5, test_idx6), axis = None))

train_imgs = train_x[train_idx]
train_labels = train_y[train_idx]
test_imgs = test_x[test_idx]
test_labels = test_y[test_idx]

n_train = train_imgs.shape[0]
n_test = test_imgs.shape[0]

print ("The number of training images : {}, shape : {}".format(n_train, train_imgs.shape))
print ("The number of testing images : {}, shape : {}".format(n_test, test_imgs.shape))
The number of training images : 18081, shape : (18081, 784)
The number of testing images : 2985, shape : (2985, 784)

3.1. Define a Structure of an Autoencoder

  • Input shape and latent variable shape
  • Encoder shape
  • Decoder shape




Note on 2D Latent Space and Digits 1, 5, and 6 from MNIST

In this autoencoder structure, we assign only two neurons in the latent space. This choice serves a specific purpose - it is not necessarily because a 2D latent space is the optimal representation for the data, but rather because it provides an effective way to visualize the compressed data points in a two-dimensional space.

  • A 2D latent space allows the data points to be plotted in a 2D plane, making it easier to visualize how the autoencoder organizes and compresses the input data.

  • Instead of using all 10 digits (0 - 9), using only a subset of digits (e.g., 1, 5, 6) makes the visualization simpler and clearer.

  • Plotting all 10 digits in a 2D latent space can result in overlapping clusters, making it difficult to interpret the latent space representation.


In [ ]:
# Define Structure

# Encoder Structure
encoder = tf.keras.models.Sequential([
    tf.keras.layers.Input(shape = (784,)),
    tf.keras.layers.Dense(units = 500, activation = 'relu'),
    tf.keras.layers.Dense(units = 300, activation = 'relu'),
    tf.keras.layers.Dense(units = 2, activation = None)
    ])

# Decoder Structure
decoder = tf.keras.models.Sequential([
    tf.keras.layers.Input(shape = (2,)),
    tf.keras.layers.Dense(units = 300, activation = 'relu'),
    tf.keras.layers.Dense(units = 500, activation = 'relu'),
    tf.keras.layers.Dense(units = 28*28, activation = None)
    ])

# Autoencoder = Encoder + Decoder
autoencoder = tf.keras.models.Sequential([encoder, decoder])

3.2. Define Loss and Optimizer

Loss

  • Squared loss

$$ \frac{1}{m}\sum_{i=1}^{m} (t_{i} - y_{i})^2 $$


Optimizer

  • AdamOptimizer: the most popular optimizer

In [ ]:
autoencoder.compile(optimizer = tf.keras.optimizers.Adam(0.001),
                    loss = 'mean_squared_error',
                    metrics = ['mse'])
In [ ]:
# Train Model & Evaluate Test Data

training = autoencoder.fit(train_imgs, train_imgs, batch_size = 50, epochs = 10, verbose = 0)

3.3. Test or Evaluate

  • Test reconstruction performance of the autoencoder
In [ ]:
test_scores = autoencoder.evaluate(test_imgs, test_imgs, verbose = 0)

print('Test loss: {}'.format(test_scores[0]))
print('Mean Squared Error: {} %'.format(test_scores[1]*100))
Test loss: 0.026258012279868126
Mean Squared Error: 2.6258012279868126 %
In [ ]:
# Visualize Evaluation on Test Data

rand_idx = np.random.randint(1, test_imgs.shape[0])
# rand_idx = 6

test_img = test_imgs[rand_idx]
reconst_img = autoencoder.predict(test_img.reshape(1, 28*28), verbose = 0)

plt.figure(figsize = (6, 4))
plt.subplot(1,2,1)
plt.imshow(test_img.reshape(28,28), 'gray')
plt.title('Input Image', fontsize = 12)

plt.xticks([])
plt.yticks([])
plt.subplot(1,2,2)
plt.imshow(reconst_img.reshape(28,28), 'gray')
plt.title('Reconstructed Image', fontsize = 12)
plt.xticks([])
plt.yticks([])

plt.show()
No description has been provided for this image

In this autoencoder for MNIST, it is remarkable that a latent space with only two neurons can capture sufficient information to reconstruct an MNIST image, originally represented in a 784-dimensional space.



4. Latent Space

The latent space refers to a lower-dimensional, abstract feature space where input data is encoded by a machine learning model, such as an autoencoder. In this space, the essential characteristics of the input data are represented in a compressed form. The term "latent" implies that the features in this space are hidden or learned representations that capture underlying patterns, rather than explicit input features.

  • Let’s examine how the compressed features of the digits 1, 5, and 6 are distributed in the learned latent space.

  • To see the distribution of latent variables, we make a projection of 784-dimensional image space onto 2-dimensional latent space


In [ ]:
idx = np.random.randint(0, len(test_labels), 500)
test_x, test_y = test_imgs[idx], test_labels[idx]
test_x = np.array(test_x)
In [ ]:
test_latent = encoder.predict(test_x, verbose = 0)

plt.figure(figsize = (5, 5))
plt.scatter(test_latent[test_y == 1,0], test_latent[test_y == 1,1], label = '1')
plt.scatter(test_latent[test_y == 5,0], test_latent[test_y == 5,1], label = '5')
plt.scatter(test_latent[test_y == 6,0], test_latent[test_y == 6,1], label = '6')
plt.title('Latent Space', fontsize = 12)
plt.xlabel('$Z_1$', fontsize = 12)
plt.ylabel('$Z_2$', fontsize = 12)
plt.legend(fontsize = 12)
plt.axis('equal')
plt.show()
No description has been provided for this image

From the results, we can conclude that clustering or classification can be performed in the 2D latent space, rather than in the original 784-dimensional input space. In many practical applications, dimensionality reduction is performed before applying further machine learning algorithms or analyses. This preprocessing step helps to address issues related to high-dimensional data and often improves the performance of subsequent tasks such as clustering, classification, or anomaly detection.


Note:

  • The latent space may change if the model is re-trained.

  • Each latent variable (or dimension) does not have an explicit physical meaning.




5. Data Generation

While autoencoders are primarily designed for representation learning and reconstruction, they can also be used for data generation by propagating values through the decoder. The process involves selecting a latent vector (a point in the latent space) and feeding it into the decoder to reconstruct an output that resembles the original data distribution.


How Data Generation Works in an Autoencoder

(1) Latent Space Exploration:

  • After training, the latent space encodes the structure of the input data. Each point in this space represents compressed features of a particular input (e.g., digits in the MNIST dataset).

  • By selecting specific points in the latent space, we can generate corresponding outputs through the decoder.

(2) Random Latent Vector Selection:

  • Instead of using actual encoded inputs, we can randomly sample latent vectors within a meaningful range of the latent space (typically around the learned distribution).

  • These sampled vectors are fed into the decoder to produce new, generative outputs that resemble the patterns learned during training.

(3) Decoding Process:

  • The decoder interprets the randomly selected latent vector as a set of features and reconstructs an image or data point that corresponds to these features.

  • If the latent vector lies near a cluster representing digit "6" in the MNIST dataset, for example, the output may resemble a handwritten "6".


Key Characteristics of Generative Data in Autoencoders

(1) Reconstruction vs. Generation:

  • In reconstruction, the encoder compresses the input, and the decoder reconstructs the same input.

  • In data generation, the input is replaced by a synthetic latent vector, and the decoder "imagines" a corresponding output based on the patterns learned during training.

(2) Interpolation:

  • By selecting latent values between two points (e.g., between representations of digit "1" and digit "5"), the decoder can generate data that smoothly transitions between these classes, creating hybrid or intermediate samples.

In [ ]:
new_data = np.array([[-3, -5]])

fake_image = decoder.predict(new_data, verbose = 0)

plt.figure(figsize = (9, 4))
plt.subplot(1,2,1)
plt.scatter(test_latent[test_y == 1,0], test_latent[test_y == 1,1], label = '1')
plt.scatter(test_latent[test_y == 5,0], test_latent[test_y == 5,1], label = '5')
plt.scatter(test_latent[test_y == 6,0], test_latent[test_y == 6,1], label = '6')
plt.scatter(new_data[:,0], new_data[:,1], c = 'k', marker = '*', s = 100, label = 'new data')
plt.title('Latent Space', fontsize = 10)
plt.xlabel('$Z_1$', fontsize = 10)
plt.ylabel('$Z_2$', fontsize = 10)
plt.legend(loc = 2)
plt.axis('equal')
plt.subplot(1,2,2)
plt.imshow(fake_image.reshape(28,28), 'gray')
plt.title('Generated Fake Image', fontsize = 10)
plt.xticks([])
plt.yticks([])
plt.show()
No description has been provided for this image

5.1. Walk in the Latent Space

A latent space is a compressed, lower-dimensional space that a model uses to represent high-dimensional data. Each point in this space corresponds to some meaningful representation of an input.

A "walk in the latent space" refers to the process of moving smoothly through this space - typically by interpolating between latent vectors - and observing how the outputs change. This is like taking a walk through the model's internal "imagination."


In [ ]:
new_data_1 = np.array([[0, 3]])
new_data_2 = np.array([[-3, -1]])
c = 0.5*(new_data_1 + new_data_2)

fake_image_1 = decoder.predict(new_data_1, verbose = 0)
fake_image_2 = decoder.predict(new_data_2, verbose = 0)

plt.figure(figsize = (6, 4))
plt.scatter(test_latent[test_y == 1,0], test_latent[test_y == 1,1], label = '1')
plt.scatter(test_latent[test_y == 5,0], test_latent[test_y == 5,1], label = '5')
plt.scatter(test_latent[test_y == 6,0], test_latent[test_y == 6,1], label = '6')
plt.scatter(new_data_1[:,0], new_data_1[:,1], c = 'k', marker = '*', s = 100, label = 'new data')
plt.scatter(new_data_2[:,0], new_data_2[:,1], c = 'k', marker = '*', s = 100, label = 'new data')
plt.scatter(c[:,0], c[:,1], c = 'k', marker = 's', s = 100, label = 'new data')
plt.show()

plt.figure(figsize = (6, 4))
plt.subplot(1,2,1)
plt.imshow(fake_image_1.reshape(28,28), 'gray')
plt.title('Input Image 1', fontsize = 12)

plt.xticks([])
plt.yticks([])
plt.subplot(1,2,2)
plt.imshow(fake_image_2.reshape(28,28), 'gray')
plt.title('Input Image 2', fontsize = 12)
plt.xticks([])
plt.yticks([])

plt.show()
No description has been provided for this image
No description has been provided for this image

In this example, two points in the latent space appear to represent different tilt angles of a digit. This suggests that the latent space captures meaningful variations in the data, such as rotation. However, it is important to note again that these representations are not fixed - they will change if the model is re-trained.


In [ ]:
fake_image_lat = decoder.predict(0.5*(new_data_1 + new_data_2), verbose = 0)

plt.figure(figsize = (6, 4))
plt.subplot(1,2,1)
plt.imshow(0.5*(fake_image_1 + fake_image_2).reshape(28,28), 'gray')
plt.title('Interpolation in original')
plt.xticks([])
plt.yticks([])

plt.subplot(1,2,2)
plt.imshow(fake_image_lat.reshape(28,28), 'gray')
plt.title('Interpolation in latent')
plt.xticks([])
plt.yticks([])

plt.show()
No description has been provided for this image

In the original input space, computing $ \frac{x_1 + x_2}{2} $ results in a simple pixel-wise average of two images, which often appears blurry and lacks interpretability. In contrast, reconstructing the image from the averaged latent vector $ \frac{z_1 + z_2}{2} $ produces a digit whose tilt angle lies roughly between those of $ z_1 $ and $ z_2 $. This demonstrates that interpolation in the latent space can yield semantically meaningful results, unlike direct interpolation in the raw input space.


Vector Arithmetic in Latent Space

Latent space representations learned by generative models can be directly manipulated through simple vector arithmetic to generate new, semantically meaningful outputs. This fascinating property enables intuitive and targeted image generation, such as modifying attributes like gender, facial expression, or accessories.

A major milestone in this area was the 2015 paper by Alec Radford et al. titled "Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks." The authors introduced a stable architecture for training deep convolutional neural networks within the GAN framework, known as DCGAN. (While not an autoencoder, it shares a similar underlying philosophy as a generative model.)

In their study, the authors explored the structure of the latent space by training models on various datasets - most notably, a dataset of celebrity faces - and demonstrated that semantic transformations (e.g., adding glasses or changing gender) could be achieved via linear operations in the latent space.



The image above illustrates this concept. A latent vector representing a "man with glasses" is subtracted from a "man without glasses", isolating the direction in latent space corresponding to the "glasses" feature. When this difference is added to the latent vector of a "woman without glasses", the result is a new latent vector that, when decoded, generates an image of a "woman with glasses."

This arithmetic operation can be formally written as:


$$ z_{\text{man with glasses}} - z_{\text{man without glasses}} + z_{\text{woman without glasses}} = z_{\text{woman with glasses}} $$


This example highlights the fact that GANs and similar generative models learn structured and semantically meaningful latent spaces, where directions correspond to interpretable, high-level concepts. Such properties make these models especially powerful in applications such as generative design, facial attribute editing, style transfer, and interactive AI-based creative tools.



5.2. Generative AI in Image Modeling: From Autoencoders to Diffusion Models

Generative AI focuses on learning data distributions in order to generate new, realistic samples - such as images, audio, or text - that resemble those from the original dataset. In the field of image generation, several deep learning architectures have been developed, each marking a significant advancement in generative modeling:

  • Autoencoders
  • Variational Autoencoders (VAEs)
  • Generative Adversarial Networks (GANs)
  • Diffusion Models

Autoencoders are not probabilistic models and often produce outputs that lack sharpness or realism, as we have observed. However, they serve as an important foundational architecture for generative AI.


Generative image modeling has rapidly evolved - from simple reconstructions using autoencoders to high-fidelity and controllable image generation using diffusion models. Each step in this progression - from VAEs to GANs to diffusion models - has contributed to improvements in visual realism, diversity of outputs, and user-driven customization.

Today, tools such as DALL-E and Stable Diffusion have made these technologies widely accessible to researchers, creators, and the general public, enabling a broad range of creative and scientific applications.




6. Revisit the Problem: Flow Around a Circular Cylinder

We previously studied fluid flow past a circular cylinder at low Reynolds number in the context of dimension reduction. In that example, we applied SVD or POD as linear techniques for reducing dimensionality.

Now that we've introduced the autoencoder - a nonlinear, neural network-based approach to dimension reduction - let's apply an autoencoder to the same problem and compare the results.

This example is adapted from the textbook "Data-Driven Science and Engineering" by Steven L. Brunton and J. Nathan Kutz.



Data Description and Visualization

The dataset provided contains flow field data for fluid flow past a circular cylinder at a Reynolds number of Re = 100. The data file, CYLINDER_ALL.mat, includes 151 snapshots of velocity components and vorticity fields:

  • UALL: Horizontal velocity field (u-velocity)
  • VALL: Vertical velocity field (v-velocity)
  • VORTALL: Vorticity field

The flow is studied at a Reynolds number of Re = 100, a regime where vortex shedding occurs periodically and can be captured effectively using dimesion reduction.


In [ ]:
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
In [ ]:
from scipy import io

flows = io.loadmat('/content/drive/MyDrive/ML/ML_data/CYLINDER_ALL.mat')

flows_mat_u = flows['UALL']
flows_mat_v = flows['VALL']
flows_mat_vort = flows['VORTALL']

flows_mat_vort_normalized = (flows_mat_vort - flows_mat_vort.min()) / (flows_mat_vort.max() - flows_mat_vort.min())
flows_mat_vort_normalized_mean = flows_mat_vort_normalized.mean(axis = 1)
flows_mat_vort_normalized_centered = flows_mat_vort_normalized - flows_mat_vort_normalized_mean[:, None]

flows_mat_vort_normalized_centered = np.array(flows_mat_vort_normalized_centered)
print(flows_mat_vort_normalized_centered.shape)

nx = 449
ny = 199
(89351, 151)

6.1. Autoencoder



Input Reshaping (or Flattening) for an Autoencoder

In [ ]:
# focus on vorticity field only

train_x = flows_mat_vort_normalized_centered.T.reshape(-1, nx*ny, 1)

print(train_x.shape)
(151, 89351, 1)

Autoencoder with a 2 Dimensional Latent Space

In [ ]:
import tensorflow as tf

# Encoder Structure
encoder = tf.keras.models.Sequential([
    tf.keras.layers.Input(shape = (89351,)),
    tf.keras.layers.Dense(units = 256, activation = 'relu'),
    tf.keras.layers.Dense(units = 64, activation = 'relu'),
    tf.keras.layers.Dense(units = 2, activation = None)
    ])

# Decoder Structure
decoder = tf.keras.models.Sequential([
    tf.keras.layers.Input(shape = (2,)),
    tf.keras.layers.Dense(units = 64, activation = 'relu'),
    tf.keras.layers.Dense(units = 256, activation = 'relu'),
    tf.keras.layers.Dense(units = 449*199, activation = None)
    ])

# Autoencoder = Encoder + Decoder
autoencoder = tf.keras.models.Sequential([encoder, decoder])
In [ ]:
autoencoder.compile(optimizer = tf.keras.optimizers.Adam(),
                    loss = 'mean_squared_error',
                    metrics = ['mse'])
In [ ]:
# Train Model
training = autoencoder.fit(train_x, train_x, epochs = 500, verbose = 0)

Reconstructing Inputs and Visualizing Them in 2D Latent Space

In [ ]:
reconst_x = autoencoder.predict(train_x, verbose = 0)

latent = encoder.predict(train_x, verbose = 0)
In [ ]:
# Let's see the flow and latent space in time

from matplotlib.animation import FuncAnimation
import time
from IPython import display

fig = plt.figure(figsize = (6, 4))

ax1 = fig.add_subplot(1, 3, 1)
ax2 = fig.add_subplot(1, 3, 2)
ax3 = fig.add_subplot(1, 3, 3)

def animate(i):
  ax1.clear()
  ax1.imshow(train_x[i,:].reshape(nx, ny) + flows_mat_vort_normalized_mean.reshape(nx, ny), 'gray')
  ax1.set_title('Ground Truth')
  ax1.axis('off')
  ax2.clear()
  ax2.imshow(reconst_x[i,:].reshape(nx, ny) + flows_mat_vort_normalized_mean.reshape(nx, ny), 'gray')
  ax2.set_title('Reconstructed')
  ax2.axis('off')
  ax3.clear()
  ax3.scatter(latent[:,0], latent[:,1], alpha = 0.3)
  ax3.scatter(latent[i,0], latent[i,1], color = 'r', s = 50)
  ax3.set_title('Latent Space')
  ax3.set_xticks([])
  ax3.set_yticks([])

ani = FuncAnimation(fig, animate, frames = int(150/1), interval = 20)
video = ani.to_html5_video()
html = display.HTML(video)
display.display(html)
plt.close()
In [ ]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')