Dimension Reduction

Table of Contents

1. Statistics¶

Statistics is the branch of mathematics that deals with the collection, analysis, interpretation, presentation, and organization of data. It plays a crucial role in various fields, including science, business, social sciences, and machine learning.

from IPython.display import YouTubeVideo
YouTubeVideo('OWOzaQhdE3Y', width = "560", height = "315")

Statistics vs Probability

Probability

Definition: Probability is the branch of mathematics that deals with predicting the likelihood of future outcomes based on known conditions or assumptions.
Focus: Probability starts with a known model (e.g., theoretical distribution) and aims to predict possible outcomes.

Statistics

Definition: Statistics is the branch of mathematics that involves analyzing existing data to infer properties of the population or predict future outcomes.
Focus: Statistics starts with observed data and attempts to build models or draw conclusions about the underlying distribution.

1.1. Populations and Samples¶

In statistics, understanding the distinction between populations and samples is fundamental to data analysis and inference.

A population includes all the elements from a set of data
A parameter is a quantity computed from a population
- mean, $\mu$
- variance, $\sigma^2$

A sample is a subset of the population.
- one or more observations
A statistic is a quantity computed from samples
- sample mean, $\bar{x}$
- sample variance, $s^2$
- sample correlation, $S_{𝑥𝑦}$

Purpose of Sampling

Samples are used because measuring the entire population is often impractical or impossible.
To make inferences about a population without examining every member.

How to Generate Random Numbers (Samples or data)

Data sampled from population/process/generative model

import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

## random number generation (1D)
m = 1000;

# uniform distribution U(0,1)
x1 = np.random.rand(m,1);

# uniform distribution U(a,b)
a = 1;
b = 5;
x2 = a + (b-a)*np.random.rand(m,1);

# standard normal (Gaussian) distribution N(0,1^2)
# x3 = np.random.normal(0, 1, m)
x3 = np.random.randn(m,1);

# normal distribution N(5,2^2)
x4 = 5 + 2*np.random.randn(m,1);

# random integers
x5 = np.random.randint(1, 6, size = (1,m));

Histogram: A Graphical Representation of Data Distribution

A histogram is a powerful tool for visualizing the distribution of data. It represents the frequency of data points across defined intervals (bins). Unlike bar charts, histograms group continuous data into bins to reveal patterns such as skewness, modality, and outliers.

1.2. Multivariate Statistics¶

Data Representation

Each observation $x^{(i)}$ is treated as a column vector with multiple features (variables)
The dataset $X$ is organized as a matrix, where each row corresponds to a transposed observation vector $\left(x^{(i)}\right)^T$

$$ x^{(i)} = \begin{bmatrix}x_1^{(i)} \\ x_2^{(i)}\\ \vdots \end{bmatrix}, \quad X = \begin{bmatrix} -& (x^{(i)})^T & -\\ - & (x^{(i)})^T & -\\ & \vdots & \\ - & (x^{(m)})^T & -\end{bmatrix}$$

Sample Mean

The sample mean, denoted $\bar x$, is the average of all observation vectors $\left(x^{(i)}, x^{(2)}, \cdots , x^{(m)}\right)$

$$\text{sample mean} \; \bar x = \frac{x^{(1)} + x^{(2)} + \cdots + x^{(m)}}{m} = \frac{1}{m} \sum\limits_{i=1}^{m}x^{(i)} $$

Sample Variance

The sample variance, $S^2$, measures the spread of the data. The formula is:

$$\text{sample variance} \; S^2 = \frac{1}{m-1} \sum\limits_{i=1}^{m}(x^{(i)} - \bar x)^2$$

Population variance

For a complete population of size $N$, the variance is calculated as:

$$\sigma^2 = \frac{1}{N}\sum\limits_{i=1}^{N}(x^{(i)} - \mu)^2$$

1.2.1. Correlation of Two Random Variables¶

This section explores the relationship between two random variables, focusing on variance, covariance, and correlation.

Sample Variance

Measures the spread (dispersion) of a single variable around its mean.

$$S_x = \frac{1}{m-1} \sum\limits_{i=1}^{m}\left(x^{(i)}-\bar x\right)^2$$

Similarly, $S_y$ is the variance of the variable $y$

Sample Covariance

Measures the joint variability of two random variables $x$ and $y$

$$S_{xy} = \frac{1}{m-1} \sum\limits_{i=1}^{m}\left(x^{(i)}-\bar x\right)\left(y^{(i)}-\bar y \right)$$

Interpretation:
- $S_{xy} > 0$: $x$ and $y$ tend to increase together (positive correlation).
- $S_{xy} < 0$: $x$ and $y$ tend to move in opposite directions (negative correlation).
- $S_{xy} \approx 0$: No clear linear relationship.

Sample Covariance Matrix

The covariance matrix summarizes the variances and covariances:

$$S = \begin{bmatrix} S_x & S_{xy} \\ S_{yx} & S_y \end{bmatrix}$$

The diagonal entries are variances $(S_x, S_y)$.
The off-diagonal entries $(S_{xy}, S_{yx})$ represent the covariances.
Since $S_{xy} = S_{yx}$, the matrix is symmetric.

Sample Correlation Coefficient

Measures the strength and direction of the linear relationship between two variables:

$$r = \frac{S_{xy}}{ \sqrt {S_{xx}\cdot S_{yy}} }$$

Range: $-1 \leq r \leq 1$
Interpretation:
- $r = 1$: Perfect positive correlation
- $r = -1$: Perfect negative correlation
- $r = 0$: No linear correlation

Suppose

$$x_1 \leq x_2 \leq \cdots \leq x_n$$

$$y_1 \leq y_2 \leq \cdots \leq y_n$$

1.2.2. Correlation Coefficient¶

The correlation coefficient is a statistical measure that quantifies the strength and direction of the linear relationship between two variables.

Key Properties

Range of Values:

$$0 \leq \left\lvert \text{ correlation coefficient } \right\rvert \leq 1$$

Interpretation:
- $+1 \to \;$ close to a straight line (positive correlation)
- $-1 \to \;$ close to a straight line (negative correlation)
- $\;\;\;0 \to \;$ no linear line (uncorrelated)

Closeness to a line
- A value close to $+1$ or $-1$ implies the data points are tightly clustered around a straight line.
- The correlation does not indicate the slope of the line

Important Note on Causality

Correlation does NOT imply causation.
Even if two variables are strongly correlated, it doesn't necessarily mean one causes the other.

1.2.3. Correlation Coefficient Plot¶

A correlation coefficient plot is a visual tool that shows the strength and direction of correlation between pairs of variables. It's particularly useful in identifying patterns and relationships within a dataset.

seaborn

import seaborn as sns
import pandas as pd

m = 300
x = 2*np.random.randn(m)
y = np.random.randn(m)

xo = np.sort(x)
yo = np.sort(y)
yor = -np.sort(-y)

d = {'col. 1': x, 'col. 2': xo, 'col. 3': yo, 'col. 4': yor}
df = pd.DataFrame(data = d)

sns.pairplot(df, height = 1.5, aspect = 1)
plt.show()

2. Principal Component Analysis (PCA)¶

from IPython.display import YouTubeVideo
YouTubeVideo('-sM2dYlYSoI', width = "560", height = "315")

2.1. Exploring the Concept of 'Dimension' in Various Fields¶

The term dimension is widely used across different disciplines. Let's explore its meaning in several contexts:

(1) In Coordinate Systems

In geometry, coordinates like $x_1, x_2, x_3$ represent the axes in 3-dimensional space.

(2) In Dynamics (Generalized Coordinates)

In mechanical systems, a generalized coordinate is another example of a dimension.
For instance, in a four-bar linkage mechanism, the angle $\theta$ can serve as a generalized coordinate because knowing $\theta$ uniquely determines the entire configuration of the system.
Therefore, the four-bar linkage is a 1D system.

(3) In Machine Learning

The input dimension refers to the number of columns (features) in the input data $x$.
The output dimension refers to the number of columns in the output data $y$.

(4) In Image Processing

For an $8 \times 8$ pixel image, each pixel can have a distinct value (e.g., grayscale intensity or RGB values).
The total number of pixels (64 in this example) can be considered the dimension of the image.

2.2. Dimension Reduction¶

2.2.1. Two Examples of Dimension Reduction¶

Example 1

In this example, we have a system described by two original variables:

$$ \begin{align*} \dot u_1 &= -2(u_1 - u_2) \\ \dot u_2 &= \quad\; (u_1 - u_2) \end{align*} $$

These equations describe the dynamics of $u_1$ and $u_2$, which initially appear to be a 2D system.

We'll introduce new variables based on physical insights:

$$ \begin{align*} v_1 &= u_1 + 2u_2 \quad &\text{total amount of water}\\ v_2 &= u_1 - u_2 \quad &\text{difference in height} \end{align*} $$

By substituting the new variables into the original system:

$$ \begin{align*} \dot v_1 &= 0 \\ \dot v_2 &= -3 v_2 \end{align*} $$

The solution in $v_1$ and $v_2$

$$ \begin{align*} v_1(t) &= v_1(0) \\ v_2(t) &= v_2(0) e^{-3t} \end{align*} $$

Since $v_1(t)$ is constant, it doesn't evolve over time and effectively behaves like a fixed parameter rather than a dynamic variable. Consequently, the system's behavior is entirely governed by $v_2(t)$, the only variable that changes with time.

This is a great example of how dimension reduction (2D $\rightarrow$ 1D) can emerge naturally from analyzing the underlying physics and identifying conserved quantities.

Example 2

In this example, we have an $15 \times 15$ pixel image, which corresponds to 225 dimensions if each pixel is treated as an independent feature.

Key Insight: Why random noise looks meaningless

If each pixel is assigned a random value, the resulting image appears chaotic and lacks any meaningful structure.
This randomness illustrates that while the data exists in 225-dimensional space, not all points in this space represent meaningful images.

Why does this relate to dimension reduction?

Real-World Images are far from random. They follow specific patterns, structures, and correlations.
Consequently, meaningful images occupy only a small subspace within the full 225-dimensional space.
This observation opens the door for dimension reduction - mapping high-dimensional image data into a lower-dimensional subspace that retains meaningful features.

2.2.2. Motivation: Can High-Dimensional Data Be Described in a Simpler Way?¶

In many machine learning and data analysis tasks, datasets often contain numerous features, resulting in high-dimensional data. While this abundance of information can be beneficial, it introduces several challenges such as:

Increased computational complexity
Difficulties in visualizing and interpreting the data
Greater risk of overfitting in machine learning models

To address these issues, we explore dimensionality reduction - a technique designed to simplify complex data by transforming it into a lower dimensional representation while preserving as much meaningful information as possible.

Dimensionality reduction offers several important benefits:

(1) Insights into Low-Dimensional Structures:

Reducing data dimensions often reveals meaningful patterns and structures that may be obscured in high dimensional spaces.
For example, in visualization tasks, reducing data to 2D or 3D makes it easier to interpret and understand.

(2) Improved Generalization and Reduced Overfitting:

With fewer dimensions, the model complexity decreases, reducing the risk of overfitting and improving the model's ability to generalize well to unseen data.

(3) Increased Computational Efficiency:

Many machine learning algorithms scale poorly with increasing data dimensionality. Reducing the number of features significantly speeds up training and inference.

(4) Lower Storage Requirements:

Dimensionality reduction effectively compresses data by representing it with fewer variables, reducing memory and storage costs.

Dimensionality Reduction vs. Feature Selection

Feature Selection involves choosing a subset of existing features that best represent the data.
Dimensionality Reduction, on the other hand, is more like Feature Extraction, where new features are constructed from combinations of the original features.

While both approaches aim to simplify the data, dimensionality reduction actively transforms the original data into a lower dimensional form rather than simply discarding irrelevant features.

In essence, dimensionality reduction helps uncover the intrinsic structure of the data, improving both interpretability and computational efficiency - all while retaining critical information.

Redundancy in Data

Redundancy occurs when two or more features provide overlapping or duplicated information.
Highly correlated features often carry similar patterns, making some dimensions less informative.

(1) Low Redundancy

The data points are spread across both feature axes ($r_1$ and $r_2$) without a clear pattern.
Here, both features provide unique information, and discarding one feature would result in a considerable loss of information.

(2) Moderate Redundancy

(3) High Redundancy

The data points are tightly concentrated, meaning the two features are highly correlated.
Since the two features are strongly correlated, knowing the value of one feature ($r_1$ ) allows us to accurately predict the value of the other feature ($r_2$).
Therefore, the two features essentially contain the same information, and one dimension can be removed with minimal information loss.

Example of Dim Reduction in 2D

Consider a dataset distributed as shown in the following figure.

Each data point has two features $\{x_1,x_2\}$

(1) Ignoring feature $x_2$

Suppose we discard the feature $x_2$, and each example is reduced to

$$x = \{x_1\}$$

(2) Assessing information loss

The key question is: "Are we losing much information by throwing away $x_2$?"
Observing the data distribution reveals that the data points are highly concentrated along the $x_1$-axis, with very little spread along the $x_2$-axis.
This suggests that the data has minimal variance in the $x_2$ direction.

Now consider a dataset distributed as shown in the following figure.

In this case, the dataset is distributed across two features $\{x_1,x_2\}$

(1) The data exhibits significant variance along both axes ($x_1$ and $x_2$)

The spread of points suggests that both features contribute meaningful information

(2) Suppose we attempt to discard $x_2$ and reduce the data to:

$$x = \{x_1\}$$

(3) Significant information loss:

Since both features exhibit substantial variability, ignoring either $x_1$ or $x_2$ would discard important information about the data's structure.
Have to keep both features

While the dataset remains the same, the coordinate system can be changed. This transformation reveals key insights into information retention and loss.

(1) Observing the data in the new axes

After transforming the data into new coordinates ($u_1$ and $u_2$), the variance is now concentrated mostly along the $u_1$-axis.
The spread along the $u_2$-axis is minimal, indicating low variability in that direction.

(2) Discarding a dimension in the new axes

Suppose we discard $u_2$, reducing the data to a 1-dimensional representation:

$$x = \{ u_1 \}$$

Since the data's primary variability is along $u_1$, this reduction results in minimal information loss.

How to Find $u_1$

The key question is: How can we determine the optimal axis $u_1$ that maximizes variance and minimizes information loss?

This is precisely the objective of Principal Component Analysis (PCA).
PCA computes the principal components, which are the directions of maximum variance in the data.
The first principal component ($u_1$) is the direction along which the data has the greatest spread, ensuring the most informative projection.

Core Idea

Maximizing Variance: To identify the new axes (principal components) along which the data exhibits the most spread.
Minimizing Reconstruction Error: To minimize the squared error between the original data and its projection onto the new lower-dimensional space.

In essence, PCA identifies the optimal axis $u_1$ that retains the most important information about the data when reducing dimensions.

2.3. PCA Algorithm¶

In the previous section, we introduced the concept of dimensionality reduction and explored the core ideas behind it. Now, we will dive deeper into the mechanics of how this reduction is achieved - specifically through the Principal Component Analysis (PCA) algorithm.

The goal of PCA is to transform the data into a new coordinate system such that:

The first principal component (denoted as $ u_1 $) captures the maximum variance in the data.
Subsequent principal components ($ u_2, u_3, \cdots $) capture the remaining variance in decreasing order.

In this section, we will systematically study the PCA algorithm and its step-by-step procedure.

2.3.1. Pre-processing¶

Before applying Principal Component Analysis (PCA), it is crucial to preprocess the data to ensure that all features are appropriately scaled and centered. Proper preprocessing improves the accuracy and reliability of PCA results.

Step 1: Given Data

Suppose we are given a dataset consisting of $m$ examples, each with $n$ features:

$$ x^{(i)} = \begin{bmatrix} x^{(i)}_1 \\ \vdots \\x^{(i)}_n \end{bmatrix},\qquad X = \begin{bmatrix} \cdots & (x^{(1)})^T & \cdots\\ \cdots & (x^{(2)})^T & \cdots\\ & \vdots & \\ \cdots & (x^{(m)})^T & \cdots\\ \end{bmatrix}$$

Step 2: Shifting (Zero Mean)

To ensure the data is centered at zero (mean removal):

Compute the mean vector

$$\mu = \frac{1}{m}\sum\limits_{i=1}^{m}x^{(i)} $$

Shift each sample

$$x^{(i)} \leftarrow x^{(i)} - \mu \quad \text{(zero mean)}$$

Step 3: Rescaling (Unit Variance) [Optional]

Compute the variance of each feature

$$\sigma^2_j = \frac{1}{m-1}\sum_{i=1}^{m}\left(x_j^{(i)}\right)^2 $$

Rescale each feature

$$x^{(i)}_j \leftarrow \frac{x^{(i)}_j}{\sigma_j} $$

2.3.2. Maximize Variance¶

The goal of Principal Component Analysis (PCA) is to find the optimal direction (axis) that maximizes the variance in the data when projected onto that direction.

Step 1: Data Projection

Let the data be represented as $X = [x^{(1)}, x^{(2)}, \cdots, x^{(m)}]^T$.
Suppose we want to project the data onto a unit vector $ u_1 $.

Step 2: Maximizing the Variance of Projected Data

PCA identifies the direction $ u_1 $ such that the projected data captures the maximum variance.

$$\begin{align*} \text{variance of projected data} & = \frac{1}{m}\sum\limits_{i=1}^{m}\big(u^Tx^{(i)}\big)^2 = \frac{1}{m}\sum\limits_{i=1}^{m}\big( {x^{(i)}}^Tu\big)^2 \\ & = \frac{1}{m}\sum\limits_{i=1}^{m}\big( {x^{(i)}}^Tu\big)^T\big( {x^{(i)}}^Tu\big) = \frac{1}{m}\sum\limits_{i=1}^{m}u^Tx^{(i)}{x^{(i)}}^Tu \\ & = u^T\left(\frac{1}{m}\sum\limits_{i=1}^{m}x^{(i)}{x^{(i)}}^T \right) u \\ & =u^TSu \qquad (S=\frac{1}{m}X^T X: \text{sample covariance matrix}) \end{align*}$$

Step 3: Optimization Formulation

Maximizing the variance can now be written as the following optimization problem:

$$\begin{align*} \text{maximize} \quad & u^TSu \\ \text{subject to} \quad & u^Tu = 1\end{align*}$$

Step 4: Eigenvalue Analysis

From the properties of eigenvectors and eigenvalues:

$$\begin{align*} & u^TSu = u^T\lambda u = \lambda u^Tu = \lambda \quad (\text{Eigen analysis} : Su = \lambda u) \\ \\ & \implies \text{pick the largest eigenvalue } \lambda _1 \text{ of covariance matrix } S\\ & \implies u = u_1 \, \text{ is the } \,\lambda_1's \,\text{ corresponding eigenvector}\\ & \implies u_1 \text{ is the first principal component (direction of highest variance in the data)} \end{align*}$$

2.3.3. Minimize the Sum-of-Squared Error¶

The objective of PCA can also be framed as minimizing the sum-of-squared errors between the original data points and their projections.

Step 1: Derivation of Error Term

$$ \begin{align*} \lVert e^{(i)} \rVert ^2 & = \lVert x^{(i)} \rVert ^2 - \big( {x^{(i)}}^Tu \big)^2 \\ & = \lVert x^{(i)} \rVert ^2 - \big( {x^{(i)}}^Tu \big)^T\big( {x^{(i)}}^Tu \big) \\ & = \lVert x^{(i)} \rVert ^2 - u^Tx^{(i)}{x^{(i)}}^Tu\\ \end{align*} $$

Step 2: Minimizing the Mean Squared Error

$$\begin{align*} \frac {1}{m} \sum\limits_{i=1}^{m} \lVert e^{(i)} \rVert ^2 & = \frac {1}{m} \sum\limits_{i=1}^{m} \lVert x^{(i)} \rVert ^2 - \frac{1}{m}\sum\limits_{i=1}^{m}u^Tx^{(i)}{x^{(i)}}^Tu \\ & = \frac {1}{m} \sum\limits_{i=1}^{m} \lVert x^{(i)} \rVert ^2 - u^T\left(\frac{1}{m}\sum\limits_{i=1}^{m}x^{(i)}{x^{(i)}}^T\right)u \end{align*}$$

Step 3: Optimization Formulation

Minimizing the error is equivalent to maximizing the variance:

$$\begin{align*} &\text{minimize} \; \underbrace{\frac {1}{m} \sum\limits_{i=1}^{m} \lVert x^{(i)} \rVert ^2}_{\text{constant given $x_i$}} - \underbrace{u^T\left(\frac{1}{m}\sum\limits_{i=1}^{m}x^{(i)}{x^{(i)}}^T\right)u}_{\text{maximize}} \\ \\ \implies &\text{maximize} \; u^T\left(\frac{1}{m}\sum\limits_{i=1}^{m}x^{(i)}{x^{(i)}}^T\right)u = \max \; u^T S u\end{align*}$$

$$ \therefore \; \text{minimize} \; \text{error}^2 = \text{maximize} \; \text{variance}$$

2.3.4. Dimension Reduction Method ($n \rightarrow k$)¶

In PCA, we can reduce the dimensionality of the data while retaining as much meaningful information as possible by following these steps:

Step 1: Select Top $k$ Principal Components

Identify the top $k \; (\leq n)$ eigenvectors corresponding to the largest eigenvalues
Form a matrix $U = [u_1, u_2, \cdots, u_k]$ where the eigenvectors are orthonormal

Step 2: Project Each Data Point onto the New Subspace

Each data point $x^{(i)}$ is projected onto the span of the selected eigenvectors:

$$z^{(i)} = \begin{bmatrix} u_1^Tx^{(i)}\\ u_2^Tx^{(i)}\\ \vdots \\ u_k^Tx^{(i)}\\ \end{bmatrix} \;\text{ or }\; z = U^{T}x $$

In a matrix form

$$Z = XU$$

2.3.5. Linear Regression vs. PCA¶

Linear Regression

A supervised method focused on prediction.
Minimizes the vertical distances (residuals) between the observed $x_2$ values and the regression line.

PCA

An unsupervised method focused on data compression or dimensionality reduction.
Minimizes the orthogonal (perpendicular) distances from the data points to the principal component line.

2.4. PCA in Python¶

Data Generation

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# data generation
m = 1000
mu = np.array([0, 0])
sigma = np.array([[3, 1.5],
                  [1.5, 1]])

X = np.random.multivariate_normal(mu, sigma, m)
X = np.asmatrix(X)

print(X.shape)

fig = plt.figure(figsize = (6, 4))
plt.plot(X[:,0], X[:,1], 'b.', alpha = 0.3)
plt.grid(alpha = 0.3)
plt.xlim([-7, 7])
plt.ylim([-4, 4])
plt.show()

(1000, 2)

PCA Implementation

S = 1/(m-1)*X.T*X

D, U = np.linalg.eig(S)

idx = np.argsort(-D)
D = D[idx]
U = U[:,idx]

print(D, '\n')
print(U)

[3.61832577 0.20528425] 

[[ 0.87990038 -0.47515821]
 [ 0.47515821  0.87990038]]

Plot the principal component

h = U[1,0]/U[0,0]
xp = np.arange(-6, 6, 0.1)
yp = h*xp

fig = plt.figure(figsize = (6, 4))
plt.plot(X[:,0], X[:,1], 'b.', alpha = 0.3)
plt.plot(xp, yp, 'r', linewidth = 3)
plt.grid(alpha = 0.3)
plt.xlim([-7, 7])
plt.ylim([-4, 4])
plt.show()

Plot the histogram of data projected onto $u_1$

Z1 = X*U[:,0]

plt.figure(figsize = (6, 4))
plt.hist(Z1, 51)
plt.xlim([-7, 7])
plt.show()

Plot the histogram of data projected onto $u_2$

Z2 = X*U[:,1]

plt.figure(figsize = (6, 4))
plt.hist(Z2, 21)
plt.xlim([-7, 7])
plt.show()

PCA from sklearn.decomposition

from sklearn.decomposition import PCA

X = np.array(X)

pca = PCA(n_components = 1)
pca.fit(X)

PCA(n_components=1)

PCA(n_components=1)

Plot the histogram of data projected onto $u_1$

Z = pca.transform(X)

plt.figure(figsize = (6, 4))
plt.hist(Z, 51)
plt.xlim([-7, 7])
plt.show()

2.5. PCA Example: Mass-Spring Systsem¶

Consider a simple spring-mass system in which a ball of mass $m$ is attached to a massless, frictionless spring. The system exhibits oscillatory motion, which can ideally be described by a one-dimensional function of time representing the displacement.

Experimental Setup

However, we lack prior knowledge about the system's true dynamics and the optimal measurement direction.

We place three cameras (Camera A, Camera B, and Camera C) at arbitrary angles around the system to capture the ball's position in 3D space.
Each camera records 2D projections of the ball's position.
The measured coordinates from each camera are recorded in their respective arbitrary coordinate systems.
Additionally, external factors such as air resistance, camera imperfections, and potential friction introduce noise into the recorded data.

Challenge

The true underlying dynamics are defined solely by the ball's motion along the $x$-axis, yet:

We lack knowledge of the true $x$-axis in the system.
We have collected redundant and noisy data by measuring along arbitrary camera axes.

Objective

Our goal is to apply Principal Component Analysis (PCA) to:

Identify the most significant direction in the data that represents the system's primary motion.
Reduce the dimensionality of the dataset while retaining the core dynamics.
Extract a simplified, yet accurate, representation of the ball's motion - ideally corresponding to the true $x$-axis displacement.

Video recording from Camera A

from IPython.display import YouTubeVideo
YouTubeVideo('Pkit-64g0eU', width = "560", height = "315")

Video recording from Camera B

from IPython.display import YouTubeVideo
YouTubeVideo('x4lvjVjUUqg', width = "560", height = "315")

Video recording from Camera C

from IPython.display import YouTubeVideo
YouTubeVideo('2t62WkNIqxY', width = "560", height = "315")

Each camera provides two measurements - one for the horizontal position ($x$) and one for the vertical position ($y$) - relative to its own coordinate system. For each time instance $i$, the recorded data can be structured as follows:

$$ x^{(i)} = \begin{bmatrix} x \text{ in camera A} \\ y \text{ in camera A} \\ x \text{ in camera B} \\ y \text{ in camera B} \\ x \text{ in camera C} \\ y \text{ in camera C} \end{bmatrix} $$

This 6-dimensional vector $x^{(i)}$ represents the system's position as captured by all three cameras at time instance $i$.

To organize the entire dataset over $m$ time instances, we construct a data matrix $X$ as follows:

$$ X_{m \times 6} = \begin{bmatrix} \cdots & (x^{(1)})^T & \cdots \\ \cdots & (x^{(2)})^T & \cdots \\ & \vdots & \\ & \vdots & \\ \cdots & (x^{(m)})^T & \cdots \\ \end{bmatrix} $$

Where:

Each row represents the system's recorded position at a particular time instance.
Each column corresponds to one of the six measured coordinates (two from each camera).

We have collected the data for you. Here is the dataset that you can download for analysis.

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive

import numpy as np

X = np.load('/content/drive/MyDrive/ML/ML_data/pca_spring.npy')
X = np.asmatrix(X.T)

m = X.shape[0]

print(X.shape)

(273, 6)

# Do not worry about the negative sign for the relative y-axis.
# it is simply a translation to plot the motion as observed by the camera.

plt.figure(figsize = (8, 4))
plt.subplot(1,3,1)
plt.plot(X[:, 0], -X[:, 1], 'r')
plt.axis('equal')
plt.title('Camera 1')

plt.subplot(1,3,2)
plt.plot(X[:, 2], -X[:, 3], 'b')
plt.axis('equal')
plt.title('Camera 2')

plt.subplot(1,3,3)
plt.plot(X[:, 4], -X[:, 5], 'k')
plt.axis('equal')
plt.title('Camera 3')

plt.show()

Apply PCA

X = X - np.mean(X, axis = 0)

S = 1/(m-1)*X.T*X

D, U = np.linalg.eig(S)

idx = np.argsort(-D)
D = D[idx]
U = U[:,idx]

print(D, '\n')
print(U)

[2.46033089e+04 3.22747042e+02 8.73851124e+01 8.19527660e+01
 3.19467195e+01 7.42861585e+00] 

[[ 0.36881064  0.62298194 -0.12571821 -0.42647348  0.52121775 -0.0807439 ]
 [ 0.35632379  0.57286174  0.132303    0.59881765 -0.40143215  0.08734045]
 [ 0.58419477 -0.22610057 -0.20325551 -0.47751523 -0.58153918  0.00857804]
 [ 0.08652315 -0.02671281  0.75692234 -0.14177391 -0.06010869 -0.62861422]
 [ 0.4159798  -0.29900638  0.49374948  0.05637591  0.32442517  0.62075559]
 [-0.46389987  0.37746931  0.32963322 -0.45633202 -0.34660023  0.45308403]]

Plot Eigenvalues

plt.figure(figsize = (6, 4))
plt.stem(np.sqrt(D))
plt.grid(alpha = 0.3)
plt.show()

Projection to the new Eigenvectors

$$Z = XU$$

# relative magnitudes of the principal components

Z = X*U
xp = np.arange(0, m)/24    # 24 frame rate

plt.figure(figsize = (6, 4))
plt.plot(xp, Z)
plt.grid(alpha = 0.3)
plt.show()

Projection to the Principal Component

## projected onto the first principal component
# 6 dim -> 1 dim (dim reduction)
# relative magnitude of the first principal component

plt.figure(figsize = (6, 4))
plt.plot(Z[:,0])
plt.grid(alpha = 0.3)
plt.show()

Discussion:

The first principal component effectively captured the system's oscillatory behavior, corresponding to the ideal spring-mass motion.
PCA successfully reduced the dataset from six dimensions to one while preserving the system's key dynamics.
PCA filtered out minor variations caused by noise, improving data clarity.
It is not the magnitudes of the measurements that are physically meaningful, but rather the relative magnitudes and patterns within the data.
By strategically combining multiple less-reliable measurements, PCA can uncover the true underlying structure of the system, providing a cost-effective alternative to expensive equipment.

3. Singular Value Decomposition (SVD)¶

from IPython.display import YouTubeVideo
YouTubeVideo('sFym4MYOQ-o', width = "560", height = "315")

Geometry of Linear Maps

In the context of linear algebra, the matrix $ A \in \mathbb{R}^{m \times n} $ represents a linear transformation that maps vectors from an $ n $-dimensional space to an $ m $-dimensional space.

The matrix $ A $ can be interpreted as performing a combination of the following transformations:

(1) Rotation:

$ A $ can rotate vectors, changing their direction without altering their length. This preserves angles between vectors.

(2) Scaling (Stretching/Compression):

$ A $ can stretch or compress vectors along certain directions. This changes the vector's magnitude.

(3) Shearing (Skewing):

$ A $ can skew the original space, distorting shapes without strict rotation or scaling.

Mapping the Unit Ball to an Ellipsoid

The unit ball in $ \mathbb{R}^n $, defined as:

$$S = \{ x \in \mathbb{R}^n \mid \|x\| \leq 1 \}$$

is transformed by $ A $ into an ellipsoid in $ \mathbb{R}^m $:

$$AS = \{ Ax \mid x \in S \}$$

This geometric interpretation highlights that linear transformations distort space by rotating, scaling or skewing.

3.1. SVD Algorithm¶

Singular Values and Singular Vectors

The Singular Value Decomposition (SVD) provides a structured way to describe this transformation. The key elements of the SVD are:

the numbers $\sigma_1, \cdots, \sigma_n$ are called the singular values of $A$ by convention, $\sigma_i > 0$
the vectors $u_1, \cdots, u_n$ are unit vectors along the principal semiaxes of $AS$
the unit vectors $v_1, \cdots, v_n$ are the preimages of the principal semiaxes

These satisfy the following relationship:

$$A v_i = \sigma_i u_i$$

Matrix Formulation of SVD

By combining these vectors and values into matrices:

$$\begin{align*} A\begin{bmatrix} v_1& v_2 & \cdots & v_r \end{bmatrix} & =\begin{bmatrix} A v_1 & A v_2 & \cdots & A v_r \end{bmatrix} \\\\ & =\begin{bmatrix} \sigma_1u_1& \sigma_2u_2 & \cdots & \sigma_ru_r \end{bmatrix} \\\\ & =\begin{bmatrix} u_1& u_2 & \cdots & u_r \end{bmatrix}\begin{bmatrix} \sigma_1& & & \\ & \sigma_2 & & \\ & & \ddots & \\ & & & \sigma_r \end{bmatrix} \end{align*}$$

This leads to the compact matrix form:

$$A V = U \Sigma$$

Where:

$U$: Orthogonal matrix containing left singular vectors
$V$: Orthogonal matrix containing right singular vectors
$\Sigma$: Diagonal matrix containing singular values

Thin Singular Value Decomposition

In the case where $A \in \mathbb{R}^{m \times n}$ is skinny (more rows than columns) and full rank (i.e., $r = n$), we define:

$$A v_i = \sigma_i u_i \qquad \text{for } 1 \leq i \leq n$$

Let:

$$ \begin{align*} \hat U &= \begin{bmatrix}u_1 & u_2 & \cdots & u_n \end{bmatrix},\quad \quad \hat \Sigma &= \begin{bmatrix}\sigma_1& & & \\ & \sigma_2 &&\\ &&\ddots&\\ &&&\sigma_n \end{bmatrix}, \quad \quad V &= \begin{bmatrix} v_1 & v_2 & \cdots & v_n \end{bmatrix} \end{align*} $$

In matrix form,

$$AV = \hat U \hat \Sigma$$

Since $V$ is orthogona, we obtain the thin (or reduced) SVD of $A$

$$ A = \hat U \hat \Sigma V^T $$

Full Singular Value Decomposition

In addition to the thin SVD, we can extend the decomposition to form the full SVD by adding extra components to ensure that the resulting matrices are fully orthogonal.

Suppose $A \in \mathbb{R}^{m \times n}$ with $m \geq n$.

To construct the full SVD:

(1) Extend $U$:

$\quad \,$Add extra orthonormal columns to $\hat{U}$ to create a complete orthogonal matrix $U$.

$$U = \begin{bmatrix} u_1 & u_2 & \cdots & u_n & u_{n+1}& \cdots & u_m \end{bmatrix}$$

$\quad \,$Since $U$ is now an orthogonal matrix, $UU^T = I_m$.

(2) Extend $\Sigma$:

$\quad \,$Add extra rows of zeros to the diagonal matrix $\hat{\Sigma}$, ensuring the dimensions align with $U$.

$$ \Sigma = \begin{bmatrix} \sigma_1 & & & \\ & \sigma_2 & & \\ & & \ddots & \\ & & & \sigma_n \\ & & & 0 \\ & & & \vdots \\ & & & 0 \end{bmatrix} $$

(3) Combine in Matrix Form:

$\quad \,$Using these augmented matrices, the full SVD is defined as:

$$A = U \Sigma V^T$$

Where:

$U \in \mathbb{R}^{m \times m}$: an orthogonal matrix.
$\Sigma \in \mathbb{R}^{m \times n}$: a diagonal matrix with singular values.
$V \in \mathbb{R}^{n \times n}$: an orthogonal matrix.

This is the (full) singular value decomposition of $A$

Every matrix $A$ has a singular value decomposition.
The full SVD extends the thin SVD by adding orthogonal components to $U$ and zero rows in $\Sigma$/

Interpretation of SVD

The Singular Value Decomposition (SVD) provides an insightful way to understand a linear transformation $Ax$ by breaking it down into three distinct steps:

Step 1: Rotation by $V^T$

The first transformation is a rotation of the input vector using $V^T$.
This step aligns the data with the principal directions in the row space of the matrix.

Step 2: Diagonal Scaling by $\Sigma$

The second transformation applies scaling by the diagonal matrix $\Sigma$.
Each coordinate is stretched or compressed by the corresponding singular value $\sigma_i$.
These singular values represent the magnitude of the transformation in the corresponding direction.

Step 3: Rotation by $U$

The final transformation is another rotation by the orthogonal matrix $U$.
This step aligns the scaled data with the principal directions in the column space.

SVD factorizes the matrix $ A $ into three distinct components: an orthogonal matrix $ U $, a diagonal matrix $ \Sigma $ containing singular values, and another orthogonal matrix $ V^T $. This decomposition effectively represents $ A $ as a sequence of three linear transformations: a rotation by $ V^T $, followed by a scaling by $ \Sigma $, and then another rotation by $ U $.

3.2. PCA and SVD¶

(1) Any real symmetric and positive definite matrix $B \in \mathbb{R}^{n \times n}$ can be decomposed into its eigen decomposition as follows:

$$ \begin{align*} Bs_1 &= \lambda_1 s_1 \\ Bs_2 &= \lambda_2 s_2 \\ &\vdots\\ Bs_n &= \lambda_n s_n \\\\\\ B\begin{bmatrix}s_1 & s_2 & \cdots & s_n\end{bmatrix} &= \begin{bmatrix}s_1 & s_2 & \cdots & s_n\end{bmatrix}\begin{bmatrix}\lambda_1 & & & \\ & \lambda_2 & & \\ &&\ddots &\\ &&& \lambda_n\end{bmatrix}\\\\\\ B &= S\Lambda S^{-1} = S\Lambda S^T \end{align*} $$

where:

$S$ is an orthogonal matrix containing eigenvectors of $B$ ($\rightarrow SS^T = I$).
$\Lambda$ is a diagonal matrix with corresponding eigenvalues.

(2) A real matrix $A \in \mathbb{R}^{m \times n}$ (when $m>n$) has a singular value decomposition (SVD) given by:

$$A = U\Sigma V^T$$

where:

$U$ is an orthogonal matrix of size $m \times m$.
$V$ is an orthogonal matrix of size $n \times n$.
$\Sigma$ is a diagonal matrix of singular values (size $m \times n$).

4.2.1. Connecting PCA with SVD¶

From the matrix $A$ (assuming $A$ is skinny and full rank), we can construct two positive-definite symmetric matrices:

$$AA^T \quad \text{and} \quad A^TA$$

For $AA^T \in \mathbb{R}^{m \times m}$, $U$ is the matrix of eigenvectors of $AA^T$

$$AA^T = U\Sigma V^T V \Sigma^T U^T = U \Sigma \Sigma^T U^T$$

For $A^TA \in \mathbb{R}^{n \times n}$, $V$ is the matrix of eigenvectors of $A^TA$

$$A^TA = V \Sigma^T U^T U\Sigma V^T = V \Sigma^T \Sigma V^T$$

PCA via SVD

Recall that PCA relies on the covariance matrix:

$$S = \frac{1}{m}X^TX$$

By scaling the data matrix $A = \frac{X}{\sqrt m}$, the covariance matrix becomes:

$$S = A^T A$$

Using the SVD result:

$$S = A^TA = V \Sigma^T \Sigma V^T = V \Lambda V^T$$

$V$ is the matrix of PCA principal components (eigenvectors of $S = A^TA$)
$\sigma_i^2 = \lambda_i$ (singular values squared are the eigenvalues in PCA)

Summary

The eigenvectors of $A^TA$ (from the SVD) provide the principal components in PCA.
The squared singular values of $A$ correspond to the eigenvalues in PCA.

4.2.2. Efficient Computation of SVD¶

We have established the relationship between PCA and SVD. Suppose $A \in \mathbb{R}^{m \times n}$ where $m \gg n$.

To compute the SVD:

$U$: The left singular vectors of $A$
$V$: The right singular vectors of $A$
$\Sigma$: Diagonal matrix of singular values, $\sigma_i = \sqrt{\lambda_i}$

Key Observations:

The columns of $U$ are the eigenvectors of $AA^T \in \mathbb{R}^{m \times m}$.
The columns of $V$ are the eigenvectors of $A^T A \in \mathbb{R}^{n \times n}$.
The diagonal elements of $\Sigma$ are the singular values, $\sigma_i = \sqrt{\lambda_i}$.

Computational Challenge

If $n$ is small, computing $V$ via np.linalg.eig(A^T*A) is feasible since $A^T A \in \mathbb{R}^{n \times n}$. However, computing $U$ directly from np.linalg.eig(A*A^T) is inefficient because:

$AA^T \in \mathbb{R}^{m \times m}$
Since $m \gg n$, this matrix is much larger than $A^T A$, making it computationally expensive.

Efficient Solution

Instead of computing $U$ directly via np.linalg.eig(A*A^T), we can leverage the relationship between $U$ and $V$:

$$u_i = \frac{1}{\sqrt{\lambda_i}} A v_i$$

Where:

$v_i$ is the $i$-th eigenvector of $A^T A$
$\lambda_i$ is the corresponding eigenvalue
$u_i$ is the $i$-th left singular vector of $A$

Why Does This Work?

Starting from the known relationship:

$$A^T A v_i = \lambda_i v_i$$

By multiplying both sides by $A$,

$$A A^T (A v_i) = \lambda_i (A v_i)$$

This confirms that $A v_i$ is an eigenvector of $AA^T$ with eigenvalue $\lambda_i$.

However, this vector is not yet normalized. To normalize it:

$$u_i = \frac{1}{\sqrt{\lambda_i}} A v_i$$

This ensures that $u_i$ has unit length.

This method efficiently computes the left singular vectors $U$ without directly forming the large matrix $AA^T$, making it ideal when $m \gg n$.

Computing $V$ via $\text{eig}(A^T A)$ is efficient since $n$ is relatively small.
Instead of computing $U$ directly via $\text{eig}(AA^T)$, we use the relation $u_i = \frac{1}{\sqrt{\lambda_i}} A v_i$, which is computationally faster and avoids large matrix operations.

Example

Consider the following matrix:

$$A = \begin{bmatrix} 4 & 4 \\ -3 & 3 \end{bmatrix}$$

From the decomposition:

$$ \begin{aligned} A v_1 &= \sigma_1 u_1 \\ A v_2 &= \sigma_2 u_2 \end{aligned} $$

Combining these,

$$ AV = U\Sigma \quad \implies \quad A = U \Sigma V^T $$

To determine $V$ through eigen-analysis

$$A^TA = V\Sigma^TU^TU\Sigma V^T = V\Sigma^T\Sigma V^T = V\begin{bmatrix} \sigma_1^2& & \\ & \sigma_2^2 & \\ & & \ddots \\ \end{bmatrix}V^T$$

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

A = np.array([[4, 4],
              [-3, 3]])

A = np.asmatrix(A)

[D, V] = np.linalg.eig(A.T*A)

print(V, '\n')
print(D)

[[ 0.70710678 -0.70710678]
 [ 0.70710678  0.70710678]] 

[32. 18.]

To find $U$ from eigen-analysis

$$AA^T = U\Sigma V^TV\Sigma^TU^T = U\Sigma\Sigma^TU^T = U\begin{bmatrix} \sigma_1^2& & \\ & \sigma_2^2 & \\ & & \ddots \\ \end{bmatrix}U^T$$

[D, U] = np.linalg.eig(A*A.T)

print(U, '\n')
print(D)

[[1. 0.]
 [0. 1.]] 

[32. 18.]

Instead of computing $U$ directly via np.linalg.eig(A*A^T), we can leverage the relationship between $U$ and $V$:

$$u_i = \frac{1}{\sqrt{\lambda_i}}Av_i$$

sigma = np.sqrt(D)
Sigma = np.diag(sigma)
U = A @ V @ np.linalg.inv(Sigma)

print(U)

[[1. 0.]
 [0. 1.]]

However, how to compute the SVD numerically is not the focus of this course. Instead, we will use Python's built-in np.linalg.svd() function to compute the SVD decomposition directly. By doing so, we can concentrate on understanding the underlying theory and practical applications of SVD.

The function np.linalg.svd() performs SVD as follows:

[U, S, VT] = np.linalg.svd(A, full_matrices = True)

print(U.shape, S.shape, VT.shape, '\n')
print(U, '\n')
print(S, '\n')
print(VT)

(2, 2) (2,) (2, 2) 

[[-1.  0.]
 [ 0.  1.]] 

[5.65685425 4.24264069] 

[[-0.70710678 -0.70710678]
 [-0.70710678  0.70710678]]

MIT 18.06 Linear Algebra by Prof. Gilbert Strang

from IPython.display import YouTubeVideo
YouTubeVideo('Nx0lRBaXoz4', width = "560", height = "315")

3.3. Low Rank Approximation: Dimension Reduction¶

3.3.1. Low Rank Approximation¶

In the 2D case, consider a vector $x$ that can be expressed in terms of the basis vectors $v_1$ and $v_2$ (where $\hat v$ denotes a unit vector, though this notation is omitted here for simplicity):

$$x = c_1 v_1 + c_2 v_2$$

Equivalently,

$$x = \langle x, v_1 \rangle v_1 + \langle x, v_2 \rangle v_2$$

Since inner products can be rewritten in matrix notation,

$$x = ( v_1^{T} x) v_1 + ( v_2^T x) v_2$$

SVD Transformation

Next, consider the effect of matrix $A$ on the vector $x$:

$$Ax = c_1 A v_1 + c_2 A v_2$$

By the SVD property,

$$A v_1 = \sigma_1 u_1 \quad \text{and} \quad A v_2 = \sigma_2 u_2$$

Thus,

$$Ax = c_1 \sigma_1 u_1 + c_2 \sigma_2 u_2$$

Equivalently,

$$ \begin{align*} Ax &= u_1 \sigma_1 v_1^T x + u_2 \sigma_2 v_2^T x \\ & = \begin{bmatrix} u_1 & u_2 \end{bmatrix}\begin{bmatrix}\sigma_1 & 0\\ 0 & \sigma_2 \end{bmatrix}\begin{bmatrix} v_1^T \\ v_2^T \end{bmatrix} x\\ & = \begin{bmatrix} u_1 & u_2 \end{bmatrix}\begin{bmatrix}\sigma_1 & 0\\ 0 & \sigma_2 \end{bmatrix}\begin{bmatrix} v_1 & v_2 \end{bmatrix}^T x \end{align*} $$

In matrix form,

$$Ax = U \Sigma V^T x$$

Low-Rank Approximation in General

In practice, if one singular value dominates the others ($\sigma_1 \gg \sigma_2$), we can approximate $A$ with a rank-1 approximation:

$$Ax = u_1 \sigma_1 v_1^T x + u_2 \sigma_2 v_2^T x \approx u_1 \sigma_1 v_1^T x$$

General Low-Rank Approximation of Matrix $A$

For a general matrix $A$, the SVD decomposition allows for constructing a low-rank approximation by retaining only the top $k$ singular values:

Full-rank representation:

$$ A = U_r \Sigma_r V_r^T = \sum_{i=1}^{r} \sigma_i u_i v_i^T $$

Rank-$k$ approximation (with $k \leq r$):

$$ \tilde{A} = U_k \Sigma_k V_k^T = \sum_{i=1}^{k} \sigma_i u_i v_i^T $$

Example: Matrix

A = np.array([[11.08, 6.82, 1.76, -6.82],
              [2.50, -1.01, -2.60, 1.19],
              [-4.88, -5.07, -3.21, 5.20],
              [-0.49, 1.52, 2.07, -1.66],
              [-14.04, -12.40, -6.66, 12.65],
              [0.27, -8.51, -10.19, 9.15],
              [9.53, -9.84, -17.00, 11.00],
              [-12.01, 3.64, 11.10, -4.48]])

[U, S, VT] = np.linalg.svd(A, full_matrices = True)

U = np.asmatrix(U)
S = np.asmatrix(np.diag(S))
VT = np.asmatrix(VT)

print(U.shape, S.shape, V.shape, '\n')
print(U, '\n')
print(S, '\n')
print(VT)

(8, 8) (4, 4) (2, 2) 

[[-0.24642369  0.44536234  0.6205749   0.32739909  0.45915182  0.05410587
  -0.18650735  0.00956801]
 [ 0.07432183  0.10750102  0.28337342 -0.77803389 -0.10352762  0.32547077
  -0.42245245  0.04655432]
 [ 0.21369502 -0.1903207   0.49492174  0.11043549 -0.4736735  -0.61498898
  -0.24132713 -0.01233476]
 [-0.08223154 -0.02472167  0.20057294  0.06152361 -0.27047805  0.29819169
   0.20131309 -0.86371786]
 [ 0.50375686 -0.55378628  0.13739629 -0.02351069  0.61002316  0.01682912
  -0.07593043 -0.20479838]
 [ 0.43719294  0.03496575 -0.05496539  0.50172005 -0.29664823  0.55199001
  -0.3572128   0.18055856]
 [ 0.59019451  0.42657675  0.21078205 -0.13667122 -0.03298713 -0.00394525
   0.6243326   0.1252985 ]
 [-0.2967926  -0.51321848  0.42787997  0.0231769  -0.14042827  0.34499254
   0.40598717  0.40166774]] 

[[3.68258450e+01 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [0.00000000e+00 2.62368811e+01 0.00000000e+00 0.00000000e+00]
 [0.00000000e+00 0.00000000e+00 2.20213307e-02 0.00000000e+00]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 5.10429562e-03]] 

[[-0.03564832 -0.5381828  -0.61426544  0.5759917 ]
 [ 0.92076022  0.16617467 -0.32597333 -0.13538086]
 [-0.13920696 -0.49222525 -0.31155848 -0.80079151]
 [-0.36269992  0.66367127 -0.6475729  -0.09294376]]

[U, S, VT] = np.linalg.svd(A, full_matrices = False)

U = np.asmatrix(U)
S = np.asmatrix(np.diag(S))
VT = np.asmatrix(VT)

print(U.shape, S.shape, V.shape, '\n')
print(U, '\n')
print(S, '\n')
print(VT)

(8, 4) (4, 4) (2, 2) 

[[-0.24642369  0.44536234  0.6205749   0.32739909]
 [ 0.07432183  0.10750102  0.28337342 -0.77803389]
 [ 0.21369502 -0.1903207   0.49492174  0.11043549]
 [-0.08223154 -0.02472167  0.20057294  0.06152361]
 [ 0.50375686 -0.55378628  0.13739629 -0.02351069]
 [ 0.43719294  0.03496575 -0.05496539  0.50172005]
 [ 0.59019451  0.42657675  0.21078205 -0.13667122]
 [-0.2967926  -0.51321848  0.42787997  0.0231769 ]] 

[[3.68258450e+01 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [0.00000000e+00 2.62368811e+01 0.00000000e+00 0.00000000e+00]
 [0.00000000e+00 0.00000000e+00 2.20213307e-02 0.00000000e+00]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 5.10429562e-03]] 

[[-0.03564832 -0.5381828  -0.61426544  0.5759917 ]
 [ 0.92076022  0.16617467 -0.32597333 -0.13538086]
 [-0.13920696 -0.49222525 -0.31155848 -0.80079151]
 [-0.36269992  0.66367127 -0.6475729  -0.09294376]]

Question

How to determine a good $k$ so that we can efficiently compress $A$ without losing too much information of $A$?

sig_val = np.diag(S)

plt.figure(figsize = (6, 4))
plt.stem(range(1,5), sig_val)
plt.xticks(range(1,5))
plt.title('Singular Values')
plt.show()

When singular values decay rapidly, retaining only the top $k$ singular values can provide an accurate yet efficient approximation.

k = 2

low_U = U[:, :k]
low_S = S[:k, :k]
low_VT = VT[:k, :]

A_approx = low_U*low_S*low_VT

print(low_U.shape, low_S.shape, low_VT.shape, '\n')
print('A:\n', A ,'\n')
print('A_approx:\n', A_approx, '\n')
print('low_U:\n', low_U, '\n')
print('low_VT:\n', low_VT)

(8, 2) (2, 2) (2, 4) 

A:
 [[ 11.08   6.82   1.76  -6.82]
 [  2.5   -1.01  -2.6    1.19]
 [ -4.88  -5.07  -3.21   5.2 ]
 [ -0.49   1.52   2.07  -1.66]
 [-14.04 -12.4   -6.66  12.65]
 [  0.27  -8.51 -10.19   9.15]
 [  9.53  -9.84 -17.    11.  ]
 [-12.01   3.64  11.1   -4.48]] 

A_approx:
 [[ 11.08250851   6.8256176    1.76533991  -6.80890115]
 [  2.49942829  -1.00429274  -2.60062751   1.19462804]
 [ -4.87827835  -5.06500943  -3.20623934   5.20878009]
 [ -0.48927124   1.52196569   2.07157948  -1.65643381]
 [-14.03962233 -12.39843105  -6.65913505  12.65241176]
 [  0.27076035  -8.51229541 -10.18871873   9.14926874]
 [  9.53039313  -9.83725225 -16.99900559  11.0036522 ]
 [-12.00864542   3.64455947  11.10301226  -4.47244356]] 

low_U:
 [[-0.24642369  0.44536234]
 [ 0.07432183  0.10750102]
 [ 0.21369502 -0.1903207 ]
 [-0.08223154 -0.02472167]
 [ 0.50375686 -0.55378628]
 [ 0.43719294  0.03496575]
 [ 0.59019451  0.42657675]
 [-0.2967926  -0.51321848]] 

low_VT:
 [[-0.03564832 -0.5381828  -0.61426544  0.5759917 ]
 [ 0.92076022  0.16617467 -0.32597333 -0.13538086]]

3.3.2. Visual Illustration¶

Full SVD

Thin (or Economy) SVD

Truncated SVD (Low Rank Approximation)

$$Y \approx U_r \Sigma_r V_r^T$$

First term

$$Y_1 \approx u_1 \sigma_1 v_1^T$$

Second term

$$Y_2 \approx u_2 \sigma_2 v_2^T$$

Third term

$$Y_3 \approx u_3 \sigma_3 v_3^T$$

Best approximation with a finite $r$ decompositions

$$Y \approx Y_1 + Y_2 + Y_3 + \cdots = u_1 \sigma_1 v_1^T + u_2 \sigma_2 v_2^T + u_3 \sigma_3 v_3^T + \cdots = \sum_{j = 1}^{r} u_j \sigma_j v_j^T$$

3.3.3. Example: Image Approximation¶

Image as a Matrix

An image can be represented as a matrix $A$, where each entry corresponds to the intensity of a pixel for grayscale images. By applying SVD, we decompose the image matrix as follows:

$$A = U\Sigma V^T$$

Image Approximation with Rank-$k$ Decomposition

Instead of retaining the full rank of $A$, we can approximate it using only the top $k$ singular values:

$$\tilde A = U_k \Sigma_k V_k^T = \sum_{i=1}^{k}\sigma_i u_i v_i^T \qquad (k \leq r)$$

Downlaod the image file

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive

import cv2

img = cv2.imread('/content/drive/MyDrive/ML/ML_data/kaist_me.jpg', 0)
print(img.shape)

(1459, 2593)

plt.figure(figsize = (6, 6))
plt.imshow(img, 'gray')
plt.axis('off')
plt.show()

SVD

A = img

[U, S, VT] = np.linalg.svd(A, full_matrices = False)

U = np.asmatrix(U)
S = np.asmatrix(np.diag(S))
VT = np.asmatrix(VT)

print(U.shape, S.shape, VT.shape, '\n')

(1459, 1459) (1459, 1459) (1459, 2593)

Plot Singular Values

sig_val = np.diag(S)

plt.figure(figsize = (6, 4))
plt.stem(sig_val)
plt.title('Singular Values')
plt.show()

Approximated image when $k = 20$

k = 20

low_U = U[:, :k]
low_S = S[:k, :k]
low_VT = VT[:k, :]

A_hat = low_U*low_S*low_VT

plt.figure(figsize = (8, 5))
plt.subplot(1,2,1)
plt.imshow(A, 'gray')
plt.axis('off')
plt.title('Original image')

plt.subplot(1,2,2)
plt.imshow(A_hat, 'gray')
plt.axis('off')
plt.title('Approximated image w/ rank = 20')

plt.tight_layout()
plt.show()

Approximated images with $k$ varied

K = [1, 2, 3, 4, 10, 20, 30, 100]

plt.figure(figsize = (10, 5))

for i in range(len(K)):
    low_U = U[:, :K[i]]
    low_S = S[:K[i], :K[i]]
    low_VT = VT[:K[i], :]

    A_hat = low_U*low_S*low_VT

    plt.subplot(2,4,i+1), plt.imshow(A_hat, 'gray'), plt.axis('off')
    plt.title('rank {}'.format(K[i]))

plt.tight_layout()
plt.show()

Discussion

Each image is reconstructed using a progressively higher rank approximation, capturing increasing levels of detail.

(1) Rank 1 Image:

Captures only the most dominant pattern or trend in the image - primarily the strongest intensity gradients (e.g., horizontal patterns or key contrast areas).
This is because the first singular value and its corresponding singular vectors dominate the overall structure.

(2) Rank 2 to Rank 4 Images:

As the rank increases, additional singular values and vectors are included, allowing finer details like outlines and texture to emerge.

(3) Rank 10 to Rank 30 Images:

The image progressively becomes clearer, with sharper details and improved accuracy.

(4) Rank 100 Image:

The reconstruction is almost identical to the original image, indicating that a significant portion of the visual information is retained.

Image Compression Using Low-Rank Approximation with SVD

Suppose a satellite captures an image that needs to be transmitted to the ground station. Given the constraints of limited communication bandwidth, it becomes crucial to minimize the amount of transmitted data without significant loss of image quality. In this scenario, low-rank approximation using SVD offers an effective solution.

Instead of transmitting the entire matrix $A$, we only transmit:

The first $k$ columns of $U$
The first $k$ singular values in $\Sigma$
The first $k$ rows of $V^T$

The transmitted data is significantly smaller:

Original data size: $m \times n$
Compressed data size: $k (m + 1 + n)$

3.3.4. Example: Extracting Background from Sequential Images¶

This example demonstrates how SVD can effectively isolate the background from a series of video frames by leveraging low-rank approximation. The objective is to remove moving objects (people) from the images while preserving the static background.

The dataset consists of 20 sequential images (or frames) captured from a video showing people walking. These images contain both dynamic (moving) elements and a static background. By applying SVD, we aim to identify and extract the dominant structure - the background - while minimizing the impact of moving objects.

Example Source: Medium article
Dataset: Download and unzip data

Step 1: Display the Images

A video of people walking, captured by a surveillance camera

from IPython.display import YouTubeVideo
YouTubeVideo('-5R2yvPsbmo', width = "560", height = "315")

Capture each frame as an image

from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).

import numpy as np
import cv2
import matplotlib.pyplot as plt
%matplotlib inline

# Plot the first 'capimage_0.jpg' image in grayscale.

data_path = '/content/drive/MyDrive/ML/ML_data'
img = cv2.imread(data_path + '/capimage_0.jpg', 0)

plt.figure(figsize = (6, 4))
plt.imshow(img, 'gray')
plt.axis('off')
plt.show()

print(img.shape)

(338, 600)

# Plot the sequential 6 images in grayscale.

plt.figure(figsize = (8, 4))

for i in range(6):
  img_name = ['capimage_' + str(i) + '.jpg']
  img = cv2.imread(data_path + '/' + img_name[0], 0)
  plt.subplot(2, 3, i+1)
  plt.imshow(img, 'gray')
  plt.title(img_name[0], fontsize = 12)
  plt.axis('off')

plt.tight_layout()
plt.show()

Step 2: Construct Matrix $A$

In the previous example, an image was represented as a matrix $A$, allowing us to apply SVD for low-rank approximation. However, in this example, we are dealing with 20 images, each represented as its own matrix. This introduces a new challenge: how to efficiently apply SVD to multiple image matrices simultaneously.

The brilliant (yet conceptually straightforward) idea is to flatten each image into a column vector and then horizontally stack these vectors to construct a new matrix $A$. This consolidated matrix enables us to apply SVD efficiently.

At first glance, you might hesitate since flattening seemingly disrupts the inherent spatial structure of the images. Surprisingly, however, it works well!

Load all 20 grayscale images.
Reshape each image into a column vector of shape $(\text{row} \times \text{col},1)$.
Horizontally stack these vectors to form matrix $A$, resulting in a shape of:

$$A \in \mathbb{R}^{(\text{row} \times \text{col})\times 20}$$

flat_img = []

for i in range(20):
  img = cv2.imread(data_path + '/capimage_' + str(i) + '.jpg', 0)
  flat_img = img.reshape(-1, 1)
  if i == 0:
    A = flat_img
  else:
    A = np.hstack((A, flat_img))

row = img.shape[0]
col = img.shape[1]

print(row)
print(col)
print(A.shape)

338
600
(202800, 20)

Step 3: Compute SVD and Visualize Sigular Values

U, S, VT = np.linalg.svd(A, full_matrices = False)
U = np.asmatrix(U)
S = np.asmatrix(np.diag(S))
VT = np.asmatrix(VT)

print(U.shape, S.shape, VT.shape, '\n')

plt.figure(figsize = (6, 4))
plt.stem(np.diag(S))
plt.title('Singular Values')
plt.show()

(202800, 20) (20, 20) (20, 20)

Step 4: Low-Rank Approximation with $k = 1$

Using only the first singular value, reconstruct the background using the formula:

$$\hat{A} = U_k \Sigma_k V_k^T = \sigma_1 u_1 v_1^T$$

Extract the first column of $\hat A$, which corresponds to the reconstructed background of the first frame.
Reshape it back to its original image dimensions and display the result.

# Plot A_hat[:,0] with a reshape of (row, col)

A_hat = U[:,0] * S[0,0] * VT[0,:]

plt.figure(figsize = (6, 8))
plt.imshow(A_hat[:, 0].reshape(row, col), 'gray')
plt.axis('off')
plt.show()

Discussion

The first singular component captures the most dominant pattern across all frames - in this case, the stationary background.
Moving objects (like people walking) are treated as noise or high-frequency variations and are effectively removed by the low-rank approximation.
This is an impressive example. I personally had never envisioned such an innovative approach to effectively remove moving objects from images.

3.4. Eigenfaces: A SVD-Based Facial Recognition Technique¶

Eigenfaces is a well-known technique in computer vision that applies PCA or SVD for facial recognition. The method is based on the idea that faces can be represented as combinations of key facial features, called eigenfaces. By leveraging SVD, Eigenfaces effectively reduces the dimensionality of face images while retaining the most important features for identification.

$$ \tilde{x} = \sum_{i=1}^{k} \langle x,\hat{u}_i\rangle \,\hat{u}_i $$

Concept of Eigenfaces

The key idea behind Eigenfaces is that any face can be represented as a linear combination of a set of “basis” faces - the eigenfaces - which capture the most dominant features in the dataset.

Given a set of face images:

Each face is reshaped into a column vector.
These column vectors are stacked horizontally to form a matrix $A$.

$$ A = \begin{bmatrix} | & | & \cdots & | \\ x_1 & x_2 & \cdots & x_n \\ | & | & \cdots & | \end{bmatrix} $$

Where:

Each $x_i$ is a reshaped image vector.
$A \in \mathbb{R}^{(m \times n)}\; (m \gg n) $, where $m$ is the total number of pixels in each image and $n$ is the total number of images.

In this example, we will use a set of 12 images of U.S. Presidents to demonstrate the concept of Eigenfaces.

Download the face file

from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).

import numpy as np
from scipy import io
import matplotlib.pyplot as plt
%matplotlib inline

faces = io.loadmat('/content/drive/MyDrive/ML/ML_data/faces.mat')
face_mat = faces['faces']

row, col, n = face_mat.shape

plt.figure(figsize = (6, 8))

for i in range(n):
    plt.subplot(4, 3, i+1), plt.axis('off')
    plt.imshow(face_mat[:, :, i], 'gray')

plt.show()

Build matrix $A$ out of 12 images

$$ A = \begin{bmatrix} | & | & \cdots & | \\ x_1 & x_2 & \cdots & x_n \\ | & | & \cdots & | \end{bmatrix} $$

A = np.reshape(face_mat, [row*col, n])

Compute the Eigenfaces Using SVD

Given a set of facial images stored as matrix $A$, compute the SVD:

$$A = U \Sigma V^T$$

Where:

$U$: Left singular vectors (the eigenfaces).
$\Sigma$: Diagonal matrix containing the singular values.
$V$: Right singular vectors representing the coefficients (weights).

[U, S, VT] = np.linalg.svd(A, full_matrices = False)

U = np.asmatrix(U)
S = np.asmatrix(np.diag(S))
VT = np.asmatrix(VT)

print(U.shape, S.shape, VT.shape, '\n')

(50000, 12) (12, 12) (12, 12)

Eigenface or eigenmode from $U$

plt.figure(figsize = (6, 8))

for i in range(n):
    plt.subplot(4, 3, i+1), plt.axis('off')
    plt.imshow(U[:, i].reshape(row, col), 'gray')

plt.show()

Singular values

sig_val = np.diag(S)

plt.figure(figsize = (6, 4))
plt.stem(sig_val)
plt.title('Singular Values')
plt.show()

3.4.1. Data Compression¶

Transmit the compressed information

Assume that both parties have eigenfunctions
Only thing to transmit is the corresponding weights of the eigenfunctions

Encode the Image Using Weights

obama = face_mat[:,:,11]

plt.figure(figsize = (3, 4))
plt.imshow(obama, 'gray')
plt.axis('off')
plt.show()

Transmit the Encoded Data

For a given image $x$, project the image onto the eigenface basis:

$$z = U^T x$$

Where:

$z$ is the vector of coefficients (weights) that encodes the original image in the reduced-dimensional subspace.
This encoding drastically reduces the data size since $z$ typically requires far fewer values than the original image.

Instead of transmitting the entire image $x$, only the coefficient vector $z$ is transmitted. The vector $z$ is smaller in size because the SVD retains only the most significant features, discarding noise or less important details.

# sender
x = np.reshape(obama, [row*col, 1])
eigenface_decom = U.T*x

print(eigenface_decom)

[[-37117.28758117]
 [ -1854.11277302]
 [  4846.84543325]
 [  7876.39630423]
 [  -873.87825574]
 [  -450.54219688]
 [  1218.84905328]
 [ -5629.177794  ]
 [ -3180.2611649 ]
 [ -2436.0983698 ]
 [  1741.91910063]
 [  -438.24812691]]

Reconstruct the Image at the Receiver’s End

At the receiver’s end, reconstruct the original image using the transmitted coefficient vector:

$$\hat{x} = U z$$

Where:

$\hat{x}$ is the reconstructed image.
The quality of the reconstructed image depends on the number of retained coefficients (corresponding to the rank $k$).

# receiver
reconst = U*eigenface_decom

plt.figure(figsize = (3, 4))
plt.imshow(reconst.reshape(row, col), 'gray')
plt.axis('off')
plt.show()

Choosing the Rank $k$ for Reconstruction

Large $k$ → Higher quality with more detail, but larger data size.
Small $k$ → Lower data size, but reduced image quality.

In practice, selecting an optimal $k$ balances compression efficiency with visual quality.

k = 9

# sender
x = np.reshape(obama, [row*col, 1])
eigenface_decom = U[:,:k].T*x

# reciever
reconst = U[:,:k]*eigenface_decom

plt.figure(figsize = (3, 4))
plt.imshow(reconst.reshape(row, col), 'gray')
plt.axis('off')
plt.show()

Discussion

SVD efficiently separates meaningful facial features from noise.
Encoding the image as coefficients (weights) significantly reduces the transmitted data.
The method achieves excellent compression without severe loss of image quality.
$U$ contains the eigenmodes.

3.4.2. Face Recognition Using Eigenfaces¶

The Eigenfaces method is a classical yet effective technique for face recognition. While modern deep learning methods dominate large-scale face recognition tasks, Eigenfaces with SVD remains practical for scenarios with limited data or controlled environments.

Step 1: Project Image onto the Eigenface Subspace

Given an unseen input face image $x$, project it onto the eigenface subspace:

$$z = U^T x$$

Where:

$z$ is the vector of projected coefficients (or weights) that represents the image in the reduced-dimensional space.
$U$ contains the eigenmodes (eigenfaces) computed via SVD.

Step 2: Face Identification Using Distance Metrics

For each new input image, compute its coefficient vector $z$.
Compare $z$ with the coefficient vectors stored in the training set.
The most common method for comparison is the Euclidean distance:

$$d(x, x_i) = \lVert z - z_i \rVert_2$$

Where:

$z_i$ is the coefficient vector for the $i$-th individual in the training set.

Step 3: Classification Decision

The identity of the input face is determined by finding the closest match in the training set:

$$\hat{i} = \arg \min_i \; \lVert z - z_i \rVert_2$$

When is the Eigenface Method Preferred Over Deep Learning?

Although deep learning dominates face recognition tasks today, Eigenfaces is particularly useful in the following scenarios:

Limited Data: When only a small dataset is available (e.g., internal employee database).
Controlled Environments: In secure workplaces or facilities where environmental conditions are consistent.
Cost Efficiency: Suitable for applications with limited computational resources.
Simple Implementation: Requires no complex neural network architecture.

Practical Example

Suppose you need to build a face recognition system for identifying employees in your company:

Collect face images of all employees and compute their projected coefficients using the eigenface method.
When a person enters the building, compute the projected coefficients for their image.
Compare these coefficients with those stored in the database using Euclidean distance.
If the closest match is below a predefined threshold, the individual is recognized as an employee; otherwise, access may be denied.

3.4.3. Recognition and Reconstruction of Disguised Face¶

Here, I present an interesting scenario: President John F. Kennedy is wearing sunglasses when he enters the White House.

Download the image file

import cv2

jfk = cv2.imread('/content/drive/MyDrive/ML/ML_data/jfk.png', 0)

plt.figure(figsize = (3, 4))
plt.imshow(jfk, 'gray')
plt.axis('off')
plt.show()

Although our database contains his face image without sunglasses, our system must still successfully recognize him. Let's see how we can achieve this.

# projected coeeficient for new face
x = np.reshape(jfk, [row*col, 1])
eigenface_decom = U.T*x

# compute Euclidean distance to all training set
distances = []

for i in range(A.shape[1]):
  train_image_decomposition = U.T @ A[:, i].reshape(-1, 1)
  distance = np.linalg.norm(eigenface_decom - train_image_decomposition)
  distances.append(distance)


argmin_index = np.argmin(distances)

print(f"The input image is most similar to training image {argmin_index}")

The input image is most similar to training image 2

We can go even further.

Using the low-rank approximation properties of SVD, we can reconstruct his face by emphasizing dominant features while filtering out subtle variations, such as sunglasses.

This method effectively reveals the underlying facial structure, demonstrating how SVD can be used not only for recognition but also for image reconstruction and enhancement.

Projection onto $U$ and Reconstruct

x = np.reshape(jfk, [row*col, 1])

eigenface_decom = U.T*x
reconst = U*eigenface_decom

plt.figure(figsize = (9, 4))

plt.subplot(1,3,1)
plt.imshow(jfk, 'gray')
plt.title('Disguised')
plt.axis('off')

plt.subplot(1,3,2)
plt.imshow(reconst.reshape(row, col), 'gray')
plt.title('Reconstructed')
plt.axis('off')

plt.subplot(1,3,3)
plt.imshow(A[:,2].reshape(row, col), 'gray')
plt.title('Original')
plt.axis('off')

plt.show()

4. Proper Orthogonal Decomposition (POD)¶

The Proper Orthogonal Decomposition (POD) is a powerful method used to analyze data that varies across both space and time. This technique effectively identifies dominant patterns within complex systems and is widely used in fields such as fluid dynamics, structural mechanics, and signal processing.

Suppose we have data that is a function of both space and time:

$$y(x,t)$$

Our objective is to decompose this system into its fundamental building blocks.

A natural starting point is to express the system using a separation of variables approach:

$$y(x,t) = \sum_{j = 1}^{r}u_j(x) a_j(t)$$

Where

$u_j(x)$ represents spatial modes (basis functions)
$a_j(t)$ represents time-dependent coefficients (weights)

We aim to approximate most of the data using the smallest number of terms $r$.

Arrangement of Data

To implement POD efficiently, we organize the data into a matrix form:

Suppose measurements are taken at spatial locations $x_1, x_2, \cdots, x_n$
Suppose measurements are recorded at times $t_1, t_2, \cdots, t_m$
(typically after first subtracting a mean or equilibrium state)

We can construct a data matrix $Y$ with shape $n \times m$

POD Algorithm

$$ \begin{align*} Y_r &= U_r \Sigma_r V^T_r \\ &= \sum^{r}_{j=1}u_j \; \sigma_j v_j^T \\ & = \sum^{r}_{j=1}u_j(x) \; \sigma_j v_j(t)^T = \sum^{r}_{j=1}u_j (x) a_j(t) \end{align*} $$

This decomposition can be interpreted as:

$u_j(x)$: Spatial mode that represents dominant spatial features.
$a_j(t) = \sigma_j v_j(t)$: Temporal coefficient that describes the corresponding dynamic behavior.

4.1. Example 1¶

$$y(x,t) = \sin(\omega t - k x) + \varepsilon $$

Data Arrangement

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# data generated from the function

x = np.linspace(0, 5, 200)
Nx = np.size(x)

dt = 0.01
t = np.arange(0, 1, dt)

k = 2
omega = 4*np.pi

Y = np.zeros([np.size(x), np.size(t)], dtype = 'd')

for i in range(0, np.size(x)):
  for j in range(0, np.size(t)):
    Y[i, j] = np.sin(omega*t[j] - k*x[i]) + 0.1*np.random.randn()

print(Y.shape)

(200, 100)

Plot

# Let's see the signal in time

from matplotlib.animation import FuncAnimation
import time
from IPython import display

fig = plt.figure(figsize = (6, 4))
ax1 = fig.add_subplot(1, 1, 1)

def animate(i):
  ax1.clear()
  ax1.plot(x, Y[:,i], linewidth = 2)
  ax1.set_xlabel('x')
  ax1.set_xlim(0, 5)
  ax1.set_ylim(-2, 2)
  ax1.grid()

ani = FuncAnimation(fig, animate, frames = int(np.size(t)/1), interval = 10)
video = ani.to_html5_video()
html = display.HTML(video)
display.display(html)
plt.close()

Perform the SVD

U, S, VT = np.linalg.svd(Y, full_matrices = False)

print(U.shape)
print(S.shape)
print(VT.shape)

(200, 100)
(100,)
(100, 100)

Plot the Singular Values

plt.figure(figsize = (6, 4))
plt.plot(range(1, 101), S, 'o-')
plt.xticks(np.arange(6))
plt.xlim(0.5, 5.5)
plt.ylabel('Singular Value')
plt.xlabel('Index')
plt.grid(alpha = 0.3)
plt.show()

To achieve an efficient representation, we truncate the SVD by selecting only the top 2 singular values:

$$ y(x,t) \approx u_1(x)\sigma_1 v_1(t) + u_2(x)\sigma_2 v_2(t)$$

Plot $u_1(x)$

plt.figure(figsize = (6, 4))
plt.plot(x, U[:,0])
plt.title('POD Mode 1')
plt.xlabel('x')
plt.ylim([-2, 2])
plt.grid()
plt.show()

Plot $u_2(x)$

plt.figure(figsize = (6, 4))
plt.plot(x, U[:,1])
plt.title('POD Mode 2')
plt.xlabel('x')
plt.ylim([-2, 2])
plt.grid()
plt.show()

$$\begin{align*} y(x,t) &\approx u_1(x)\sigma_1 v_1(t) + u_2(x)\sigma_2 v_2(t) \\ &\approx u_1(x)a_1(t) + u_2(x)a_2(t) \end{align*} $$

Plot $a_1(t)$ and $a_2(t)$

plt.figure(figsize = (6, 4))
plt.plot(t, S[0]*VT[0,:], label = 'a1(t)', linewidth = 2)
plt.plot(t, S[1]*VT[1,:], label = 'a2(t)', linewidth = 2)
plt.xlim([0, 1])
plt.xlabel('time')
plt.title('Mode Coefficients')
plt.legend(loc = 4)
plt.grid()
plt.show()

Low-Rank Approximation

U = np.asmatrix(U)
diagS = np.asmatrix(np.diag(S))
VT = np.asmatrix(VT)

print(U.shape, diagS.shape, VT.shape, '\n')

k = 2

low_U = U[:, :k]
low_S = diagS[:k, :k]
low_VT = VT[:k, :]

low_Y = low_U*low_S*low_VT

print(low_U.shape, low_S.shape, low_VT.shape, '\n')

(200, 100) (100, 100) (100, 100) 

(200, 2) (2, 2) (2, 100)

# Let's see the signal in time

from matplotlib.animation import FuncAnimation
import time
from IPython import display

fig = plt.figure(figsize = (6, 4))
ax1 = fig.add_subplot(1, 1, 1)

def animate(i):
  ax1.clear()
  ax1.plot(x, low_Y[:,i], linewidth = 2)
  ax1.set_xlim(0, 5)
  ax1.set_ylim(-2, 2)
  ax1.grid()

ani = FuncAnimation(fig, animate, frames = int(np.size(t)/1), interval = 10)
video = ani.to_html5_video()
html = display.HTML(video)
display.display(html)
plt.close()

Discussion

The POD method effectively demonstrates its ability to extract meaningful patterns from data in a data-driven manner. The key outcomes and insights from this example are summarized below:

(1) Noise Reduction

By retaining only the dominant modes in the decomposition, POD successfully filters out noise and reconstructs the underlying signal with improved clarity.
This denoising capability arises naturally from the fact that noise typically resides in the smaller singular values, which are discarded during low-rank approximation.

(2) Data-Driven Decomposition

This example illustrates that POD can effectively extract meaningful components without explicit knowledge of the underlying system dynamics. Specifically, POD decomposes the original signal into dominant spatial and temporal modes:

$$y(x,t) \approx u_1(x)a_1(t) + u_2(x)a_2(t)$$

Note: This data-driven decomposition closely aligns with the well-known trigonometric identity.

$$y(x,t) = \sin(\omega t - k x) = \sin(\omega t)\cos(-kx) + \cos(\omega t) \sin(-kx)$$

4.2. Example 2: Fluid Flow Past a Circular Cylinder at Low Reynolds Number¶

This example is adapted from the textbook "Data-Driven Science and Engineering" by Steven L. Brunton and J. Nathan Kutz.

Data Description and Visualization

The dataset provided contains flow field data for fluid flow past a circular cylinder at a Reynolds number of Re = 100. The data file, CYLINDER_ALL.mat, includes 151 snapshots of velocity components and vorticity fields:

UALL: Horizontal velocity field (u-velocity)
VALL: Vertical velocity field (v-velocity)
VORTALL: Vorticity field

The flow is studied at a Reynolds number of Re = 100, a regime where vortex shedding occurs periodically and can be captured effectively using low-rank approximations.

from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).

from scipy import io

flows = io.loadmat('/content/drive/MyDrive/ML/ML_data/CYLINDER_ALL.mat')

flows_mat_u = flows['UALL']
flows_mat_v = flows['VALL']
flows_mat_vort = flows['VORTALL']

flows_mat_u.shape

(89351, 151)

nx = 449
ny = 199

plt.figure(figsize = (6, 4))
plt.subplot(1,3,1)
plt.imshow(flows_mat_u[:,150].reshape(nx, ny), 'gray')
plt.title('u velocity')
plt.axis('off')

plt.subplot(1,3,2)
plt.imshow(flows_mat_v[:,150].reshape(nx, ny), 'gray')
plt.title('v velocity')
plt.axis('off')

plt.subplot(1,3,3)
plt.imshow(flows_mat_vort[:,150].reshape(nx, ny), 'gray')
plt.title('vorticity')
plt.axis('off')

plt.show()

# Let's see the flow in time

from matplotlib.animation import FuncAnimation
import time
from IPython import display

fig = plt.figure(figsize = (6, 4))
ax1 = fig.add_subplot(1, 3, 1)
ax2 = fig.add_subplot(1, 3, 2)
ax3 = fig.add_subplot(1, 3, 3)

def animate(i):
  ax1.clear()
  ax1.imshow(flows_mat_u[:,i].reshape(nx, ny), 'gray')
  ax1.set_title('u velocity')
  ax1.axis('off')
  ax2.clear()
  ax2.imshow(flows_mat_v[:,i].reshape(nx, ny), 'gray')
  ax2.set_title('v velocity')
  ax2.axis('off')
  ax3.clear()
  ax3.imshow(flows_mat_vort[:,i].reshape(nx, ny), 'gray')
  ax3.set_title('vorticity')
  ax3.axis('off')

ani = FuncAnimation(fig, animate, frames = int(150/1), interval = 10)
video = ani.to_html5_video()
html = display.HTML(video)
display.display(html)
plt.close()

From now on, we will focus on vorticity rather than u and v velocity.

It is common practice to subtract the mean or equilibrium state for clearer analysis.

flows_mat_vort_normalized = (flows_mat_vort - flows_mat_vort.min()) / (flows_mat_vort.max() - flows_mat_vort.min())

flows_mat_vort_normalized_mean = flows_mat_vort_normalized.mean(axis = 1)
flows_mat_vort_normalized_centered = flows_mat_vort_normalized - flows_mat_vort_normalized_mean[:, None]

plt.imshow(flows_mat_vort_normalized_mean.reshape(nx, ny), 'gray')
plt.title('vorticity')
plt.axis('off')
plt.show()

Perform the POD

flows_mat = flows_mat_vort_normalized_centered

U, S, VT = np.linalg.svd(flows_mat, full_matrices = False)
print(U.shape)
print(S.shape)
print(VT.shape)

(89351, 151)
(151,)
(151, 151)

Plot Singular Values

plt.figure(figsize = (6, 4))
plt.plot(S, '-o')
plt.ylabel('Singular Value')
plt.xlabel('Index')
plt.show()

Plot Several Spatial Modes

plt.figure(figsize = (6, 4))

plt.subplot(1,4,1)
plt.imshow(U[:,0].reshape(nx, ny), 'gray')
plt.title('POD Mode 1')
plt.axis('off')

plt.subplot(1,4,2)
plt.imshow(U[:,1].reshape(nx, ny), 'gray')
plt.title('POD Mode 2')
plt.axis('off')

plt.subplot(1,4,3)
plt.imshow(U[:,4].reshape(nx, ny), 'gray')
plt.title('POD Mode 5')
plt.axis('off')

plt.subplot(1,4,4)
plt.imshow(U[:,12].reshape(nx, ny), 'gray')
plt.title('POD Mode 13')
plt.axis('off')

plt.show()

Low Rank Approximation

k = 5

low_U = U[:, :k]
low_S = diagS[:k, :k]
low_VT = VT[:k, :]

print(low_U.shape, low_S.shape, low_VT.shape, '\n')

low_flows_mat = low_U*low_S*low_VT

(89351, 5) (5, 5) (5, 151)

# Let's see the flow in time

from matplotlib.animation import FuncAnimation
import time
from IPython import display

fig = plt.figure(figsize = (6, 4))
ax1 = fig.add_subplot(1, 2, 1)
ax2 = fig.add_subplot(1, 2, 2)

def animate(i):
  ax1.clear()
  ax1.imshow(flows_mat[:,i].reshape(nx, ny) + flows_mat_vort_normalized_mean.reshape(nx, ny), 'gray')
  ax1.set_title('original')
  ax1.axis('off')
  ax2.clear()
  ax2.imshow(low_flows_mat[:,i].reshape(nx, ny) + flows_mat_vort_normalized_mean.reshape(nx, ny), 'gray')
  ax2.set_title('low rank')
  ax2.axis('off')

ani = FuncAnimation(fig, animate, frames = int(150/1), interval = 10)
video = ani.to_html5_video()
html = display.HTML(video)
display.display(html)
plt.close()

Discussion

In this example, we apply POD to analyze the complex flow field dynamics of fluid flow past a circular cylinder at a Reynolds number of 100. This method effectively identifies dominant flow patterns and provides a low-dimensional representation of the system.

Key Takeaway from Dimension Reduction

Just as the Heliocentric model simplified our understanding of planetary motion by shifting to a better reference frame, effective dimension reduction techniques reveal meaningful insights by aligning data with its most informative structure.

By finding the right transformation or new coordinate system, complex data can become self-explanatory, improving both interpretation and predictive power.

5. Fisher Discriminant Analysis (FDA)¶

from IPython.display import YouTubeVideo
YouTubeVideo('4Pz3TDX5dBs', width = "560", height = "315")

5.1. FDA Algorithm¶

5.1.1. Dimensionality Reduction with Label¶

In many machine learning tasks, dimensionality reduction is a crucial step for improving efficiency and enhancing model performance. While unsupervised methods such as Principal Component Analysis (PCA) effectively reduce dimensionality by maximizing variance, they do not take class labels into account. Consequently, PCA may overlook class-specific patterns that are important for classification or regression.

Fisher Discriminant Analysis (FDA) - also known as Linear Discriminant Analysis (LDA) - addresses this limitation by incorporating label information into the dimensionality reduction process. Unlike PCA, FDA aims to project the data onto a lower dimensional space that maximizes class separability.

Key Characteristics of PCA and FDA

Principal Component Analysis (PCA)

Focuses solely on maximizing data variance.
Does not utilize class label information.
Reduces dimensionality by selecting directions that capture the largest variability in the data.

Fisher Discriminant Analysis (FDA)

Uses class label information to find optimal projection directions.
Ensures that data points from the same class are projected close together.
Projects data points from different classes far apart to improve class separability.

Core Objective of FDA

FDA seeks to achieve two primary objectives simultaneously:

(1) Minimize Within-Class Variance:

FDA ensures that samples belonging to the same class are projected close together.
This is achieved by minimizing the spread of data points within each class.

(2) Maximize Between-Class Variance:

FDA aims to position samples from different classes as far apart as possible.
This is achieved by maximizing the distance between class means.

In this sense, FDA provides a supervised dimensionality reduction method that emphasizes class separation, making it particularly suitable for tasks such as classification.

5.1.2. Projection onto Line $\omega$¶

In Fisher Discriminant Analysis (FDA), the goal is to project high-dimensional data onto a one-dimensional space (a line) in such a way that maximizes class separation.

Suppose each data point $x$ is a two-dimensional vector:

$$x = \begin{bmatrix}x_1 \\ x_2 \end{bmatrix}$$

For simplicity, we assume the data is zero mean. If not, data should be centered:

$$x \leftarrow x - \bar{x}$$

Each data point is projected onto a line defined by the vector $\omega$. The projection is computed as:

$$ \hat{y} = \langle \omega,x\rangle = \omega^Tx = \omega_1x_1 + \omega_2x_2$$

Here, $\hat y$ is a scalar value representing the projected data point on the line.

The original data point $x$ in two dimensions is transformed into the scalar $\hat y$, which is a lower-dimensional representation:

$$x = \begin{bmatrix}x_1 \\ x_2 \end{bmatrix} \quad \rightarrow \quad \hat y \; (\text{scalar})$$

Each data point $x$ is projected onto the line defined by $\omega$, with the projected length along the $\omega$-direction serving as the new data point.

The set of projected points:

$$\{ \hat y^{(1)}, \cdots, \hat y^{(m)} \}$$

represents the distribution of data points in the new one-dimensional space.

Key Question: Which $\omega$ is better for classification?

The choice of $\omega$ significantly impacts the effectiveness of the projection for separating classes.
An optimal $\omega$ should maximize the separation between classes while minimizing the spread of points within each class.

In Fisher Discriminant Analysis, the optimal $\omega$ is selected to achieve this balance by maximizing the between-class variance and minimizing the within-class variance - a concept explored in the subsequent sections.

5.1.3. Fisher Discriminant Analysis¶

The goal of FDA is to find a direction $\omega$ such that:

The distance between class means (when projected onto $\omega$) is maximized.
The variance within each class (when projected onto $\omega$) is minimized.

To achieve this, FDA maximizes the following ratio:

$$\max_\omega \frac{(\text{seperation of projected means})^2}{\text{sum of within class variances}}$$

Equivalently,

$$ \implies \max_\omega \frac {\big(\mu_0^T\omega - \mu_1^T\omega \big)^2}{n_0\omega^TS_0\omega + n_1\omega^TS_1\omega}$$

Where:

$\mu_0$ and $\mu_1$ are the mean vectors for Class 0 and Class 1, respectively.
$n_0$ and $n_1$ are the number of samples in each class.
$S_0$ and $S_1$ are the within-class scatter matrices, which measure the variance of each class.

We aim to find the optimal projection vector $\omega$ that maximizes class separation.

Step 1: Objective Function

The objective is to maximize the following ratio:

$$\omega^* = \text{arg}\max_\omega \bigg\{ \frac{\big((\mu_0^T - \mu_1^T)\omega\big)^2}{n_0\omega^TS_0\omega + n_1\omega^TS_1\omega} \bigg \}$$

By defining:

$m \equiv \mu_0 - \mu_1$: the difference between class means
$\Sigma \equiv n_0S_0 + n_1S_1 = R^TR$: the combined within-class scatter matrix

We rewrite the objective as:

$$J(\omega) = \frac {\big((\mu_0^T - \mu_1^T)\omega\big)^2}{\omega^T(n_0S_0 + n_1S_1)\omega} = \frac {(m^T\omega)^2}{\omega^T\Sigma \omega}$$

Step 2: Coordinate Transformation

Since $ \Sigma $ is symmetric and positive semi-definite, we can factorize it as $ \Sigma = R^T R $, where $ R $ is a square root matrix.

Now introduce a new variable $ u $ such that:

$$u = R\omega \quad \Rightarrow \quad \omega = R^{-1}u$$

This change of variables simplifies the objective function:

$$J(u) = \frac {\big( m^TR^{-1}u\big)^2}{\omega^TR^TR\omega} = \frac{\left( m^T R^{-1} u \right)^2}{u^T u}$$

Step 3: Maximizing the Objective

Since $ u^T u = \lVert u \rVert^2 $, the objective becomes:

$$J(u) = \left( \left( R^{-T} m \right)^T \frac{u}{\lVert u \rVert} \right)^2 = { \left \langle R^{-T} m, \frac{u}{\lVert u \rVert} \right \rangle}^2$$

By the properties of the dot product, this expression is maximized when:

$$u = a R^{-T} m$$

Why?:

Dot product of a unit vector and another vector is maximum when the two have the same direction.

Step 4: Recovering $ \omega $

By substituting back the relationship $ \omega = R^{-1} u $:

$$\omega = a R^{-1} R^{-T} m$$

Since $ R^{-1} R^{-T} = \Sigma^{-1} $, we have:

$$\omega = a \Sigma^{-1} m$$

By substituting back $ m = \mu_0 - \mu_1 $ and $ \Sigma = n_0S_0 + n_1S_1 $, the final solution is:

$$\omega = a (n_0S_0 + n_1S_1)^{-1} (\mu_0 - \mu_1)$$

This result highlights FDA's ability to effectively combine dimensionality reduction and classification by maximizing class separation.

5.2. FDA in Python¶

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

Data Generation

# generating data set

n0 = 200
n1 = 200

mu = [0, 0]
sigma = [[0.9, -0.4],
         [-0.4, 0.3]]

np.random.seed(42)
x0 = np.random.multivariate_normal([2.5,2.5], sigma, n0).T        # data in class 0
x1 = np.random.multivariate_normal([1,1], sigma, n1).T            # data in class 1

x0 = np.asmatrix(x0)
x1 = np.asmatrix(x1)

plt.figure(figsize = (6, 4))
plt.plot(x0[0,:], x0[1,:], 'r.', alpha = 0.3)
plt.plot(x1[0,:], x1[1,:], 'b.', alpha = 0.3)
plt.xlim([-2, 6])
plt.ylim([-1, 4.5])
plt.show()

Before determining the optimal $ \omega $, it is helpful to examine how the distribution of projected data points varies depending on different projection directions.

Consider the line represented by the equation:

$$\omega_1 x_1 + \omega_2 x_2 = 0$$

By observing the projected points along these different lines, we can gain insight into how the choice of $ \omega $ influences class separation and data spread. This visualization helps build intuition before proceeding to identify the optimal $ \omega $.

Projection onto x-axis

# x-axis

w = np.array([[1], [0]])
w = np.asmatrix(w)

plt.figure(figsize = (6, 4))
plt.plot(x0[0,:], x0[1,:], 'r.', alpha = 0.3)
plt.plot(x1[0,:], x1[1,:], 'b.', alpha = 0.3)
plt.axhline(0, color = 'k')
plt.xlim([-2, 6])
plt.ylim([-1, 4.5])
plt.show()

Histogram of prejected data

y1 = x0.T*w
y2 = x1.T*w

plt.figure(figsize = (6, 4))
plt.hist(y1, 21, color = 'r', rwidth = 0.5, alpha = 0.4)
plt.hist(y2, 21, color = 'b', rwidth = 0.5, alpha = 0.4)
plt.show()

Projection onto y-axis

# y-axis

w = np.array([[0], [1]])
w = np.asmatrix(w)

plt.figure(figsize = (6, 4))
plt.plot(x0[0,:], x0[1,:], 'r.', alpha = 0.3)
plt.plot(x1[0,:], x1[1,:], 'b.', alpha = 0.3)
plt.axvline(0, color = 'k')
plt.xlim([-2, 6])
plt.ylim([-1, 4.5])
plt.show()

y1 = x0.T*w
y2 = x1.T*w

plt.figure(figsize = (6, 4))
plt.hist(y1, 21, color = 'r', rwidth = 0.5, alpha = 0.4)
plt.hist(y2, 21, color = 'b', rwidth = 0.5, alpha = 0.4)
plt.show()

From both cases, we can observe that there is overlap in their histograms, which leads to potential misclassification when the data is projected into a lower-dimensional space.

Now, let's examine the optimal projection line determined by Fisher Discriminant Analysis (FDA) and analyze the corresponding distribution of projected data to evaluate its effectiveness in improving classification performance.

Projection line by FDA

mu0 = np.mean(x0, axis = 1)
mu1 = np.mean(x1, axis = 1)

S0 = 1/(n0 - 1)*(x0 - mu0)*(x0 - mu0).T
S1 = 1/(n1 - 1)*(x1 - mu1)*(x1 - mu1).T

w = (n0*S0 + n1*S1).I*(mu0 - mu1)
print(w)

[[0.02425817]
 [0.04626859]]

plt.figure(figsize = (6, 4))
plt.plot(x0[0,:], x0[1,:], 'r.', alpha = 0.3)
plt.plot(x1[0,:], x1[1,:], 'b.', alpha = 0.3)

xp = np.arange(-4, 6, 0.1)
yp = w[1,0]/w[0,0]*xp
plt.plot(xp, yp, 'k')
plt.xlim([-2, 6])
plt.ylim([-1, 4.5])
plt.show()

Histogram

y1 = x0.T*w
y2 = x1.T*w

plt.figure(figsize = (6, 4))
plt.hist(y1, 21, color = 'r', rwidth = 0.5, alpha = 0.4)
plt.hist(y2, 21, color = 'b', rwidth = 0.5, alpha = 0.4)
plt.show()

Scikit-learn

# reshape data

X = np.vstack([x0.T, x1.T])
y = np.vstack([np.ones([n0, 1]), np.zeros([n1, 1])])

from sklearn import discriminant_analysis

X = np.array(X)

clf = discriminant_analysis.LinearDiscriminantAnalysis()
clf.fit(X, np.ravel(y))

LinearDiscriminantAnalysis()

LinearDiscriminantAnalysis()

Histogram

# projection

X_LDA = clf.transform(X)

plt.figure(figsize = (6, 4))
plt.hist(X_LDA[0:200], 21, color = 'r', rwidth = 0.5, alpha = 0.4)
plt.hist(X_LDA[200:400], 21, color = 'b', rwidth = 0.5, alpha = 0.4)
plt.show()

Classification boundary

# classifier boundary, not a projection line

clf_w = clf.coef_[0]
clf_w0 = clf.intercept_[0]
print(clf_w)
print(clf_w0)

[ 9.70326866 18.50743511]
-49.45561208450006

Projection line

# projection line

mu0 = np.mean(x0, axis = 1)
mu1 = np.mean(x1, axis = 1)

S0 = 1/(n0 - 1)*(x0 - mu0)*(x0 - mu0).T
S1 = 1/(n1 - 1)*(x1 - mu1)*(x1 - mu1).T

prj_w = (n0*S0 + n1*S1).I*(mu0 - mu1)

xp = np.arange(-4, 6, 0.1)
prj_yp = prj_w[1,0]/prj_w[0,0]*xp
clf_yp = - clf_w[0]/clf_w[1]*xp - clf_w0/clf_w[1]

plt.figure(figsize = (6, 4))
plt.plot(x0[0,:], x0[1,:], 'r.', alpha = 0.3)
plt.plot(x1[0,:], x1[1,:], 'b.', alpha = 0.3)
plt.plot(xp, prj_yp, 'k', label = 'LDA projection line')
plt.plot(xp, clf_yp, 'g', label = 'LDA classification boundary')
plt.legend()
plt.xlim([-2, 6])
plt.ylim([-1, 4.5])
plt.show()

We have studied Fisher Discriminant Analysis (FDA), exploring both its theoretical foundation and practical implementation. While FDA facilitates dimensionality reduction, it is important to emphasize that FDA operates in a supervised manner. Unlike unsupervised methods such as PCA, the primary objective of FDA is not merely to reduce dimensionality but to achieve effective class separation and improve classification performance.

%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')