Dimension Reduction
Table of Contents
Statistics is the branch of mathematics that deals with the collection, analysis, interpretation, presentation, and organization of data. It plays a crucial role in various fields, including science, business, social sciences, and machine learning.
from IPython.display import YouTubeVideo
YouTubeVideo('OWOzaQhdE3Y', width = "560", height = "315")
Statistics vs Probability
Probability
Definition: Probability is the branch of mathematics that deals with predicting the likelihood of future outcomes based on known conditions or assumptions.
Focus: Probability starts with a known model (e.g., theoretical distribution) and aims to predict possible outcomes.
Statistics
Definition: Statistics is the branch of mathematics that involves analyzing existing data to infer properties of the population or predict future outcomes.
Focus: Statistics starts with observed data and attempts to build models or draw conclusions about the underlying distribution.
In statistics, understanding the distinction between populations and samples is fundamental to data analysis and inference.
A population includes all the elements from a set of data
A parameter is a quantity computed from a population
A sample is a subset of the population.
A statistic is a quantity computed from samples
Purpose of Sampling
Samples are used because measuring the entire population is often impractical or impossible.
To make inferences about a population without examining every member.
How to Generate Random Numbers (Samples or data)
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
## random number generation (1D)
m = 1000;
# uniform distribution U(0,1)
x1 = np.random.rand(m,1);
# uniform distribution U(a,b)
a = 1;
b = 5;
x2 = a + (b-a)*np.random.rand(m,1);
# standard normal (Gaussian) distribution N(0,1^2)
# x3 = np.random.normal(0, 1, m)
x3 = np.random.randn(m,1);
# normal distribution N(5,2^2)
x4 = 5 + 2*np.random.randn(m,1);
# random integers
x5 = np.random.randint(1, 6, size = (1,m));
Histogram: A Graphical Representation of Data Distribution
A histogram is a powerful tool for visualizing the distribution of data. It represents the frequency of data points across defined intervals (bins). Unlike bar charts, histograms group continuous data into bins to reveal patterns such as skewness, modality, and outliers.
Data Representation
Sample Mean
Sample Variance
Population variance
This section explores the relationship between two random variables, focusing on variance, covariance, and correlation.
Sample Variance
Sample Covariance
Sample Covariance Matrix
Sample Correlation Coefficient
Suppose
The correlation coefficient is a statistical measure that quantifies the strength and direction of the linear relationship between two variables.
Key Properties
Interpretation:
$+1 \to \;$ close to a straight line (positive correlation)
$-1 \to \;$ close to a straight line (negative correlation)
$\;\;\;0 \to \;$ no linear line (uncorrelated)
Closeness to a line
A value close to $+1$ or $-1$ implies the data points are tightly clustered around a straight line.
The correlation does not indicate the slope of the line
Important Note on Causality
Correlation does NOT imply causation.
Even if two variables are strongly correlated, it doesn't necessarily mean one causes the other.
A correlation coefficient plot is a visual tool that shows the strength and direction of correlation between pairs of variables. It's particularly useful in identifying patterns and relationships within a dataset.
seaborn
import seaborn as sns
import pandas as pd
d = {'col. 1': x, 'col. 2': xo, 'col. 3': yo, 'col. 4': yor}
df = pd.DataFrame(data = d)
sns.pairplot(df, height = 1.5, aspect = 1)
plt.show()
from IPython.display import YouTubeVideo
YouTubeVideo('-sM2dYlYSoI', width = "560", height = "315")
The term dimension is widely used across different disciplines. Let's explore its meaning in several contexts:
(1) In Coordinate Systems
(2) In Dynamics (Generalized Coordinates)
(3) In Machine Learning
(4) In Image Processing
In this example, we have a system described by two original variables:
These equations describe the dynamics of $u_1$ and $u_2$, which initially appear to be a 2D system.
We'll introduce new variables based on physical insights:
By substituting the new variables into the original system:
The solution in $v_1$ and $v_2$
Since $v_1(t)$ is constant, it doesn't evolve over time and effectively behaves like a fixed parameter rather than a dynamic variable. Consequently, the system's behavior is entirely governed by $v_2(t)$, the only variable that changes with time.
This is a great example of how dimension reduction (2D $\rightarrow$ 1D) can emerge naturally from analyzing the underlying physics and identifying conserved quantities.
Example 2
In this example, we have an $15 \times 15$ pixel image, which corresponds to 225 dimensions if each pixel is treated as an independent feature.
Key Insight: Why random noise looks meaningless
If each pixel is assigned a random value, the resulting image appears chaotic and lacks any meaningful structure.
This randomness illustrates that while the data exists in 225-dimensional space, not all points in this space represent meaningful images.
Why does this relate to dimension reduction?
Real-World Images are far from random. They follow specific patterns, structures, and correlations.
Consequently, meaningful images occupy only a small subspace within the full 225-dimensional space.
This observation opens the door for dimension reduction - mapping high-dimensional image data into a lower-dimensional subspace that retains meaningful features.
In many machine learning and data analysis tasks, datasets often contain numerous features, resulting in high-dimensional data. While this abundance of information can be beneficial, it introduces several challenges such as:
To address these issues, we explore dimensionality reduction - a technique designed to simplify complex data by transforming it into a lower dimensional representation while preserving as much meaningful information as possible.
Dimensionality reduction offers several important benefits:
(1) Insights into Low-Dimensional Structures:
(2) Improved Generalization and Reduced Overfitting:
(3) Increased Computational Efficiency:
(4) Lower Storage Requirements:
Dimensionality Reduction vs. Feature Selection
Feature Selection involves choosing a subset of existing features that best represent the data.
Dimensionality Reduction, on the other hand, is more like Feature Extraction, where new features are constructed from combinations of the original features.
While both approaches aim to simplify the data, dimensionality reduction actively transforms the original data into a lower dimensional form rather than simply discarding irrelevant features.
In essence, dimensionality reduction helps uncover the intrinsic structure of the data, improving both interpretability and computational efficiency - all while retaining critical information.
Redundancy in Data
Redundancy occurs when two or more features provide overlapping or duplicated information.
Highly correlated features often carry similar patterns, making some dimensions less informative.
(1) Low Redundancy
(2) Moderate Redundancy
(3) High Redundancy
Example of Dim Reduction in 2D
Consider a dataset distributed as shown in the following figure.
Each data point has two features $\{x_1,x_2\}$
(1) Ignoring feature $x_2$
(2) Assessing information loss
The key question is: "Are we losing much information by throwing away $x_2$?"
Observing the data distribution reveals that the data points are highly concentrated along the $x_1$-axis, with very little spread along the $x_2$-axis.
This suggests that the data has minimal variance in the $x_2$ direction.
Now consider a dataset distributed as shown in the following figure.
In this case, the dataset is distributed across two features $\{x_1,x_2\}$
(1) The data exhibits significant variance along both axes ($x_1$ and $x_2$)
(2) Suppose we attempt to discard $x_2$ and reduce the data to:
(3) Significant information loss:
While the dataset remains the same, the coordinate system can be changed. This transformation reveals key insights into information retention and loss.
(1) Observing the data in the new axes
After transforming the data into new coordinates ($u_1$ and $u_2$), the variance is now concentrated mostly along the $u_1$-axis.
The spread along the $u_2$-axis is minimal, indicating low variability in that direction.
(2) Discarding a dimension in the new axes
How to Find $u_1$
The key question is: How can we determine the optimal axis $u_1$ that maximizes variance and minimizes information loss?
Core Idea
In essence, PCA identifies the optimal axis $u_1$ that retains the most important information about the data when reducing dimensions.
In the previous section, we introduced the concept of dimensionality reduction and explored the core ideas behind it. Now, we will dive deeper into the mechanics of how this reduction is achieved - specifically through the Principal Component Analysis (PCA) algorithm.
The goal of PCA is to transform the data into a new coordinate system such that:
In this section, we will systematically study the PCA algorithm and its step-by-step procedure.
Before applying Principal Component Analysis (PCA), it is crucial to preprocess the data to ensure that all features are appropriately scaled and centered. Proper preprocessing improves the accuracy and reliability of PCA results.
Step 1: Given Data
Suppose we are given a dataset consisting of $m$ examples, each with $n$ features:
Step 2: Shifting (Zero Mean)
To ensure the data is centered at zero (mean removal):
Step 3: Rescaling (Unit Variance) [Optional]
The goal of Principal Component Analysis (PCA) is to find the optimal direction (axis) that maximizes the variance in the data when projected onto that direction.
Step 1: Data Projection
Step 2: Maximizing the Variance of Projected Data
PCA identifies the direction $ u_1 $ such that the projected data captures the maximum variance.
Step 3: Optimization Formulation
Maximizing the variance can now be written as the following optimization problem:
Step 4: Eigenvalue Analysis
From the properties of eigenvectors and eigenvalues:
The objective of PCA can also be framed as minimizing the sum-of-squared errors between the original data points and their projections.
Step 1: Derivation of Error Term
Step 2: Minimizing the Mean Squared Error
Step 3: Optimization Formulation
Minimizing the error is equivalent to maximizing the variance:
In PCA, we can reduce the dimensionality of the data while retaining as much meaningful information as possible by following these steps:
Step 1: Select Top $k$ Principal Components
Step 2: Project Each Data Point onto the New Subspace
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
# data generation
m = 1000
mu = np.array([0, 0])
sigma = np.array([[3, 1.5],
[1.5, 1]])
X = np.random.multivariate_normal(mu, sigma, m)
X = np.asmatrix(X)
print(X.shape)
fig = plt.figure(figsize = (6, 4))
plt.plot(X[:,0], X[:,1], 'b.', alpha = 0.3)
plt.grid(alpha = 0.3)
plt.xlim([-7, 7])
plt.ylim([-4, 4])
plt.show()
PCA Implementation
S = 1/(m-1)*X.T*X
D, U = np.linalg.eig(S)
idx = np.argsort(-D)
D = D[idx]
U = U[:,idx]
print(D, '\n')
print(U)
Plot the principal component
h = U[1,0]/U[0,0]
xp = np.arange(-6, 6, 0.1)
yp = h*xp
fig = plt.figure(figsize = (6, 4))
plt.plot(X[:,0], X[:,1], 'b.', alpha = 0.3)
plt.plot(xp, yp, 'r', linewidth = 3)
plt.grid(alpha = 0.3)
plt.xlim([-7, 7])
plt.ylim([-4, 4])
plt.show()
Plot the histogram of data projected onto $u_1$
Z1 = X*U[:,0]
plt.figure(figsize = (6, 4))
plt.hist(Z1, 51)
plt.xlim([-7, 7])
plt.show()
Plot the histogram of data projected onto $u_2$
Z2 = X*U[:,1]
plt.figure(figsize = (6, 4))
plt.hist(Z2, 21)
plt.xlim([-7, 7])
plt.show()
PCA from sklearn.decomposition
from sklearn.decomposition import PCA
X = np.array(X)
pca = PCA(n_components = 1)
pca.fit(X)
Plot the histogram of data projected onto $u_1$
Z = pca.transform(X)
plt.figure(figsize = (6, 4))
plt.hist(Z, 51)
plt.xlim([-7, 7])
plt.show()
Consider a simple spring-mass system in which a ball of mass $m$ is attached to a massless, frictionless spring. The system exhibits oscillatory motion, which can ideally be described by a one-dimensional function of time representing the displacement.
Experimental Setup
However, we lack prior knowledge about the system's true dynamics and the optimal measurement direction.
Challenge
The true underlying dynamics are defined solely by the ball's motion along the $x$-axis, yet:
Objective
Our goal is to apply Principal Component Analysis (PCA) to:
Video recording from Camera A
from IPython.display import YouTubeVideo
YouTubeVideo('Pkit-64g0eU', width = "560", height = "315")
Video recording from Camera B
from IPython.display import YouTubeVideo
YouTubeVideo('x4lvjVjUUqg', width = "560", height = "315")
Video recording from Camera C
from IPython.display import YouTubeVideo
YouTubeVideo('2t62WkNIqxY', width = "560", height = "315")
Each camera provides two measurements - one for the horizontal position ($x$) and one for the vertical position ($y$) - relative to its own coordinate system. For each time instance $i$, the recorded data can be structured as follows:
This 6-dimensional vector $x^{(i)}$ represents the system's position as captured by all three cameras at time instance $i$.
To organize the entire dataset over $m$ time instances, we construct a data matrix $X$ as follows:
Where:
We have collected the data for you. Here is the dataset that you can download for analysis.
from google.colab import drive
drive.mount('/content/drive')
import numpy as np
X = np.load('/content/drive/MyDrive/ML/ML_data/pca_spring.npy')
X = np.asmatrix(X.T)
m = X.shape[0]
print(X.shape)
# Do not worry about the negative sign for the relative y-axis.
# it is simply a translation to plot the motion as observed by the camera.
plt.figure(figsize = (8, 4))
plt.subplot(1,3,1)
plt.plot(X[:, 0], -X[:, 1], 'r')
plt.axis('equal')
plt.title('Camera 1')
plt.subplot(1,3,2)
plt.plot(X[:, 2], -X[:, 3], 'b')
plt.axis('equal')
plt.title('Camera 2')
plt.subplot(1,3,3)
plt.plot(X[:, 4], -X[:, 5], 'k')
plt.axis('equal')
plt.title('Camera 3')
plt.show()
Apply PCA
X = X - np.mean(X, axis = 0)
S = 1/(m-1)*X.T*X
D, U = np.linalg.eig(S)
idx = np.argsort(-D)
D = D[idx]
U = U[:,idx]
print(D, '\n')
print(U)
Plot Eigenvalues
plt.figure(figsize = (6, 4))
plt.stem(np.sqrt(D))
plt.grid(alpha = 0.3)
plt.show()
Projection to the new Eigenvectors
# relative magnitudes of the principal components
Z = X*U
xp = np.arange(0, m)/24 # 24 frame rate
plt.figure(figsize = (6, 4))
plt.plot(xp, Z)
plt.grid(alpha = 0.3)
plt.show()
Projection to the Principal Component
## projected onto the first principal component
# 6 dim -> 1 dim (dim reduction)
# relative magnitude of the first principal component
plt.figure(figsize = (6, 4))
plt.plot(Z[:,0])
plt.grid(alpha = 0.3)
plt.show()
Discussion:
from IPython.display import YouTubeVideo
YouTubeVideo('4Pz3TDX5dBs', width = "560", height = "315")
In many machine learning tasks, dimensionality reduction is a crucial step for improving efficiency and enhancing model performance. While unsupervised methods such as Principal Component Analysis (PCA) effectively reduce dimensionality by maximizing variance, they do not take class labels into account. Consequently, PCA may overlook class-specific patterns that are important for classification or regression.
Fisher Discriminant Analysis (FDA) - also known as Linear Discriminant Analysis (LDA) - addresses this limitation by incorporating label information into the dimensionality reduction process. Unlike PCA, FDA aims to project the data onto a lower dimensional space that maximizes class separability.
Key Characteristics of PCA and FDA
Principal Component Analysis (PCA)
Fisher Discriminant Analysis (FDA)
Core Objective of FDA
FDA seeks to achieve two primary objectives simultaneously:
(1) Minimize Within-Class Variance:
(2) Maximize Between-Class Variance:
In this sense, FDA provides a supervised dimensionality reduction method that emphasizes class separation, making it particularly suitable for tasks such as classification.
In Fisher Discriminant Analysis (FDA), the goal is to project high-dimensional data onto a one-dimensional space (a line) in such a way that maximizes class separation.
Suppose each data point $x$ is a two-dimensional vector:
For simplicity, we assume the data is zero mean. If not, data should be centered:
Each data point is projected onto a line defined by the vector $\omega$. The projection is computed as:
Here, $\hat y$ is a scalar value representing the projected data point on the line.
The original data point $x$ in two dimensions is transformed into the scalar $\hat y$, which is a lower-dimensional representation:
Each data point $x$ is projected onto the line defined by $\omega$, with the projected length along the $\omega$-direction serving as the new data point.
The set of projected points:
represents the distribution of data points in the new one-dimensional space.
Key Question: Which $\omega$ is better for classification?
The choice of $\omega$ significantly impacts the effectiveness of the projection for separating classes.
An optimal $\omega$ should maximize the separation between classes while minimizing the spread of points within each class.
In Fisher Discriminant Analysis, the optimal $\omega$ is selected to achieve this balance by maximizing the between-class variance and minimizing the within-class variance - a concept explored in the subsequent sections.
The goal of FDA is to find a direction $\omega$ such that:
The distance between class means (when projected onto $\omega$) is maximized.
The variance within each class (when projected onto $\omega$) is minimized.
To achieve this, FDA maximizes the following ratio:
Equivalently,
Where:
$\mu_0$ and $\mu_1$ are the mean vectors for Class 0 and Class 1, respectively.
$n_0$ and $n_1$ are the number of samples in each class.
$S_0$ and $S_1$ are the within-class scatter matrices, which measure the variance of each class.
We aim to find the optimal projection vector $\omega$ that maximizes class separation.
Step 1: Objective Function
The objective is to maximize the following ratio:
By defining:
We rewrite the objective as:
Step 2: Coordinate Transformation
Since $ \Sigma $ is symmetric and positive semi-definite, we can factorize it as $ \Sigma = R^T R $, where $ R $ is a square root matrix.
Now introduce a new variable $ u $ such that:
This change of variables simplifies the objective function:
Step 3: Maximizing the Objective
Since $ u^T u = \lVert u \rVert^2 $, the objective becomes:
By the properties of the dot product, this expression is maximized when:
Why?:
Dot product of a unit vector and another vector is maximum when the two have the same direction.
Step 4: Recovering $ \omega $
By substituting back the relationship $ \omega = R^{-1} u $:
Since $ R^{-1} R^{-T} = \Sigma^{-1} $, we have:
By substituting back $ m = \mu_0 - \mu_1 $ and $ \Sigma = n_0S_0 + n_1S_1 $, the final solution is:
This result highlights FDA's ability to effectively combine dimensionality reduction and classification by maximizing class separation.
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
Data Generation
# generating data set
n0 = 200
n1 = 200
mu = [0, 0]
sigma = [[0.9, -0.4],
[-0.4, 0.3]]
np.random.seed(42)
x0 = np.random.multivariate_normal([2.5,2.5], sigma, n0).T # data in class 0
x1 = np.random.multivariate_normal([1,1], sigma, n1).T # data in class 1
x0 = np.asmatrix(x0)
x1 = np.asmatrix(x1)
plt.figure(figsize = (6, 4))
plt.plot(x0[0,:], x0[1,:], 'r.', alpha = 0.3)
plt.plot(x1[0,:], x1[1,:], 'b.', alpha = 0.3)
plt.xlim([-2, 6])
plt.ylim([-1, 4.5])
plt.show()
Before determining the optimal $ \omega $, it is helpful to examine how the distribution of projected data points varies depending on different projection directions.
Consider the line represented by the equation:
By observing the projected points along these different lines, we can gain insight into how the choice of $ \omega $ influences class separation and data spread. This visualization helps build intuition before proceeding to identify the optimal $ \omega $.
Projection onto x-axis
# x-axis
w = np.array([[1], [0]])
w = np.asmatrix(w)
plt.figure(figsize = (6, 4))
plt.plot(x0[0,:], x0[1,:], 'r.', alpha = 0.3)
plt.plot(x1[0,:], x1[1,:], 'b.', alpha = 0.3)
plt.axhline(0, color = 'k')
plt.xlim([-2, 6])
plt.ylim([-1, 4.5])
plt.show()
Histogram of prejected data
y1 = x0.T*w
y2 = x1.T*w
plt.figure(figsize = (6, 4))
plt.hist(y1, 21, color = 'r', rwidth = 0.5, alpha = 0.4)
plt.hist(y2, 21, color = 'b', rwidth = 0.5, alpha = 0.4)
plt.show()
Projection onto y-axis
# y-axis
w = np.array([[0], [1]])
w = np.asmatrix(w)
plt.figure(figsize = (6, 4))
plt.plot(x0[0,:], x0[1,:], 'r.', alpha = 0.3)
plt.plot(x1[0,:], x1[1,:], 'b.', alpha = 0.3)
plt.axvline(0, color = 'k')
plt.xlim([-2, 6])
plt.ylim([-1, 4.5])
plt.show()
y1 = x0.T*w
y2 = x1.T*w
plt.figure(figsize = (6, 4))
plt.hist(y1, 21, color = 'r', rwidth = 0.5, alpha = 0.4)
plt.hist(y2, 21, color = 'b', rwidth = 0.5, alpha = 0.4)
plt.show()
From both cases, we can observe that there is overlap in their histograms, which leads to potential misclassification when the data is projected into a lower-dimensional space.
Now, let's examine the optimal projection line determined by Fisher Discriminant Analysis (FDA) and analyze the corresponding distribution of projected data to evaluate its effectiveness in improving classification performance.
Projection line by FDA
mu0 = np.mean(x0, axis = 1)
mu1 = np.mean(x1, axis = 1)
S0 = 1/(n0 - 1)*(x0 - mu0)*(x0 - mu0).T
S1 = 1/(n1 - 1)*(x1 - mu1)*(x1 - mu1).T
w = (n0*S0 + n1*S1).I*(mu0 - mu1)
print(w)
plt.figure(figsize = (6, 4))
plt.plot(x0[0,:], x0[1,:], 'r.', alpha = 0.3)
plt.plot(x1[0,:], x1[1,:], 'b.', alpha = 0.3)
xp = np.arange(-4, 6, 0.1)
yp = w[1,0]/w[0,0]*xp
plt.plot(xp, yp, 'k')
plt.xlim([-2, 6])
plt.ylim([-1, 4.5])
plt.show()
Histogram
y1 = x0.T*w
y2 = x1.T*w
plt.figure(figsize = (6, 4))
plt.hist(y1, 21, color = 'r', rwidth = 0.5, alpha = 0.4)
plt.hist(y2, 21, color = 'b', rwidth = 0.5, alpha = 0.4)
plt.show()
Scikit-learn
# reshape data
X = np.vstack([x0.T, x1.T])
y = np.vstack([np.ones([n0, 1]), np.zeros([n1, 1])])
from sklearn import discriminant_analysis
X = np.array(X)
clf = discriminant_analysis.LinearDiscriminantAnalysis()
clf.fit(X, np.ravel(y))
Histogram
# projection
X_LDA = clf.transform(X)
plt.figure(figsize = (6, 4))
plt.hist(X_LDA[0:200], 21, color = 'r', rwidth = 0.5, alpha = 0.4)
plt.hist(X_LDA[200:400], 21, color = 'b', rwidth = 0.5, alpha = 0.4)
plt.show()
Classification boundary
# classifier boundary, not a projection line
clf_w = clf.coef_[0]
clf_w0 = clf.intercept_[0]
print(clf_w)
print(clf_w0)
Projection line
# projection line
mu0 = np.mean(x0, axis = 1)
mu1 = np.mean(x1, axis = 1)
S0 = 1/(n0 - 1)*(x0 - mu0)*(x0 - mu0).T
S1 = 1/(n1 - 1)*(x1 - mu1)*(x1 - mu1).T
prj_w = (n0*S0 + n1*S1).I*(mu0 - mu1)
xp = np.arange(-4, 6, 0.1)
prj_yp = prj_w[1,0]/prj_w[0,0]*xp
clf_yp = - clf_w[0]/clf_w[1]*xp - clf_w0/clf_w[1]
plt.figure(figsize = (6, 4))
plt.plot(x0[0,:], x0[1,:], 'r.', alpha = 0.3)
plt.plot(x1[0,:], x1[1,:], 'b.', alpha = 0.3)
plt.plot(xp, prj_yp, 'k', label = 'LDA projection line')
plt.plot(xp, clf_yp, 'g', label = 'LDA classification boundary')
plt.legend()
plt.xlim([-2, 6])
plt.ylim([-1, 4.5])
plt.show()
We have studied Fisher Discriminant Analysis (FDA), exploring both its theoretical foundation and practical implementation. While FDA facilitates dimensionality reduction, it is important to emphasize that FDA operates in a supervised manner. Unlike unsupervised methods such as PCA, the primary objective of FDA is not merely to reduce dimensionality but to achieve effective class separation and improve classification performance.
from IPython.display import YouTubeVideo
YouTubeVideo('sFym4MYOQ-o', width = "560", height = "315")
Geometry of Linear Maps
In the context of linear algebra, the matrix $ A \in \mathbb{R}^{m \times n} $ represents a linear transformation that maps vectors from an $ n $-dimensional space to an $ m $-dimensional space.
The matrix $ A $ can be interpreted as performing a combination of the following transformations:
(1) Rotation:
(2) Scaling (Stretching/Compression):
(3) Shearing (Skewing):
Mapping the Unit Ball to an Ellipsoid
The unit ball in $ \mathbb{R}^n $, defined as:
is transformed by $ A $ into an ellipsoid in $ \mathbb{R}^m $:
This geometric interpretation highlights that linear transformations distort space by rotating, scaling or skewing.
Singular Values and Singular Vectors
The Singular Value Decomposition (SVD) provides a structured way to describe this transformation. The key elements of the SVD are:
the numbers $\sigma_1, \cdots, \sigma_n$ are called the singular values of $A$ by convention, $\sigma_i > 0$
the vectors $u_1, \cdots, u_n$ are unit vectors along the principal semiaxes of $AS$
the unit vectors $v_1, \cdots, v_n$ are the preimages of the principal semiaxes
These satisfy the following relationship:
Matrix Formulation of SVD
By combining these vectors and values into matrices:
This leads to the compact matrix form:
Where:
Thin Singular Value Decomposition
In the case where $A \in \mathbb{R}^{m \times n}$ is skinny (more rows than columns) and full rank (i.e., $r = n$), we define:
Let:
In matrix form,
Since $V$ is orthogona, we obtain the thin (or reduced) SVD of $A$
Full Singular Value Decomposition
In addition to the thin SVD, we can extend the decomposition to form the full SVD by adding extra components to ensure that the resulting matrices are fully orthogonal.
Suppose $A \in \mathbb{R}^{m \times n}$ with $m \geq n$.
To construct the full SVD:
(1) Extend $U$:
$\quad \,$Add extra orthonormal columns to $\hat{U}$ to create a complete orthogonal matrix $U$.
$\quad \,$Since $U$ is now an orthogonal matrix, $UU^T = I_m$.
(2) Extend $\Sigma$:
$\quad \,$Add extra rows of zeros to the diagonal matrix $\hat{\Sigma}$, ensuring the dimensions align with $U$.
(3) Combine in Matrix Form:
$\quad \,$Using these augmented matrices, the full SVD is defined as:
Where:
This is the (full) singular value decomposition of $A$
Every matrix $A$ has a singular value decomposition.
The full SVD extends the thin SVD by adding orthogonal components to $U$ and zero rows in $\Sigma$/
Interpretation of SVD
The Singular Value Decomposition (SVD) provides an insightful way to understand a linear transformation $Ax$ by breaking it down into three distinct steps:
Step 1: Rotation by $V^T$
The first transformation is a rotation of the input vector using $V^T$.
This step aligns the data with the principal directions in the row space of the matrix.
Step 2: Diagonal Scaling by $\Sigma$
The second transformation applies scaling by the diagonal matrix $\Sigma$.
Each coordinate is stretched or compressed by the corresponding singular value $\sigma_i$.
These singular values represent the magnitude of the transformation in the corresponding direction.
Step 3: Rotation by $U$
The final transformation is another rotation by the orthogonal matrix $U$.
This step aligns the scaled data with the principal directions in the column space.
SVD factorizes the matrix $ A $ into three distinct components: an orthogonal matrix $ U $, a diagonal matrix $ \Sigma $ containing singular values, and another orthogonal matrix $ V^T $. This decomposition effectively represents $ A $ as a sequence of three linear transformations: a rotation by $ V^T $, followed by a scaling by $ \Sigma $, and then another rotation by $ U $.
(1) Any real symmetric and positive definite matrix $B \in \mathbb{R}^{n \times n}$ can be decomposed into its eigen decomposition as follows:
where:
(2) A real matrix $A \in \mathbb{R}^{m \times n}$ (when $m>n$) has a singular value decomposition (SVD) given by:
where:
From the matrix $A$ (assuming $A$ is skinny and full rank), we can construct two positive-definite symmetric matrices:
For $AA^T \in \mathbb{R}^{m \times m}$, $U$ is the matrix of eigenvectors of $AA^T$
For $A^TA \in \mathbb{R}^{n \times n}$, $V$ is the matrix of eigenvectors of $A^TA$
PCA via SVD
Recall that PCA relies on the covariance matrix:
By scaling the data matrix $A = \frac{X}{\sqrt m}$, the covariance matrix becomes:
Using the SVD result:
Summary
The eigenvectors of $A^TA$ (from the SVD) provide the principal components in PCA.
The squared singular values of $A$ correspond to the eigenvalues in PCA.
We have established the relationship between PCA and SVD. Suppose $A \in \mathbb{R}^{m \times n}$ where $m \gg n$.
To compute the SVD:
$U$: The left singular vectors of $A$
$V$: The right singular vectors of $A$
$\Sigma$: Diagonal matrix of singular values, $\sigma_i = \sqrt{\lambda_i}$
Key Observations:
The columns of $U$ are the eigenvectors of $AA^T \in \mathbb{R}^{m \times m}$.
The columns of $V$ are the eigenvectors of $A^T A \in \mathbb{R}^{n \times n}$.
The diagonal elements of $\Sigma$ are the singular values, $\sigma_i = \sqrt{\lambda_i}$.
Computational Challenge
If $n$ is small, computing $V$ via np.linalg.eig(A^T*A)
is feasible since $A^T A \in \mathbb{R}^{n \times n}$. However, computing $U$ directly from np.linalg.eig(A*A^T)
is inefficient because:
Efficient Solution
Instead of computing $U$ directly via np.linalg.eig(A*A^T)
, we can leverage the relationship between $U$ and $V$:
Where:
Why Does This Work?
Starting from the known relationship:
By multiplying both sides by $A$,
This confirms that $A v_i$ is an eigenvector of $AA^T$ with eigenvalue $\lambda_i$.
However, this vector is not yet normalized. To normalize it:
This ensures that $u_i$ has unit length.
This method efficiently computes the left singular vectors $U$ without directly forming the large matrix $AA^T$, making it ideal when $m \gg n$.
Example
Consider the following matrix:
From the decomposition:
Combining these,
To determine $V$ through eigen-analysis
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
A = np.array([[4, 4],
[-3, 3]])
A = np.asmatrix(A)
[D, V] = np.linalg.eig(A.T*A)
print(V, '\n')
print(D)
To find $U$ from eigen-analysis
[D, U] = np.linalg.eig(A*A.T)
print(U, '\n')
print(D)
Instead of computing $U$ directly via np.linalg.eig(A*A^T)
, we can leverage the relationship between $U$ and $V$:
sigma = np.sqrt(D)
Sigma = np.diag(sigma)
U = A @ V @ np.linalg.inv(Sigma)
print(U)
However, how to compute the SVD numerically is not the focus of this course. Instead, we will use Python's built-in np.linalg.svd()
function to compute the SVD decomposition directly. By doing so, we can concentrate on understanding the underlying theory and practical applications of SVD.
The function np.linalg.svd()
performs SVD as follows:
[U, S, VT] = np.linalg.svd(A, full_matrices = True)
print(U.shape, S.shape, VT.shape, '\n')
print(U, '\n')
print(S, '\n')
print(VT)
MIT 18.06 Linear Algebra by Prof. Gilbert Strang
from IPython.display import YouTubeVideo
YouTubeVideo('Nx0lRBaXoz4', width = "560", height = "315")
In the 2D case, consider a vector $x$ that can be expressed in terms of the basis vectors $v_1$ and $v_2$ (where $\hat v$ denotes a unit vector, though this notation is omitted here for simplicity):
Equivalently,
Since inner products can be rewritten in matrix notation,
SVD Transformation
Next, consider the effect of matrix $A$ on the vector $x$:
By the SVD property,
Thus,
Equivalently,
In matrix form,
Low-Rank Approximation in General
In practice, if one singular value dominates the others ($\sigma_1 \gg \sigma_2$), we can approximate $A$ with a rank-1 approximation:
General Low-Rank Approximation of Matrix $A$
For a general matrix $A$, the SVD decomposition allows for constructing a low-rank approximation by retaining only the top $k$ singular values:
A = np.array([[11.08, 6.82, 1.76, -6.82],
[2.50, -1.01, -2.60, 1.19],
[-4.88, -5.07, -3.21, 5.20],
[-0.49, 1.52, 2.07, -1.66],
[-14.04, -12.40, -6.66, 12.65],
[0.27, -8.51, -10.19, 9.15],
[9.53, -9.84, -17.00, 11.00],
[-12.01, 3.64, 11.10, -4.48]])
[U, S, VT] = np.linalg.svd(A, full_matrices = True)
U = np.asmatrix(U)
S = np.asmatrix(np.diag(S))
VT = np.asmatrix(VT)
print(U.shape, S.shape, V.shape, '\n')
print(U, '\n')
print(S, '\n')
print(VT)
[U, S, VT] = np.linalg.svd(A, full_matrices = False)
U = np.asmatrix(U)
S = np.asmatrix(np.diag(S))
VT = np.asmatrix(VT)
print(U.shape, S.shape, V.shape, '\n')
print(U, '\n')
print(S, '\n')
print(VT)
Question
How to determine a good $k$ so that we can efficiently compress $A$ without losing too much information of $A$?
sig_val = np.diag(S)
plt.figure(figsize = (6, 4))
plt.stem(range(1,5), sig_val)
plt.xticks(range(1,5))
plt.title('Singular Values')
plt.show()
When singular values decay rapidly, retaining only the top $k$ singular values can provide an accurate yet efficient approximation.
k = 2
low_U = U[:, :k]
low_S = S[:k, :k]
low_VT = VT[:k, :]
A_approx = low_U*low_S*low_VT
print(low_U.shape, low_S.shape, low_VT.shape, '\n')
print('A:\n', A ,'\n')
print('A_approx:\n', A_approx, '\n')
print('low_U:\n', low_U, '\n')
print('low_VT:\n', low_VT)
Thin (or Economy) SVD
Truncated SVD (Low Rank Approximation)
$$Y \approx U_r \Sigma_r V_r^T$$Image as a Matrix
An image can be represented as a matrix $A$, where each entry corresponds to the intensity of a pixel for grayscale images. By applying SVD, we decompose the image matrix as follows:
Image Approximation with Rank-$k$ Decomposition
Instead of retaining the full rank of $A$, we can approximate it using only the top $k$ singular values:
Downlaod the image file
from google.colab import drive
drive.mount('/content/drive')
import cv2
img = cv2.imread('/content/drive/MyDrive/ML/ML_data/kaist_me.jpg', 0)
print(img.shape)
plt.figure(figsize = (6, 6))
plt.imshow(img, 'gray')
plt.axis('off')
plt.show()
SVD
A = img
[U, S, VT] = np.linalg.svd(A, full_matrices = False)
U = np.asmatrix(U)
S = np.asmatrix(np.diag(S))
VT = np.asmatrix(VT)
print(U.shape, S.shape, VT.shape, '\n')
Plot Singular Values
sig_val = np.diag(S)
plt.figure(figsize = (6, 4))
plt.stem(sig_val)
plt.title('Singular Values')
plt.show()
Approximated image when $k = 20$
k = 20
low_U = U[:, :k]
low_S = S[:k, :k]
low_VT = VT[:k, :]
A_hat = low_U*low_S*low_VT
plt.figure(figsize = (8, 5))
plt.subplot(1,2,1)
plt.imshow(A, 'gray')
plt.axis('off')
plt.title('Original image')
plt.subplot(1,2,2)
plt.imshow(A_hat, 'gray')
plt.axis('off')
plt.title('Approximated image w/ rank = 20')
plt.tight_layout()
plt.show()
Approximated images with $k$ varied
K = [1, 2, 3, 4, 10, 20, 30, 100]
plt.figure(figsize = (10, 5))
for i in range(len(K)):
low_U = U[:, :K[i]]
low_S = S[:K[i], :K[i]]
low_VT = VT[:K[i], :]
A_hat = low_U*low_S*low_VT
plt.subplot(2,4,i+1), plt.imshow(A_hat, 'gray'), plt.axis('off')
plt.title('rank {}'.format(K[i]))
plt.tight_layout()
plt.show()
Discussion
Each image is reconstructed using a progressively higher rank approximation, capturing increasing levels of detail.
(1) Rank 1 Image:
Captures only the most dominant pattern or trend in the image - primarily the strongest intensity gradients (e.g., horizontal patterns or key contrast areas).
This is because the first singular value and its corresponding singular vectors dominate the overall structure.
(2) Rank 2 to Rank 4 Images:
(3) Rank 10 to Rank 30 Images:
(4) Rank 100 Image:
Image Compression Using Low-Rank Approximation with SVD
Suppose a satellite captures an image that needs to be transmitted to the ground station. Given the constraints of limited communication bandwidth, it becomes crucial to minimize the amount of transmitted data without significant loss of image quality. In this scenario, low-rank approximation using SVD offers an effective solution.
Instead of transmitting the entire matrix $A$, we only transmit:
The transmitted data is significantly smaller:
This example demonstrates how SVD can effectively isolate the background from a series of video frames by leveraging low-rank approximation. The objective is to remove moving objects (people) from the images while preserving the static background.
The dataset consists of 20 sequential images (or frames) captured from a video showing people walking. These images contain both dynamic (moving) elements and a static background. By applying SVD, we aim to identify and extract the dominant structure - the background - while minimizing the impact of moving objects.
Download and unzip data
Step 1: Display the Images
from google.colab import drive
drive.mount('/content/drive')
import numpy as np
import cv2
import matplotlib.pyplot as plt
%matplotlib inline
# Plot the first 'capimage_0.jpg' image in grayscale.
data_path = '/content/drive/MyDrive/ML/ML_data'
img = cv2.imread(data_path + '/capimage_0.jpg', 0)
plt.figure(figsize = (6, 4))
plt.imshow(img, 'gray')
plt.axis('off')
plt.show()
print(img.shape)
# Plot the sequential 6 images in grayscale.
plt.figure(figsize = (8, 4))
for i in range(6):
img_name = ['capimage_' + str(i) + '.jpg']
img = cv2.imread(data_path + '/' + img_name[0], 0)
plt.subplot(2, 3, i+1)
plt.imshow(img, 'gray')
plt.title(img_name[0])
plt.axis('off')
plt.tight_layout()
plt.show()
Step 2: Construct Matrix $A$
In the previous example, an image was represented as a matrix $A$, allowing us to apply SVD for low-rank approximation. However, in this example, we are dealing with 20 images, each represented as its own matrix. This introduces a new challenge: how to efficiently apply SVD to multiple image matrices simultaneously.
The brilliant (yet conceptually straightforward) idea is to flatten each image into a column vector and then horizontally stack these vectors to construct a new matrix $A$. This consolidated matrix enables us to apply SVD efficiently.
At first glance, you might hesitate since flattening seemingly disrupts the inherent spatial structure of the images. Surprisingly, however, it works well!
Load all 20 grayscale images.
Reshape each image into a column vector of shape $(\text{row} \times \text{col},1)$.
Horizontally stack these vectors to form matrix $A$, resulting in a shape of:
flat_img = []
for i in range(20):
img = cv2.imread(data_path + '/capimage_' + str(i) + '.jpg', 0)
flat_img = img.reshape(-1, 1)
A = np.hstack((A, flat_img))
row = img.shape[0]
col = img.shape[1]
print(row)
print(col)
print(A.shape)
Step 3: Compute SVD and Visualize Sigular Values
U, S, VT = np.linalg.svd(A, full_matrices = False)
U = np.asmatrix(U)
S = np.asmatrix(np.diag(S))
VT = np.asmatrix(VT)
print(U.shape, S.shape, VT.shape, '\n')
plt.figure(figsize = (6, 4))
plt.stem(np.diag(S))
plt.title('Singular Values')
plt.show()
Step 4: Low-Rank Approximation with $k = 1$
Using only the first singular value, reconstruct the background using the formula:
Extract the first column of $\hat A$, which corresponds to the reconstructed background of the first frame.
Reshape it back to its original image dimensions and display the result.
# Plot A_hat[:,0] with a reshape of (row, col)
A_hat = U[:,0] * S[0,0] * VT[0,:]
plt.figure(figsize = (6, 8))
plt.imshow(A_hat[:, 0].reshape(row, col), 'gray')
plt.axis('off')
plt.show()
Discussion
The first singular component captures the most dominant pattern across all frames - in this case, the stationary background.
Moving objects (like people walking) are treated as noise or high-frequency variations and are effectively removed by the low-rank approximation.
This is an impressive example. I personally had never envisioned such an innovative approach to effectively remove moving objects from images.
Eigenfaces is a well-known technique in computer vision that applies PCA or SVD for facial recognition. The method is based on the idea that faces can be represented as combinations of key facial features, called eigenfaces. By leveraging SVD, Eigenfaces effectively reduces the dimensionality of face images while retaining the most important features for identification.
Concept of Eigenfaces
The key idea behind Eigenfaces is that any face can be represented as a linear combination of a set of “basis” faces - the eigenfaces - which capture the most dominant features in the dataset.
Given a set of face images:
Where:
In this example, we will use a set of 12 images of U.S. Presidents to demonstrate the concept of Eigenfaces.
from google.colab import drive
drive.mount('/content/drive')
import numpy as np
from scipy import io
import matplotlib.pyplot as plt
%matplotlib inline
faces = io.loadmat('/content/drive/MyDrive/ML/ML_data/faces.mat')
face_mat = faces['faces']
row, col, n = face_mat.shape
plt.figure(figsize = (6, 8))
for i in range(n):
plt.subplot(4, 3, i+1), plt.axis('off')
plt.imshow(face_mat[:, :, i], 'gray')
plt.show()
Build matrix $A$ out of 12 images
A = np.reshape(face_mat, [row*col, n])
Compute the Eigenfaces Using SVD
Given a set of facial images stored as matrix $A$, compute the SVD:
Where:
[U, S, VT] = np.linalg.svd(A, full_matrices = False)
U = np.asmatrix(U)
S = np.asmatrix(np.diag(S))
VT = np.asmatrix(VT)
print(U.shape, S.shape, VT.shape, '\n')
Eigenface or eigenmode from $U$
plt.figure(figsize = (6, 8))
for i in range(m):
plt.subplot(4, 3, i+1), plt.axis('off')
plt.imshow(U[:, i].reshape(row, col), 'gray')
plt.show()
Singular values
sig_val = np.diag(S)
plt.figure(figsize = (6, 4))
plt.stem(sig_val)
plt.title('Singular Values')
plt.show()
Transmit the compressed information
Encode the Image Using Weights
obama = face_mat[:,:,11]
plt.figure(figsize = (3, 4))
plt.imshow(obama, 'gray')
plt.axis('off')
plt.show()
Transmit the Encoded Data
For a given image $x$, project the image onto the eigenface basis:
Where:
Instead of transmitting the entire image $x$, only the coefficient vector $z$ is transmitted.
The vector $z$ is smaller in size because the SVD retains only the most significant features, discarding noise or less important details.
# sender
x = np.reshape(obama, [row*col, 1])
eigenface_decom = U.T*x
print(eigenface_decom.shape)
Reconstruct the Image at the Receiver’s End
At the receiver’s end, reconstruct the original image using the transmitted coefficient vector:
Where:
# receiver
reconst = U*eigenface_decom
plt.figure(figsize = (3, 4))
plt.imshow(reconst.reshape(row, col), 'gray')
plt.axis('off')
plt.show()
Choosing the Rank $k$ for Reconstruction
In practice, selecting an optimal $k$ balances compression efficiency with visual quality.
k = 9
# sender
x = np.reshape(obama, [row*col, 1])
eigenface_decom = U[:,:k].T*x
# reciever
reconst = U[:,:k]*eigenface_decom
plt.figure(figsize = (3, 4))
plt.imshow(reconst.reshape(row, col), 'gray')
plt.axis('off')
plt.show()
Discussion
SVD efficiently separates meaningful facial features from noise.
Encoding the image as coefficients (weights) significantly reduces the transmitted data.
The method achieves excellent compression without severe loss of image quality.
$U$ contains the eigenmodes.
The Eigenfaces method is a classical yet effective technique for face recognition. While modern deep learning methods dominate large-scale face recognition tasks, Eigenfaces with SVD remains practical for scenarios with limited data or controlled environments.
Step 1: Project Image onto the Eigenface Subspace
Where:
Step 2: Face Identification Using Distance Metrics
Where:
Step 3: Classification Decision
When is the Eigenface Method Preferred Over Deep Learning?
Although deep learning dominates face recognition tasks today, Eigenfaces is particularly useful in the following scenarios:
Practical Example
Suppose you need to build a face recognition system for identifying employees in your company:
Here, I present an interesting scenario: President John F. Kennedy is wearing sunglasses when he enters the White House.
download the image file
import cv2
jfk = cv2.imread('/content/drive/MyDrive/ML/ML_data/jfk.png', 0)
plt.figure(figsize = (3, 4))
plt.imshow(jfk, 'gray')
plt.axis('off')
plt.show()
Although our database contains his face image without sunglasses, our system must still successfully recognize him. Let's see how we can achieve this.
# projected coeeficient for new face
x = np.reshape(jfk, [row*col, 1])
eigenface_decom = U.T*x
# compute Euclidean distance to all training set
distances = []
for i in range(A.shape[1]):
train_image_decomposition = U.T @ A[:, i].reshape(-1, 1)
distance = np.linalg.norm(eigenface_decom - train_image_decomposition)
distances.append(distance)
argmin_index = np.argmin(distances)
print(f"The input image is most similar to training image {argmin_index}")
We can go even further.
Using the low-rank approximation properties of SVD, we can reconstruct his face by emphasizing dominant features while filtering out subtle variations, such as sunglasses.
This method effectively reveals the underlying facial structure, demonstrating how SVD can be used not only for recognition but also for image reconstruction and enhancement.
Projection onto $U$ and Reconstruct
x = np.reshape(jfk, [row*col, 1])
eigenface_decom = U.T*x
reconst = U*eigenface_decom
plt.figure(figsize = (9, 4))
plt.subplot(1,3,1)
plt.imshow(jfk, 'gray')
plt.title('Disguised')
plt.axis('off')
plt.subplot(1,3,2)
plt.imshow(reconst.reshape(row, col), 'gray')
plt.title('Reconstructed')
plt.axis('off')
plt.subplot(1,3,3)
plt.imshow(A[:,2].reshape(row, col), 'gray')
plt.title('Original')
plt.axis('off')
plt.show()
The Proper Orthogonal Decomposition (POD) is a powerful method used to analyze data that varies across both space and time. This technique effectively identifies dominant patterns within complex systems and is widely used in fields such as fluid dynamics, structural mechanics, and signal processing.
Suppose we have data that is a function of both space and time:
Our objective is to decompose this system into its fundamental building blocks.
A natural starting point is to express the system using a separation of variables approach:
Where
We aim to approximate most of the data using the smallest number of terms $r$.
Arrangement of Data
To implement POD efficiently, we organize the data into a matrix form:
Suppose measurements are taken at spatial locations $x_1, x_2, \cdots, x_n$
Suppose measurements are recorded at times $t_1, t_2, \cdots, t_m$
(typically after first subtracting a mean or equilibrium state)
We can construct a data matrix $Y$ with shape $n \times m$
POD Algorithm
This decomposition can be interpreted as:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
# data generated from the function
x = np.linspace(0, 5, 100)
Nx = np.size(x)
dt = 0.01
t = np.arange(0, 5, dt)
k = 2
omega = 4*np.pi
Y = np.zeros([np.size(x), np.size(t)], dtype = 'd')
for i in range(0, np.size(x)):
for j in range(0, np.size(t)):
Y[i, j] = np.sin(omega*t[j] - k*x[i]) + 0.1*np.random.randn()
print(Y.shape)
Plot
# Let's see the signal in time
from matplotlib.animation import FuncAnimation
import time
from IPython import display
fig = plt.figure(figsize = (6, 4))
ax1 = fig.add_subplot(1, 1, 1)
def animate(i):
ax1.clear()
ax1.plot(x, Y[:,i], linewidth = 2)
ax1.set_xlabel('x')
ax1.set_xlim(0, 5)
ax1.set_ylim(-2, 2)
ax1.grid()
ani = FuncAnimation(fig, animate, frames = int(np.size(t)/1), interval = 10)
video = ani.to_html5_video()
html = display.HTML(video)
display.display(html)
plt.close()
Perform the SVD
U, S, VT = np.linalg.svd(Y, full_matrices = False)
print(U.shape)
print(S.shape)
print(VT.shape)
Plot the Singular Values
plt.figure(figsize = (6, 4))
plt.plot(range(1, 101), S, 'o-')
plt.xticks(np.arange(6))
plt.xlim(0.5, 5.5)
plt.ylabel('Singular Value')
plt.xlabel('Index')
plt.grid(alpha = 0.3)
plt.show()
To achieve an efficient representation, we truncate the SVD by selecting only the top 2 singular values:
Plot $u_1(x)$
plt.figure(figsize = (6, 4))
plt.plot(x, U[:,0])
plt.title('POD Mode 1')
plt.xlabel('x')
plt.ylim([-2, 2])
plt.grid()
plt.show()
Plot $u_2(x)$
plt.figure(figsize = (6, 4))
plt.plot(x, U[:,1])
plt.title('POD Mode 2')
plt.xlabel('x')
plt.ylim([-2, 2])
plt.grid()
plt.show()
Plot $a_1(t)$ and $a_2(t)$
plt.figure(figsize = (6, 4))
plt.plot(t, S[0]*VT[0,:], label = 'a1(t)', linewidth = 2)
plt.plot(t, S[1]*VT[1,:], label = 'a2(t)', linewidth = 2)
plt.xlim([0, 5])
plt.xlabel('time')
plt.title('Mode Coefficients')
plt.legend(loc = 4)
plt.grid()
plt.show()
Low-Rank Approximation
U = np.asmatrix(U)
diagS = np.asmatrix(np.diag(S))
VT = np.asmatrix(VT)
print(U.shape, diagS.shape, VT.shape, '\n')
k = 2
low_U = U[:, :k]
low_S = diagS[:k, :k]
low_VT = VT[:k, :]
low_Y = low_U*low_S*low_VT
print(low_U.shape, low_S.shape, low_VT.shape, '\n')
# Let's see the signal in time
from matplotlib.animation import FuncAnimation
import time
from IPython import display
fig = plt.figure(figsize = (6, 4))
ax1 = fig.add_subplot(1, 1, 1)
def animate(i):
ax1.clear()
ax1.plot(x, low_Y[:,i], linewidth = 2)
ax1.set_xlim(0, 5)
ax1.set_ylim(-2, 2)
ax1.grid()
ani = FuncAnimation(fig, animate, frames = int(np.size(t)/1), interval = 10)
video = ani.to_html5_video()
html = display.HTML(video)
display.display(html)
plt.close()
Discussion
The POD method effectively demonstrates its ability to extract meaningful patterns from data in a data-driven manner. The key outcomes and insights from this example are summarized below:
(1) Noise Reduction
By retaining only the dominant modes in the decomposition, POD successfully filters out noise and reconstructs the underlying signal with improved clarity.
This denoising capability arises naturally from the fact that noise typically resides in the smaller singular values, which are discarded during low-rank approximation.
(2) Data-Driven Decomposition
Note: This data-driven decomposition closely aligns with the well-known trigonometric identity.
This example is adapted from the textbook "Data-Driven Science and Engineering" by Steven L. Brunton and J. Nathan Kutz.
Data Description and Visualization
The dataset provided contains flow field data for fluid flow past a circular cylinder at a Reynolds number of Re = 100. The data file, CYLINDER_ALL.mat, includes 151 snapshots of velocity components and vorticity fields:
The flow is studied at a Reynolds number of Re = 100, a regime where vortex shedding occurs periodically and can be captured effectively using low-rank approximations.
from google.colab import drive
drive.mount('/content/drive')
from scipy import io
flows = io.loadmat('/content/drive/MyDrive/ML/ML_data/CYLINDER_ALL.mat')
flows_mat_u = flows['UALL']
flows_mat_v = flows['VALL']
flows_mat_vort = flows['VORTALL']
flows_mat_u.shape
nx = 449
ny = 199
plt.figure(figsize = (6, 4))
plt.subplot(1,3,1)
plt.imshow(flows_mat_u[:,150].reshape(nx, ny), 'gray')
plt.title('u velocity')
plt.axis('off')
plt.subplot(1,3,2)
plt.imshow(flows_mat_v[:,150].reshape(nx, ny), 'gray')
plt.title('v velocity')
plt.axis('off')
plt.subplot(1,3,3)
plt.imshow(flows_mat_vort[:,150].reshape(nx, ny), 'gray')
plt.title('vorticity')
plt.axis('off')
plt.show()
# Let's see the flow in time
from matplotlib.animation import FuncAnimation
import time
from IPython import display
fig = plt.figure(figsize = (6, 4))
ax1 = fig.add_subplot(1, 3, 1)
ax2 = fig.add_subplot(1, 3, 2)
ax3 = fig.add_subplot(1, 3, 3)
def animate(i):
ax1.clear()
ax1.imshow(flows_mat_u[:,i].reshape(nx, ny), 'gray')
ax1.set_title('u velocity')
ax1.axis('off')
ax2.clear()
ax2.imshow(flows_mat_v[:,i].reshape(nx, ny), 'gray')
ax2.set_title('v velocity')
ax2.axis('off')
ax3.clear()
ax3.imshow(flows_mat_vort[:,i].reshape(nx, ny), 'gray')
ax3.set_title('vorticity')
ax3.axis('off')
ani = FuncAnimation(fig, animate, frames = int(150/1), interval = 10)
video = ani.to_html5_video()
html = display.HTML(video)
display.display(html)
plt.close()
From now on, we will focus on vorticity rather than u and v velocity.
It is common practice to subtract the mean or equilibrium state for clearer analysis.
flows_mat_vort_normalized = (flows_mat_vort - flows_mat_vort.min()) / (flows_mat_vort.max() - flows_mat_vort.min())
flows_mat_vort_normalized_mean = flows_mat_vort_normalized.mean(axis = 1)
flows_mat_vort_normalized_centered = flows_mat_vort_normalized - flows_mat_vort_normalized_mean[:, None]
plt.imshow(flows_mat_vort_normalized_mean.reshape(nx, ny), 'gray')
plt.title('vorticity')
plt.axis('off')
plt.show()
Perform the POD
flows_mat = flows_mat_vort_centered_normalized
U, S, VT = np.linalg.svd(flows_mat, full_matrices = False)
print(U.shape)
print(S.shape)
print(VT.shape)
Plot Singular Values
plt.figure(figsize = (6, 4))
plt.semilogy(S, '-o')
plt.ylabel('Singular Value')
plt.xlabel('Index')
plt.show()
Plot Several Spatial Modes
plt.figure(figsize = (6, 4))
plt.subplot(1,4,1)
plt.imshow(U[:,0].reshape(nx, ny), 'gray')
plt.title('POD Mode 1')
plt.axis('off')
plt.subplot(1,4,2)
plt.imshow(U[:,1].reshape(nx, ny), 'gray')
plt.title('POD Mode 2')
plt.axis('off')
plt.subplot(1,4,3)
plt.imshow(U[:,4].reshape(nx, ny), 'gray')
plt.title('POD Mode 5')
plt.axis('off')
plt.subplot(1,4,4)
plt.imshow(U[:,12].reshape(nx, ny), 'gray')
plt.title('POD Mode 13')
plt.axis('off')
plt.show()
Low Rank Approximation
k = 5
low_U = U[:, :k]
low_S = diagS[:k, :k]
low_VT = VT[:k, :]
print(low_U.shape, low_S.shape, low_VT.shape, '\n')
low_flows_mat = low_U*low_S*low_VT
# Let's see the flow in time
from matplotlib.animation import FuncAnimation
import time
from IPython import display
fig = plt.figure(figsize = (6, 4))
ax1 = fig.add_subplot(1, 2, 1)
ax2 = fig.add_subplot(1, 2, 2)
def animate(i):
ax1.clear()
ax1.imshow(flows_mat[:,i].reshape(nx, ny) + flows_mat_vort_normalized_mean.reshape(nx, ny), 'gray')
ax1.set_title('original')
ax1.axis('off')
ax2.clear()
ax2.imshow(low_flows_mat[:,i].reshape(nx, ny) + flows_mat_vort_normalized_mean.reshape(nx, ny), 'gray')
ax2.set_title('low rank')
ax2.axis('off')
ani = FuncAnimation(fig, animate, frames = int(150/1), interval = 10)
video = ani.to_html5_video()
html = display.HTML(video)
display.display(html)
plt.close()
Discussion
In this example, we apply POD to analyze the complex flow field dynamics of fluid flow past a circular cylinder at a Reynolds number of 100. This method effectively identifies dominant flow patterns and provides a low-dimensional representation of the system.
Key Takeaway from Dimension Reduction
Just as the Heliocentric model simplified our understanding of planetary motion by shifting to a better reference frame, effective dimension reduction techniques reveal meaningful insights by aligning data with its most informative structure.
By finding the right transformation or new coordinate system, complex data can become self-explanatory, improving both interpretation and predictive power.
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')