Dimension Reduction

Table of Contents



1. Statistics

Statistics is the branch of mathematics that deals with the collection, analysis, interpretation, presentation, and organization of data. It plays a crucial role in various fields, including science, business, social sciences, and machine learning.


In [ ]:
from IPython.display import YouTubeVideo
YouTubeVideo('OWOzaQhdE3Y', width = "560", height = "315")
Out[ ]:

Statistics vs Probability


Probability

  • Definition: Probability is the branch of mathematics that deals with predicting the likelihood of future outcomes based on known conditions or assumptions.

  • Focus: Probability starts with a known model (e.g., theoretical distribution) and aims to predict possible outcomes.


Statistics

  • Definition: Statistics is the branch of mathematics that involves analyzing existing data to infer properties of the population or predict future outcomes.

  • Focus: Statistics starts with observed data and attempts to build models or draw conclusions about the underlying distribution.


1.1. Populations and Samples

In statistics, understanding the distinction between populations and samples is fundamental to data analysis and inference.

  • A population includes all the elements from a set of data

  • A parameter is a quantity computed from a population

    • mean, $\mu$
    • variance, $\sigma^2$

  • A sample is a subset of the population.

    • one or more observations
  • A statistic is a quantity computed from samples

    • sample mean, $\bar{x}$
    • sample variance, $s^2$
    • sample correlation, $S_{𝑥𝑦}$

Purpose of Sampling

  • Samples are used because measuring the entire population is often impractical or impossible.

  • To make inferences about a population without examining every member.


How to Generate Random Numbers (Samples or data)

  • Data sampled from population/process/generative model

In [ ]:
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline
In [ ]:
## random number generation (1D)
m = 1000;

# uniform distribution U(0,1)
x1 = np.random.rand(m,1);

# uniform distribution U(a,b)
a = 1;
b = 5;
x2 = a + (b-a)*np.random.rand(m,1);

# standard normal (Gaussian) distribution N(0,1^2)
# x3 = np.random.normal(0, 1, m)
x3 = np.random.randn(m,1);

# normal distribution N(5,2^2)
x4 = 5 + 2*np.random.randn(m,1);

# random integers
x5 = np.random.randint(1, 6, size = (1,m));

Histogram: A Graphical Representation of Data Distribution

A histogram is a powerful tool for visualizing the distribution of data. It represents the frequency of data points across defined intervals (bins). Unlike bar charts, histograms group continuous data into bins to reveal patterns such as skewness, modality, and outliers.




1.2. Multivariate Statistics


Data Representation

  • Each observation $x^{(i)}$ is treated as a column vector with multiple features (variables)
  • The dataset $X$ is organized as a matrix, where each row corresponds to a transposed observation vector $\left(x^{(i)}\right)^T$

$$ x^{(i)} = \begin{bmatrix}x_1^{(i)} \\ x_2^{(i)}\\ \vdots \end{bmatrix}, \quad X = \begin{bmatrix} -& (x^{(i)})^T & -\\ - & (x^{(i)})^T & -\\ & \vdots & \\ - & (x^{(m)})^T & -\end{bmatrix}$$


Sample Mean

  • The sample mean, denoted $\bar x$, is the average of all observation vectors $\left(x^{(i)}, x^{(2)}, \cdots , x^{(m)}\right)$

$$\text{sample mean} \; \bar x = \frac{x^{(1)} + x^{(2)} + \cdots + x^{(m)}}{m} = \frac{1}{m} \sum\limits_{i=1}^{m}x^{(i)} $$


Sample Variance

  • The sample variance, $S^2$, measures the spread of the data. The formula is:

$$\text{sample variance} \; S^2 = \frac{1}{m-1} \sum\limits_{i=1}^{m}(x^{(i)} - \bar x)^2$$


Population variance

  • For a complete population of size $N$, the variance is calculated as:

$$\sigma^2 = \frac{1}{N}\sum\limits_{i=1}^{N}(x^{(i)} - \mu)^2$$



1.2.1. Correlation of Two Random Variables

This section explores the relationship between two random variables, focusing on variance, covariance, and correlation.


Sample Variance

  • Measures the spread (dispersion) of a single variable around its mean.

$$S_x = \frac{1}{m-1} \sum\limits_{i=1}^{m}\left(x^{(i)}-\bar x\right)^2$$


  • Similarly, $S_y$ is the variance of the variable $y$

Sample Covariance

  • Measures the joint variability of two random variables $x$ and $y$

$$S_{xy} = \frac{1}{m-1} \sum\limits_{i=1}^{m}\left(x^{(i)}-\bar x\right)\left(y^{(i)}-\bar y \right)$$


  • Interpretation:
    • $S_{xy} > 0$: $x$ and $y$ tend to increase together (positive correlation).
    • $S_{xy} < 0$: $x$ and $y$ tend to move in opposite directions (negative correlation).
    • $S_{xy} \approx 0$: No clear linear relationship.

Sample Covariance Matrix

  • The covariance matrix summarizes the variances and covariances:

$$S = \begin{bmatrix} S_x & S_{xy} \\ S_{yx} & S_y \end{bmatrix}$$


  • The diagonal entries are variances $(S_x, S_y)$.
  • The off-diagonal entries $(S_{xy}, S_{yx})$ represent the covariances.
  • Since $S_{xy} = S_{yx}$, the matrix is symmetric.

Sample Correlation Coefficient

  • Measures the strength and direction of the linear relationship between two variables:

$$r = \frac{S_{xy}}{ \sqrt {S_{xx}\cdot S_{yy}} }$$


  • Range: $-1 \leq r \leq 1$
  • Interpretation:
    • $r = 1$: Perfect positive correlation
    • $r = -1$: Perfect negative correlation
    • $r = 0$: No linear correlation

Suppose


$$x_1 \leq x_2 \leq \cdots \leq x_n$$

$$y_1 \leq y_2 \leq \cdots \leq y_n$$




1.2.2. Correlation Coefficient

The correlation coefficient is a statistical measure that quantifies the strength and direction of the linear relationship between two variables.


Key Properties

  • Range of Values:

$$0 \leq \left\lvert \text{ correlation coefficient } \right\rvert \leq 1$$


  • Interpretation:

    • $+1 \to \;$ close to a straight line (positive correlation)

    • $-1 \to \;$ close to a straight line (negative correlation)

    • $\;\;\;0 \to \;$ no linear line (uncorrelated)


  • Closeness to a line
    • A value close to $+1$ or $-1$ implies the data points are tightly clustered around a straight line.

    • The correlation does not indicate the slope of the line


Important Note on Causality

  • Correlation does NOT imply causation.

  • Even if two variables are strongly correlated, it doesn't necessarily mean one causes the other.


1.2.3. Correlation Coefficient Plot

A correlation coefficient plot is a visual tool that shows the strength and direction of correlation between pairs of variables. It's particularly useful in identifying patterns and relationships within a dataset.


seaborn

In [ ]:
import seaborn as sns
import pandas as pd

m = 300
x = 2*np.random.randn(m)
y = np.random.randn(m)

xo = np.sort(x)
yo = np.sort(y)
yor = -np.sort(-y)

d = {'col. 1': x, 'col. 2': xo, 'col. 3': yo, 'col. 4': yor}
df = pd.DataFrame(data = d)

sns.pairplot(df, height = 1.5, aspect = 1)
plt.show()
No description has been provided for this image



2. Principal Component Analysis (PCA)


In [ ]:
from IPython.display import YouTubeVideo
YouTubeVideo('-sM2dYlYSoI', width = "560", height = "315")
Out[ ]:

2.1. Exploring the Concept of 'Dimension' in Various Fields

The term dimension is widely used across different disciplines. Let's explore its meaning in several contexts:


(1) In Coordinate Systems

  • In geometry, coordinates like $x_1, x_2, x_3$ represent the axes in 3-dimensional space.


(2) In Dynamics (Generalized Coordinates)

  • In mechanical systems, a generalized coordinate is another example of a dimension.
  • For instance, in a four-bar linkage mechanism, the angle $\theta$ can serve as a generalized coordinate because knowing $\theta$ uniquely determines the entire configuration of the system.
  • Therefore, the four-bar linkage is a 1D system.


(3) In Machine Learning

  • The input dimension refers to the number of columns (features) in the input data $x$.
  • The output dimension refers to the number of columns in the output data $y$.


(4) In Image Processing

  • For an $8 \times 8$ pixel image, each pixel can have a distinct value (e.g., grayscale intensity or RGB values).
  • The total number of pixels (64 in this example) can be considered the dimension of the image.


2.2. Dimension Reduction

2.2.1. Two Examples of Dimension Reduction


Example 1


In this example, we have a system described by two original variables:


$$ \begin{align*} \dot u_1 &= -2(u_1 - u_2) \\ \dot u_2 &= \quad\; (u_1 - u_2) \end{align*} $$


These equations describe the dynamics of $u_1$ and $u_2$, which initially appear to be a 2D system.


We'll introduce new variables based on physical insights:


$$ \begin{align*} v_1 &= u_1 + 2u_2 \quad &\text{total amount of water}\\ v_2 &= u_1 - u_2 \quad &\text{difference in height} \end{align*} $$


By substituting the new variables into the original system:


$$ \begin{align*} \dot v_1 &= 0 \\ \dot v_2 &= -3 v_2 \end{align*} $$


The solution in $v_1$ and $v_2$


$$ \begin{align*} v_1(t) &= v_1(0) \\ v_2(t) &= v_2(0) e^{-3t} \end{align*} $$


Since $v_1(t)$ is constant, it doesn't evolve over time and effectively behaves like a fixed parameter rather than a dynamic variable. Consequently, the system's behavior is entirely governed by $v_2(t)$, the only variable that changes with time.

This is a great example of how dimension reduction (2D $\rightarrow$ 1D) can emerge naturally from analyzing the underlying physics and identifying conserved quantities.


Example 2

In this example, we have an $15 \times 15$ pixel image, which corresponds to 225 dimensions if each pixel is treated as an independent feature.



Key Insight: Why random noise looks meaningless

  • If each pixel is assigned a random value, the resulting image appears chaotic and lacks any meaningful structure.

  • This randomness illustrates that while the data exists in 225-dimensional space, not all points in this space represent meaningful images.


Why does this relate to dimension reduction?

  • Real-World Images are far from random. They follow specific patterns, structures, and correlations.

  • Consequently, meaningful images occupy only a small subspace within the full 225-dimensional space.

  • This observation opens the door for dimension reduction - mapping high-dimensional image data into a lower-dimensional subspace that retains meaningful features.



2.2.2. Motivation: Can High-Dimensional Data Be Described in a Simpler Way?

In many machine learning and data analysis tasks, datasets often contain numerous features, resulting in high-dimensional data. While this abundance of information can be beneficial, it introduces several challenges such as:

  • Increased computational complexity
  • Difficulties in visualizing and interpreting the data
  • Greater risk of overfitting in machine learning models

To address these issues, we explore dimensionality reduction - a technique designed to simplify complex data by transforming it into a lower dimensional representation while preserving as much meaningful information as possible.


Dimensionality reduction offers several important benefits:

(1) Insights into Low-Dimensional Structures:

  • Reducing data dimensions often reveals meaningful patterns and structures that may be obscured in high dimensional spaces.
  • For example, in visualization tasks, reducing data to 2D or 3D makes it easier to interpret and understand.

(2) Improved Generalization and Reduced Overfitting:

  • With fewer dimensions, the model complexity decreases, reducing the risk of overfitting and improving the model's ability to generalize well to unseen data.

(3) Increased Computational Efficiency:

  • Many machine learning algorithms scale poorly with increasing data dimensionality. Reducing the number of features significantly speeds up training and inference.

(4) Lower Storage Requirements:

  • Dimensionality reduction effectively compresses data by representing it with fewer variables, reducing memory and storage costs.

Dimensionality Reduction vs. Feature Selection

  • Feature Selection involves choosing a subset of existing features that best represent the data.

  • Dimensionality Reduction, on the other hand, is more like Feature Extraction, where new features are constructed from combinations of the original features.

While both approaches aim to simplify the data, dimensionality reduction actively transforms the original data into a lower dimensional form rather than simply discarding irrelevant features.


In essence, dimensionality reduction helps uncover the intrinsic structure of the data, improving both interpretability and computational efficiency - all while retaining critical information.


Redundancy in Data

  • Redundancy occurs when two or more features provide overlapping or duplicated information.

  • Highly correlated features often carry similar patterns, making some dimensions less informative.



(1) Low Redundancy

  • The data points are spread across both feature axes ($r_1$ and $r_2$) without a clear pattern.
  • Here, both features provide unique information, and discarding one feature would result in a considerable loss of information.

(2) Moderate Redundancy

(3) High Redundancy

  • The data points are tightly concentrated, meaning the two features are highly correlated.
  • Since the two features are strongly correlated, knowing the value of one feature ($r_1$ ) allows us to accurately predict the value of the other feature ($r_2$).
  • Therefore, the two features essentially contain the same information, and one dimension can be removed with minimal information loss.

Example of Dim Reduction in 2D

Consider a dataset distributed as shown in the following figure.




Each data point has two features $\{x_1,x_2\}$


(1) Ignoring feature $x_2$

  • Suppose we discard the feature $x_2$, and each example is reduced to

$$x = \{x_1\}$$


(2) Assessing information loss

  • The key question is: "Are we losing much information by throwing away $x_2$?"

  • Observing the data distribution reveals that the data points are highly concentrated along the $x_1$-axis, with very little spread along the $x_2$-axis.

  • This suggests that the data has minimal variance in the $x_2$ direction.


Now consider a dataset distributed as shown in the following figure.



In this case, the dataset is distributed across two features $\{x_1,x_2\}$


(1) The data exhibits significant variance along both axes ($x_1$ and $x_2$)

  • The spread of points suggests that both features contribute meaningful information

(2) Suppose we attempt to discard $x_2$ and reduce the data to:


$$x = \{x_1\}$$


(3) Significant information loss:

  • Since both features exhibit substantial variability, ignoring either $x_1$ or $x_2$ would discard important information about the data's structure.
  • Have to keep both features

While the dataset remains the same, the coordinate system can be changed. This transformation reveals key insights into information retention and loss.




(1) Observing the data in the new axes

  • After transforming the data into new coordinates ($u_1$ and $u_2$), the variance is now concentrated mostly along the $u_1$-axis.

  • The spread along the $u_2$-axis is minimal, indicating low variability in that direction.

(2) Discarding a dimension in the new axes

  • Suppose we discard $u_2$, reducing the data to a 1-dimensional representation:

$$x = \{ u_1 \}$$


  • Since the data's primary variability is along $u_1$, this reduction results in minimal information loss.

How to Find $u_1$

The key question is: How can we determine the optimal axis $u_1$ that maximizes variance and minimizes information loss?

  • This is precisely the objective of Principal Component Analysis (PCA).
  • PCA computes the principal components, which are the directions of maximum variance in the data.
  • The first principal component ($u_1$) is the direction along which the data has the greatest spread, ensuring the most informative projection.

Core Idea

  • Maximizing Variance: To identify the new axes (principal components) along which the data exhibits the most spread.
  • Minimizing Reconstruction Error: To minimize the squared error between the original data and its projection onto the new lower-dimensional space.

In essence, PCA identifies the optimal axis $u_1$ that retains the most important information about the data when reducing dimensions.



2.3. PCA Algorithm

In the previous section, we introduced the concept of dimensionality reduction and explored the core ideas behind it. Now, we will dive deeper into the mechanics of how this reduction is achieved - specifically through the Principal Component Analysis (PCA) algorithm.


The goal of PCA is to transform the data into a new coordinate system such that:

  • The first principal component (denoted as $ u_1 $) captures the maximum variance in the data.
  • Subsequent principal components ($ u_2, u_3, \cdots $) capture the remaining variance in decreasing order.

In this section, we will systematically study the PCA algorithm and its step-by-step procedure.


2.3.1. Pre-processing

Before applying Principal Component Analysis (PCA), it is crucial to preprocess the data to ensure that all features are appropriately scaled and centered. Proper preprocessing improves the accuracy and reliability of PCA results.


Step 1: Given Data

Suppose we are given a dataset consisting of $m$ examples, each with $n$ features:


$$ x^{(i)} = \begin{bmatrix} x^{(i)}_1 \\ \vdots \\x^{(i)}_n \end{bmatrix},\qquad X = \begin{bmatrix} \cdots & (x^{(1)})^T & \cdots\\ \cdots & (x^{(2)})^T & \cdots\\ & \vdots & \\ \cdots & (x^{(m)})^T & \cdots\\ \end{bmatrix}$$


Step 2: Shifting (Zero Mean)

To ensure the data is centered at zero (mean removal):

  • Compute the mean vector

$$\mu = \frac{1}{m}\sum\limits_{i=1}^{m}x^{(i)} $$

  • Shift each sample

$$x^{(i)} \leftarrow x^{(i)} - \mu \quad \text{(zero mean)}$$


Step 3: Rescaling (Unit Variance) [Optional]


  • Compute the variance of each feature

$$\sigma^2_j = \frac{1}{m-1}\sum_{i=1}^{m}\left(x_j^{(i)}\right)^2 $$

  • Rescale each feature

$$x^{(i)}_j \leftarrow \frac{x^{(i)}_j}{\sigma_j} $$



2.3.2. Maximize Variance

The goal of Principal Component Analysis (PCA) is to find the optimal direction (axis) that maximizes the variance in the data when projected onto that direction.



Step 1: Data Projection

  • Let the data be represented as $X = [x^{(1)}, x^{(2)}, \cdots, x^{(m)}]^T$.
  • Suppose we want to project the data onto a unit vector $ u_1 $.

Step 2: Maximizing the Variance of Projected Data

PCA identifies the direction $ u_1 $ such that the projected data captures the maximum variance.


$$\begin{align*} \text{variance of projected data} & = \frac{1}{m}\sum\limits_{i=1}^{m}\big(u^Tx^{(i)}\big)^2 = \frac{1}{m}\sum\limits_{i=1}^{m}\big( {x^{(i)}}^Tu\big)^2 \\ & = \frac{1}{m}\sum\limits_{i=1}^{m}\big( {x^{(i)}}^Tu\big)^T\big( {x^{(i)}}^Tu\big) = \frac{1}{m}\sum\limits_{i=1}^{m}u^Tx^{(i)}{x^{(i)}}^Tu \\ & = u^T\left(\frac{1}{m}\sum\limits_{i=1}^{m}x^{(i)}{x^{(i)}}^T \right) u \\ & =u^TSu \qquad (S=\frac{1}{m}X^T X: \text{sample covariance matrix}) \end{align*}$$


Step 3: Optimization Formulation

Maximizing the variance can now be written as the following optimization problem:


$$\begin{align*} \text{maximize} \quad & u^TSu \\ \text{subject to} \quad & u^Tu = 1\end{align*}$$


Step 4: Eigenvalue Analysis

From the properties of eigenvectors and eigenvalues:


$$\begin{align*} & u^TSu = u^T\lambda u = \lambda u^Tu = \lambda \quad (\text{Eigen analysis} : Su = \lambda u) \\ \\ & \implies \text{pick the largest eigenvalue } \lambda _1 \text{ of covariance matrix } S\\ & \implies u = u_1 \, \text{ is the } \,\lambda_1's \,\text{ corresponding eigenvector}\\ & \implies u_1 \text{ is the first principal component (direction of highest variance in the data)} \end{align*}$$



2.3.3. Minimize the Sum-of-Squared Error

The objective of PCA can also be framed as minimizing the sum-of-squared errors between the original data points and their projections.




Step 1: Derivation of Error Term


$$ \begin{align*} \lVert e^{(i)} \rVert ^2 & = \lVert x^{(i)} \rVert ^2 - \big( {x^{(i)}}^Tu \big)^2 \\ & = \lVert x^{(i)} \rVert ^2 - \big( {x^{(i)}}^Tu \big)^T\big( {x^{(i)}}^Tu \big) \\ & = \lVert x^{(i)} \rVert ^2 - u^Tx^{(i)}{x^{(i)}}^Tu\\ \end{align*} $$


Step 2: Minimizing the Mean Squared Error


$$\begin{align*} \frac {1}{m} \sum\limits_{i=1}^{m} \lVert e^{(i)} \rVert ^2 & = \frac {1}{m} \sum\limits_{i=1}^{m} \lVert x^{(i)} \rVert ^2 - \frac{1}{m}\sum\limits_{i=1}^{m}u^Tx^{(i)}{x^{(i)}}^Tu \\ & = \frac {1}{m} \sum\limits_{i=1}^{m} \lVert x^{(i)} \rVert ^2 - u^T\left(\frac{1}{m}\sum\limits_{i=1}^{m}x^{(i)}{x^{(i)}}^T\right)u \end{align*}$$


Step 3: Optimization Formulation

Minimizing the error is equivalent to maximizing the variance:


$$\begin{align*} &\text{minimize} \; \underbrace{\frac {1}{m} \sum\limits_{i=1}^{m} \lVert x^{(i)} \rVert ^2}_{\text{constant given $x_i$}} - \underbrace{u^T\left(\frac{1}{m}\sum\limits_{i=1}^{m}x^{(i)}{x^{(i)}}^T\right)u}_{\text{maximize}} \\ \\ \implies &\text{maximize} \; u^T\left(\frac{1}{m}\sum\limits_{i=1}^{m}x^{(i)}{x^{(i)}}^T\right)u = \max \; u^T S u\end{align*}$$


$$ \therefore \; \text{minimize} \; \text{error}^2 = \text{maximize} \; \text{variance}$$



2.3.4. Dimension Reduction Method ($n \rightarrow k$)

In PCA, we can reduce the dimensionality of the data while retaining as much meaningful information as possible by following these steps:


Step 1: Select Top $k$ Principal Components

  • Identify the top $k \; (\leq n)$ eigenvectors corresponding to the largest eigenvalues
  • Form a matrix $U = [u_1, u_2, \cdots, u_k]$ where the eigenvectors are orthonormal

Step 2: Project Each Data Point onto the New Subspace

  • Each data point $x^{(i)}$ is projected onto the span of the selected eigenvectors:

$$z^{(i)} = \begin{bmatrix} u_1^Tx^{(i)}\\ u_2^Tx^{(i)}\\ \vdots \\ u_k^Tx^{(i)}\\ \end{bmatrix} \;\text{ or }\; z = U^{T}x $$

  • In a matrix form

$$Z = XU$$


2.3.5. Linear Regression vs. PCA

Linear Regression

  • A supervised method focused on prediction.

  • Minimizes the vertical distances (residuals) between the observed $x_2$ values and the regression line.

PCA

  • An unsupervised method focused on data compression or dimensionality reduction.

  • Minimizes the orthogonal (perpendicular) distances from the data points to the principal component line.



2.4. PCA in Python


Data Generation

In [ ]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# data generation
m = 1000
mu = np.array([0, 0])
sigma = np.array([[3, 1.5],
                  [1.5, 1]])

X = np.random.multivariate_normal(mu, sigma, m)
X = np.asmatrix(X)

print(X.shape)

fig = plt.figure(figsize = (6, 4))
plt.plot(X[:,0], X[:,1], 'b.', alpha = 0.3)
plt.grid(alpha = 0.3)
plt.xlim([-7, 7])
plt.ylim([-4, 4])
plt.show()
(1000, 2)
No description has been provided for this image

PCA Implementation

In [ ]:
S = 1/(m-1)*X.T*X

D, U = np.linalg.eig(S)

idx = np.argsort(-D)
D = D[idx]
U = U[:,idx]

print(D, '\n')
print(U)
[3.61832577 0.20528425] 

[[ 0.87990038 -0.47515821]
 [ 0.47515821  0.87990038]]

Plot the principal component

In [ ]:
h = U[1,0]/U[0,0]
xp = np.arange(-6, 6, 0.1)
yp = h*xp

fig = plt.figure(figsize = (6, 4))
plt.plot(X[:,0], X[:,1], 'b.', alpha = 0.3)
plt.plot(xp, yp, 'r', linewidth = 3)
plt.grid(alpha = 0.3)
plt.xlim([-7, 7])
plt.ylim([-4, 4])
plt.show()
No description has been provided for this image

Plot the histogram of data projected onto $u_1$

In [ ]:
Z1 = X*U[:,0]

plt.figure(figsize = (6, 4))
plt.hist(Z1, 51)
plt.xlim([-7, 7])
plt.show()
No description has been provided for this image

Plot the histogram of data projected onto $u_2$

In [ ]:
Z2 = X*U[:,1]

plt.figure(figsize = (6, 4))
plt.hist(Z2, 21)
plt.xlim([-7, 7])
plt.show()
No description has been provided for this image

PCA from sklearn.decomposition

In [ ]:
from sklearn.decomposition import PCA

X = np.array(X)

pca = PCA(n_components = 1)
pca.fit(X)
Out[ ]:
PCA(n_components=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Plot the histogram of data projected onto $u_1$

In [ ]:
Z = pca.transform(X)

plt.figure(figsize = (6, 4))
plt.hist(Z, 51)
plt.xlim([-7, 7])
plt.show()
No description has been provided for this image

2.5. PCA Example: Mass-Spring Systsem

Consider a simple spring-mass system in which a ball of mass $m$ is attached to a massless, frictionless spring. The system exhibits oscillatory motion, which can ideally be described by a one-dimensional function of time representing the displacement.




Experimental Setup

However, we lack prior knowledge about the system's true dynamics and the optimal measurement direction.

  • We place three cameras (Camera A, Camera B, and Camera C) at arbitrary angles around the system to capture the ball's position in 3D space.
  • Each camera records 2D projections of the ball's position.
  • The measured coordinates from each camera are recorded in their respective arbitrary coordinate systems.
  • Additionally, external factors such as air resistance, camera imperfections, and potential friction introduce noise into the recorded data.

Challenge

The true underlying dynamics are defined solely by the ball's motion along the $x$-axis, yet:

  • We lack knowledge of the true $x$-axis in the system.
  • We have collected redundant and noisy data by measuring along arbitrary camera axes.

Objective

Our goal is to apply Principal Component Analysis (PCA) to:

  • Identify the most significant direction in the data that represents the system's primary motion.
  • Reduce the dimensionality of the dataset while retaining the core dynamics.
  • Extract a simplified, yet accurate, representation of the ball's motion - ideally corresponding to the true $x$-axis displacement.

Video recording from Camera A

In [ ]:
from IPython.display import YouTubeVideo
YouTubeVideo('Pkit-64g0eU', width = "560", height = "315")
Out[ ]:

Video recording from Camera B

In [ ]:
from IPython.display import YouTubeVideo
YouTubeVideo('x4lvjVjUUqg', width = "560", height = "315")
Out[ ]:

Video recording from Camera C

In [ ]:
from IPython.display import YouTubeVideo
YouTubeVideo('2t62WkNIqxY', width = "560", height = "315")
Out[ ]:

Each camera provides two measurements - one for the horizontal position ($x$) and one for the vertical position ($y$) - relative to its own coordinate system. For each time instance $i$, the recorded data can be structured as follows:


$$ x^{(i)} = \begin{bmatrix} x \text{ in camera A} \\ y \text{ in camera A} \\ x \text{ in camera B} \\ y \text{ in camera B} \\ x \text{ in camera C} \\ y \text{ in camera C} \end{bmatrix} $$


This 6-dimensional vector $x^{(i)}$ represents the system's position as captured by all three cameras at time instance $i$.


To organize the entire dataset over $m$ time instances, we construct a data matrix $X$ as follows:


$$ X_{m \times 6} = \begin{bmatrix} \cdots & (x^{(1)})^T & \cdots \\ \cdots & (x^{(2)})^T & \cdots \\ & \vdots & \\ & \vdots & \\ \cdots & (x^{(m)})^T & \cdots \\ \end{bmatrix} $$


Where:

  • Each row represents the system's recorded position at a particular time instance.
  • Each column corresponds to one of the six measured coordinates (two from each camera).

We have collected the data for you. Here is the dataset that you can download for analysis.


In [ ]:
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
In [ ]:
import numpy as np

X = np.load('/content/drive/MyDrive/ML/ML_data/pca_spring.npy')
X = np.asmatrix(X.T)

m = X.shape[0]

print(X.shape)
(273, 6)
In [ ]:
# Do not worry about the negative sign for the relative y-axis.
# it is simply a translation to plot the motion as observed by the camera.

plt.figure(figsize = (8, 4))
plt.subplot(1,3,1)
plt.plot(X[:, 0], -X[:, 1], 'r')
plt.axis('equal')
plt.title('Camera 1')

plt.subplot(1,3,2)
plt.plot(X[:, 2], -X[:, 3], 'b')
plt.axis('equal')
plt.title('Camera 2')

plt.subplot(1,3,3)
plt.plot(X[:, 4], -X[:, 5], 'k')
plt.axis('equal')
plt.title('Camera 3')

plt.show()
No description has been provided for this image

Apply PCA

In [ ]:
X = X - np.mean(X, axis = 0)

S = 1/(m-1)*X.T*X

D, U = np.linalg.eig(S)

idx = np.argsort(-D)
D = D[idx]
U = U[:,idx]

print(D, '\n')
print(U)
[2.46033089e+04 3.22747042e+02 8.73851124e+01 8.19527660e+01
 3.19467195e+01 7.42861585e+00] 

[[ 0.36881064  0.62298194 -0.12571821 -0.42647348  0.52121775 -0.0807439 ]
 [ 0.35632379  0.57286174  0.132303    0.59881765 -0.40143215  0.08734045]
 [ 0.58419477 -0.22610057 -0.20325551 -0.47751523 -0.58153918  0.00857804]
 [ 0.08652315 -0.02671281  0.75692234 -0.14177391 -0.06010869 -0.62861422]
 [ 0.4159798  -0.29900638  0.49374948  0.05637591  0.32442517  0.62075559]
 [-0.46389987  0.37746931  0.32963322 -0.45633202 -0.34660023  0.45308403]]

Plot Eigenvalues

In [ ]:
plt.figure(figsize = (6, 4))
plt.stem(np.sqrt(D))
plt.grid(alpha = 0.3)
plt.show()
No description has been provided for this image

Projection to the new Eigenvectors


$$Z = XU$$

In [ ]:
# relative magnitudes of the principal components

Z = X*U
xp = np.arange(0, m)/24    # 24 frame rate

plt.figure(figsize = (6, 4))
plt.plot(xp, Z)
plt.grid(alpha = 0.3)
plt.show()
No description has been provided for this image

Projection to the Principal Component

In [ ]:
## projected onto the first principal component
# 6 dim -> 1 dim (dim reduction)
# relative magnitude of the first principal component

plt.figure(figsize = (6, 4))
plt.plot(Z[:,0])
plt.grid(alpha = 0.3)
plt.show()
No description has been provided for this image

Discussion:

  • The first principal component effectively captured the system's oscillatory behavior, corresponding to the ideal spring-mass motion.
  • PCA successfully reduced the dataset from six dimensions to one while preserving the system's key dynamics.
  • PCA filtered out minor variations caused by noise, improving data clarity.
  • It is not the magnitudes of the measurements that are physically meaningful, but rather the relative magnitudes and patterns within the data.
  • By strategically combining multiple less-reliable measurements, PCA can uncover the true underlying structure of the system, providing a cost-effective alternative to expensive equipment.



3. Singular Value Decomposition (SVD)


In [ ]:
from IPython.display import YouTubeVideo
YouTubeVideo('sFym4MYOQ-o', width = "560", height = "315")
Out[ ]:

Geometry of Linear Maps

In the context of linear algebra, the matrix $ A \in \mathbb{R}^{m \times n} $ represents a linear transformation that maps vectors from an $ n $-dimensional space to an $ m $-dimensional space.


The matrix $ A $ can be interpreted as performing a combination of the following transformations:

(1) Rotation:

  • $ A $ can rotate vectors, changing their direction without altering their length. This preserves angles between vectors.

(2) Scaling (Stretching/Compression):

  • $ A $ can stretch or compress vectors along certain directions. This changes the vector's magnitude.

(3) Shearing (Skewing):

  • $ A $ can skew the original space, distorting shapes without strict rotation or scaling.

Mapping the Unit Ball to an Ellipsoid





The unit ball in $ \mathbb{R}^n $, defined as:


$$S = \{ x \in \mathbb{R}^n \mid \|x\| \leq 1 \}$$


is transformed by $ A $ into an ellipsoid in $ \mathbb{R}^m $:


$$AS = \{ Ax \mid x \in S \}$$


This geometric interpretation highlights that linear transformations distort space by rotating, scaling or skewing.


3.1. SVD Algorithm


Singular Values and Singular Vectors

The Singular Value Decomposition (SVD) provides a structured way to describe this transformation. The key elements of the SVD are:

  • the numbers $\sigma_1, \cdots, \sigma_n$ are called the singular values of $A$ by convention, $\sigma_i > 0$

  • the vectors $u_1, \cdots, u_n$ are unit vectors along the principal semiaxes of $AS$

  • the unit vectors $v_1, \cdots, v_n$ are the preimages of the principal semiaxes


These satisfy the following relationship:


$$A v_i = \sigma_i u_i$$


Matrix Formulation of SVD

By combining these vectors and values into matrices:


$$\begin{align*} A\begin{bmatrix} v_1& v_2 & \cdots & v_r \end{bmatrix} & =\begin{bmatrix} A v_1 & A v_2 & \cdots & A v_r \end{bmatrix} \\\\ & =\begin{bmatrix} \sigma_1u_1& \sigma_2u_2 & \cdots & \sigma_ru_r \end{bmatrix} \\\\ & =\begin{bmatrix} u_1& u_2 & \cdots & u_r \end{bmatrix}\begin{bmatrix} \sigma_1& & & \\ & \sigma_2 & & \\ & & \ddots & \\ & & & \sigma_r \end{bmatrix} \end{align*}$$


This leads to the compact matrix form:


$$A V = U \Sigma$$


Where:

  • $U$: Orthogonal matrix containing left singular vectors
  • $V$: Orthogonal matrix containing right singular vectors
  • $\Sigma$: Diagonal matrix containing singular values

Thin Singular Value Decomposition

In the case where $A \in \mathbb{R}^{m \times n}$ is skinny (more rows than columns) and full rank (i.e., $r = n$), we define:


$$A v_i = \sigma_i u_i \qquad \text{for } 1 \leq i \leq n$$


Let:


$$ \begin{align*} \hat U &= \begin{bmatrix}u_1 & u_2 & \cdots & u_n \end{bmatrix},\quad \quad \hat \Sigma &= \begin{bmatrix}\sigma_1& & & \\ & \sigma_2 &&\\ &&\ddots&\\ &&&\sigma_n \end{bmatrix}, \quad \quad V &= \begin{bmatrix} v_1 & v_2 & \cdots & v_n \end{bmatrix} \end{align*} $$


In matrix form,


$$AV = \hat U \hat \Sigma$$


Since $V$ is orthogona, we obtain the thin (or reduced) SVD of $A$


$$ A = \hat U \hat \Sigma V^T $$





Full Singular Value Decomposition

In addition to the thin SVD, we can extend the decomposition to form the full SVD by adding extra components to ensure that the resulting matrices are fully orthogonal.


Suppose $A \in \mathbb{R}^{m \times n}$ with $m \geq n$.


To construct the full SVD:

(1) Extend $U$:

$\quad \,$Add extra orthonormal columns to $\hat{U}$ to create a complete orthogonal matrix $U$.


$$U = \begin{bmatrix} u_1 & u_2 & \cdots & u_n & u_{n+1}& \cdots & u_m \end{bmatrix}$$


$\quad \,$Since $U$ is now an orthogonal matrix, $UU^T = I_m$.


(2) Extend $\Sigma$:

$\quad \,$Add extra rows of zeros to the diagonal matrix $\hat{\Sigma}$, ensuring the dimensions align with $U$.


$$ \Sigma = \begin{bmatrix} \sigma_1 & & & \\ & \sigma_2 & & \\ & & \ddots & \\ & & & \sigma_n \\ & & & 0 \\ & & & \vdots \\ & & & 0 \end{bmatrix} $$


(3) Combine in Matrix Form:

$\quad \,$Using these augmented matrices, the full SVD is defined as:


$$A = U \Sigma V^T$$


Where:

  • $U \in \mathbb{R}^{m \times m}$: an orthogonal matrix.
  • $\Sigma \in \mathbb{R}^{m \times n}$: a diagonal matrix with singular values.
  • $V \in \mathbb{R}^{n \times n}$: an orthogonal matrix.



This is the (full) singular value decomposition of $A$

  • Every matrix $A$ has a singular value decomposition.

  • The full SVD extends the thin SVD by adding orthogonal components to $U$ and zero rows in $\Sigma$/


Interpretation of SVD





The Singular Value Decomposition (SVD) provides an insightful way to understand a linear transformation $Ax$ by breaking it down into three distinct steps:


Step 1: Rotation by $V^T$

  • The first transformation is a rotation of the input vector using $V^T$.

  • This step aligns the data with the principal directions in the row space of the matrix.


Step 2: Diagonal Scaling by $\Sigma$

  • The second transformation applies scaling by the diagonal matrix $\Sigma$.

  • Each coordinate is stretched or compressed by the corresponding singular value $\sigma_i$.

  • These singular values represent the magnitude of the transformation in the corresponding direction.


Step 3: Rotation by $U$

  • The final transformation is another rotation by the orthogonal matrix $U$.

  • This step aligns the scaled data with the principal directions in the column space.


SVD factorizes the matrix $ A $ into three distinct components: an orthogonal matrix $ U $, a diagonal matrix $ \Sigma $ containing singular values, and another orthogonal matrix $ V^T $. This decomposition effectively represents $ A $ as a sequence of three linear transformations: a rotation by $ V^T $, followed by a scaling by $ \Sigma $, and then another rotation by $ U $.



3.2. PCA and SVD

(1) Any real symmetric and positive definite matrix $B \in \mathbb{R}^{n \times n}$ can be decomposed into its eigen decomposition as follows:


$$ \begin{align*} Bs_1 &= \lambda_1 s_1 \\ Bs_2 &= \lambda_2 s_2 \\ &\vdots\\ Bs_n &= \lambda_n s_n \\\\\\ B\begin{bmatrix}s_1 & s_2 & \cdots & s_n\end{bmatrix} &= \begin{bmatrix}s_1 & s_2 & \cdots & s_n\end{bmatrix}\begin{bmatrix}\lambda_1 & & & \\ & \lambda_2 & & \\ &&\ddots &\\ &&& \lambda_n\end{bmatrix}\\\\\\ B &= S\Lambda S^{-1} = S\Lambda S^T \end{align*} $$


where:

  • $S$ is an orthogonal matrix containing eigenvectors of $B$ ($\rightarrow SS^T = I$).
  • $\Lambda$ is a diagonal matrix with corresponding eigenvalues.

(2) A real matrix $A \in \mathbb{R}^{m \times n}$ (when $m>n$) has a singular value decomposition (SVD) given by:


$$A = U\Sigma V^T$$


where:

  • $U$ is an orthogonal matrix of size $m \times m$.
  • $V$ is an orthogonal matrix of size $n \times n$.
  • $\Sigma$ is a diagonal matrix of singular values (size $m \times n$).

4.2.1. Connecting PCA with SVD

From the matrix $A$ (assuming $A$ is skinny and full rank), we can construct two positive-definite symmetric matrices:


$$AA^T \quad \text{and} \quad A^TA$$


For $AA^T \in \mathbb{R}^{m \times m}$, $U$ is the matrix of eigenvectors of $AA^T$


$$AA^T = U\Sigma V^T V \Sigma^T U^T = U \Sigma \Sigma^T U^T$$


For $A^TA \in \mathbb{R}^{n \times n}$, $V$ is the matrix of eigenvectors of $A^TA$


$$A^TA = V \Sigma^T U^T U\Sigma V^T = V \Sigma^T \Sigma V^T$$



PCA via SVD

Recall that PCA relies on the covariance matrix:


$$S = \frac{1}{m}X^TX$$


By scaling the data matrix $A = \frac{X}{\sqrt m}$, the covariance matrix becomes:


$$S = A^T A$$


Using the SVD result:


$$S = A^TA = V \Sigma^T \Sigma V^T = V \Lambda V^T$$


  • $V$ is the matrix of PCA principal components (eigenvectors of $S = A^TA$)
  • $\sigma_i^2 = \lambda_i$ (singular values squared are the eigenvalues in PCA)

Summary

  • The eigenvectors of $A^TA$ (from the SVD) provide the principal components in PCA.

  • The squared singular values of $A$ correspond to the eigenvalues in PCA.


4.2.2. Efficient Computation of SVD

We have established the relationship between PCA and SVD. Suppose $A \in \mathbb{R}^{m \times n}$ where $m \gg n$.


To compute the SVD:

  • $U$: The left singular vectors of $A$

  • $V$: The right singular vectors of $A$

  • $\Sigma$: Diagonal matrix of singular values, $\sigma_i = \sqrt{\lambda_i}$


Key Observations:

  • The columns of $U$ are the eigenvectors of $AA^T \in \mathbb{R}^{m \times m}$.

  • The columns of $V$ are the eigenvectors of $A^T A \in \mathbb{R}^{n \times n}$.

  • The diagonal elements of $\Sigma$ are the singular values, $\sigma_i = \sqrt{\lambda_i}$.


Computational Challenge

If $n$ is small, computing $V$ via np.linalg.eig(A^T*A) is feasible since $A^T A \in \mathbb{R}^{n \times n}$. However, computing $U$ directly from np.linalg.eig(A*A^T) is inefficient because:


  • $AA^T \in \mathbb{R}^{m \times m}$
  • Since $m \gg n$, this matrix is much larger than $A^T A$, making it computationally expensive.

Efficient Solution

Instead of computing $U$ directly via np.linalg.eig(A*A^T), we can leverage the relationship between $U$ and $V$:


$$u_i = \frac{1}{\sqrt{\lambda_i}} A v_i$$


Where:

  • $v_i$ is the $i$-th eigenvector of $A^T A$
  • $\lambda_i$ is the corresponding eigenvalue
  • $u_i$ is the $i$-th left singular vector of $A$

Why Does This Work?

Starting from the known relationship:


$$A^T A v_i = \lambda_i v_i$$


By multiplying both sides by $A$,


$$A A^T (A v_i) = \lambda_i (A v_i)$$


This confirms that $A v_i$ is an eigenvector of $AA^T$ with eigenvalue $\lambda_i$.

However, this vector is not yet normalized. To normalize it:


$$u_i = \frac{1}{\sqrt{\lambda_i}} A v_i$$


This ensures that $u_i$ has unit length.


This method efficiently computes the left singular vectors $U$ without directly forming the large matrix $AA^T$, making it ideal when $m \gg n$.

  • Computing $V$ via $\text{eig}(A^T A)$ is efficient since $n$ is relatively small.
  • Instead of computing $U$ directly via $\text{eig}(AA^T)$, we use the relation $u_i = \frac{1}{\sqrt{\lambda_i}} A v_i$, which is computationally faster and avoids large matrix operations.

Example

Consider the following matrix:


$$A = \begin{bmatrix} 4 & 4 \\ -3 & 3 \end{bmatrix}$$


From the decomposition:


$$ \begin{aligned} A v_1 &= \sigma_1 u_1 \\ A v_2 &= \sigma_2 u_2 \end{aligned} $$


Combining these,


$$ AV = U\Sigma \quad \implies \quad A = U \Sigma V^T $$


To determine $V$ through eigen-analysis


$$A^TA = V\Sigma^TU^TU\Sigma V^T = V\Sigma^T\Sigma V^T = V\begin{bmatrix} \sigma_1^2& & \\ & \sigma_2^2 & \\ & & \ddots \\ \end{bmatrix}V^T$$


In [ ]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
In [ ]:
A = np.array([[4, 4],
              [-3, 3]])

A = np.asmatrix(A)

[D, V] = np.linalg.eig(A.T*A)

print(V, '\n')
print(D)
[[ 0.70710678 -0.70710678]
 [ 0.70710678  0.70710678]] 

[32. 18.]

To find $U$ from eigen-analysis


$$AA^T = U\Sigma V^TV\Sigma^TU^T = U\Sigma\Sigma^TU^T = U\begin{bmatrix} \sigma_1^2& & \\ & \sigma_2^2 & \\ & & \ddots \\ \end{bmatrix}U^T$$


In [ ]:
[D, U] = np.linalg.eig(A*A.T)

print(U, '\n')
print(D)
[[1. 0.]
 [0. 1.]] 

[32. 18.]

Instead of computing $U$ directly via np.linalg.eig(A*A^T), we can leverage the relationship between $U$ and $V$:


$$u_i = \frac{1}{\sqrt{\lambda_i}}Av_i$$


In [ ]:
sigma = np.sqrt(D)
Sigma = np.diag(sigma)
U = A @ V @ np.linalg.inv(Sigma)

print(U)
[[1. 0.]
 [0. 1.]]

However, how to compute the SVD numerically is not the focus of this course. Instead, we will use Python's built-in np.linalg.svd() function to compute the SVD decomposition directly. By doing so, we can concentrate on understanding the underlying theory and practical applications of SVD.


The function np.linalg.svd() performs SVD as follows:


In [ ]:
[U, S, VT] = np.linalg.svd(A, full_matrices = True)

print(U.shape, S.shape, VT.shape, '\n')
print(U, '\n')
print(S, '\n')
print(VT)
(2, 2) (2,) (2, 2) 

[[-1.  0.]
 [ 0.  1.]] 

[5.65685425 4.24264069] 

[[-0.70710678 -0.70710678]
 [-0.70710678  0.70710678]]

MIT 18.06 Linear Algebra by Prof. Gilbert Strang

In [ ]:
from IPython.display import YouTubeVideo
YouTubeVideo('Nx0lRBaXoz4', width = "560", height = "315")
Out[ ]:

3.3. Low Rank Approximation: Dimension Reduction

3.3.1. Low Rank Approximation



In the 2D case, consider a vector $x$ that can be expressed in terms of the basis vectors $v_1$ and $v_2$ (where $\hat v$ denotes a unit vector, though this notation is omitted here for simplicity):


$$x = c_1 v_1 + c_2 v_2$$


Equivalently,


$$x = \langle x, v_1 \rangle v_1 + \langle x, v_2 \rangle v_2$$


Since inner products can be rewritten in matrix notation,


$$x = ( v_1^{T} x) v_1 + ( v_2^T x) v_2$$


SVD Transformation

Next, consider the effect of matrix $A$ on the vector $x$:


$$Ax = c_1 A v_1 + c_2 A v_2$$


By the SVD property,


$$A v_1 = \sigma_1 u_1 \quad \text{and} \quad A v_2 = \sigma_2 u_2$$


Thus,


$$Ax = c_1 \sigma_1 u_1 + c_2 \sigma_2 u_2$$


Equivalently,


$$ \begin{align*} Ax &= u_1 \sigma_1 v_1^T x + u_2 \sigma_2 v_2^T x \\ & = \begin{bmatrix} u_1 & u_2 \end{bmatrix}\begin{bmatrix}\sigma_1 & 0\\ 0 & \sigma_2 \end{bmatrix}\begin{bmatrix} v_1^T \\ v_2^T \end{bmatrix} x\\ & = \begin{bmatrix} u_1 & u_2 \end{bmatrix}\begin{bmatrix}\sigma_1 & 0\\ 0 & \sigma_2 \end{bmatrix}\begin{bmatrix} v_1 & v_2 \end{bmatrix}^T x \end{align*} $$


In matrix form,


$$Ax = U \Sigma V^T x$$


Low-Rank Approximation in General

In practice, if one singular value dominates the others ($\sigma_1 \gg \sigma_2$), we can approximate $A$ with a rank-1 approximation:


$$Ax = u_1 \sigma_1 v_1^T x + u_2 \sigma_2 v_2^T x \approx u_1 \sigma_1 v_1^T x$$


General Low-Rank Approximation of Matrix $A$

For a general matrix $A$, the SVD decomposition allows for constructing a low-rank approximation by retaining only the top $k$ singular values:

  • Full-rank representation:

$$ A = U_r \Sigma_r V_r^T = \sum_{i=1}^{r} \sigma_i u_i v_i^T $$

  • Rank-$k$ approximation (with $k \leq r$):

$$ \tilde{A} = U_k \Sigma_k V_k^T = \sum_{i=1}^{k} \sigma_i u_i v_i^T $$


Example: Matrix

In [ ]:
A = np.array([[11.08, 6.82, 1.76, -6.82],
              [2.50, -1.01, -2.60, 1.19],
              [-4.88, -5.07, -3.21, 5.20],
              [-0.49, 1.52, 2.07, -1.66],
              [-14.04, -12.40, -6.66, 12.65],
              [0.27, -8.51, -10.19, 9.15],
              [9.53, -9.84, -17.00, 11.00],
              [-12.01, 3.64, 11.10, -4.48]])


In [ ]:
[U, S, VT] = np.linalg.svd(A, full_matrices = True)

U = np.asmatrix(U)
S = np.asmatrix(np.diag(S))
VT = np.asmatrix(VT)

print(U.shape, S.shape, V.shape, '\n')
print(U, '\n')
print(S, '\n')
print(VT)
(8, 8) (4, 4) (2, 2) 

[[-0.24642369  0.44536234  0.6205749   0.32739909  0.45915182  0.05410587
  -0.18650735  0.00956801]
 [ 0.07432183  0.10750102  0.28337342 -0.77803389 -0.10352762  0.32547077
  -0.42245245  0.04655432]
 [ 0.21369502 -0.1903207   0.49492174  0.11043549 -0.4736735  -0.61498898
  -0.24132713 -0.01233476]
 [-0.08223154 -0.02472167  0.20057294  0.06152361 -0.27047805  0.29819169
   0.20131309 -0.86371786]
 [ 0.50375686 -0.55378628  0.13739629 -0.02351069  0.61002316  0.01682912
  -0.07593043 -0.20479838]
 [ 0.43719294  0.03496575 -0.05496539  0.50172005 -0.29664823  0.55199001
  -0.3572128   0.18055856]
 [ 0.59019451  0.42657675  0.21078205 -0.13667122 -0.03298713 -0.00394525
   0.6243326   0.1252985 ]
 [-0.2967926  -0.51321848  0.42787997  0.0231769  -0.14042827  0.34499254
   0.40598717  0.40166774]] 

[[3.68258450e+01 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [0.00000000e+00 2.62368811e+01 0.00000000e+00 0.00000000e+00]
 [0.00000000e+00 0.00000000e+00 2.20213307e-02 0.00000000e+00]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 5.10429562e-03]] 

[[-0.03564832 -0.5381828  -0.61426544  0.5759917 ]
 [ 0.92076022  0.16617467 -0.32597333 -0.13538086]
 [-0.13920696 -0.49222525 -0.31155848 -0.80079151]
 [-0.36269992  0.66367127 -0.6475729  -0.09294376]]


In [ ]:
[U, S, VT] = np.linalg.svd(A, full_matrices = False)

U = np.asmatrix(U)
S = np.asmatrix(np.diag(S))
VT = np.asmatrix(VT)

print(U.shape, S.shape, V.shape, '\n')
print(U, '\n')
print(S, '\n')
print(VT)
(8, 4) (4, 4) (2, 2) 

[[-0.24642369  0.44536234  0.6205749   0.32739909]
 [ 0.07432183  0.10750102  0.28337342 -0.77803389]
 [ 0.21369502 -0.1903207   0.49492174  0.11043549]
 [-0.08223154 -0.02472167  0.20057294  0.06152361]
 [ 0.50375686 -0.55378628  0.13739629 -0.02351069]
 [ 0.43719294  0.03496575 -0.05496539  0.50172005]
 [ 0.59019451  0.42657675  0.21078205 -0.13667122]
 [-0.2967926  -0.51321848  0.42787997  0.0231769 ]] 

[[3.68258450e+01 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [0.00000000e+00 2.62368811e+01 0.00000000e+00 0.00000000e+00]
 [0.00000000e+00 0.00000000e+00 2.20213307e-02 0.00000000e+00]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 5.10429562e-03]] 

[[-0.03564832 -0.5381828  -0.61426544  0.5759917 ]
 [ 0.92076022  0.16617467 -0.32597333 -0.13538086]
 [-0.13920696 -0.49222525 -0.31155848 -0.80079151]
 [-0.36269992  0.66367127 -0.6475729  -0.09294376]]

Question

How to determine a good $k$ so that we can efficiently compress $A$ without losing too much information of $A$?


In [ ]:
sig_val = np.diag(S)

plt.figure(figsize = (6, 4))
plt.stem(range(1,5), sig_val)
plt.xticks(range(1,5))
plt.title('Singular Values')
plt.show()
No description has been provided for this image

When singular values decay rapidly, retaining only the top $k$ singular values can provide an accurate yet efficient approximation.


In [ ]:
k = 2

low_U = U[:, :k]
low_S = S[:k, :k]
low_VT = VT[:k, :]

A_approx = low_U*low_S*low_VT

print(low_U.shape, low_S.shape, low_VT.shape, '\n')
print('A:\n', A ,'\n')
print('A_approx:\n', A_approx, '\n')
print('low_U:\n', low_U, '\n')
print('low_VT:\n', low_VT)
(8, 2) (2, 2) (2, 4) 

A:
 [[ 11.08   6.82   1.76  -6.82]
 [  2.5   -1.01  -2.6    1.19]
 [ -4.88  -5.07  -3.21   5.2 ]
 [ -0.49   1.52   2.07  -1.66]
 [-14.04 -12.4   -6.66  12.65]
 [  0.27  -8.51 -10.19   9.15]
 [  9.53  -9.84 -17.    11.  ]
 [-12.01   3.64  11.1   -4.48]] 

A_approx:
 [[ 11.08250851   6.8256176    1.76533991  -6.80890115]
 [  2.49942829  -1.00429274  -2.60062751   1.19462804]
 [ -4.87827835  -5.06500943  -3.20623934   5.20878009]
 [ -0.48927124   1.52196569   2.07157948  -1.65643381]
 [-14.03962233 -12.39843105  -6.65913505  12.65241176]
 [  0.27076035  -8.51229541 -10.18871873   9.14926874]
 [  9.53039313  -9.83725225 -16.99900559  11.0036522 ]
 [-12.00864542   3.64455947  11.10301226  -4.47244356]] 

low_U:
 [[-0.24642369  0.44536234]
 [ 0.07432183  0.10750102]
 [ 0.21369502 -0.1903207 ]
 [-0.08223154 -0.02472167]
 [ 0.50375686 -0.55378628]
 [ 0.43719294  0.03496575]
 [ 0.59019451  0.42657675]
 [-0.2967926  -0.51321848]] 

low_VT:
 [[-0.03564832 -0.5381828  -0.61426544  0.5759917 ]
 [ 0.92076022  0.16617467 -0.32597333 -0.13538086]]

3.3.2. Visual Illustration


Full SVD


Thin (or Economy) SVD


Truncated SVD (Low Rank Approximation)

$$Y \approx U_r \Sigma_r V_r^T$$

  • First term

$$Y_1 \approx u_1 \sigma_1 v_1^T$$

  • Second term

$$Y_2 \approx u_2 \sigma_2 v_2^T$$

  • Third term

$$Y_3 \approx u_3 \sigma_3 v_3^T$$


  • Best approximation with a finite $r$ decompositions

$$Y \approx Y_1 + Y_2 + Y_3 + \cdots = u_1 \sigma_1 v_1^T + u_2 \sigma_2 v_2^T + u_3 \sigma_3 v_3^T + \cdots = \sum_{j = 1}^{r} u_j \sigma_j v_j^T$$


3.3.3. Example: Image Approximation

Image as a Matrix

An image can be represented as a matrix $A$, where each entry corresponds to the intensity of a pixel for grayscale images. By applying SVD, we decompose the image matrix as follows:


$$A = U\Sigma V^T$$

Image Approximation with Rank-$k$ Decomposition

Instead of retaining the full rank of $A$, we can approximate it using only the top $k$ singular values:


$$\tilde A = U_k \Sigma_k V_k^T = \sum_{i=1}^{k}\sigma_i u_i v_i^T \qquad (k \leq r)$$


Downlaod the image file

In [ ]:
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
In [ ]:
import cv2

img = cv2.imread('/content/drive/MyDrive/ML/ML_data/kaist_me.jpg', 0)
print(img.shape)
(1459, 2593)
In [ ]:
plt.figure(figsize = (6, 6))
plt.imshow(img, 'gray')
plt.axis('off')
plt.show()
No description has been provided for this image

SVD

In [ ]:
A = img

[U, S, VT] = np.linalg.svd(A, full_matrices = False)

U = np.asmatrix(U)
S = np.asmatrix(np.diag(S))
VT = np.asmatrix(VT)

print(U.shape, S.shape, VT.shape, '\n')
(1459, 1459) (1459, 1459) (1459, 2593) 


Plot Singular Values

In [ ]:
sig_val = np.diag(S)

plt.figure(figsize = (6, 4))
plt.stem(sig_val)
plt.title('Singular Values')
plt.show()
No description has been provided for this image

Approximated image when $k = 20$

In [ ]:
k = 20

low_U = U[:, :k]
low_S = S[:k, :k]
low_VT = VT[:k, :]

A_hat = low_U*low_S*low_VT

plt.figure(figsize = (8, 5))
plt.subplot(1,2,1)
plt.imshow(A, 'gray')
plt.axis('off')
plt.title('Original image')

plt.subplot(1,2,2)
plt.imshow(A_hat, 'gray')
plt.axis('off')
plt.title('Approximated image w/ rank = 20')

plt.tight_layout()
plt.show()
No description has been provided for this image

Approximated images with $k$ varied

In [ ]:
K = [1, 2, 3, 4, 10, 20, 30, 100]

plt.figure(figsize = (10, 5))

for i in range(len(K)):
    low_U = U[:, :K[i]]
    low_S = S[:K[i], :K[i]]
    low_VT = VT[:K[i], :]

    A_hat = low_U*low_S*low_VT

    plt.subplot(2,4,i+1), plt.imshow(A_hat, 'gray'), plt.axis('off')
    plt.title('rank {}'.format(K[i]))

plt.tight_layout()
plt.show()