Dimension Reduction

Table of Contents

I. 1. Statistics ¶

II. 2. Principal Component Analysis (PCA)¶

III. 3. Fisher Discriminant Analysis (FDA) or Linear Discriminant Analysis (LDA)¶

I. 3.1. FDA Algorithm ¶

I. 3.1.1. Dimensionality Reduction with Label ¶
II. 3.1.2. Projection onto Line $\omega$ ¶
III. 3.1.3. Fisher Discriminant Analysis ¶

II. 3.2. Python Code ¶

IV. 4. Singular Value Decomposition (SVD)¶

I. 4.1. SVD Algorithm ¶
II. 4.2. PCA and SVD ¶
III. 4.3. Low Rank Approximation: Dimension Reduction ¶

I. 4.3.1. Low Rank Approximation ¶
II. 4.3.2. Example: Image Approximation ¶
III. 4.3.3. Eigenface ¶

1. Statistics¶

1.1. Populations and Samples¶

A population includes all the elements from a set of data
A parameter is a quantity computed from a population
- mean, $\mu$
- variance, $\sigma^2$
A sample is a subset of the population.
- one or more observations
A statistic is a quantity computed from a sample
- sample mean, $\bar{x}$
- sample variance, $𝑠^2$
- sample correlation, $𝑆_{𝑥𝑦}$

1.1.1. How to Generate Random Numbers (Samples or data)¶

Data sampled from population/process/generative model

import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

## random number generation (1D)
m = 1000;

# uniform distribution U(0,1)
x1 = np.random.rand(m,1);

# uniform distribution U(a,b)
a = 1;
b = 5;
x2 = a + (b-a)*np.random.rand(m,1);

# standard normal (Gaussian) distribution N(0,1^2)
# x3 = np.random.normal(0, 1, m)
x3 = np.random.randn(m,1);

# normal distribution N(5,2^2)
x4 = 5 + 2*np.random.randn(m,1);

# random integers
x5 = np.random.randint(1, 6, size = (1,m));

1.1.2. Histogram : graphical representation of data distribution¶

$\Rightarrow$ rough sense of density of data

1.2. Inference¶

True population or process is modeled probabilistically.
Sampling supplies us with realizations from probability model.
Compute something, but recognize that we could have just as easily gotten a different set of realizations.

We want to infer the characteristics of the true probability model from our one sample.

1.3. Law of Large Numbers¶

Sample mean converges to the population mean as sample size gets large

$\bar{x} \rightarrow \mu_x \qquad \text{as} \qquad m \rightarrow \infty$

True for any probability density functions

wikipedia

sample mean and sample variance

$\begin{align} \bar{x} &=\frac{x_1+x_2+...+x_m}{m}\\ s^2 &=\frac{\sum_{i=1}^{m}(x_i-\bar{x})^2}{m-1} \end{align}$

suppose $x \sim U[0,1]$

# statistics
# numerically understand statisticcs

m = 100
x = np.random.rand(m,1)

#xbar = 1/m*np.sum(x, axis = 0)
#np.mean(x, axis = 0)
xbar = 1/m*np.sum(x)
np.mean(x)

varbar = (1/(m - 1))*np.sum((x - xbar)**2)
np.var(x)

print(xbar)
print(np.mean(x))
print(varbar)
print(np.var(x))

0.5117861127203165
0.5117861127203165
0.08854315742973504
0.08765772585543768

# various sample size m

m = np.arange(10, 2000, 20)
means = []

for i in m:
    x = np.random.normal(10, 30, i)
    means.append(np.mean(x))

plt.figure(figsize = (6, 4))
plt.plot(m, means, 'bo', markersize = 4)
plt.axhline(10, c = 'k', linestyle='dashed')
plt.xlabel('# of smaples (= sample size)', fontsize = 15)
plt.ylabel('sample mean', fontsize = 15)
plt.ylim([0, 20])
plt.show()

1.4. Central Limit Theorem¶

Sample mean (not samples) will be approximately normal-distributed as a sample size $m \rightarrow \infty$

$\bar{x} =\frac{x_1+x_2+...+x_m}{m}$

More samples provide more confidence (or less uncertainty)
Note: true regardless of any distribution of population

$\bar{x} \rightarrow N\left(\mu_x,\left(\frac{\sigma}{\sqrt{m}}\right)^2 \right)$

wikipedia

Variance Gets Smaller as $m$ is Larger
- Seems approximately Gaussian distributed
- numerically demostrate that sample mean follows the Gaussin distribution

N = 100
m = np.array([10, 40, 160])   # sample of size m

S1 = []   # sample mean (or sample average)
S2 = []
S3 = []

for i in range(N):
    S1.append(np.mean(np.random.rand(m[0], 1)))
    S2.append(np.mean(np.random.rand(m[1], 1)))
    S3.append(np.mean(np.random.rand(m[2], 1)))

plt.figure(figsize = (6, 4))
plt.subplot(1,3,1), plt.hist(S1, 21), plt.xlim([0, 1]), plt.title('m = '+ str(m[0])), plt.yticks([])
plt.subplot(1,3,2), plt.hist(S2, 21), plt.xlim([0, 1]), plt.title('m = '+ str(m[1])), plt.yticks([])
plt.subplot(1,3,3), plt.hist(S3, 21), plt.xlim([0, 1]), plt.title('m = '+ str(m[2])), plt.yticks([])
plt.show()

1.5. Multivariate Statistics¶

$x^{(i)} = \begin{bmatrix}x_1^{(i)} \\ x_2^{(i)}\\ \vdots \end{bmatrix}, \quad X = \begin{bmatrix} -& (x^{(i)})^T & -\\ - & (x^{(i)})^T & -\\ & \vdots & \\ - & (x^{(m)})^T & -\end{bmatrix}$

$m$ observations $\left(x^{(i)}, x^{(2)}, \cdots , x^{(m)}\right)$

$\begin{align*} \text{sample mean} \; \bar x &= \frac{x^{(1)} + x^{(2)} + \cdots + x^{(m)}}{m} = \frac{1}{m} \sum\limits_{i=1}^{m}x^{(i)} \\ \text{sample variance} \; S^2 &= \frac{1}{m-1} \sum\limits_{i=1}^{m}(x^{(i)} - \bar x)^2 \\ (\text{Note: } &\text{population variance} \; \sigma^2 = \frac{1}{N}\sum\limits_{i=1}^{N}(x^{(i)} - \mu)^2 \end{align*}$

1.5.1. Correlation of Two Random Variables¶

$\begin{align*} \text{Sample Variance} : S_x &= \frac{1}{m-1} \sum\limits_{i=1}^{m}\left(x^{(i)}-\bar x\right)^2 \\ \text{Sample Covariance} : S_{xy} &= \frac{1}{m-1} \sum\limits_{i=1}^{m}\left(x^{(i)}-\bar x\right)\left(y^{(i)}-\bar y \right)\\ \text{Sample Covariance matrix} : S &= \begin{bmatrix} S_x & S_{xy} \\ S_{yx} & S_y \end{bmatrix}\\ \text{sample correlation coefficient} : r &= \frac{S_{xy}}{ \sqrt {S_{xx}\cdot S_{yy}} } \end{align*}$

Strength of linear relationship between two variables, $x$ and $y$

Assume

$x_1 \leq x_2 \leq \cdots \leq x_n$

$y_1 \leq y_2 \leq \cdots \leq y_n$

1.5.2. Correlation Coefficient¶

$+1 \to$ close to a straight line
$-1 \to$ close to a straight line
Indicate how close to a linear line, but
No information on slope

$0 \leq \left\lvert \text{ correlation coefficient } \right\rvert \leq 1$

$\hspace{1cm}\begin{array}{Icr}\leftarrow\\ (\text{uncorrelated})\end{array} \quad \quad \quad \begin{array}{Icr}\rightarrow \\ (\text{linearly correlated})\end{array}$

Does not tell anything about causality

1.5.3. Correlation Coefficient Plot¶

Plots correlation coefficients among pairs of variables
http://rpsychologist.com/d3/correlation/

1.5.4. Covariance Matrix¶

$\sum = \begin{bmatrix} E[(X_1-\mu_1)(X_1-\mu_1)]& E[(X_1-\mu_1)(X_2-\mu_2)] & \cdots &E[(X_1-\mu_1)(X_n-\mu_n)]\\ E[(X_2-\mu_2)(X_1-\mu_1)]& E[(X_2-\mu_2)(X_2-\mu_2)] & \cdots &E[(X_2-\mu_2)(X_n-\mu_n)]\\ \vdots & \vdots & \ddots & \vdots\\ E[(X_n-\mu_n)(X_1-\mu_1)]& E[(X_n-\mu_n)(X_2-\mu_2)] & \cdots &E[(X_n-\mu_n)(X_n-\mu_n)]\\ \end{bmatrix}$

# correlation coefficient

m = 300
x = np.random.rand(m)
y = np.random.rand(m)

xo = np.sort(x)
yo = np.sort(y)
yor = -np.sort(-y)

plt.figure(figsize = (6, 6))
plt.plot(x, y, 'ko', label = 'random')
plt.plot(xo, yo, 'ro', label = 'sorted')
plt.plot(xo, yor, 'bo', label = 'reversely ordered')

plt.xticks([])
plt.yticks([])
plt.xlabel('x', fontsize = 20)
plt.ylabel('y', fontsize = 20)
plt.axis('equal')
plt.legend(fontsize = 12)
plt.show()

print(np.corrcoef(x,y), '\n')
print(np.corrcoef(xo,yo), '\n')
print(np.corrcoef(xo,yor))

[[1.        0.0572692]
 [0.0572692 1.       ]] 

[[1.        0.9942284]
 [0.9942284 1.       ]] 

[[ 1.         -0.99710652]
 [-0.99710652  1.        ]]

# correlation coefficient

m = 300
x = 2*np.random.randn(m)
y = np.random.randn(m)

xo = np.sort(x)
yo = np.sort(y)
yor = -np.sort(-y)

plt.figure(figsize = (6, 6))
plt.plot(x, y, 'ko', label = 'random')
plt.plot(xo, yo, 'ro', label = 'sorted')
plt.plot(xo, yor, 'bo', label = 'reversely ordered')

plt.xticks([])
plt.yticks([])
plt.xlabel('x', fontsize = 20)
plt.ylabel('y', fontsize = 20)
plt.axis('equal')
plt.legend(fontsize = 12)
plt.show()

print(np.corrcoef(x,y), '\n')
print(np.corrcoef(xo,yo), '\n')
print(np.corrcoef(xo,yor))

[[ 1.         -0.07563469]
 [-0.07563469  1.        ]] 

[[1.         0.99835142]
 [0.99835142 1.        ]] 

[[ 1.         -0.99687447]
 [-0.99687447  1.        ]]

import seaborn as sns
import pandas as pd

d = {'col. 1': x, 'col. 2': xo, 'col. 3': yo, 'col. 4': yor}
df = pd.DataFrame(data = d)

sns.pairplot(df, height = 1.5, aspect = 1)
plt.show()

2. Principal Component Analysis (PCA)¶

Motivation: Can we describe high-dimensional data in a "simpler" way?

$\quad \rightarrow$ Dimension reduction without losing too much information
$\quad \rightarrow$ Find a low-dimensional, yet useful representation of the data

Why dimensionality reduction?
- insights into the low-dimensinal structures in the data (visualization)
- Fewer dimensions ⇒ Less chances of overfitting ⇒ Better generalization
- Speeding up learning algorithms
  - Most algorithms scale badly with increasing data dimensionality
- Less storage requirements (data compression)
- Note: Dimensionality Reduction is different from Feature Selection
  - .. although the goals are kind of the same
- Dimensionality reduction is more like “Feature Extraction”
  - Constructing a small set of new features from the original features

How?
idea: highly correlated data contains redundant features

Each example $x$ has 2 features $\{x_1,x_2\}$
Consider ignoring the feature $x_2$ for each example
Each 2-dimensional example $x$ now becomes 1-dimensional $x = \{x_1\}$
Are we losing much information by throwing away $x_2$ ?

Each example $x$ has 2 features $\{x_1,x_2\}$
Consider ignoring the feature $x_2$ for each example
Each 2-dimensional example $x$ now becomes 1-dimensional $x = \{x_1\}$
Are we losing much information by throwing away $x_2$ ?
Yes, the data has substantial variance along both features (i.e. both axes)

Now consider a change of axes
Each example $x$ has 2 features $\{u_1,u_2\}$
Consider ignoring the feature $u_2$ for each example
Each 2-dimensional example $x$ now becomes 1-dimensional $x = \{u_1\}$
Are we losing much information by throwing away $u_2$ ?
No. Most of the data spread is along $u_1$ (very little variance along $u_2$ )

Data → projection onto unit vector ˆu1
- PCA is used when we want projections capturing maximum variance directions
- Principal Components (PC): directions of maximum variability in the data
- Roughly speaking, PCA does a change of axes that can represent the data in a succinct manner

HOW?

Maximize variance (most separable)
Minimize the sum-of-squares (minimum squared error)

2.1. PCA Algorithm¶

2.1.1. Pre-processing¶

Given data

$x^{(i)} = \begin{bmatrix} x^{(i)}_1 \\ \vdots \\x^{(i)}_n \end{bmatrix},\qquad X = \begin{bmatrix} \cdots & (x^{(1)})^T & \cdots\\ \cdots & (x^{(2)})^T & \cdots\\ & \vdots & \\ \cdots & (x^{(m)})^T & \cdots\\ \end{bmatrix}$

Shifting (zero mean) and rescaling (unit variance)
1. Shift to zero mean
  
  $\begin{align*} \mu &= \frac{1}{m}\sum\limits_{i=1}^{m}x^{(i)} \\ x^{(i)} &\leftarrow x^{(i)} - \mu \quad \text{(zero mean)} \end{align*}$
2. [optional] Rescaling (unit variance)
  
  $\begin{align*} \sigma^2_j &= \frac{1}{m-1}\sum\limits_{i=1}{m}\left(x_j^{(i)}\right)^2 \\ x^{(i)}_j &\leftarrow \frac{x^{(i)}_j}{\sigma_j} \\ \end{align*}$

2.1.2. Maximize Variance¶

Find unit vector $u$ such that maximizes variance of projections

Note: $m \approx m-1$ for big data

$\begin{align*} \text{variance of projected data} & = \frac{1}{m}\sum\limits_{i=1}^{m}\big(u^Tx^{(i)}\big)^2 = \frac{1}{m}\sum\limits_{i=1}^{m}\big( {x^{(i)}}^Tu\big)^2 \\ & = \frac{1}{m}\sum\limits_{i=1}^{m}\big( {x^{(i)}}^Tu\big)^T\big( {x^{(i)}}^Tu\big) = \frac{1}{m}\sum\limits_{i=1}^{m}u^Tx^{(i)}{x^{(i)}}^Tu \\ & = u^T\left(\frac{1}{m}\sum\limits_{i=1}^{m}x^{(i)}{x^{(i)}}^T \right) u \\ & =u^TSu \qquad (S=\frac{1}{m}X^T X: \text{sample covariance matrix}) \end{align*}$

In an optimization form

$\begin{align*} \text{maximize} \quad & u^TSu \\ \text{subject to} \quad & u^Tu = 1\end{align*}$

$\begin{align*} & u^TSu = u^T\lambda u = \lambda u^Tu = \lambda \quad (\text{Eigen analysis} : Su = \lambda u) \\ \\ & \implies \text{pick the largest eigenvalue } \lambda _1 \text{ of covariance matrix } S\\ & \implies u = u_1 \, \text{ is the } \,\lambda_1's \,\text{ corresponding eigenvector}\\ & \implies u_1 \text{ is the first principal component (direction of highest variance in the data)} \end{align*}$

2.1.3. Minimize the Sum-of-Squared Error¶

$\begin{align*} \lVert e^{(i)} \rVert ^2 & = \lVert x^{(i)} \rVert ^2 - \big( {x^{(i)}}^Tu \big)^2 \\ & = \lVert x^{(i)} \rVert ^2 - \big( {x^{(i)}}^Tu \big)^T\big( {x^{(i)}}^Tu \big) \\ & = \lVert x^{(i)} \rVert ^2 - u^Tx^{(i)}{x^{(i)}}^Tu\\ \end{align*}$

$\begin{align*} \frac {1}{m} \sum\limits_{i=1}^{m} \lVert e^{(i)} \rVert ^2 & = \frac {1}{m} \sum\limits_{i=1}^{m} \lVert x^{(i)} \rVert ^2 - \frac{1}{m}\sum\limits_{i=1}^{m}u^Tx^{(i)}{x^{(i)}}^Tu \\ & = \frac {1}{m} \sum\limits_{i=1}^{m} \lVert x^{(i)} \rVert ^2 - u^T\left(\frac{1}{m}\sum\limits_{i=1}^{m}x^{(i)}{x^{(i)}}^T\right)u \end{align*}$

In an optimization form

$\begin{align*} &\text{minimize} \; \underbrace{\frac {1}{m} \sum\limits_{i=1}^{m} \lVert x^{(i)} \rVert ^2}_{\text{constant given $x_i$}} - \underbrace{u^T\left(\frac{1}{m}\sum\limits_{i=1}^{m}x^{(i)}{x^{(i)}}^T\right)u}_{\text{maximize}} \\ \\ \implies &\text{maximize} \; u^T\left(\frac{1}{m}\sum\limits_{i=1}^{m}x^{(i)}{x^{(i)}}^T\right)u = \max \; u^T S u\end{align*}$

$\therefore \; \text{minimize} \; error^2 = \text{maximize} \; variance$

2.1.4. Dimension Reduction Method ( $n \rightarrow k$ )¶

Choose top $k$ (orthonormal) eigenvectors, $U = [u_1, u_2, \cdots, u_k]$
Project $x_i$ onto span $\{ u_1, u_2, \cdots, u_k\}$

$z^{(i)} = \begin{bmatrix} u_1^Tx^{(i)}\\ u_2^Tx^{(i)}\\ \vdots \\ u_k^Tx^{(i)}\\ \end{bmatrix} \;\text{ or }\; z = U^{T}x$

Pictorial summary of PCA

$\qquad \qquad \qquad \qquad x^{(i)} \rightarrow$ projection onto unit vector $u \implies u^Tx^{(i)} =$ distance from the origin along $u$

2.2. PCA Visualization¶

2.2.1. Eigenvalues¶

$\lambda_1, \lambda_2$ indicates variance along the eigenvectors, respectively.
The larger eigenvalue is, the more dominant feature (eigenvector) is

2.2.2. Eigenvectors¶

Given basis $\{\hat{x}_1, \hat{x}_2\}$ to transformed basis $\{\hat{u}_1, \hat{u}_2\}$

$[\hat{u}_1\; \hat{u}_2]=[\hat{x}_1 \; \hat{x}_2]\begin{bmatrix} c_1 & c_3 \\c_2 & c_4\end{bmatrix}$

2.3. Python Codes¶

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# data generation
m = 5000
mu = np.array([0, 0])
sigma = np.array([[3, 1.5],
                  [1.5, 1]])

X = np.random.multivariate_normal(mu, sigma, m)
X = np.asmatrix(X)

fig = plt.figure(figsize = (6, 4))
plt.plot(X[:,0], X[:,1], 'k.', alpha = 0.3)
plt.axis('equal')
plt.grid(alpha = 0.3)
plt.show()

S = 1/(m-1)*X.T*X

D, U = np.linalg.eig(S)

idx = np.argsort(-D)
D = D[idx]
U = U[:,idx]

print(D, '\n')
print(U)

[3.81888851 0.1982668 ] 

[[ 0.88088115 -0.47333752]
 [ 0.47333752  0.88088115]]

h = U[1,0]/U[0,0]
xp = np.arange(-6, 6, 0.1)
yp = h*xp

fig = plt.figure(figsize = (6, 4))
plt.plot(X[:,0], X[:,1], 'k.', alpha = 0.3)
plt.plot(xp, yp, 'r', linewidth = 3)
plt.axis('equal')
plt.grid(alpha = 0.3)
plt.show()

Z = X*U[:,0]

plt.figure(figsize = (6, 4))
plt.hist(Z, 51)
plt.show()

Z = X*U[:,1]

plt.figure(figsize = (6, 4))
plt.hist(Z, 51)
plt.show()

from sklearn.decomposition import PCA

X = np.array(X)

pca = PCA(n_components = 1)
pca.fit(X)

PCA(n_components=1)

PCA(n_components=1)

u = pca.transform(X)

plt.figure(figsize = (6, 4))
plt.hist(u, 51)
plt.show()

2.4. PCA Example¶

Multiple video camera records of a spring and mass system

source:
- https://www.cs.princeton.edu/picasso/mats/PCA-Tutorial-Intuition_jp.pdf

%%html
<center><iframe
width="560" height="315" src="https://www.youtube.com/embed/Pkit-64g0eU" frameborder="0" allowfullscreen>
</iframe></center>

%%html
<center><iframe
width="560" height="315" src="https://www.youtube.com/embed/x4lvjVjUUqg" frameborder="0" allowfullscreen>
</iframe></center>

%%html
<center><iframe
width="560" height="315" src="https://www.youtube.com/embed/2t62WkNIqxY" frameborder="0" allowfullscreen>
</iframe></center>

$x^{(i)} = \begin{bmatrix} x \text{ in camera 1} \\ y \text{ in camera 1} \\ x \text{ in camera 2} \\ y \text{ in camera 2} \\ x \text{ in camera 3} \\ y \text{ in camera 3} \end{bmatrix},\qquad X_{m \times 6} = \begin{bmatrix} \cdots & (x^{(1)})^T & \cdots \\ \cdots & (x^{(2)})^T & \cdots \\ & \vdots & \\ & \vdots & \\ \cdots & (x^{(m)})^T & \cdots \\ \end{bmatrix}$

Download files

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive

from six.moves import cPickle

X = cPickle.load(open('/content/drive/MyDrive/ML_Colab/ML_data/pca_spring.pkl','rb'))
X = np.asmatrix(X.T)

print(X.shape)
m = X.shape[0]

(273, 6)

plt.figure(figsize = (6, 4))
plt.subplot(1,3,1)
plt.plot(X[:, 0], -X[:, 1], 'r')
plt.axis('equal')
plt.title('Camera 1')

plt.subplot(1,3,2)
plt.plot(X[:, 2], -X[:, 3], 'b')
plt.axis('equal')
plt.title('Camera 2')

plt.subplot(1,3,3)
plt.plot(X[:, 4], -X[:, 5], 'k')
plt.axis('equal')
plt.title('Camera 3')

plt.show()

X = X - np.mean(X, axis = 0)

S = 1/(m-1)*X.T*X

D, U = np.linalg.eig(S)

idx = np.argsort(-D)
D = D[idx]
U = U[:,idx]

print(D, '\n')
print(U)

[2.46033089e+04 3.22747042e+02 8.73851124e+01 8.19527660e+01
 3.19467195e+01 7.42861585e+00] 

[[ 0.36881064  0.62298194 -0.12571821 -0.42647348  0.52121775 -0.0807439 ]
 [ 0.35632379  0.57286174  0.132303    0.59881765 -0.40143215  0.08734045]
 [ 0.58419477 -0.22610057 -0.20325551 -0.47751523 -0.58153918  0.00857804]
 [ 0.08652315 -0.02671281  0.75692234 -0.14177391 -0.06010869 -0.62861422]
 [ 0.4159798  -0.29900638  0.49374948  0.05637591  0.32442517  0.62075559]
 [-0.46389987  0.37746931  0.32963322 -0.45633202 -0.34660023  0.45308403]]

plt.figure(figsize = (6, 4))
plt.stem(np.sqrt(D))
plt.grid(alpha = 0.3)
plt.show()

# relative magnitutes of the principal components

Z = X*U
xp = np.arange(0, m)/24    # 24 frame rate

plt.figure(figsize = (6, 4))
plt.plot(xp, Z)
plt.yticks([])
plt.show()

## projected onto the first principal component
# 6 dim -> 1 dim (dim reduction)
# relative magnitute of the first principal component

Z = X*U[:,0]

plt.figure(figsize = (6, 4))
plt.plot(Z)
plt.yticks([])
plt.show()

Reference: John P Cunningham & Byron M Yu, Dimensionality reduction for large-scale neural recordings, Nature Neuroscience 17, 1500–1509 (2014)

3. Fisher Discriminant Analysis (FDA) or Linear Discriminant Analysis (LDA)¶

3.1. FDA Algorithm¶

3.1.1. Dimensionality Reduction with Label¶

Dimensionality reduction with label information (when the ultimate goal is classification/regression)
PCA ignores label information even if it is available
- Only chooses directions of maximum variance
Fisher Discriminant Analysis (FDA) takes into account the label information
- It is also called Linear Discriminant Analysis (LDA)
FDA/LDA projects data while preserving class separation
- Examples from same class are put closely together by the projection
- Examples from different classes are placed far apart by the projection

3.1.2. Projection onto Line $\omega$ ¶

Linear regression projects each data point
- assume zero mean, otherwise $x \leftarrow x - \bar{x}$
- $\omega_0 = 0$

$\hat{y} = \langle \omega,x\rangle = \omega^Tx = \omega_1x_1 + \omega_2x_2$

Dimension reduction

$x = \begin{bmatrix}x_1 \\ x_2 \end{bmatrix} \rightarrow \hat y \; (\text{scalar})$

Each data point $x = \begin{bmatrix}x_1 \\ x_2 \end{bmatrix}$ is projected onto $\omega$ (projected length on $\omega$ direction)

For a given $\omega$ , distribution of the projected points $\{ \hat y^{(1)}, \cdots, \hat y^{(m)} \}$ is specified.

Question: Which $\omega$ is better for classification?

3.1.3. Fisher Discriminant Analysis¶

Find \omega so that when projected onto \omega,
- the classes are maximally separated (maximize distance between classes)
- Each class is tight (minimize variance of each class)

$\max_\omega \frac{(\text{seperation of projected means})^2}{\text{sum of within class variances}}$

$\implies \max_\omega \frac {\big(\mu_0^T\omega - \mu_1^T\omega \big)^2}{n_0\omega^TS_0\omega + n_1\omega^TS_1\omega}$

$\omega = \text{arg}\max_\omega \bigg\{ \frac{\big((\mu_0^T - \mu_1^T)\omega\big)^2}{n_0\omega^TS_0\omega + n_1\omega^TS_1\omega} \bigg \}$

$J(\omega) = \frac {\big((\mu_0^T - \mu_1^T)\omega\big)^2}{\omega^T(n_0S_0 + n_1S_1)\omega} = \frac {(m^T\omega)^2}{\omega^T\Sigma \omega}$

$\begin{align*} m &\equiv \mu_0 - \mu_1 &\\ \Sigma &\equiv n_0S_0 + n_1S_1 = R^TR \qquad &\text{We can always write } \Sigma \text{ like this, where } R \text{ is a square root matrix} \\ u &\equiv R\omega \rightarrow \omega = R^{-1}u \qquad &\text{Using } R, \text{change the coordinate systems from } \omega \text{ to } u \end{align*}$

$J(u) = \frac {\big( m^TR^{-1}u\big)^2}{\omega^TR^TR\omega} = \frac {\bigg( \big(R^{-T}m \big)^Tu\bigg)^2}{u^Tu}= \bigg( \left(R^{-T}m \right)^T \frac{u}{\parallel u \parallel} \bigg)^2$

$J(u) = \bigg( \left(R^{(-T)}m \right)^T \frac{u}{\lVert u \rVert} \bigg)^2 \; \text{is maximum when} \; u = a\,R^{-T}m$

Why?
- Dot product of a unit vector and another vector is maximum when the two have the same direction.

$\begin{align*} u &= aR^{-T}m = aR^{-T}(\mu_0 - \mu_1)\\ \omega &= R^{-1}u = aR^{-1}R^{-T}(\mu_0 - \mu_1) = a\big(R^TR\big)^{-1}(\mu_0 - \mu_1) = a\Sigma^{-1}(\mu_0 - \mu_1) \\ \\ \therefore \; \omega &= a(n_0S_0 + n_1S_1)^{-1}(\mu_0 - \mu_1) \end{align*}$

3.2. Python Code¶

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# generating data set

n0 = 200
n1 = 200

mu = [0, 0]
sigma = [[0.9, -0.4],
         [-0.4, 0.3]]

x0 = np.random.multivariate_normal([2.5,2.5], sigma, n0).T        # data in class 0
x1 = np.random.multivariate_normal([1,1], sigma, n1).T            # data in class 0

x0 = np.asmatrix(x0)
x1 = np.asmatrix(x1)

mu0 = np.mean(x0, axis = 1)
mu1 = np.mean(x1, axis = 1)

S0 = 1/(n0 - 1)*(x0 - mu0)*(x0 - mu0).T
S1 = 1/(n1 - 1)*(x1 - mu1)*(x1 - mu1).T

w = (n0*S0 + n1*S1).I*(mu0 - mu1)
print(w)

[[0.02702847]
 [0.04885679]]

Projection line and histogram

plt.figure(figsize = (6, 4))
plt.plot(x0[0,:], x0[1,:], 'r.')
plt.plot(x1[0,:], x1[1,:], 'b.')

xp = np.arange(-4, 6, 0.1)
yp = w[1,0]/w[0,0]*xp
plt.plot(xp, yp, 'k')
plt.axis('equal')
plt.ylim([-2, 6])
plt.show()

y1 = x0.T*w
y2 = x1.T*w

plt.figure(figsize = (6, 4))
plt.hist(y1, 21, color = 'r', rwidth = 0.5)
plt.hist(y2, 21, color = 'b', rwidth = 0.5)
plt.show()

Different $\omega$ (x axis)

w = np.array([[1], [0]])
w = np.asmatrix(w)

plt.figure(figsize = (6, 4))
plt.plot(x0[0,:], x0[1,:], 'r.')
plt.plot(x1[0,:], x1[1,:], 'b.')
plt.axhline(0, color = 'k')
plt.axis('equal')
plt.ylim([-2, 6])
plt.show()

y1 = x0.T*w
y2 = x1.T*w

plt.figure(figsize = (6, 4))
plt.hist(y1, 21, color = 'r', rwidth = 0.5)
plt.hist(y2, 21, color = 'b', rwidth = 0.5)
plt.show()

Different $\omega$ (y axis)

w = np.array([[0], [1]])
w = np.asmatrix(w)

plt.figure(figsize = (6, 4))
plt.plot(x0[0,:], x0[1,:], 'r.')
plt.plot(x1[0,:], x1[1,:], 'b.')
plt.axvline(0, color = 'k')
plt.axis('equal')
plt.ylim([-2, 6])
plt.show()

y1 = x0.T*w
y2 = x1.T*w

plt.figure(figsize = (6, 4))
plt.hist(y1, 21, color = 'r', rwidth = 0.5)
plt.hist(y2, 21, color = 'b', rwidth = 0.5)
plt.show()

Scikit-learn

# reshape data

X = np.vstack([x0.T, x1.T])
y = np.vstack([np.ones([n0, 1]), np.zeros([n1, 1])])

from sklearn import discriminant_analysis

X = np.array(X)

clf = discriminant_analysis.LinearDiscriminantAnalysis()
clf.fit(X, np.ravel(y))

LinearDiscriminantAnalysis()

LinearDiscriminantAnalysis()

# projection

X_LDA = clf.transform(X)

plt.figure(figsize = (6, 4))
plt.hist(X_LDA[0:200], 21, color = 'r', rwidth = 0.5)
plt.hist(X_LDA[200:400], 21, color = 'b', rwidth = 0.5)
plt.show()

# classifier boundary, not a projection line

clf_w = clf.coef_[0]
clf_w0 = clf.intercept_[0]
print(clf_w)
print(clf_w0)

[10.81138803 19.54271748]
-53.80755330301122

# projection line

mu0 = np.mean(x0, axis = 1)
mu1 = np.mean(x1, axis = 1)

S0 = 1/(n0 - 1)*(x0 - mu0)*(x0 - mu0).T
S1 = 1/(n1 - 1)*(x1 - mu1)*(x1 - mu1).T

prj_w = (n0*S0 + n1*S1).I*(mu0 - mu1)

xp = np.arange(-4, 6, 0.1)
prj_yp = prj_w[1,0]/prj_w[0,0]*xp
clf_yp = - clf_w[0]/clf_w[1]*xp - clf_w0/clf_w[1]

plt.figure(figsize = (6, 4))
plt.plot(x0[0,:], x0[1,:], 'r.')
plt.plot(x1[0,:], x1[1,:], 'b.')
plt.plot(xp, prj_yp, 'k', label = 'LDA projection line')
plt.plot(xp, clf_yp, 'g', label = 'LDA classification boundary')
plt.legend()
plt.axis('equal')
plt.ylim([-2, 6])
plt.show()

4. Singular Value Decomposition (SVD)¶

Geometry of Linear Maps

Matrix $A$ (or linear transformation) = rotate + stretch/compress

an extremely important fact:

every matrix

$A \in \mathbb{R}^{m \times n}$ maps the unit ball in

$\mathbb{R}^n$ to an ellipsoid in

$\mathbb{R}^m$

$\begin{align*} S &= \{ x \in \mathbb{R}^n \mid \; \Vert x \Vert \leq 1 \}\\ AS &= \{Ax \mid x\in \mathbf{S}\} \end{align*}$

4.1. SVD Algorithm¶

Singular Values and Singular Vectors

the numbers $\sigma_1, \cdots, \sigma_n$ are called the singular values of $A$ by convention, $\sigma_i > 0$
the vectors $u_1, \cdots, u_n$ are unit vectors along the principal semiaxes of $AS$
the vectors $v_1, \cdots, v_n$ are the preimages of the principal semiaxes, defined so that

$A v_i = \sigma_i u_i$

$\begin{align*} A\begin{bmatrix} v_1& v_2 & \cdots & v_r \end{bmatrix} & =\begin{bmatrix} A v_1 & A v_2 & \cdots & A v_r \end{bmatrix} \\ & =\begin{bmatrix} \sigma_1u_1& \sigma_2u_2 & \cdots & \sigma_ru_r \end{bmatrix} \\ & =\begin{bmatrix} u_1& u_2 & \cdots & u_r \end{bmatrix}\begin{bmatrix} \sigma_1& & & \\ & \sigma_2 & & \\ & & \ddots & \\ & & & \sigma_r \end{bmatrix} \\ \\ \therefore \; AV & = U\Sigma \quad (r \leq m, n) \end{align*}$

Thin Singular Value Decomposition

We have $A \in \mathbb{R}^{m \times n}$ , skinny and full rank (i.e., $r = n$ ), and

$A v_i = \sigma_i u_i \quad \text{for } 1 \leq i \leq n$

let

$\begin{align*} \hat U &= \begin{bmatrix}u_1 & u_2 & \cdots & u_n \end{bmatrix}\\ \\ \quad \hat \Sigma &= \begin{bmatrix}\sigma_1& & & \\ & \sigma_2 &&\\ &&\ddots&\\ &&&\sigma_n \end{bmatrix}\\ \\ \quad V &= \begin{bmatrix} v_1 & v_2 & \cdots & v_n \end{bmatrix} \end{align*}$

in matrix form, the above equation is $AV = \hat U \hat \Sigma$ and since $V$ is orthogonal

$A = \hat U \hat \Sigma V^T$

called the thin (or reduced) SVD of $A$

Full Singular Value Decomposition

We can add extra orthonormal columns to $U$ to make an orthogonal matrix $(m \geq n)$

$U = \begin{bmatrix} u_1 & u_2 & \cdots & u_m \end{bmatrix}$

We also add extra rows of zeros to $\hat \Sigma$ , so

$A = U\Sigma V^T$

This is the (full) singular value decomposition of $A$

Every matrix $A$ has a singular value decomposition.
If $A$ is not full rank, then some of the diagonal entries of $\Sigma$ will be zero

Interpretation of SVD

The SVD decomposes the linear map into

rotate by $V^T$
diagonal scaling by $\sigma_i$
rotate by $U$

Note that, unlike the eigen-decomposition, input and output directions are different

SVD: Matrix factorization

for any matrix $A$

$A = U\Sigma V^T$

for symmetric and positive definite matrix $A$

$A = S\Lambda S^{-1} = S\Lambda S^T\quad (S: \text{eigenvectors})$

Example

$\begin{array}{Icr} A = \begin{bmatrix} 4 & 4 \\ -3 & 3 \end{bmatrix} \end{array} \quad \quad \begin{array}{Icr} \begin{align*} &\textbf{orthonormal basis} \\ &v_1, v_2 \text{ in row space } \mathbb{R}^2 \\ &u_1, u_2 \text{ in column space } \mathbb{R}^2 \\ &\sigma_1 > 0 \\ &\sigma_2 > 0 \end{align*} \end{array}$

$\begin{array}{lcr} A v_1 = \sigma_1 u_1\\ A v_2 = \sigma_2 u_2 \end{array} \; \implies \; \begin{array}{lcr} AV = U\Sigma \\ A = U\Sigma V^{-1} = U\Sigma V^T \end{array}$

$A^TA$ : (small) positive definite & symmetric

$A^TA = V\Sigma^TU^TU\Sigma V^T = V\Sigma^T\Sigma V^T = V\begin{bmatrix} \sigma_1^2& & \\ & \sigma_2^2 & \\ & & \ddots \\ \end{bmatrix}V^T$

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

A = np.array([[4, 4],
              [-3, 3]])

A = np.asmatrix(A)

[D, V] = np.linalg.eig(A.T*A)

print(V, '\n')
print(D)

[[ 0.70710678 -0.70710678]
 [ 0.70710678  0.70710678]] 

[32. 18.]

$AA^T$ : (large) positive definite & symmetric

$AA^T = U\Sigma V^TV\Sigma^TU^T = U\Sigma\Sigma^TU^T = U\begin{bmatrix} \sigma_1^2& & \\ & \sigma_2^2 & \\ & & \ddots \\ \end{bmatrix}U^T$

[D, U] = np.linalg.eig(A*A.T)

print(U, '\n')
print(D)

[[1. 0.]
 [0. 1.]] 

[32. 18.]

$A = U\Sigma V^T$

[U, S, VT] = np.linalg.svd(A, full_matrices = True)

print(U.shape, S.shape, VT.shape, '\n')
print(U, '\n')
print(S, '\n')
print(VT)

(2, 2) (2,) (2, 2) 

[[-1.  0.]
 [ 0.  1.]] 

[5.65685425 4.24264069] 

[[-0.70710678 -0.70710678]
 [-0.70710678  0.70710678]]

4.2. PCA and SVD¶

Any real symmetric and positive definite matrix $B$ has a eigen decomposition

$B = S\Lambda S^T$

A real matrix $(m\times n)$ $A$ , where $m>n$ , has the decompsition

$A = U\Sigma V^T$

From $A$ (skinny and full rank) we can construct two positive-definite symmetric matrices, $AA^T$ and $A^TA$

$\begin{align*} AA^T &= U\Sigma V^T V \Sigma^T U^T = U \Sigma \Sigma^T U^T \quad (m \times m, \,n\, \text{eigenvalues and }\, m-n \; \text{zeros},\; U\; \text{eigenvectors})\\ A^TA &= V \Sigma^T U^T U\Sigma V^T = V \Sigma^T \Sigma V^T \quad (n \times n, \,n \;\text{eigenvalues},\; V\; \text{eigenvectors}) \end{align*}$

PCA by SVD

$B = A^TA = V \Sigma^T \Sigma V^T = V \Lambda V^T$

$V$ is eigenvectors of $B = A^TA$
$\sigma_i^2 = \lambda_i$
Note that in PCA,

$VS V^T \quad \left(\text{where }S = \frac{1}{m}X^TX = \left(\frac{X}{\sqrt{m}}\right)^T \frac{X}{\sqrt{m}} = A^TA \right)$

%%html
<center><iframe src="https://www.youtube.com/embed/Nx0lRBaXoz4?rel=0"
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>

4.3. Low Rank Approximation: Dimension Reduction¶

4.3.1. Low Rank Approximation¶

in 2D case

$\begin{align*} x & = c_1 v_1 + c_2 v_2 \\ & = \langle x, v_1 \rangle v_1 + \langle x, v_2 \rangle v_2\\ & = ( v_1^{T} x) v_1 + ( v_2^T x) v_2 \\ \\ Ax & = c_1 A v_1 + c_2 A v_2 \\ & = c_1\sigma_1 u_1 + c_2 \sigma_2 u_2 \\ & = u_1 \sigma_1 v_1^Tx + u_2\sigma_2 v_2^T x \\ \\ & = \begin{bmatrix} u_1 & u_2 \end{bmatrix}\begin{bmatrix}\sigma_1 & 0\\ 0 & \sigma_2 \end{bmatrix}\begin{bmatrix} v_1^T \\ v_2^T \end{bmatrix} x \\ \\ & = U \Sigma V^Tx \\\\ & \approx u_1\sigma_1 v_1^Tx \quad (\text{if } \sigma_1 \gg \sigma_2) \end{align*}$

Low rank approximation of matrix $A$ in general

$\begin{align*} A &= U_r \Sigma_r V_r^T = \sum_{i=1}^{r}\sigma_i u_i v_i^T\\ \\ \tilde A &= U_k \Sigma_k V_k^T = \sum_{i=1}^{k}\sigma_i u_i v_i^T \qquad (k \leq r) \end{align*}$

Example: Matrix

A = np.array([[11.08, 6.82, 1.76, -6.82],
              [2.50, -1.01, -2.60, 1.19],
              [-4.88, -5.07, -3.21, 5.20],
              [-0.49, 1.52, 2.07, -1.66],
              [-14.04, -12.40, -6.66, 12.65],
              [0.27, -8.51, -10.19, 9.15],
              [9.53, -9.84, -17.00, 11.00],
              [-12.01, 3.64, 11.10, -4.48]])

[U, S, VT] = np.linalg.svd(A, full_matrices = True)

U = np.asmatrix(U)
S = np.asmatrix(np.diag(S))
VT = np.asmatrix(VT)

print(U.shape, S.shape, V.shape, '\n')
print(U, '\n')
print(S, '\n')
print(VT)

(8, 8) (4, 4) (2, 2) 

[[-0.24642369  0.44536234  0.6205749   0.32739909  0.45915182  0.05410587
  -0.18650735  0.00956801]
 [ 0.07432183  0.10750102  0.28337342 -0.77803389 -0.10352762  0.32547077
  -0.42245245  0.04655432]
 [ 0.21369502 -0.1903207   0.49492174  0.11043549 -0.4736735  -0.61498898
  -0.24132713 -0.01233476]
 [-0.08223154 -0.02472167  0.20057294  0.06152361 -0.27047805  0.29819169
   0.20131309 -0.86371786]
 [ 0.50375686 -0.55378628  0.13739629 -0.02351069  0.61002316  0.01682912
  -0.07593043 -0.20479838]
 [ 0.43719294  0.03496575 -0.05496539  0.50172005 -0.29664823  0.55199001
  -0.3572128   0.18055856]
 [ 0.59019451  0.42657675  0.21078205 -0.13667122 -0.03298713 -0.00394525
   0.6243326   0.1252985 ]
 [-0.2967926  -0.51321848  0.42787997  0.0231769  -0.14042827  0.34499254
   0.40598717  0.40166774]] 

[[3.68258450e+01 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [0.00000000e+00 2.62368811e+01 0.00000000e+00 0.00000000e+00]
 [0.00000000e+00 0.00000000e+00 2.20213307e-02 0.00000000e+00]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 5.10429562e-03]] 

[[-0.03564832 -0.5381828  -0.61426544  0.5759917 ]
 [ 0.92076022  0.16617467 -0.32597333 -0.13538086]
 [-0.13920696 -0.49222525 -0.31155848 -0.80079151]
 [-0.36269992  0.66367127 -0.6475729  -0.09294376]]

[U, S, VT] = np.linalg.svd(A, full_matrices = False)

U = np.asmatrix(U)
S = np.asmatrix(np.diag(S))
VT = np.asmatrix(VT)

print(U.shape, S.shape, V.shape, '\n')
print(U, '\n')
print(S, '\n')
print(VT)

(8, 4) (4, 4) (2, 2) 

[[-0.24642369  0.44536234  0.6205749   0.32739909]
 [ 0.07432183  0.10750102  0.28337342 -0.77803389]
 [ 0.21369502 -0.1903207   0.49492174  0.11043549]
 [-0.08223154 -0.02472167  0.20057294  0.06152361]
 [ 0.50375686 -0.55378628  0.13739629 -0.02351069]
 [ 0.43719294  0.03496575 -0.05496539  0.50172005]
 [ 0.59019451  0.42657675  0.21078205 -0.13667122]
 [-0.2967926  -0.51321848  0.42787997  0.0231769 ]] 

[[3.68258450e+01 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [0.00000000e+00 2.62368811e+01 0.00000000e+00 0.00000000e+00]
 [0.00000000e+00 0.00000000e+00 2.20213307e-02 0.00000000e+00]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 5.10429562e-03]] 

[[-0.03564832 -0.5381828  -0.61426544  0.5759917 ]
 [ 0.92076022  0.16617467 -0.32597333 -0.13538086]
 [-0.13920696 -0.49222525 -0.31155848 -0.80079151]
 [-0.36269992  0.66367127 -0.6475729  -0.09294376]]

Question

How to determine a good $k$ so that we can efficiently compress $A$ without losing too much information of $A$ ?

sig_val = np.diag(S)

plt.figure(figsize = (6, 4))
plt.stem(range(1,5), sig_val)
plt.xticks(range(1,5))
plt.title('Singular Values')
plt.show()

k = 2

low_U = U[:, :k]
low_S = S[:k, :k]
low_VT = VT[:k, :]

A_approx = low_U*low_S*low_VT

print(low_U.shape, low_S.shape, low_VT.shape, '\n')
print('A:\n', A ,'\n')
print('A_approx:\n', A_approx, '\n')
print('low_U:\n', low_U, '\n')
print('low_VT:\n', low_VT)

(8, 2) (2, 2) (2, 4) 

A:
 [[ 11.08   6.82   1.76  -6.82]
 [  2.5   -1.01  -2.6    1.19]
 [ -4.88  -5.07  -3.21   5.2 ]
 [ -0.49   1.52   2.07  -1.66]
 [-14.04 -12.4   -6.66  12.65]
 [  0.27  -8.51 -10.19   9.15]
 [  9.53  -9.84 -17.    11.  ]
 [-12.01   3.64  11.1   -4.48]] 

A_approx:
 [[ 11.08250851   6.8256176    1.76533991  -6.80890115]
 [  2.49942829  -1.00429274  -2.60062751   1.19462804]
 [ -4.87827835  -5.06500943  -3.20623934   5.20878009]
 [ -0.48927124   1.52196569   2.07157948  -1.65643381]
 [-14.03962233 -12.39843105  -6.65913505  12.65241176]
 [  0.27076035  -8.51229541 -10.18871873   9.14926874]
 [  9.53039313  -9.83725225 -16.99900559  11.0036522 ]
 [-12.00864542   3.64455947  11.10301226  -4.47244356]] 

low_U:
 [[-0.24642369  0.44536234]
 [ 0.07432183  0.10750102]
 [ 0.21369502 -0.1903207 ]
 [-0.08223154 -0.02472167]
 [ 0.50375686 -0.55378628]
 [ 0.43719294  0.03496575]
 [ 0.59019451  0.42657675]
 [-0.2967926  -0.51321848]] 

low_VT:
 [[-0.03564832 -0.5381828  -0.61426544  0.5759917 ]
 [ 0.92076022  0.16617467 -0.32597333 -0.13538086]]

4.3.2. Example: Image Approximation¶

Learn how to reduce the amount of storage needed for an image using the singular value decomposition.
Approximation of $A$

$\tilde A = U_k \Sigma_k V_k^T = \sum_{i=1}^{k}\sigma_i u_i v_i^T \qquad (k \leq r)$

Downlaod the image file

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive

import cv2

img = cv2.imread('/content/drive/MyDrive/ML_Colab/ML_data/postech.jpg', 0)
print(img.shape)

(642, 642)

plt.figure(figsize = (6, 6))
plt.imshow(img, 'gray')
plt.axis('off')
plt.show()

A = img

[U, S, VT] = np.linalg.svd(A)

U = np.asmatrix(U)
S = np.asmatrix(np.diag(S))
VT = np.asmatrix(VT)

print(U.shape, S.shape, VT.shape, '\n')
print(U, '\n')
print(S, '\n')

(642, 642) (642, 642) (642, 642) 

[[-0.05056924 -0.04981795  0.02191623 ... -0.06332669  0.02607014
  -0.01198101]
 [-0.05058442 -0.04975868  0.02183727 ...  0.06690459 -0.03290307
  -0.00625734]
 [-0.05060117 -0.04961593  0.02185365 ... -0.04061354  0.12324206
   0.02296389]
 ...
 [-0.04781458 -0.03290448  0.01364876 ...  0.00269351 -0.01030524
  -0.0110923 ]
 [-0.04799566 -0.03305308  0.01349296 ... -0.00713524  0.01524358
   0.01091434]
 [-0.04790926 -0.03471368  0.01449735 ...  0.00852267 -0.00585473
  -0.01242731]] 

[[1.16316669e+05 0.00000000e+00 0.00000000e+00 ... 0.00000000e+00
  0.00000000e+00 0.00000000e+00]
 [0.00000000e+00 1.80615470e+04 0.00000000e+00 ... 0.00000000e+00
  0.00000000e+00 0.00000000e+00]
 [0.00000000e+00 0.00000000e+00 8.13201766e+03 ... 0.00000000e+00
  0.00000000e+00 0.00000000e+00]
 ...
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 ... 5.45275026e-02
  0.00000000e+00 0.00000000e+00]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 ... 0.00000000e+00
  2.24722778e-02 0.00000000e+00]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 ... 0.00000000e+00
  0.00000000e+00 3.53870387e-03]]

sig_val = np.diag(S)

plt.figure(figsize = (6, 4))
plt.stem(sig_val)
plt.title('Singular Values')
plt.show()

Approximated image when k = 20

k = 20

low_U = U[:, :k]
low_S = S[:k, :k]
low_VT = VT[:k, :]

A_hat = low_U*low_S*low_VT

plt.figure(figsize = (8, 4))
plt.subplot(1,2,1), plt.imshow(A, 'gray'), plt.axis('off'), plt.title('Original image')
plt.subplot(1,2,2), plt.imshow(A_hat, 'gray'), plt.axis('off'), plt.title('Approximated image w/ rank = 20')
plt.show()

Approximated images with k varied

K = [1, 2, 3, 4, 10, 20, 30, 600]

plt.figure(figsize = (8, 5))

for i in range(len(K)):
    low_U = U[:, :K[i]]
    low_S = S[:K[i], :K[i]]
    low_VT = VT[:K[i], :]

    A_hat = low_U*low_S*low_VT

    plt.subplot(2,4,i+1), plt.imshow(A_hat, 'gray'), plt.axis('off')
    plt.title('rank {}'.format(K[i]))

plt.show()

4.3.3. Eigenface¶

http://en.wikipedia.org/wiki/Eigenface
Further editted by Tim Chartier, Davidson College 2007
Best approximation of $x$ with $U$ eigenface (eigenvectors)

$\tilde{x} = \sum_{i=1}^{k} \langle x,\hat{u}_i\rangle \,\hat{u}_i$

12 USA Presidents

Download the face file

from scipy import io

faces = io.loadmat('/content/drive/MyDrive/ML_Colab/ML_data/faces.mat')
face_mat = faces['faces']

row, col, m = face_mat.shape

plt.figure(figsize = (6, 8))

for i in range(m):
    plt.subplot(4, 3, i+1), plt.axis('off')
    plt.imshow(face_mat[:, :, i], 'gray')

plt.show()

SVD

A = np.reshape(face_mat, [row*col, m])

[U, S, VT] = np.linalg.svd(A, full_matrices = False)

U = np.asmatrix(U)
S = np.asmatrix(np.diag(S))
VT = np.asmatrix(VT)

print(U.shape, S.shape, VT.shape, '\n')

(50000, 12) (12, 12) (12, 12)

Eigenface

plt.figure(figsize = (6, 8))

for i in range(m):
    plt.subplot(4, 3, i+1), plt.axis('off')
    plt.imshow(U[:, i].reshape(row, col), 'gray')

plt.show()

Singular values

sig_val = np.diag(S)

plt.figure(figsize = (6, 4))
plt.stem(sig_val)
plt.title('Singular Values')
plt.show()

Transmit the compressed information

Assum that both parties have eigenfunctions
Only thing to transmit is the corresponding weights of the eigenfunctions

obama = face_mat[:,:,11]

plt.figure(figsize = (3, 4))
plt.imshow(obama, 'gray')
plt.axis('off')
plt.show()

# sender
x = np.reshape(obama, [row*col, 1])
eigenface_decom = U.T*x
print(obama.shape, eigenface_decom.shape)

(250, 200) (12, 1)

# reciever
reconst = U*eigenface_decom

plt.figure(figsize = (3, 4))
plt.imshow(reconst.reshape(row, col), 'gray')
plt.axis('off')
plt.show()

Seeing through a Disguise with SVD

Acquire new image
- original image disguised
- download the image file

jfk = cv2.imread('/content/drive/MyDrive/ML_Colab/ML_data/jfk.png', 0)

plt.figure(figsize = (3, 4))
plt.imshow(jfk, 'gray')
plt.axis('off')
plt.show()

Projection onto $U$

x = np.reshape(jfk, [row*col, 1])

eigenface_decom = U.T*x
reconst = U*eigenface_decom

plt.figure(figsize = (3, 4))
plt.imshow(reconst.reshape(row, col), 'gray')
plt.axis('off')
plt.show()

%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')

1. Statistics¶

1.1. Populations and Samples¶

1.1.1. How to Generate Random Numbers (Samples or data)¶

1.1.2. Histogram : graphical representation of data distribution¶

1.2. Inference¶

1.3. Law of Large Numbers¶

1.4. Central Limit Theorem¶

1.5. Multivariate Statistics¶

1.5.1. Correlation of Two Random Variables¶

1.5.2. Correlation Coefficient¶

1.5.3. Correlation Coefficient Plot¶

1.5.4. Covariance Matrix¶

2. Principal Component Analysis (PCA)¶

2.1. PCA Algorithm¶

2.1.1. Pre-processing¶

2.1.2. Maximize Variance¶

2.1.3. Minimize the Sum-of-Squared Error¶

2.1.4. Dimension Reduction Method (n \rightarrow kn \rightarrow k)¶

2.2. PCA Visualization¶

2.2.1. Eigenvalues¶

2.2.2. Eigenvectors¶

2.3. Python Codes¶

2.4. PCA Example¶

3. Fisher Discriminant Analysis (FDA) or Linear Discriminant Analysis (LDA)¶

3.1. FDA Algorithm¶

3.1.1. Dimensionality Reduction with Label¶

3.1.2. Projection onto Line \omega\omega¶

3.1.3. Fisher Discriminant Analysis¶

3.2. Python Code¶

4. Singular Value Decomposition (SVD)¶

4.1. SVD Algorithm¶

4.2. PCA and SVD¶

4.3. Low Rank Approximation: Dimension Reduction¶

4.3.1. Low Rank Approximation¶

4.3.2. Example: Image Approximation¶

4.3.3. Eigenface¶

2.1.4. Dimension Reduction Method ( $n \rightarrow k$ )¶

3.1.2. Projection onto Line $\omega$ ¶