Dimension Reduction
Table of Contents
- I. 1. Statistics¶
- I. 1.1. Populations and Samples¶
- I. 1.1.1. How to Generate Random Numbers (Samples or data)¶
- II. 1.1.2. Histogram : graphical representation of data distribution¶
- II. 1.2. Inference¶
- III. 1.3. Law of Large Numbers¶
- IV. 1.4. Central Limit Theorem¶
- V. 1.5. Multivariate Statistics¶
- II. 2. Principal Component Analysis (PCA)¶
- I. 2.1. PCA Algorithm¶
- I. 2.1.1. Pre-processing¶
- II. 2.1.2. Maximize Variance¶
- III. 2.1.3. Minimize the Sum-of-Squared Error¶
- IV. 2.1.4. Dimension Reduction Method (n \rightarrow k)¶
- II. 2.2. PCA Visualization¶
- III. 2.3. Python Codes¶
- IV. 2.4. PCA Example¶
- III. 3. Fisher Discriminant Analysis (FDA) or Linear Discriminant Analysis (LDA)¶
- I. 3.1. FDA Algorithm¶
- I. 3.1.1. Dimensionality Reduction with Label¶
- II. 3.1.2. Projection onto Line \omega¶
- III. 3.1.3. Fisher Discriminant Analysis¶
- II. 3.2. Python Code¶
- IV. 4. Singular Value Decomposition (SVD)¶
A population includes all the elements from a set of data
A parameter is a quantity computed from a population
- mean, μ
- variance, σ2
A sample is a subset of the population.
- one or more observations
A statistic is a quantity computed from a sample
- sample mean, ˉx
- sample variance, 𝑠2
- sample correlation, 𝑆𝑥𝑦
1.1.1. How to Generate Random Numbers (Samples or data)¶
- Data sampled from population/process/generative model
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
## random number generation (1D)
m = 1000;
# uniform distribution U(0,1)
x1 = np.random.rand(m,1);
# uniform distribution U(a,b)
a = 1;
b = 5;
x2 = a + (b-a)*np.random.rand(m,1);
# standard normal (Gaussian) distribution N(0,1^2)
# x3 = np.random.normal(0, 1, m)
x3 = np.random.randn(m,1);
# normal distribution N(5,2^2)
x4 = 5 + 2*np.random.randn(m,1);
# random integers
x5 = np.random.randint(1, 6, size = (1,m));
1.2. Inference¶
- True population or process is modeled probabilistically.
- Sampling supplies us with realizations from probability model.
- Compute something, but recognize that we could have just as easily gotten a different set of realizations.
- We want to infer the characteristics of the true probability model from our one sample.
1.3. Law of Large Numbers¶
- Sample mean converges to the population mean as sample size gets large
ˉx→μxasm→∞
- True for any probability density functions
- sample mean and sample variance
ˉx=x1+x2+...+xmms2=∑mi=1(xi−ˉx)2m−1
- suppose x∼U[0,1]
# statistics
# numerically understand statisticcs
m = 100
x = np.random.rand(m,1)
#xbar = 1/m*np.sum(x, axis = 0)
#np.mean(x, axis = 0)
xbar = 1/m*np.sum(x)
np.mean(x)
varbar = (1/(m - 1))*np.sum((x - xbar)**2)
np.var(x)
print(xbar)
print(np.mean(x))
print(varbar)
print(np.var(x))
# various sample size m
m = np.arange(10, 2000, 20)
means = []
for i in m:
x = np.random.normal(10, 30, i)
means.append(np.mean(x))
plt.figure(figsize = (6, 4))
plt.plot(m, means, 'bo', markersize = 4)
plt.axhline(10, c = 'k', linestyle='dashed')
plt.xlabel('# of smaples (= sample size)', fontsize = 15)
plt.ylabel('sample mean', fontsize = 15)
plt.ylim([0, 20])
plt.show()
1.4. Central Limit Theorem¶
- Sample mean (not samples) will be approximately normal-distributed as a sample size m→∞
ˉx=x1+x2+...+xmm
- More samples provide more confidence (or less uncertainty)
- Note: true regardless of any distribution of population
ˉx→N(μx,(σ√m)2)
N = 100
m = np.array([10, 40, 160]) # sample of size m
S1 = [] # sample mean (or sample average)
S2 = []
S3 = []
for i in range(N):
S1.append(np.mean(np.random.rand(m[0], 1)))
S2.append(np.mean(np.random.rand(m[1], 1)))
S3.append(np.mean(np.random.rand(m[2], 1)))
plt.figure(figsize = (6, 4))
plt.subplot(1,3,1), plt.hist(S1, 21), plt.xlim([0, 1]), plt.title('m = '+ str(m[0])), plt.yticks([])
plt.subplot(1,3,2), plt.hist(S2, 21), plt.xlim([0, 1]), plt.title('m = '+ str(m[1])), plt.yticks([])
plt.subplot(1,3,3), plt.hist(S3, 21), plt.xlim([0, 1]), plt.title('m = '+ str(m[2])), plt.yticks([])
plt.show()
1.5. Multivariate Statistics¶
x(i)=[x(i)1x(i)2⋮],X=[−(x(i))T−−(x(i))T−⋮−(x(m))T−]
- m observations (x(i),x(2),⋯,x(m))
sample meanˉx=x(1)+x(2)+⋯+x(m)m=1mm∑i=1x(i)sample varianceS2=1m−1m∑i=1(x(i)−ˉx)2(Note: population varianceσ2=1NN∑i=1(x(i)−μ)2
1.5.1. Correlation of Two Random Variables¶
Sample Variance:Sx=1m−1m∑i=1(x(i)−ˉx)2Sample Covariance:Sxy=1m−1m∑i=1(x(i)−ˉx)(y(i)−ˉy)Sample Covariance matrix:S=[SxSxySyxSy]sample correlation coefficient:r=Sxy√Sxx⋅Syy
- Strength of linear relationship between two variables, x and y
- Assume
x1≤x2≤⋯≤xn
y1≤y2≤⋯≤yn
1.5.2. Correlation Coefficient¶
+1→ close to a straight line
−1→ close to a straight line
Indicate how close to a linear line, but
No information on slope
0≤| correlation coefficient |≤1
←(uncorrelated)→(linearly correlated)
- Does not tell anything about causality
1.5.3. Correlation Coefficient Plot¶
- Plots correlation coefficients among pairs of variables
- http://rpsychologist.com/d3/correlation/
1.5.4. Covariance Matrix¶
∑=[E[(X1−μ1)(X1−μ1)]E[(X1−μ1)(X2−μ2)]⋯E[(X1−μ1)(Xn−μn)]E[(X2−μ2)(X1−μ1)]E[(X2−μ2)(X2−μ2)]⋯E[(X2−μ2)(Xn−μn)]⋮⋮⋱⋮E[(Xn−μn)(X1−μ1)]E[(Xn−μn)(X2−μ2)]⋯E[(Xn−μn)(Xn−μn)]]
# correlation coefficient
m = 300
x = np.random.rand(m)
y = np.random.rand(m)
xo = np.sort(x)
yo = np.sort(y)
yor = -np.sort(-y)
plt.figure(figsize = (6, 6))
plt.plot(x, y, 'ko', label = 'random')
plt.plot(xo, yo, 'ro', label = 'sorted')
plt.plot(xo, yor, 'bo', label = 'reversely ordered')
plt.xticks([])
plt.yticks([])
plt.xlabel('x', fontsize = 20)
plt.ylabel('y', fontsize = 20)
plt.axis('equal')
plt.legend(fontsize = 12)
plt.show()
print(np.corrcoef(x,y), '\n')
print(np.corrcoef(xo,yo), '\n')
print(np.corrcoef(xo,yor))
# correlation coefficient
m = 300
x = 2*np.random.randn(m)
y = np.random.randn(m)
xo = np.sort(x)
yo = np.sort(y)
yor = -np.sort(-y)
plt.figure(figsize = (6, 6))
plt.plot(x, y, 'ko', label = 'random')
plt.plot(xo, yo, 'ro', label = 'sorted')
plt.plot(xo, yor, 'bo', label = 'reversely ordered')
plt.xticks([])
plt.yticks([])
plt.xlabel('x', fontsize = 20)
plt.ylabel('y', fontsize = 20)
plt.axis('equal')
plt.legend(fontsize = 12)
plt.show()
print(np.corrcoef(x,y), '\n')
print(np.corrcoef(xo,yo), '\n')
print(np.corrcoef(xo,yor))
import seaborn as sns
import pandas as pd
d = {'col. 1': x, 'col. 2': xo, 'col. 3': yo, 'col. 4': yor}
df = pd.DataFrame(data = d)
sns.pairplot(df, height = 1.5, aspect = 1)
plt.show()
2. Principal Component Analysis (PCA)¶
Motivation: Can we describe high-dimensional data in a "simpler" way?
→ Dimension reduction without losing too much information
→ Find a low-dimensional, yet useful representation of the data
- Why dimensionality reduction?
- insights into the low-dimensinal structures in the data (visualization)
- Fewer dimensions ⇒ Less chances of overfitting ⇒ Better generalization
- Speeding up learning algorithms
- Most algorithms scale badly with increasing data dimensionality
- Less storage requirements (data compression)
- Note: Dimensionality Reduction is different from Feature Selection
- .. although the goals are kind of the same
- Dimensionality reduction is more like “Feature Extraction”
- Constructing a small set of new features from the original features
- How?
idea: highly correlated data contains redundant features
Each example x has 2 features {x1,x2}
Consider ignoring the feature x2 for each example
Each 2-dimensional example x now becomes 1-dimensional x={x1}
Are we losing much information by throwing away x2 ?
Each example x has 2 features {x1,x2}
Consider ignoring the feature x2 for each example
Each 2-dimensional example x now becomes 1-dimensional x={x1}
Are we losing much information by throwing away x2 ?
Yes, the data has substantial variance along both features (i.e. both axes)
Now consider a change of axes
Each example x has 2 features {u1,u2}
Consider ignoring the feature u2 for each example
Each 2-dimensional example x now becomes 1-dimensional x={u1}
Are we losing much information by throwing away u2 ?
No. Most of the data spread is along u1 (very little variance along u2)
- Data → projection onto unit vector ˆu1
- PCA is used when we want projections capturing maximum variance directions
- Principal Components (PC): directions of maximum variability in the data
- Roughly speaking, PCA does a change of axes that can represent the data in a succinct manner
- HOW?
- Maximize variance (most separable)
- Minimize the sum-of-squares (minimum squared error)
2.1. PCA Algorithm¶
2.1.1. Pre-processing¶
- Given data
x(i)=[x(i)1⋮x(i)n],X=[⋯(x(1))T⋯⋯(x(2))T⋯⋮⋯(x(m))T⋯]
- Shifting (zero mean) and rescaling (unit variance)
- Shift to zero mean
μ=1mm∑i=1x(i)x(i)←x(i)−μ(zero mean) - [optional] Rescaling (unit variance)
σ2j=1m−1∑i=1m(x(i)j)2x(i)j←x(i)jσj
- Shift to zero mean
2.1.2. Maximize Variance¶
Find unit vector u such that maximizes variance of projections
Note: m≈m−1 for big data
variance of projected data=1mm∑i=1(uTx(i))2=1mm∑i=1(x(i)Tu)2=1mm∑i=1(x(i)Tu)T(x(i)Tu)=1mm∑i=1uTx(i)x(i)Tu=uT(1mm∑i=1x(i)x(i)T)u=uTSu(S=1mXTX:sample covariance matrix)
- In an optimization form
maximizeuTSusubject touTu=1
uTSu=uTλu=λuTu=λ(Eigen analysis:Su=λu)⟹pick the largest eigenvalue λ1 of covariance matrix S⟹u=u1 is the λ′1s corresponding eigenvector⟹u1 is the first principal component (direction of highest variance in the data)
2.1.3. Minimize the Sum-of-Squared Error¶
‖e(i)‖2=‖x(i)‖2−(x(i)Tu)2=‖x(i)‖2−(x(i)Tu)T(x(i)Tu)=‖x(i)‖2−uTx(i)x(i)Tu
1mm∑i=1‖e(i)‖2=1mm∑i=1‖x(i)‖2−1mm∑i=1uTx(i)x(i)Tu=1mm∑i=1‖x(i)‖2−uT(1mm∑i=1x(i)x(i)T)u
- In an optimization form
minimize1mm∑i=1‖x(i)‖2⏟constant given xi−uT(1mm∑i=1x(i)x(i)T)u⏟maximize⟹maximizeuT(1mm∑i=1x(i)x(i)T)u=maxuTSu
∴
2.1.4. Dimension Reduction Method (n \rightarrow k)¶
Choose top k (orthonormal) eigenvectors, U = [u_1, u_2, \cdots, u_k]
Project x_i onto span \{ u_1, u_2, \cdots, u_k\}
z^{(i)} = \begin{bmatrix} u_1^Tx^{(i)}\\ u_2^Tx^{(i)}\\ \vdots \\ u_k^Tx^{(i)}\\ \end{bmatrix} \;\text{ or }\; z = U^{T}x
- Pictorial summary of PCA
\qquad \qquad \qquad \qquad x^{(i)} \rightarrow projection onto unit vector u \implies u^Tx^{(i)} = distance from the origin along u
2.2.2. Eigenvectors¶
- Given basis \{\hat{x}_1, \hat{x}_2\} to transformed basis \{\hat{u}_1, \hat{u}_2\}
[\hat{u}_1\; \hat{u}_2]=[\hat{x}_1 \; \hat{x}_2]\begin{bmatrix} c_1 & c_3 \\c_2 & c_4\end{bmatrix}
2.3. Python Codes¶
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
# data generation
m = 5000
mu = np.array([0, 0])
sigma = np.array([[3, 1.5],
[1.5, 1]])
X = np.random.multivariate_normal(mu, sigma, m)
X = np.asmatrix(X)
fig = plt.figure(figsize = (6, 4))
plt.plot(X[:,0], X[:,1], 'k.', alpha = 0.3)
plt.axis('equal')
plt.grid(alpha = 0.3)
plt.show()
S = 1/(m-1)*X.T*X
D, U = np.linalg.eig(S)
idx = np.argsort(-D)
D = D[idx]
U = U[:,idx]
print(D, '\n')
print(U)
h = U[1,0]/U[0,0]
xp = np.arange(-6, 6, 0.1)
yp = h*xp
fig = plt.figure(figsize = (6, 4))
plt.plot(X[:,0], X[:,1], 'k.', alpha = 0.3)
plt.plot(xp, yp, 'r', linewidth = 3)
plt.axis('equal')
plt.grid(alpha = 0.3)
plt.show()
Z = X*U[:,0]
plt.figure(figsize = (6, 4))
plt.hist(Z, 51)
plt.show()
Z = X*U[:,1]
plt.figure(figsize = (6, 4))
plt.hist(Z, 51)
plt.show()
from sklearn.decomposition import PCA
X = np.array(X)
pca = PCA(n_components = 1)
pca.fit(X)
u = pca.transform(X)
plt.figure(figsize = (6, 4))
plt.hist(u, 51)
plt.show()
%%html
<center><iframe
width="560" height="315" src="https://www.youtube.com/embed/Pkit-64g0eU" frameborder="0" allowfullscreen>
</iframe></center>
%%html
<center><iframe
width="560" height="315" src="https://www.youtube.com/embed/x4lvjVjUUqg" frameborder="0" allowfullscreen>
</iframe></center>
%%html
<center><iframe
width="560" height="315" src="https://www.youtube.com/embed/2t62WkNIqxY" frameborder="0" allowfullscreen>
</iframe></center>
x^{(i)} = \begin{bmatrix} x \text{ in camera 1} \\ y \text{ in camera 1} \\ x \text{ in camera 2} \\ y \text{ in camera 2} \\ x \text{ in camera 3} \\ y \text{ in camera 3} \end{bmatrix},\qquad X_{m \times 6} = \begin{bmatrix} \cdots & (x^{(1)})^T & \cdots \\ \cdots & (x^{(2)})^T & \cdots \\ & \vdots & \\ & \vdots & \\ \cdots & (x^{(m)})^T & \cdots \\ \end{bmatrix}
Download files
from google.colab import drive
drive.mount('/content/drive')
from six.moves import cPickle
X = cPickle.load(open('/content/drive/MyDrive/ML_Colab/ML_data/pca_spring.pkl','rb'))
X = np.asmatrix(X.T)
print(X.shape)
m = X.shape[0]
plt.figure(figsize = (6, 4))
plt.subplot(1,3,1)
plt.plot(X[:, 0], -X[:, 1], 'r')
plt.axis('equal')
plt.title('Camera 1')
plt.subplot(1,3,2)
plt.plot(X[:, 2], -X[:, 3], 'b')
plt.axis('equal')
plt.title('Camera 2')
plt.subplot(1,3,3)
plt.plot(X[:, 4], -X[:, 5], 'k')
plt.axis('equal')
plt.title('Camera 3')
plt.show()
X = X - np.mean(X, axis = 0)
S = 1/(m-1)*X.T*X
D, U = np.linalg.eig(S)
idx = np.argsort(-D)
D = D[idx]
U = U[:,idx]
print(D, '\n')
print(U)
plt.figure(figsize = (6, 4))
plt.stem(np.sqrt(D))
plt.grid(alpha = 0.3)
plt.show()
# relative magnitutes of the principal components
Z = X*U
xp = np.arange(0, m)/24 # 24 frame rate
plt.figure(figsize = (6, 4))
plt.plot(xp, Z)
plt.yticks([])
plt.show()
## projected onto the first principal component
# 6 dim -> 1 dim (dim reduction)
# relative magnitute of the first principal component
Z = X*U[:,0]
plt.figure(figsize = (6, 4))
plt.plot(Z)
plt.yticks([])
plt.show()
Reference: John P Cunningham & Byron M Yu, Dimensionality reduction for large-scale neural recordings, Nature Neuroscience 17, 1500–1509 (2014)
3. Fisher Discriminant Analysis (FDA) or Linear Discriminant Analysis (LDA)¶
3.1. FDA Algorithm¶
3.1.1. Dimensionality Reduction with Label¶
Dimensionality reduction with label information (when the ultimate goal is classification/regression)
PCA ignores label information even if it is available
- Only chooses directions of maximum variance
Fisher Discriminant Analysis (FDA) takes into account the label information
- It is also called Linear Discriminant Analysis (LDA)
FDA/LDA projects data while preserving class separation
- Examples from same class are put closely together by the projection
- Examples from different classes are placed far apart by the projection
3.1.2. Projection onto Line \omega¶
- Linear regression projects each data point
- assume zero mean, otherwise x \leftarrow x - \bar{x}
- \omega_0 = 0
\hat{y} = \langle \omega,x\rangle = \omega^Tx = \omega_1x_1 + \omega_2x_2
- Dimension reduction
x = \begin{bmatrix}x_1 \\ x_2 \end{bmatrix} \rightarrow \hat y \; (\text{scalar})
- Each data point x = \begin{bmatrix}x_1 \\ x_2 \end{bmatrix} is projected onto \omega (projected length on \omega direction)
- For a given \omega, distribution of the projected points \{ \hat y^{(1)}, \cdots, \hat y^{(m)} \} is specified.
Question: Which \omega is better for classification?
3.1.3. Fisher Discriminant Analysis¶
- Find \omega so that when projected onto \omega,
- the classes are maximally separated (maximize distance between classes)
- Each class is tight (minimize variance of each class)
\max_\omega \frac{(\text{seperation of projected means})^2}{\text{sum of within class variances}}
\implies \max_\omega \frac {\big(\mu_0^T\omega - \mu_1^T\omega \big)^2}{n_0\omega^TS_0\omega + n_1\omega^TS_1\omega}
\omega = \text{arg}\max_\omega \bigg\{ \frac{\big((\mu_0^T - \mu_1^T)\omega\big)^2}{n_0\omega^TS_0\omega + n_1\omega^TS_1\omega} \bigg \}
J(\omega) = \frac {\big((\mu_0^T - \mu_1^T)\omega\big)^2}{\omega^T(n_0S_0 + n_1S_1)\omega} = \frac {(m^T\omega)^2}{\omega^T\Sigma \omega}
\begin{align*} m &\equiv \mu_0 - \mu_1 &\\ \Sigma &\equiv n_0S_0 + n_1S_1 = R^TR \qquad &\text{We can always write } \Sigma \text{ like this, where } R \text{ is a square root matrix} \\ u &\equiv R\omega \rightarrow \omega = R^{-1}u \qquad &\text{Using } R, \text{change the coordinate systems from } \omega \text{ to } u \end{align*}
J(u) = \frac {\big( m^TR^{-1}u\big)^2}{\omega^TR^TR\omega} = \frac {\bigg( \big(R^{-T}m \big)^Tu\bigg)^2}{u^Tu}= \bigg( \left(R^{-T}m \right)^T \frac{u}{\parallel u \parallel} \bigg)^2
J(u) = \bigg( \left(R^{(-T)}m \right)^T \frac{u}{\lVert u \rVert} \bigg)^2 \; \text{is maximum when} \; u = a\,R^{-T}m
- Why?
- Dot product of a unit vector and another vector is maximum when the two have the same direction.
\begin{align*} u &= aR^{-T}m = aR^{-T}(\mu_0 - \mu_1)\\ \omega &= R^{-1}u = aR^{-1}R^{-T}(\mu_0 - \mu_1) = a\big(R^TR\big)^{-1}(\mu_0 - \mu_1) = a\Sigma^{-1}(\mu_0 - \mu_1) \\ \\ \therefore \; \omega &= a(n_0S_0 + n_1S_1)^{-1}(\mu_0 - \mu_1) \end{align*}
3.2. Python Code¶
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
# generating data set
n0 = 200
n1 = 200
mu = [0, 0]
sigma = [[0.9, -0.4],
[-0.4, 0.3]]
x0 = np.random.multivariate_normal([2.5,2.5], sigma, n0).T # data in class 0
x1 = np.random.multivariate_normal([1,1], sigma, n1).T # data in class 0
x0 = np.asmatrix(x0)
x1 = np.asmatrix(x1)
mu0 = np.mean(x0, axis = 1)
mu1 = np.mean(x1, axis = 1)
S0 = 1/(n0 - 1)*(x0 - mu0)*(x0 - mu0).T
S1 = 1/(n1 - 1)*(x1 - mu1)*(x1 - mu1).T
w = (n0*S0 + n1*S1).I*(mu0 - mu1)
print(w)
Projection line and histogram
plt.figure(figsize = (6, 4))
plt.plot(x0[0,:], x0[1,:], 'r.')
plt.plot(x1[0,:], x1[1,:], 'b.')
xp = np.arange(-4, 6, 0.1)
yp = w[1,0]/w[0,0]*xp
plt.plot(xp, yp, 'k')
plt.axis('equal')
plt.ylim([-2, 6])
plt.show()
y1 = x0.T*w
y2 = x1.T*w
plt.figure(figsize = (6, 4))
plt.hist(y1, 21, color = 'r', rwidth = 0.5)
plt.hist(y2, 21, color = 'b', rwidth = 0.5)
plt.show()
Different \omega (x axis)
w = np.array([[1], [0]])
w = np.asmatrix(w)
plt.figure(figsize = (6, 4))
plt.plot(x0[0,:], x0[1,:], 'r.')
plt.plot(x1[0,:], x1[1,:], 'b.')
plt.axhline(0, color = 'k')
plt.axis('equal')
plt.ylim([-2, 6])
plt.show()
y1 = x0.T*w
y2 = x1.T*w
plt.figure(figsize = (6, 4))
plt.hist(y1, 21, color = 'r', rwidth = 0.5)
plt.hist(y2, 21, color = 'b', rwidth = 0.5)
plt.show()
Different \omega (y axis)
w = np.array([[0], [1]])
w = np.asmatrix(w)
plt.figure(figsize = (6, 4))
plt.plot(x0[0,:], x0[1,:], 'r.')
plt.plot(x1[0,:], x1[1,:], 'b.')
plt.axvline(0, color = 'k')
plt.axis('equal')
plt.ylim([-2, 6])
plt.show()
y1 = x0.T*w
y2 = x1.T*w
plt.figure(figsize = (6, 4))
plt.hist(y1, 21, color = 'r', rwidth = 0.5)
plt.hist(y2, 21, color = 'b', rwidth = 0.5)
plt.show()
# reshape data
X = np.vstack([x0.T, x1.T])
y = np.vstack([np.ones([n0, 1]), np.zeros([n1, 1])])
from sklearn import discriminant_analysis
X = np.array(X)
clf = discriminant_analysis.LinearDiscriminantAnalysis()
clf.fit(X, np.ravel(y))
# projection
X_LDA = clf.transform(X)
plt.figure(figsize = (6, 4))
plt.hist(X_LDA[0:200], 21, color = 'r', rwidth = 0.5)
plt.hist(X_LDA[200:400], 21, color = 'b', rwidth = 0.5)
plt.show()
# classifier boundary, not a projection line
clf_w = clf.coef_[0]
clf_w0 = clf.intercept_[0]
print(clf_w)
print(clf_w0)
# projection line
mu0 = np.mean(x0, axis = 1)
mu1 = np.mean(x1, axis = 1)
S0 = 1/(n0 - 1)*(x0 - mu0)*(x0 - mu0).T
S1 = 1/(n1 - 1)*(x1 - mu1)*(x1 - mu1).T
prj_w = (n0*S0 + n1*S1).I*(mu0 - mu1)
xp = np.arange(-4, 6, 0.1)
prj_yp = prj_w[1,0]/prj_w[0,0]*xp
clf_yp = - clf_w[0]/clf_w[1]*xp - clf_w0/clf_w[1]
plt.figure(figsize = (6, 4))
plt.plot(x0[0,:], x0[1,:], 'r.')
plt.plot(x1[0,:], x1[1,:], 'b.')
plt.plot(xp, prj_yp, 'k', label = 'LDA projection line')
plt.plot(xp, clf_yp, 'g', label = 'LDA classification boundary')
plt.legend()
plt.axis('equal')
plt.ylim([-2, 6])
plt.show()
4. Singular Value Decomposition (SVD)¶
Geometry of Linear Maps
Matrix A (or linear transformation) = rotate + stretch/compress
an extremely important fact:
\begin{align*} S &= \{ x \in \mathbb{R}^n \mid \; \Vert x \Vert \leq 1 \}\\ AS &= \{Ax \mid x\in \mathbf{S}\} \end{align*}
4.1. SVD Algorithm¶
Singular Values and Singular Vectors
the numbers \sigma_1, \cdots, \sigma_n are called the singular values of A by convention, \sigma_i > 0
the vectors u_1, \cdots, u_n are unit vectors along the principal semiaxes of AS
the vectors v_1, \cdots, v_n are the preimages of the principal semiaxes, defined so that
A v_i = \sigma_i u_i
\begin{align*} A\begin{bmatrix} v_1& v_2 & \cdots & v_r \end{bmatrix} & =\begin{bmatrix} A v_1 & A v_2 & \cdots & A v_r \end{bmatrix} \\ & =\begin{bmatrix} \sigma_1u_1& \sigma_2u_2 & \cdots & \sigma_ru_r \end{bmatrix} \\ & =\begin{bmatrix} u_1& u_2 & \cdots & u_r \end{bmatrix}\begin{bmatrix} \sigma_1& & & \\ & \sigma_2 & & \\ & & \ddots & \\ & & & \sigma_r \end{bmatrix} \\ \\ \therefore \; AV & = U\Sigma \quad (r \leq m, n) \end{align*}
Thin Singular Value Decomposition
We have A \in \mathbb{R}^{m \times n}, skinny and full rank (i.e., r = n), and
A v_i = \sigma_i u_i \quad \text{for } 1 \leq i \leq n
let
\begin{align*} \hat U &= \begin{bmatrix}u_1 & u_2 & \cdots & u_n \end{bmatrix}\\ \\ \quad \hat \Sigma &= \begin{bmatrix}\sigma_1& & & \\ & \sigma_2 &&\\ &&\ddots&\\ &&&\sigma_n \end{bmatrix}\\ \\ \quad V &= \begin{bmatrix} v_1 & v_2 & \cdots & v_n \end{bmatrix} \end{align*}
in matrix form, the above equation is AV = \hat U \hat \Sigma and since V is orthogonal
A = \hat U \hat \Sigma V^T
called the thin (or reduced) SVD of A
This is the (full) singular value decomposition of A
Every matrix A has a singular value decomposition.
If A is not full rank, then some of the diagonal entries of \Sigma will be zero
The SVD decomposes the linear map into
rotate by V^T
diagonal scaling by \sigma_i
rotate by U
Note that, unlike the eigen-decomposition, input and output directions are different
- Example
\begin{array}{Icr} A = \begin{bmatrix} 4 & 4 \\ -3 & 3 \end{bmatrix} \end{array} \quad \quad \begin{array}{Icr} \begin{align*} &\textbf{orthonormal basis} \\ &v_1, v_2 \text{ in row space } \mathbb{R}^2 \\ &u_1, u_2 \text{ in column space } \mathbb{R}^2 \\ &\sigma_1 > 0 \\ &\sigma_2 > 0 \end{align*} \end{array}
\begin{array}{lcr} A v_1 = \sigma_1 u_1\\ A v_2 = \sigma_2 u_2 \end{array} \; \implies \; \begin{array}{lcr} AV = U\Sigma \\ A = U\Sigma V^{-1} = U\Sigma V^T \end{array}
- A^TA: (small) positive definite & symmetric
A^TA = V\Sigma^TU^TU\Sigma V^T = V\Sigma^T\Sigma V^T = V\begin{bmatrix} \sigma_1^2& & \\ & \sigma_2^2 & \\ & & \ddots \\ \end{bmatrix}V^T
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
A = np.array([[4, 4],
[-3, 3]])
A = np.asmatrix(A)
[D, V] = np.linalg.eig(A.T*A)
print(V, '\n')
print(D)
- AA^T: (large) positive definite & symmetric
AA^T = U\Sigma V^TV\Sigma^TU^T = U\Sigma\Sigma^TU^T = U\begin{bmatrix} \sigma_1^2& & \\ & \sigma_2^2 & \\ & & \ddots \\ \end{bmatrix}U^T
[D, U] = np.linalg.eig(A*A.T)
print(U, '\n')
print(D)
A = U\Sigma V^T
[U, S, VT] = np.linalg.svd(A, full_matrices = True)
print(U.shape, S.shape, VT.shape, '\n')
print(U, '\n')
print(S, '\n')
print(VT)
4.2. PCA and SVD¶
- Any real symmetric and positive definite matrix B has a eigen decomposition
B = S\Lambda S^T
- A real matrix (m\times n) A, where m>n, has the decompsition
A = U\Sigma V^T
- From A (skinny and full rank) we can construct two positive-definite symmetric matrices, AA^T and A^TA
\begin{align*} AA^T &= U\Sigma V^T V \Sigma^T U^T = U \Sigma \Sigma^T U^T \quad (m \times m, \,n\, \text{eigenvalues and }\, m-n \; \text{zeros},\; U\; \text{eigenvectors})\\ A^TA &= V \Sigma^T U^T U\Sigma V^T = V \Sigma^T \Sigma V^T \quad (n \times n, \,n \;\text{eigenvalues},\; V\; \text{eigenvectors}) \end{align*}
- PCA by SVD
B = A^TA = V \Sigma^T \Sigma V^T = V \Lambda V^T
- V is eigenvectors of B = A^TA
- \sigma_i^2 = \lambda_i
- Note that in PCA,
VS V^T \quad \left(\text{where }S = \frac{1}{m}X^TX = \left(\frac{X}{\sqrt{m}}\right)^T \frac{X}{\sqrt{m}} = A^TA \right)
%%html
<center><iframe src="https://www.youtube.com/embed/Nx0lRBaXoz4?rel=0"
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>
4.3. Low Rank Approximation: Dimension Reduction¶
- in 2D case
\begin{align*} x & = c_1 v_1 + c_2 v_2 \\ & = \langle x, v_1 \rangle v_1 + \langle x, v_2 \rangle v_2\\ & = ( v_1^{T} x) v_1 + ( v_2^T x) v_2 \\ \\ Ax & = c_1 A v_1 + c_2 A v_2 \\ & = c_1\sigma_1 u_1 + c_2 \sigma_2 u_2 \\ & = u_1 \sigma_1 v_1^Tx + u_2\sigma_2 v_2^T x \\ \\ & = \begin{bmatrix} u_1 & u_2 \end{bmatrix}\begin{bmatrix}\sigma_1 & 0\\ 0 & \sigma_2 \end{bmatrix}\begin{bmatrix} v_1^T \\ v_2^T \end{bmatrix} x \\ \\ & = U \Sigma V^Tx \\\\ & \approx u_1\sigma_1 v_1^Tx \quad (\text{if } \sigma_1 \gg \sigma_2) \end{align*}
- Low rank approximation of matrix A in general
\begin{align*} A &= U_r \Sigma_r V_r^T = \sum_{i=1}^{r}\sigma_i u_i v_i^T\\ \\ \tilde A &= U_k \Sigma_k V_k^T = \sum_{i=1}^{k}\sigma_i u_i v_i^T \qquad (k \leq r) \end{align*}
A = np.array([[11.08, 6.82, 1.76, -6.82],
[2.50, -1.01, -2.60, 1.19],
[-4.88, -5.07, -3.21, 5.20],
[-0.49, 1.52, 2.07, -1.66],
[-14.04, -12.40, -6.66, 12.65],
[0.27, -8.51, -10.19, 9.15],
[9.53, -9.84, -17.00, 11.00],
[-12.01, 3.64, 11.10, -4.48]])
[U, S, VT] = np.linalg.svd(A, full_matrices = True)
U = np.asmatrix(U)
S = np.asmatrix(np.diag(S))
VT = np.asmatrix(VT)
print(U.shape, S.shape, V.shape, '\n')
print(U, '\n')
print(S, '\n')
print(VT)
[U, S, VT] = np.linalg.svd(A, full_matrices = False)
U = np.asmatrix(U)
S = np.asmatrix(np.diag(S))
VT = np.asmatrix(VT)
print(U.shape, S.shape, V.shape, '\n')
print(U, '\n')
print(S, '\n')
print(VT)
Question
How to determine a good k so that we can efficiently compress A without losing too much information of A?
sig_val = np.diag(S)
plt.figure(figsize = (6, 4))
plt.stem(range(1,5), sig_val)
plt.xticks(range(1,5))
plt.title('Singular Values')
plt.show()
k = 2
low_U = U[:, :k]
low_S = S[:k, :k]
low_VT = VT[:k, :]
A_approx = low_U*low_S*low_VT
print(low_U.shape, low_S.shape, low_VT.shape, '\n')
print('A:\n', A ,'\n')
print('A_approx:\n', A_approx, '\n')
print('low_U:\n', low_U, '\n')
print('low_VT:\n', low_VT)
4.3.2. Example: Image Approximation¶
Learn how to reduce the amount of storage needed for an image using the singular value decomposition.
Approximation of A
\tilde A = U_k \Sigma_k V_k^T = \sum_{i=1}^{k}\sigma_i u_i v_i^T \qquad (k \leq r)
Downlaod the image file
from google.colab import drive
drive.mount('/content/drive')
import cv2
img = cv2.imread('/content/drive/MyDrive/ML_Colab/ML_data/postech.jpg', 0)
print(img.shape)
plt.figure(figsize = (6, 6))
plt.imshow(img, 'gray')
plt.axis('off')
plt.show()
A = img
[U, S, VT] = np.linalg.svd(A)
U = np.asmatrix(U)
S = np.asmatrix(np.diag(S))
VT = np.asmatrix(VT)
print(U.shape, S.shape, VT.shape, '\n')
print(U, '\n')
print(S, '\n')
sig_val = np.diag(S)
plt.figure(figsize = (6, 4))
plt.stem(sig_val)
plt.title('Singular Values')
plt.show()
Approximated image when k = 20
k = 20
low_U = U[:, :k]
low_S = S[:k, :k]
low_VT = VT[:k, :]
A_hat = low_U*low_S*low_VT
plt.figure(figsize = (8, 4))
plt.subplot(1,2,1), plt.imshow(A, 'gray'), plt.axis('off'), plt.title('Original image')
plt.subplot(1,2,2), plt.imshow(A_hat, 'gray'), plt.axis('off'), plt.title('Approximated image w/ rank = 20')
plt.show()
Approximated images with k varied
K = [1, 2, 3, 4, 10, 20, 30, 600]
plt.figure(figsize = (8, 5))
for i in range(len(K)):
low_U = U[:, :K[i]]
low_S = S[:K[i], :K[i]]
low_VT = VT[:K[i], :]
A_hat = low_U*low_S*low_VT
plt.subplot(2,4,i+1), plt.imshow(A_hat, 'gray'), plt.axis('off')
plt.title('rank {}'.format(K[i]))
plt.show()
4.3.3. Eigenface¶
Further editted by Tim Chartier, Davidson College 2007
Best approximation of x with U eigenface (eigenvectors)
\tilde{x} = \sum_{i=1}^{k} \langle x,\hat{u}_i\rangle \,\hat{u}_i
12 USA Presidents
Download the face file
from scipy import io
faces = io.loadmat('/content/drive/MyDrive/ML_Colab/ML_data/faces.mat')
face_mat = faces['faces']
row, col, m = face_mat.shape
plt.figure(figsize = (6, 8))
for i in range(m):
plt.subplot(4, 3, i+1), plt.axis('off')
plt.imshow(face_mat[:, :, i], 'gray')
plt.show()
SVD
A = np.reshape(face_mat, [row*col, m])
[U, S, VT] = np.linalg.svd(A, full_matrices = False)
U = np.asmatrix(U)
S = np.asmatrix(np.diag(S))
VT = np.asmatrix(VT)
print(U.shape, S.shape, VT.shape, '\n')
Eigenface
plt.figure(figsize = (6, 8))
for i in range(m):
plt.subplot(4, 3, i+1), plt.axis('off')
plt.imshow(U[:, i].reshape(row, col), 'gray')
plt.show()
Singular values
sig_val = np.diag(S)
plt.figure(figsize = (6, 4))
plt.stem(sig_val)
plt.title('Singular Values')
plt.show()
Transmit the compressed information
- Assum that both parties have eigenfunctions
- Only thing to transmit is the corresponding weights of the eigenfunctions
obama = face_mat[:,:,11]
plt.figure(figsize = (3, 4))
plt.imshow(obama, 'gray')
plt.axis('off')
plt.show()
# sender
x = np.reshape(obama, [row*col, 1])
eigenface_decom = U.T*x
print(obama.shape, eigenface_decom.shape)
# reciever
reconst = U*eigenface_decom
plt.figure(figsize = (3, 4))
plt.imshow(reconst.reshape(row, col), 'gray')
plt.axis('off')
plt.show()
Seeing through a Disguise with SVD
- Acquire new image
- original image disguised
- download the image file
jfk = cv2.imread('/content/drive/MyDrive/ML_Colab/ML_data/jfk.png', 0)
plt.figure(figsize = (3, 4))
plt.imshow(jfk, 'gray')
plt.axis('off')
plt.show()
Projection onto U
x = np.reshape(jfk, [row*col, 1])
eigenface_decom = U.T*x
reconst = U*eigenface_decom
plt.figure(figsize = (3, 4))
plt.imshow(reconst.reshape(row, col), 'gray')
plt.axis('off')
plt.show()
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')