Unsupervised Learning: Dimension Reduction


By Prof. Seungchul Lee
http://iai.postech.ac.kr/
Industrial AI Lab at POSTECH

Table of Contents

1. Principal Component Analysis (PCA)


Motivation: Can we describe high-dimensional data in a "simpler" way?

$\quad \rightarrow$ Dimension reduction without losing too much information
$\quad \rightarrow$ Find a low-dimensional, yet useful representation of the data

  • Why dimensionality reduction?
    • insights into the low-dimensinal structures in the data (visualization)
    • Fewer dimensions ⇒ Less chances of overfitting ⇒ Better generalization
    • Speeding up learning algorithms
      • Most algorithms scale badly with increasing data dimensionality
    • Less storage requirements (data compression)
    • Note: Dimensionality Reduction is different from Feature Selection
      • .. although the goals are kind of the same
    • Dimensionality reduction is more like “Feature Extraction
      • Constructing a small set of new features from the original features
  • How?
    idea: highly correlated data contains redundant features




  • Each example $x$ has 2 features $\{x_1,x_2\}$

  • Consider ignoring the feature $x_2$ for each example

  • Each 2-dimensional example $x$ now becomes 1-dimensional $x = \{x_1\}$

  • Are we losing much information by throwing away $x_2$ ?



  • Each example $x$ has 2 features $\{x_1,x_2\}$

  • Consider ignoring the feature $x_2$ for each example

  • Each 2-dimensional example $x$ now becomes 1-dimensional $x = \{x_1\}$

  • Are we losing much information by throwing away $x_2$ ?

  • Yes, the data has substantial variance along both features (i.e. both axes)



  • Now consider a change of axes

  • Each example $x$ has 2 features $\{u_1,u_2\}$

  • Consider ignoring the feature $u_2$ for each example

  • Each 2-dimensional example $x$ now becomes 1-dimensional $x = \{u_1\}$

  • Are we losing much information by throwing away $u_2$ ?

  • No. Most of the data spread is along $u_1$ (very little variance along $u_2$)



  • Data $\rightarrow$ projection onto unit vector $\hat{u}_1$
  • PCA is used when we want projections capturing maximum variance directions
  • Principal Components (PC): directions of maximum variability in the data
  • Roughly speaking, PCA does a change of axes that can represent the data in a succinct manner



  • HOW?
    1. Maximize variance (most separable)
    2. Minimize the sum-of-squares (minimum squared error)

2. PCA Algorithm

2.1. Pre-processing

  • Given data
$$ x^{(i)} = \begin{bmatrix} x^{(i)}_1 \\ \vdots \\x^{(i)}_n \end{bmatrix},\qquad X = \begin{bmatrix} \cdots & (x^{(1)})^T & \cdots\\ \cdots & (x^{(2)})^T & \cdots\\ & \vdots & \\ \cdots & (x^{(m)})^T & \cdots\\ \end{bmatrix}$$
  • Shifting (zero mean) and rescaling (unit variance)
    1. Shift to zero mean $$ \begin{align*} \mu &= \frac{1}{m}\sum\limits_{i=1}^{m}x^{(i)} \\ x^{(i)} &\leftarrow x^{(i)} - \mu \quad \text{(zero mean)} \end{align*} $$
    2. [optional] Rescaling (unit variance) $$\begin{align*} \sigma^2_j &= \frac{1}{m-1}\sum\limits_{i=1}{m}\left(x_j^{(i)}\right)^2 \\ x^{(i)}_j &\leftarrow \frac{x^{(i)}_j}{\sigma_j} \\ \end{align*}$$

2.2. Maximize Variance

  • Find unit vector $u$ such that maximizes variance of projections

    Note: $m \approx m-1$ for big data

$$\begin{align*} \text{variance of projected data} & = \frac{1}{m}\sum\limits_{i=1}^{m}\big(u^Tx^{(i)}\big)^2 = \frac{1}{m}\sum\limits_{i=1}^{m}\big( {x^{(i)}}^Tu\big)^2 \\ & = \frac{1}{m}\sum\limits_{i=1}^{m}\big( {x^{(i)}}^Tu\big)^T\big( {x^{(i)}}^Tu\big) = \frac{1}{m}\sum\limits_{i=1}^{m}u^Tx^{(i)}{x^{(i)}}^Tu \\ & = u^T\left(\frac{1}{m}\sum\limits_{i=1}^{m}x^{(i)}{x^{(i)}}^T \right) u \\ & =u^TSu \qquad (S=\frac{1}{m}X^T X: \text{sample covariance matrix}) \end{align*}$$
  • In an optimization form
$$\begin{align*} \text{maximize} \quad & u^TSu \\ \text{subject to} \quad & u^Tu = 1\end{align*}$$


$$\begin{align*} & u^TSu = u^T\lambda u = \lambda u^Tu = \lambda \quad (\text{Eigen analysis} : Su = \lambda u) \\ \\ & \implies \text{pick the largest eigenvalue } \lambda _1 \text{ of covariance matrix } S\\ & \implies u = u_1 \, \text{ is the } \,\lambda_1's \,\text{ corresponding eigenvector}\\ & \implies u_1 \text{ is the first principal component (direction of highest variance in the data)} \end{align*}$$

2.3. Minimize the Sum-of-Squared Error





$$ \begin{align*} \lVert e^{(i)} \rVert ^2 & = \lVert x^{(i)} \rVert ^2 - \big( {x^{(i)}}^Tu \big)^2 \\ & = \lVert x^{(i)} \rVert ^2 - \big( {x^{(i)}}^Tu \big)^T\big( {x^{(i)}}^Tu \big) \\ & = \lVert x^{(i)} \rVert ^2 - u^Tx^{(i)}{x^{(i)}}^Tu\\ \end{align*} $$



$$\begin{align*} \frac {1}{m} \sum\limits_{i=1}^{m} \lVert e^{(i)} \rVert ^2 & = \frac {1}{m} \sum\limits_{i=1}^{m} \lVert x^{(i)} \rVert ^2 - \frac{1}{m}\sum\limits_{i=1}^{m}u^Tx^{(i)}{x^{(i)}}^Tu \\ & = \frac {1}{m} \sum\limits_{i=1}^{m} \lVert x^{(i)} \rVert ^2 - u^T\left(\frac{1}{m}\sum\limits_{i=1}^{m}x^{(i)}{x^{(i)}}^T\right)u \end{align*}$$
  • In an optimization form


$$\begin{align*} &\text{minimize} \; \underbrace{\frac {1}{m} \sum\limits_{i=1}^{m} \lVert x^{(i)} \rVert ^2}_{\text{constant given $x_i$}} - \underbrace{u^T\left(\frac{1}{m}\sum\limits_{i=1}^{m}x^{(i)}{x^{(i)}}^T\right)u}_{\text{maximize}} \\ \\ \implies &\text{maximize} \; u^T\left(\frac{1}{m}\sum\limits_{i=1}^{m}x^{(i)}{x^{(i)}}^T\right)u = \max \; u^T S u\end{align*}$$


$$ \therefore \; \text{minimize} \; error^2 = \text{maximize} \; variance$$

2.4. Dimension Reduction Method ($n \rightarrow k$)

  1. Choose top $k$ (orthonormal) eigenvectors, $U = [u_1, u_2, \cdots, u_k]$

  2. Project $x_i$ onto span $\{ u_1, u_2, \cdots, u_k\}$


$$z^{(i)} = \begin{bmatrix} u_1^Tx^{(i)}\\ u_2^Tx^{(i)}\\ \vdots \\ u_k^Tx^{(i)}\\ \end{bmatrix} \;\text{ or }\; z = U^{T}x $$

  • Pictorial summary of PCA



$\qquad \qquad \qquad x^{(i)} \rightarrow$ projection onto unit vector $u \implies u^Tx^{(i)} = $ distance from the origin along $u$

3. PCA Visualization

3.1. Eigenvalues

  • $\lambda_1, \lambda_2$ indicates variance along the eigenvectors, respectively.

  • The larger eigenvalue is, the more dominant feature (eigenvector) is

3.2. Eigenvectors

  • Given basis $\{\hat{x}_1, \hat{x}_2\}$ to transformed basis $\{\hat{u}_1, \hat{u}_2\}$



$$[\hat{u}_1\; \hat{u}_2]=[\hat{x}_1 \; \hat{x}_2]\begin{bmatrix} c_1 & c_3 \\c_2 & c_4\end{bmatrix}$$

4. Python Codes

In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# data generation
m = 5000
mu = np.array([0, 0])
sigma = np.array([[3, 1.5], 
                  [1.5, 1]])

X = np.random.multivariate_normal(mu, sigma, m)
X = np.asmatrix(X)

fig = plt.figure(figsize = (10, 8))
plt.plot(X[:,0], X[:,1], 'k.', alpha = 0.3)
plt.axis('equal')
plt.grid(alpha = 0.3)
plt.show()
In [2]:
S = 1/(m-1)*X.T*X
 
D, U = np.linalg.eig(S)

idx = np.argsort(-D)
D = D[idx]
U = U[:,idx]

print(D, '\n')
print(U)
[3.83773922 0.20372196] 

[[ 0.87940402 -0.47607622]
 [ 0.47607622  0.87940402]]
In [3]:
h = U[1,0]/U[0,0]
xp = np.arange(-6, 6, 0.1)
yp = h*xp

fig = plt.figure(figsize = (10, 8))
plt.plot(X[:,0], X[:,1], 'k.', alpha = 0.3)
plt.plot(xp, yp, 'r', linewidth = 3)
plt.axis('equal')
plt.grid(alpha = 0.3)
plt.show()
In [4]:
Z = X*U[:,0]

plt.figure(figsize = (10, 8))
plt.hist(Z, 51)
plt.show()
In [5]:
Z = X*U[:,1]

plt.figure(figsize = (10, 8))
plt.hist(Z, 51)
plt.show()
In [6]:
from sklearn.decomposition import PCA

pca = PCA(n_components = 1)
pca.fit(X)
Out[6]:
PCA(copy=True, iterated_power='auto', n_components=1, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False)
In [7]:
pca.components_
Out[7]:
array([[-0.87940299, -0.47607813]])
In [8]:
pca.explained_variance_
Out[8]:
array([3.8377386])
In [9]:
pca.explained_variance_ratio_
Out[9]:
array([0.94961564])
In [10]:
u = pca.transform(X)

plt.figure(figsize = (10, 8))
plt.hist(u, 51)
plt.show()
In [11]:
h = pca.components_[0][1] / pca.components_[0][0]
xp = np.arange(-6, 6, 0.1)
yp = h*xp

fig = plt.figure(figsize = (10, 8))
plt.plot(X[:,0], X[:,1], 'k.', alpha = 0.3)
plt.plot(xp, yp, 'r', linewidth = 3)
plt.axis('equal')
plt.grid(alpha = 0.3)
plt.show()

5. PCA Example

  • Multiple video camera records of a spring and mass system




In [12]:
%%html
<center><iframe src="https://www.youtube.com/embed/HdZQmDlEPu4" 
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>
In [13]:
%%html
<center><iframe src="https://www.youtube.com/embed/lt9Z4d_Z0jg" 
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>
In [14]:
%%html
<center><iframe src="https://www.youtube.com/embed/E-1ItDu-2EQ" 
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>
$$ x^{(i)} = \begin{bmatrix} x \text{ in camera 1} \\ y \text{ in camera 1} \\ x \text{ in camera 2} \\ y \text{ in camera 2} \\ x \text{ in camera 3} \\ y \text{ in camera 3} \end{bmatrix},\qquad X_{m \times 6} = \begin{bmatrix} \cdots & (x^{(1)})^T & \cdots \\ \cdots & (x^{(2)})^T & \cdots \\ & \vdots & \\ & \vdots & \\ \cdots & (x^{(m)})^T & \cdots \\ \end{bmatrix} $$

Download files

In [15]:
from six.moves import cPickle

X = cPickle.load(open('./data_files/pca_spring.pkl','rb'))
X = np.asmatrix(X.T)

print(X.shape)
m = X.shape[0]
(273, 6)
In [16]:
plt.figure(figsize = (12, 6))
plt.subplot(1,3,1)
plt.plot(X[:, 0], -X[:, 1], 'r')
plt.axis('equal')
plt.title('Camera 1')

plt.subplot(1,3,2)
plt.plot(X[:, 2], -X[:, 3], 'b')
plt.axis('equal')
plt.title('Camera 2')

plt.subplot(1,3,3)
plt.plot(X[:, 4], -X[:, 5], 'k')
plt.axis('equal')
plt.title('Camera 3')

plt.show()
In [17]:
X = X - np.mean(X, axis = 0)

S = 1/(m-1)*X.T*X
 
D, U = np.linalg.eig(S)

idx = np.argsort(-D)
D = D[idx]
U = U[:,idx]

print(D, '\n')
print(U)
[2.46033089e+04 3.22747042e+02 8.73851124e+01 8.19527660e+01
 3.19467195e+01 7.42861585e+00] 

[[ 0.36881064  0.62298194 -0.12571821 -0.42647348  0.52121775 -0.0807439 ]
 [ 0.35632379  0.57286174  0.132303    0.59881765 -0.40143215  0.08734045]
 [ 0.58419477 -0.22610057 -0.20325551 -0.47751523 -0.58153918  0.00857804]
 [ 0.08652315 -0.02671281  0.75692234 -0.14177391 -0.06010869 -0.62861422]
 [ 0.4159798  -0.29900638  0.49374948  0.05637591  0.32442517  0.62075559]
 [-0.46389987  0.37746931  0.32963322 -0.45633202 -0.34660023  0.45308403]]
In [18]:
plt.figure(figsize = (10,8))
plt.stem(np.sqrt(D))
plt.grid(alpha = 0.3)
plt.show()
In [19]:
# relative magnitutes of the principal components

Z = X*U
xp = np.arange(0, m)/24    # 24 frame rate

plt.figure(figsize = (10, 8))
plt.plot(xp, Z)
plt.yticks([])
plt.show()
In [20]:
## projected onto the first principal component
# 6 dim -> 1 dim (dim reduction)
# relative magnitute of the first principal component

Z = X*U[:,0]

plt.figure(figsize = (10, 8))
plt.plot(Z)
plt.yticks([])
plt.show()

Reference: John P Cunningham & Byron M Yu, Dimensionality reduction for large-scale neural recordings, Nature Neuroscience 17, 1500–1509 (2014)



6. Video Lectures

In [21]:
%%html
<center><iframe src="https://www.youtube.com/embed/OpYwMXvBGLo?rel=0" 
width="420" height="315" frameborder="0" allowfullscreen></iframe></center>
In [22]:
%%html
<center><iframe src="https://www.youtube.com/embed/hgA3to0KNBY?rel=0" 
width="420" height="315" frameborder="0" allowfullscreen></iframe></center>
In [23]:
%%html
<center><iframe src="https://www.youtube.com/embed/DmaiB5Kr36M?rel=0" 
width="420" height="315" frameborder="0" allowfullscreen></iframe></center>
In [24]:
%%html
<center><iframe src="https://www.youtube.com/embed/NXIr6eleakI?rel=0" 
width="420" height="315" frameborder="0" allowfullscreen></iframe></center>
In [25]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')