Dimension Reduction

By Prof. Seungchul Lee
http://iailab.kaist.ac.kr/
Industrial AI Lab at KASIT

# 1. Multivariate Statistics¶

Correlation of Two Random Variables

\begin{align*} \text{Sample Variance} : S_x &= \frac{1}{m-1} \sum\limits_{i=1}^{m}\left(x^{(i)}-\bar x\right)^2 \\ \text{Sample Covariance} : S_{xy} &= \frac{1}{m-1} \sum\limits_{i=1}^{m}\left(x^{(i)}-\bar x\right)\left(y^{(i)}-\bar y \right)\\ \text{Sample Covariance matrix} : S &= \begin{bmatrix} S_x & S_{xy} \\ S_{yx} & S_y \end{bmatrix}\\ \text{sample correlation coefficient} : r &= \frac{S_{xy}}{ \sqrt {S_{xx}\cdot S_{yy}} } \end{align*}
• Strength of linear relationship between two variables, $x$ and $y$

Correlation Coefficient Plot

Covariance Matrix

$$\sum = \begin{bmatrix} E[(X_1-\mu_1)(X_1-\mu_1)]& E[(X_1-\mu_1)(X_2-\mu_2)] & \cdots &E[(X_1-\mu_1)(X_n-\mu_n)]\\ E[(X_2-\mu_2)(X_1-\mu_1)]& E[(X_2-\mu_2)(X_2-\mu_2)] & \cdots &E[(X_2-\mu_2)(X_n-\mu_n)]\\ \vdots & \vdots & \ddots & \vdots\\ E[(X_n-\mu_n)(X_1-\mu_1)]& E[(X_n-\mu_n)(X_2-\mu_2)] & \cdots &E[(X_n-\mu_n)(X_n-\mu_n)]\\ \end{bmatrix}$$

# 2. Principal Component Analysis (PCA)¶

## 2.1. Motivation¶

• Can we describe high-dimensional data in a "simpler" way?

$\quad \rightarrow$ Dimension reduction without losing too much information
$\quad \rightarrow$ Find a low-dimensional, yet useful representation of the data

• How?

idea: highly correlated data contains redundant features

• Now consider a change of axes

• Each example $x$ has 2 features $\{u_1,u_2\}$

• Consider ignoring the feature $u_2$ for each example

• Each 2-dimensional example $x$ now becomes 1-dimensional $x = \{u_1\}$

• Are we losing much information by throwing away $u_2$ ?

• No. Most of the data spread is along $u_1$ (very little variance along $u_2$)

• Data $\rightarrow$ projection onto unit vector $\hat{u}_1$
• PCA is used when we want projections capturing maximum variance directions
• Principal Components (PC): directions of maximum variability in the data
• Roughly speaking, PCA does a change of axes that can represent the data in a succinct manner