Singular Value Decomposition

 by Seungchul LeeiSystems Design Labhttp://isystems.unist.ac.kr/UNIST

# 1. Geometry of Linear Maps¶

Matrix $A$ (or linear transformation) = rotate + stretch/compress

an extremely important fact:

every matrix $A \in \mathbb{R}^{m \times n}$ maps the unit ball in $\mathbb{R}^n$ to an ellipsoid in $\mathbb{R}^m$
\begin{align*} S &= \{ x \in \mathbb{R}^n \mid \; \Vert x \Vert \leq 1 \}\\ AS &= \{Ax \mid x\in \mathbf{S}\}\end{align*}

# 2. Singular Values and Singular Vectors¶

• the numbers $\sigma_1, \cdots, \sigma_n$ are called the singular values of $A$ by convention, $\sigma_i > 0$

• the vectors $u_1, \cdots, u_n$ these are unit vectors along the principal semiaxes of $AS$

• the vectors $\upsilon_1, \cdots, \upsilon_n$ these are the preimages of the principal semiaxes, defined so that $$A\upsilon_i = \sigma_i u_i$$

## 2.1. Graphical Understanding¶

\begin{align*} A\begin{bmatrix} \upsilon_1& \upsilon_2 & \cdots & \upsilon_r \end{bmatrix} & =\begin{bmatrix} A\upsilon_1 & A\upsilon_2 & \cdots & A\upsilon_r \end{bmatrix} \\ & =\begin{bmatrix} \sigma_1u_1& \sigma_2u_2 & \cdots & \sigma_ru_r \end{bmatrix} \\ & =\begin{bmatrix} u_1& u_2 & \cdots & u_r \end{bmatrix}\begin{bmatrix} \sigma_1& & & \\ & \sigma_2 & & \\ & & \ddots & \\ & & & \sigma_r \end{bmatrix} \\ \\ \therefore \; AV & = U\Sigma \quad (r \leq m, n) \end{align*}

## 2.2. Thin Singular Value Decomposition¶

We have $A \in \mathbb{R}^{m \times n}$, skinny and full rank (i.e., $r = n$), and

$$A\upsilon_i = \sigma_i u_i \quad \text{for } 1 \leq i \leq n$$

let

$$\hat U = \begin{bmatrix}u_1 & u_2 & \cdots & u_n \end{bmatrix}, \quad \hat \Sigma = \begin{bmatrix}\sigma_1& & & \\ & \sigma_2 &&\\ &&\ddots&\\ &&&\sigma_n \end{bmatrix}, \quad V = \begin{bmatrix} \upsilon_1 & \upsilon_2 & \cdots & \upsilon_n \end{bmatrix}$$

in matrix form, the above equation is $AV = \hat U \hat \Sigma$ and since $V$ is orthogonal

$$A = \hat U \hat \Sigma V^T$$

called the thin (or reduced) SVD of $A$

## 2.3. Full Singular Value Decomposition¶

We can add extra orthonormal columns to $U$ to make an orthogonal matrix $(m \geq n)$

$$U = \begin{bmatrix} u_1 & u_2 & \cdots & u_m \end{bmatrix}$$

We also add extra rows of zeros to $\hat \Sigma$, so

$$A = U\Sigma V^T$$

this is the (full) singular value decomposition of $A$

Every matrix $A$ has a singular value decomposition.

If $A$ is not full rank, then some of the diagonal entries of $\Sigma$ will be zero

## 2.4. SVD - Matrix factorization¶

• for any matrix A $$A = U\Sigma V^T$$

• for symmetric and positive definite matrix A

$$A = S\Lambda S^{-1} = S\Lambda S^T\quad (S: \text{eigenvectors})$$

## 2.5. Interpretation of SVD¶

The SVD decomposes the linear map into

• rotate by $V^T$

• diagonal scaling by $\sigma_i$

• ratate by $U$

Note that, unlike the eigen-decomposition, input and output directions are different

## 2.6. Example¶

\begin{array}{Icr} A = \begin{bmatrix} 4 & 4 \\ -3 & 3 \end{bmatrix} \end{array} \quad \quad \begin{array}{Icr} \begin{align*} &\textbf{orthonormal basis} \\ &\upsilon_1, \upsilon_2 \text{ in row space } \mathbb{R}^2 \\ &u_1, u_2 \text{ in column space } \mathbb{R}^2 \\ &\sigma_1 > 0 \\ &\sigma_2 > 0 \end{align*} \end{array}

$$\begin{array}{lcr} A\upsilon_1 = \sigma_1 u_1\\ A\upsilon_2 = \sigma_2 u_2 \end{array} \; \implies \; \begin{array}{lcr} AV = U\Sigma \\ A = U\Sigma V^{-1} = U\Sigma V^T \end{array}$$

1) $A^TA$: (small) positive definite & symmetric

$$V\Sigma^TU^TU\Sigma V^T = V\Sigma^T\Sigma V^T = V\begin{bmatrix} \sigma_1^2& & \\ & \sigma_2^2 & \\ & & \ddots \\ \end{bmatrix}V^T$$
In [1]:
A = [4 4;-3 3];
[V,D] = eig(A'*A)

Out[1]:
V =

-0.7071    0.7071
0.7071    0.7071

D =

18     0
0    32

2) $AA^T$: (large) positive definite & symmetric

$$U\Sigma V^TV\Sigma^TU^T = U\Sigma\Sigma^TU^T = U\begin{bmatrix} \sigma_1^2& & \\ & \sigma_2^2 & \\ & & \ddots \\ \end{bmatrix}U^T$$
In [2]:
A = [4 4;-3 3];
[U,D] = eig(A*A')

Out[2]:
U =

0     1
1     0

D =

18     0
0    32
In [6]:
[U,S,V] = svd(A)

Out[6]:
U =

-1     0
0     1

S =

5.6569         0
0    4.2426

V =

-0.7071   -0.7071
-0.7071    0.7071

# 3. PCA and SVD¶

• Any real symmetric and positive definite matrix B has a eigen decomposition
$$B = S\Lambda S^T$$
• A real matrix $(m\times n)$ A, where $m>n$, has the decompsition
$$A = U\Sigma V^T$$
• From A (skinny and full rank) we can construct two positive-definite symmetric matrices, $AA^T$ and $A^TA$
\begin{align*} AA^T &= U\Sigma V^T V \Sigma^T U^T = U \Sigma \Sigma^T U^T \quad (m \times m, \,n\, \text{eigenvalues and }\, m-n \; \text{zeros},\; U\; \text{eigenvectors})\\ A^TA &= V \Sigma^T U^T U\Sigma V^T = V \Sigma^T \Sigma V^T \quad (n \times n, \,n \;\text{eigenvalues},\; V\; \text{eigenvectors}) \end{align*}
• PCA by SVD $$B = A^TA = V \Sigma^T \Sigma V^T = V \Lambda V^T$$
• $V$ is eigenvectors of $B = A^TA$
• $\sigma_i^2 = \lambda_i$
• Note that in PCA, $$VS V^T \quad \left(\text{where }S = \frac{1}{m}X^TX = \left(\frac{X}{\sqrt{m}}\right)^T \frac{X}{\sqrt{m}} \right)$$

# 4. Low Rank Approximation: Dimension Reduction¶

## 4.1. Low Rank Approximation¶

• in 2D case
 \begin{align*} x & = c_1 \upsilon_1 + c_2 \upsilon_2 \\ & = \langle x, \upsilon_1 \rangle \upsilon_1 + \langle x, \upsilon_2 \rangle \upsilon_2\\ & = ( \upsilon_1^{T} x) \upsilon_1 + ( \upsilon_2^T x) \upsilon_2 \end{align*} \begin{align*} Ax & = c_1A \upsilon_1 + c_2 A \upsilon_2 \\ & = c_1\sigma_1 u_1 + c_2 \sigma_2 u_2 \\ & = u_1 \sigma_1 \upsilon_1^Tx + u_2\sigma_2 \upsilon_2^T x \approx u_1\sigma_1 \upsilon_1^Tx \quad (\text{if } \sigma_1 \gg \sigma_2)\\ \\ & = \begin{bmatrix} u_1 & u_2 \end{bmatrix}\begin{bmatrix}\sigma_1 & 0\\ 0 & \sigma_2 \end{bmatrix}\begin{bmatrix} \upsilon_1^T \\ \upsilon_2^T \end{bmatrix} x \\ \\ & = U \Sigma V^Tx \approx u_1\sigma_1 \upsilon_1^Tx \quad (\text{if } \sigma_1 \gg \sigma_2) \end{align*}
• Low rank approximation of matrix A in general \begin{align*} A &= U_r \Sigma_r V_r^T = \sum_{i=1}^{r}\sigma_i u_i \upsilon_i^T\\ \tilde A &= U_k \Sigma_k V_k^T = \sum_{i=1}^{k}\sigma_i u_i \upsilon_i^T \qquad (k \leq r)\end{align*}

## 4.2. Least Squares and SVD¶

• Consider $$\min_x \, \lVert Ax - b \rVert _2^2$$

• Let $A = U\Sigma V^T$

• Using the orthogonality of $U$ and $V$ we have

$$\lVert Ax - b \rVert _2^2 = \lVert U^T (AVV^Tx - b) \rVert _2^2 = \lVert \underbrace{\Sigma}_{\text{diagonal}} \underbrace{V^Tx}_{z}-U^Tb \rVert_2^2 = \sum_{i=1}^{r} \left( \sigma_i z_i - u_i^T b \right)^2 + \sum_{i=r+1}^{m} (u_i^Tb)^2$$
• The (minimum norm) solution of $\min_x \, \lVert Ax - b \rVert _2^2$ is given
\begin{align*} z_i &= \begin{cases} \frac{u_i^T b}{\sigma_i} & \quad i=1,\cdots,r\\ 0 & \quad i = r+1,\cdots,n \end{cases}\\ x &= Vz \\ x &= \sum_{i=1}^{r}\frac{u_i^T b}{\sigma_i} \upsilon_i \end{align*}

## 4.3. Regularized Least Squares and SVD¶

$$\min_{x} \;\lVert Ax - b\rVert_2^2 + \lambda \lVert x \rVert_2^2 = \min_x \,\Bigg\Vert \begin{bmatrix} A\\ \sqrt{\lambda}I \end{bmatrix} x - \begin{bmatrix} b\\ 0 \end{bmatrix} \Bigg\Vert^2_{2}$$\begin{align*} \left( A^T A + \lambda I \right) x &= A^T b \\ \left( V\Sigma^T U^T U\Sigma V^T + \lambda V V^T \right) x &= V\Sigma^T U^T b \\ V \left( \Sigma^T \Sigma + \lambda I\right)V^T x &= V\Sigma^T U^T b \\ \left( \Sigma^T \Sigma + \lambda I\right)V^T x &= \Sigma^T U^T b \qquad (V^Tx = z)\\ \underbrace{\left( \Sigma^T \Sigma + \lambda I\right)}_{\text{diagonal}}z &= \Sigma^T U^T b \end{align*}
• The (minimum norm) solution is given
\begin{align*} z_i &= \begin{cases} \frac{\sigma_i (u_i^T b)}{\sigma_i^2 + \lambda} & \quad i=1,\cdots,r\\ 0 & \quad i = r+1,\cdots,n \end{cases}\\ x &= Vz \\ x &= \sum_{i=1}^{r}\frac{\sigma_i (u_i^T b)}{\sigma_i^2 + \lambda} \upsilon_i = \sum_{i=1}^{r}\frac{u_i^T b}{\sigma_i + \frac{\lambda}{\sigma_i }} \upsilon_i \end{align*}

# 5. Example¶

In [7]:
A = ...
[11.08   6.82    1.76  -6.82;
2.50  -1.01   -2.60   1.19;
-4.88  -5.07   -3.21   5.20;
-0.49   1.52    2.07  -1.66;
-14.04  -12.40  -6.66  12.65;
0.27  -8.51  -10.19   9.15;
9.53  -9.84  -17.00  11.00;
-12.01   3.64   11.10  -4.48];

[U,S,V] = svd(A)

Out[7]:
U =

-0.2464    0.4454    0.6206    0.3274    0.4592    0.0541   -0.1865    0.0096
0.0743    0.1075    0.2834   -0.7780   -0.1035    0.3255   -0.4225    0.0466
0.2137   -0.1903    0.4949    0.1104   -0.4737   -0.6150   -0.2413   -0.0123
-0.0822   -0.0247    0.2006    0.0615   -0.2705    0.2982    0.2013   -0.8637
0.5038   -0.5538    0.1374   -0.0235    0.6100    0.0168   -0.0759   -0.2048
0.4372    0.0350   -0.0550    0.5017   -0.2966    0.5520   -0.3572    0.1806
0.5902    0.4266    0.2108   -0.1367   -0.0330   -0.0039    0.6243    0.1253
-0.2968   -0.5132    0.4279    0.0232   -0.1404    0.3450    0.4060    0.4017

S =

36.8258         0         0         0
0   26.2369         0         0
0         0    0.0220         0
0         0         0    0.0051
0         0         0         0
0         0         0         0
0         0         0         0
0         0         0         0

V =

-0.0356    0.9208   -0.1392   -0.3627
-0.5382    0.1662   -0.4922    0.6637
-0.6143   -0.3260   -0.3116   -0.6476
0.5760   -0.1354   -0.8008   -0.0929

Question

How to determine a good $k$ so that we can efficiently compress $A$ without losing too much information of $A$?

In [8]:
sig_val = diag(S);
sig_val = sig_val(1:4);

figure; stem(sig_val);  ylim([-1 40]);  xlim([0.8 4.1])
set(gca,'xtick',[1,2,3,4])
title('Singular Values')

Out[8]:
In [10]:
k = 2;

low_U = U(:,1:k);
low_S = S(1:k,1:k);
low_V = V(:,1:k);

A
A_approx = low_U*low_S*low_V'
low_V
low_U

Out[10]:
A =

11.0800    6.8200    1.7600   -6.8200
2.5000   -1.0100   -2.6000    1.1900
-4.8800   -5.0700   -3.2100    5.2000
-0.4900    1.5200    2.0700   -1.6600
-14.0400  -12.4000   -6.6600   12.6500
0.2700   -8.5100  -10.1900    9.1500
9.5300   -9.8400  -17.0000   11.0000
-12.0100    3.6400   11.1000   -4.4800

A_approx =

11.0825    6.8256    1.7653   -6.8089
2.4994   -1.0043   -2.6006    1.1946
-4.8783   -5.0650   -3.2062    5.2088
-0.4893    1.5220    2.0716   -1.6564
-14.0396  -12.3984   -6.6591   12.6524
0.2708   -8.5123  -10.1887    9.1493
9.5304   -9.8373  -16.9990   11.0037
-12.0086    3.6446   11.1030   -4.4724

low_V =

-0.0356    0.9208
-0.5382    0.1662
-0.6143   -0.3260
0.5760   -0.1354

low_U =

-0.2464    0.4454
0.0743    0.1075
0.2137   -0.1903
-0.0822   -0.0247
0.5038   -0.5538
0.4372    0.0350
0.5902    0.4266
-0.2968   -0.5132

## 5.1. Example : Image Approximation¶

• Learn how to reduce the amount of storage needed for an image using the singular value decomposition.
• Approximation of $A$
$$\tilde A = U_k \Sigma_k V_k^T = \sum_{i=1}^{k}\sigma_i u_i \upsilon_i^T \qquad (k \leq r)$$
In [6]:
%plot -s 800,420
g = rgb2gray(w);                    % convert rgb to gray

subplot(1,2,1), imshow(w)
subplot(1,2,2), imshow(g)

A = double(g);                      % uint8 -> double
[U,S,V] = svd(A,0);                 % reduced SVD

Out[6]:
In [3]:
%plot -s 560,420
sig_val = diag(S);
figure; stem(sig_val(1:30));  title('Singular Values')

Out[3]:

Approximated image when k = 30

In [4]:
k = 30;                 % low rank approximation

low_U = U(:,1:k);
low_S = S(1:k,1:k);
low_V = V(:,1:k);

A_approx = low_U*low_S*low_V';

figure; imshow(uint8(A_approx));

Out[4]:

Approximated images with k varied

In [5]:
%plot -s 800,600
K = [0 1 2 3 8 10 20 30];

figure;

subplot(2,4,1), imshow(g);  title('rank 400','fontsize',8)

for i = 2:length(K)
low_U = U(:,1:K(i));
low_S = S(1:K(i),1:K(i));
low_V = V(:,1:K(i));

A_approx = low_U*low_S*low_V';

subplot(2,4,i)
imshow(uint8(A_approx));
title(['rank ',num2str(K(i))],'fontsize',8)
end

Out[5]:

## 5.2. Reduce Noise in a Dataset¶

Create the initial mesh

In [5]:
[X,Y] = meshgrid(0:pi/30:3*pi/2,0:pi/30:3*pi/2);
Z = sin(X.*Y*pi/3);   % evaluate function over the grid

Out[5]:

Plot initial (exact) data

In [6]:
figure(1),  clf
surf(X,Y,Z);
view(-105,30)
axis equal

Out[6]:

Induce noise

In [7]:
maxNoise = .1;    % Change this parameter for the amount of noise to add

[nRows, nCols] = size(Z);
noise = 2*maxNoise*rand(nRows,nCols) - maxNoise;
Znoise = Z + noise;

Out[7]:

Plot noisy data

In [8]:
figure(2),  clf
surf(X,Y,Znoise);
view(-105,30)
axis equal

Out[8]:

Perform SVD

In [9]:
[U,S,V] = svd(Znoise,0);

Out[9]:

Plot the singular values

In [10]:
figure(3),  clf
stem(diag(S),'filled')

Out[10]:

Form rank-k approximation

In [11]:
k = 8;
Znoiseless = U(:,1:k)*S(1:k,1:k)*V(:,1:k)';           % compute rank_k approximation

Out[11]:

Plot rank-k approximation

In [12]:
figure(4),  clf
surf(X,Y,Znoiseless);
view(-105,30)
axis equal

Out[12]:

All plots

In [13]:
%plot -s 800,600
figure(5),  clf
subplot(2,2,1),
surf(X,Y,Z);
view(-105,30)
axis equal

subplot(2,2,2),
surf(X,Y,Znoise);
view(-105,30)
axis equal

subplot(2,2,3),
stem(diag(S),'filled')

subplot(2,2,4),  surf(X,Y,Znoiseless);
view(-105,30)
axis equal