Singular Value Decomposition

by Seungchul Lee
iSystems Design Lab
http://isystems.unist.ac.kr/
UNIST

Table of Contents

1. Geometry of Linear Maps

Matrix $A$ (or linear transformation) = rotate + stretch/compress

an extremely important fact:


every matrix $A \in \mathbb{R}^{m \times n}$ maps the unit ball in $\mathbb{R}^n$ to an ellipsoid in $\mathbb{R}^m$
$$\begin{align*} S &= \{ x \in \mathbb{R}^n \mid \; \Vert x \Vert \leq 1 \}\\ AS &= \{Ax \mid x\in \mathbf{S}\}\end{align*}$$

2. Singular Values and Singular Vectors

  • the numbers $\sigma_1, \cdots, \sigma_n$ are called the singular values of $A$ by convention, $\sigma_i > 0$

  • the vectors $u_1, \cdots, u_n$ these are unit vectors along the principal semiaxes of $AS$

  • the vectors $\upsilon_1, \cdots, \upsilon_n$ these are the preimages of the principal semiaxes, defined so that $$ A\upsilon_i = \sigma_i u_i$$

2.1. Graphical Understanding



$$\begin{align*} A\begin{bmatrix} \upsilon_1& \upsilon_2 & \cdots & \upsilon_r \end{bmatrix} & =\begin{bmatrix} A\upsilon_1 & A\upsilon_2 & \cdots & A\upsilon_r \end{bmatrix} \\ & =\begin{bmatrix} \sigma_1u_1& \sigma_2u_2 & \cdots & \sigma_ru_r \end{bmatrix} \\ & =\begin{bmatrix} u_1& u_2 & \cdots & u_r \end{bmatrix}\begin{bmatrix} \sigma_1& & & \\ & \sigma_2 & & \\ & & \ddots & \\ & & & \sigma_r \end{bmatrix} \\ \\ \therefore \; AV & = U\Sigma \quad (r \leq m, n) \end{align*}$$

2.2. Thin Singular Value Decomposition

We have $A \in \mathbb{R}^{m \times n}$, skinny and full rank (i.e., $r = n$), and

$$A\upsilon_i = \sigma_i u_i \quad \text{for } 1 \leq i \leq n$$

let

$$\hat U = \begin{bmatrix}u_1 & u_2 & \cdots & u_n \end{bmatrix}, \quad \hat \Sigma = \begin{bmatrix}\sigma_1& & & \\ & \sigma_2 &&\\ &&\ddots&\\ &&&\sigma_n \end{bmatrix}, \quad V = \begin{bmatrix} \upsilon_1 & \upsilon_2 & \cdots & \upsilon_n \end{bmatrix} $$

in matrix form, the above equation is $AV = \hat U \hat \Sigma$ and since $V$ is orthogonal

$$ A = \hat U \hat \Sigma V^T $$

called the thin (or reduced) SVD of $A$



2.3. Full Singular Value Decomposition

We can add extra orthonormal columns to $U$ to make an orthogonal matrix $(m \geq n)$

$$ U = \begin{bmatrix} u_1 & u_2 & \cdots & u_m \end{bmatrix}$$

We also add extra rows of zeros to $\hat \Sigma$, so

$$ A = U\Sigma V^T$$



this is the (full) singular value decomposition of $A$

Every matrix $A$ has a singular value decomposition.

If $A$ is not full rank, then some of the diagonal entries of $\Sigma$ will be zero

2.4. SVD - Matrix factorization

  • for any matrix A $$ A = U\Sigma V^T $$

  • for symmetric and positive definite matrix A

$$A = S\Lambda S^{-1} = S\Lambda S^T\quad (S: \text{eigenvectors})$$

2.5. Interpretation of SVD


The SVD decomposes the linear map into

  • rotate by $V^T$

  • diagonal scaling by $\sigma_i$

  • ratate by $U$

Note that, unlike the eigen-decomposition, input and output directions are different

2.6. Example

$$\begin{array}{Icr} A = \begin{bmatrix} 4 & 4 \\ -3 & 3 \end{bmatrix} \end{array} \quad \quad \begin{array}{Icr} \begin{align*} &\textbf{orthonormal basis} \\ &\upsilon_1, \upsilon_2 \text{ in row space } \mathbb{R}^2 \\ &u_1, u_2 \text{ in column space } \mathbb{R}^2 \\ &\sigma_1 > 0 \\ &\sigma_2 > 0 \end{align*} \end{array} $$


$$ \begin{array}{lcr} A\upsilon_1 = \sigma_1 u_1\\ A\upsilon_2 = \sigma_2 u_2 \end{array} \; \implies \; \begin{array}{lcr} AV = U\Sigma \\ A = U\Sigma V^{-1} = U\Sigma V^T \end{array}$$

1) $A^TA$: (small) positive definite & symmetric

$$V\Sigma^TU^TU\Sigma V^T = V\Sigma^T\Sigma V^T = V\begin{bmatrix} \sigma_1^2& & \\ & \sigma_2^2 & \\ & & \ddots \\ \end{bmatrix}V^T$$
In [1]:
A = [4 4;-3 3];
[V,D] = eig(A'*A)
Out[1]:
V =

   -0.7071    0.7071
    0.7071    0.7071


D =

    18     0
     0    32

2) $AA^T$: (large) positive definite & symmetric

$$U\Sigma V^TV\Sigma^TU^T = U\Sigma\Sigma^TU^T = U\begin{bmatrix} \sigma_1^2& & \\ & \sigma_2^2 & \\ & & \ddots \\ \end{bmatrix}U^T$$
In [2]:
A = [4 4;-3 3];
[U,D] = eig(A*A')
Out[2]:
U =

     0     1
     1     0


D =

    18     0
     0    32
In [6]:
[U,S,V] = svd(A)
Out[6]:
U =

    -1     0
     0     1


S =

    5.6569         0
         0    4.2426


V =

   -0.7071   -0.7071
   -0.7071    0.7071

3. PCA and SVD

  • Any real symmetric and positive definite matrix B has a eigen decomposition
$$B = S\Lambda S^T$$
  • A real matrix $(m\times n)$ A, where $m>n$, has the decompsition
$$A = U\Sigma V^T$$
  • From A (skinny and full rank) we can construct two positive-definite symmetric matrices, $AA^T$ and $A^TA$
$$\begin{align*} AA^T &= U\Sigma V^T V \Sigma^T U^T = U \Sigma \Sigma^T U^T \quad (m \times m, \,n\, \text{eigenvalues and }\, m-n \; \text{zeros},\; U\; \text{eigenvectors})\\ A^TA &= V \Sigma^T U^T U\Sigma V^T = V \Sigma^T \Sigma V^T \quad (n \times n, \,n \;\text{eigenvalues},\; V\; \text{eigenvectors}) \end{align*}$$
  • PCA by SVD $$B = A^TA = V \Sigma^T \Sigma V^T = V \Lambda V^T$$
    • $V$ is eigenvectors of $B = A^TA$
    • $\sigma_i^2 = \lambda_i$
    • Note that in PCA, $$ VS V^T \quad \left(\text{where }S = \frac{1}{m}X^TX = \left(\frac{X}{\sqrt{m}}\right)^T \frac{X}{\sqrt{m}} \right)$$

4. Low Rank Approximation: Dimension Reduction

4.1. Low Rank Approximation

  • in 2D case
$$\begin{align*} x & = c_1 \upsilon_1 + c_2 \upsilon_2 \\ & = \langle x, \upsilon_1 \rangle \upsilon_1 + \langle x, \upsilon_2 \rangle \upsilon_2\\ & = ( \upsilon_1^{T} x) \upsilon_1 + ( \upsilon_2^T x) \upsilon_2 \end{align*}$$ $$\begin{align*} Ax & = c_1A \upsilon_1 + c_2 A \upsilon_2 \\ & = c_1\sigma_1 u_1 + c_2 \sigma_2 u_2 \\ & = u_1 \sigma_1 \upsilon_1^Tx + u_2\sigma_2 \upsilon_2^T x \approx u_1\sigma_1 \upsilon_1^Tx \quad (\text{if } \sigma_1 \gg \sigma_2)\\ \\ & = \begin{bmatrix} u_1 & u_2 \end{bmatrix}\begin{bmatrix}\sigma_1 & 0\\ 0 & \sigma_2 \end{bmatrix}\begin{bmatrix} \upsilon_1^T \\ \upsilon_2^T \end{bmatrix} x \\ \\ & = U \Sigma V^Tx \approx u_1\sigma_1 \upsilon_1^Tx \quad (\text{if } \sigma_1 \gg \sigma_2) \end{align*}$$
  • Low rank approximation of matrix A in general $$\begin{align*} A &= U_r \Sigma_r V_r^T = \sum_{i=1}^{r}\sigma_i u_i \upsilon_i^T\\ \tilde A &= U_k \Sigma_k V_k^T = \sum_{i=1}^{k}\sigma_i u_i \upsilon_i^T \qquad (k \leq r)\end{align*}$$

4.2. Least Squares and SVD

  • Consider $$ \min_x \, \lVert Ax - b \rVert _2^2$$

  • Let $A = U\Sigma V^T$

  • Using the orthogonality of $U$ and $V$ we have

$$ \lVert Ax - b \rVert _2^2 = \lVert U^T (AVV^Tx - b) \rVert _2^2 = \lVert \underbrace{\Sigma}_{\text{diagonal}} \underbrace{V^Tx}_{z}-U^Tb \rVert_2^2 = \sum_{i=1}^{r} \left( \sigma_i z_i - u_i^T b \right)^2 + \sum_{i=r+1}^{m} (u_i^Tb)^2$$
  • The (minimum norm) solution of $ \min_x \, \lVert Ax - b \rVert _2^2$ is given
$$ \begin{align*} z_i &= \begin{cases} \frac{u_i^T b}{\sigma_i} & \quad i=1,\cdots,r\\ 0 & \quad i = r+1,\cdots,n \end{cases}\\ x &= Vz \\ x &= \sum_{i=1}^{r}\frac{u_i^T b}{\sigma_i} \upsilon_i \end{align*}$$

4.3. Regularized Least Squares and SVD

$$ \min_{x} \;\lVert Ax - b\rVert_2^2 + \lambda \lVert x \rVert_2^2 = \min_x \,\Bigg\Vert \begin{bmatrix} A\\ \sqrt{\lambda}I \end{bmatrix} x - \begin{bmatrix} b\\ 0 \end{bmatrix} \Bigg\Vert^2_{2} $$$$\begin{align*} \left( A^T A + \lambda I \right) x &= A^T b \\ \left( V\Sigma^T U^T U\Sigma V^T + \lambda V V^T \right) x &= V\Sigma^T U^T b \\ V \left( \Sigma^T \Sigma + \lambda I\right)V^T x &= V\Sigma^T U^T b \\ \left( \Sigma^T \Sigma + \lambda I\right)V^T x &= \Sigma^T U^T b \qquad (V^Tx = z)\\ \underbrace{\left( \Sigma^T \Sigma + \lambda I\right)}_{\text{diagonal}}z &= \Sigma^T U^T b \end{align*} $$
  • The (minimum norm) solution is given
$$ \begin{align*} z_i &= \begin{cases} \frac{\sigma_i (u_i^T b)}{\sigma_i^2 + \lambda} & \quad i=1,\cdots,r\\ 0 & \quad i = r+1,\cdots,n \end{cases}\\ x &= Vz \\ x &= \sum_{i=1}^{r}\frac{\sigma_i (u_i^T b)}{\sigma_i^2 + \lambda} \upsilon_i = \sum_{i=1}^{r}\frac{u_i^T b}{\sigma_i + \frac{\lambda}{\sigma_i }} \upsilon_i \end{align*}$$

5. Example

In [7]:
A = ...
    [11.08   6.82    1.76  -6.82;
      2.50  -1.01   -2.60   1.19;
     -4.88  -5.07   -3.21   5.20;
     -0.49   1.52    2.07  -1.66;
    -14.04  -12.40  -6.66  12.65;
      0.27  -8.51  -10.19   9.15;
      9.53  -9.84  -17.00  11.00;
    -12.01   3.64   11.10  -4.48];

[U,S,V] = svd(A)
Out[7]:
U =

   -0.2464    0.4454    0.6206    0.3274    0.4592    0.0541   -0.1865    0.0096
    0.0743    0.1075    0.2834   -0.7780   -0.1035    0.3255   -0.4225    0.0466
    0.2137   -0.1903    0.4949    0.1104   -0.4737   -0.6150   -0.2413   -0.0123
   -0.0822   -0.0247    0.2006    0.0615   -0.2705    0.2982    0.2013   -0.8637
    0.5038   -0.5538    0.1374   -0.0235    0.6100    0.0168   -0.0759   -0.2048
    0.4372    0.0350   -0.0550    0.5017   -0.2966    0.5520   -0.3572    0.1806
    0.5902    0.4266    0.2108   -0.1367   -0.0330   -0.0039    0.6243    0.1253
   -0.2968   -0.5132    0.4279    0.0232   -0.1404    0.3450    0.4060    0.4017


S =

   36.8258         0         0         0
         0   26.2369         0         0
         0         0    0.0220         0
         0         0         0    0.0051
         0         0         0         0
         0         0         0         0
         0         0         0         0
         0         0         0         0


V =

   -0.0356    0.9208   -0.1392   -0.3627
   -0.5382    0.1662   -0.4922    0.6637
   -0.6143   -0.3260   -0.3116   -0.6476
    0.5760   -0.1354   -0.8008   -0.0929

Question

How to determine a good $k$ so that we can efficiently compress $A$ without losing too much information of $A$?

In [8]:
sig_val = diag(S);
sig_val = sig_val(1:4);

figure; stem(sig_val);  ylim([-1 40]);  xlim([0.8 4.1])
set(gca,'xtick',[1,2,3,4])
title('Singular Values')
Out[8]:
In [10]:
k = 2;

low_U = U(:,1:k);
low_S = S(1:k,1:k);
low_V = V(:,1:k);

A
A_approx = low_U*low_S*low_V'
low_V
low_U
Out[10]:
A =

   11.0800    6.8200    1.7600   -6.8200
    2.5000   -1.0100   -2.6000    1.1900
   -4.8800   -5.0700   -3.2100    5.2000
   -0.4900    1.5200    2.0700   -1.6600
  -14.0400  -12.4000   -6.6600   12.6500
    0.2700   -8.5100  -10.1900    9.1500
    9.5300   -9.8400  -17.0000   11.0000
  -12.0100    3.6400   11.1000   -4.4800


A_approx =

   11.0825    6.8256    1.7653   -6.8089
    2.4994   -1.0043   -2.6006    1.1946
   -4.8783   -5.0650   -3.2062    5.2088
   -0.4893    1.5220    2.0716   -1.6564
  -14.0396  -12.3984   -6.6591   12.6524
    0.2708   -8.5123  -10.1887    9.1493
    9.5304   -9.8373  -16.9990   11.0037
  -12.0086    3.6446   11.1030   -4.4724


low_V =

   -0.0356    0.9208
   -0.5382    0.1662
   -0.6143   -0.3260
    0.5760   -0.1354


low_U =

   -0.2464    0.4454
    0.0743    0.1075
    0.2137   -0.1903
   -0.0822   -0.0247
    0.5038   -0.5538
    0.4372    0.0350
    0.5902    0.4266
   -0.2968   -0.5132

5.1. Example : Image Approximation

  • Learn how to reduce the amount of storage needed for an image using the singular value decomposition.
  • Approximation of $A$
$$\tilde A = U_k \Sigma_k V_k^T = \sum_{i=1}^{k}\sigma_i u_i \upsilon_i^T \qquad (k \leq r)$$
In [6]:
%plot -s 800,420
load([pwd,'\data_files\spaceshuttle.mat']);
g = rgb2gray(w);                    % convert rgb to gray

subplot(1,2,1), imshow(w)
subplot(1,2,2), imshow(g)

A = double(g);                      % uint8 -> double
[U,S,V] = svd(A,0);                 % reduced SVD
Out[6]:
In [3]:
%plot -s 560,420
sig_val = diag(S);
figure; stem(sig_val(1:30));  title('Singular Values')
Out[3]:

Approximated image when k = 30

In [4]:
k = 30;                 % low rank approximation

low_U = U(:,1:k);
low_S = S(1:k,1:k);
low_V = V(:,1:k);

A_approx = low_U*low_S*low_V';

figure; imshow(uint8(A_approx));
Out[4]:

Approximated images with k varied

In [5]:
%plot -s 800,600
K = [0 1 2 3 8 10 20 30];

figure;

subplot(2,4,1), imshow(g);  title('rank 400','fontsize',8)

for i = 2:length(K)
    low_U = U(:,1:K(i));
    low_S = S(1:K(i),1:K(i));
    low_V = V(:,1:K(i));

    A_approx = low_U*low_S*low_V';

    subplot(2,4,i)
    imshow(uint8(A_approx));
    title(['rank ',num2str(K(i))],'fontsize',8)
end
Out[5]:

5.2. Reduce Noise in a Dataset

Create the initial mesh

In [5]:
[X,Y] = meshgrid(0:pi/30:3*pi/2,0:pi/30:3*pi/2);
Z = sin(X.*Y*pi/3);   % evaluate function over the grid
Out[5]:

Plot initial (exact) data

In [6]:
figure(1),  clf
surf(X,Y,Z);
view(-105,30)
axis equal
Out[6]:

Induce noise

In [7]:
maxNoise = .1;    % Change this parameter for the amount of noise to add

[nRows, nCols] = size(Z);
noise = 2*maxNoise*rand(nRows,nCols) - maxNoise;
Znoise = Z + noise;
Out[7]:

Plot noisy data

In [8]:
figure(2),  clf
surf(X,Y,Znoise);
view(-105,30)
axis equal
Out[8]:

Perform SVD

In [9]:
[U,S,V] = svd(Znoise,0);
Out[9]:

Plot the singular values

In [10]:
figure(3),  clf
stem(diag(S),'filled')
Out[10]:

Form rank-k approximation

In [11]:
k = 8;
Znoiseless = U(:,1:k)*S(1:k,1:k)*V(:,1:k)';           % compute rank_k approximation
Out[11]:

Plot rank-k approximation

In [12]:
figure(4),  clf
surf(X,Y,Znoiseless);
view(-105,30)
axis equal
Out[12]:

All plots

In [13]:
%plot -s 800,600
figure(5),  clf
subplot(2,2,1),
surf(X,Y,Z);
view(-105,30)
axis equal

subplot(2,2,2),
surf(X,Y,Znoise);
view(-105,30)
axis equal

subplot(2,2,3),
stem(diag(S),'filled')

subplot(2,2,4),  surf(X,Y,Znoiseless);
view(-105,30)
axis equal