Regression

Table of Contents

1. Linear Regression¶

Consider a linear regression.

$\text{Given} \; \begin{cases} x_{i} \; \text{: inputs} \\ y_{i} \; \text{: outputs} \end{cases}$ , Find $\theta_{0}$ and $\theta_{1}$

$$x= \begin{bmatrix} x_{1} \\ x_{2} \\ \vdots \\ x_{m} \end{bmatrix}, \qquad y= \begin{bmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{m} \end{bmatrix} \approx \hat{y}_{i} = \theta_{0} + \theta_{1}x_{i}$$

$ \hat{y}_{i} $ : predicted output
$ \theta = \begin{bmatrix} \theta_{0} \\ \theta_{1} \\ \end{bmatrix} $ : Model parameters

$$ \hat{y}_{i} = f(x_{i}\,; \theta) \; \text{ in general}$$

in many cases, a linear model is used to predict $y_{i}$

$$ \hat{y}_{i} = \theta_{0} + \theta_{1}x_{i} \; \quad \text{ such that }\quad \min\limits_{\theta_{0}, \theta_{1}}\sum\limits_{i = 1}^{m} (\hat{y}_{i} - y_{i})^2$$

1.1. Re-cast Problem as a Least Squares¶

For convenience, we define a function that maps inputs to feature vectors, $\phi$

$$\begin{array}{Icr}\begin{align*} \hat{y}_{i} & = \theta_0 + x_i \theta_1 = 1 \cdot \theta_0 + x_i \theta_1 \\ \\ & = \begin{bmatrix}1 & x_{i}\end{bmatrix}\begin{bmatrix}\theta_{0} \\ \theta_{1}\end{bmatrix} \\\\ & =\begin{bmatrix}1 \\ x_{i} \end{bmatrix}^{T}\begin{bmatrix}\theta_{0} \\ \theta_{1}\end{bmatrix} \\\\ & =\phi^{T}(x_{i})\theta \end{align*}\end{array} \begin{array}{Icr} \quad \quad \text{feature vector} \; \phi(x_{i}) = \begin{bmatrix}1 \\ x_{i}\end{bmatrix} \end{array}$$

$$\Phi = \begin{bmatrix}1 & x_{1} \\ 1 & x_{2} \\ \vdots \\1 & x_{m} \end{bmatrix}=\begin{bmatrix}\phi^T(x_{1}) \\\phi^T(x_{2}) \\\vdots \\\phi^T(x_{m}) \end{bmatrix} \quad \implies \quad \hat{y} = \begin{bmatrix}\hat{y}_{1} \\\hat{y}_{2} \\\vdots \\\hat{y}_{m}\end{bmatrix}=\Phi\theta$$

Optimization problem

$$\min\limits_{\theta_{0}, \theta_{1}}\sum\limits_{i = 1}^{m} (\hat{y}_{i} - y_{i})^2 =\min\limits_{\theta}\lVert\Phi\theta-y\rVert^2_2 \qquad \qquad \left(\text{same as} \; \min_{x} \lVert Ax-b \rVert_2^2 \right)$$

$$ \text{solution} \; \theta^* = (\Phi^{T}\Phi)^{-1}\Phi^{T} y $$

Note

$$\begin{array}{Icr} \text{input} \\ x_{i} \end{array} \quad \rightarrow \quad \begin{array}{Icr} \text{feature} \\ \begin{bmatrix} 1 \\ x_{i} \end{bmatrix} \end{array} \quad \rightarrow \quad \begin{array}{Icr} \text{predicted output} \\ \hat{y}_{i} \end{array}$$

$$\begin{array}{Icr} \begin{bmatrix} 1 & x_{1} \\1 & x_{2}\\\vdots & \vdots\\1 & x_{m}\end{bmatrix}\begin{bmatrix}\theta_0\\\theta_1\end{bmatrix}=\begin{bmatrix}y_{1} \\y_{2} \\\vdots \\y_{m}\end{bmatrix} \\ \begin{array}{Icr} \uparrow \\ \vec{A}_1 \end{array} \;\; \begin{array}{Icr} \uparrow \\ \vec{A}_2 \end{array} \quad \begin{array}{Icr} \uparrow \\ \vec{x} \end{array} \quad\quad \;\; \begin{array}{Icr} \uparrow \\ \vec{B} \end{array} \end{array} \quad \begin{array}{Icr} \quad \text{over-determined or} \\ \quad \text{projection} \end{array}$$

$$A(= \Phi) = \left[ \vec{A}_1 \;\vec{A}_2 \right]$$

1.2. Solve Optimizaton in Linear Regression¶

1.2.1. Use Linear Algebra¶

known as least square

$$ \theta = (A^TA)^{-1}A^T y $$

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# data points in column vector [input, output]
x = np.array([0.1, 0.4, 0.7, 1.2, 1.3, 1.7, 2.2, 2.8, 3.0, 4.0, 4.3, 4.4, 4.9]).reshape(-1, 1)
y = np.array([0.5, 0.9, 1.1, 1.5, 1.5, 2.0, 2.2, 2.8, 2.7, 3.0, 3.5, 3.7, 3.9]).reshape(-1, 1)

plt.figure(figsize = (6, 4))
plt.plot(x, y, 'ko')
plt.title('Data', fontsize = 12)
plt.xlabel('X', fontsize = 12)
plt.ylabel('Y', fontsize = 12)
plt.axis('equal')
plt.grid(alpha = 0.3)
plt.xlim([0, 5])
plt.ylim([0, 4.5])
plt.show()

m = y.shape[0]
#A = np.hstack([np.ones([m, 1]), x])
A = np.hstack([x**0, x])
A = np.asmatrix(A)

theta = (A.T*A).I*A.T*y

print('theta:\n', theta)

theta:
 [[0.65306531]
 [0.67129519]]

# to plot
plt.figure(figsize = (6, 4))
plt.title('Regression', fontsize = 12)
plt.xlabel('X', fontsize = 12)
plt.ylabel('Y', fontsize = 12)
plt.plot(x, y, 'ko', label = "data")

# to plot a straight line (fitted line)
xp = np.arange(0, 5, 0.01).reshape(-1, 1)
yp = theta[0,0] + theta[1,0]*xp

plt.plot(xp, yp, 'r', linewidth = 2, label = "regression")
plt.legend()
plt.axis('equal')
plt.grid(alpha = 0.3)
plt.xlim([0, 5])
plt.ylim([0, 4.5])
plt.show()

1.2.2. Use Gradient Descent¶

$$\min_{\theta} ~ \lVert \hat y - y \rVert_2^2 = \min_{\theta} ~ \lVert A\theta - y \rVert_2^2$$

$$ \begin{align*} f &= (A\theta-y)^T(A\theta-y) = (\theta^TA^T-y^T)(A\theta-y) \\ &= \theta^TA^TA\theta - \theta^TA^Ty - y^TA\theta + y^Ty \\\\ \nabla f &= A^TA\theta + A^TA\theta - A^Ty - A^Ty = 2(A^TA\theta - A^Ty) \end{align*} $$

$$ \theta \leftarrow \theta - \alpha \nabla f $$

theta = np.random.randn(2,1)
theta = np.asmatrix(theta)

alpha = 0.001

for _ in range(1000):
    df = 2*(A.T*A*theta - A.T*y)
    theta = theta - alpha*df

print (theta)

[[0.65351016]
 [0.67116077]]

# to plot
plt.figure(figsize = (6, 4))
plt.title('Regression', fontsize = 12)
plt.xlabel('X', fontsize = 12)
plt.ylabel('Y', fontsize = 12)
plt.plot(x, y, 'ko', label = "data")

# to plot a straight line (fitted line)
xp = np.arange(0, 5, 0.01).reshape(-1, 1)
yp = theta[0,0] + theta[1,0]*xp

plt.plot(xp, yp, 'r', linewidth = 2, label = "regression")
plt.legend()
plt.axis('equal')
plt.grid(alpha = 0.3)
plt.xlim([0, 5])
plt.show()

1.2.3. Use CVXPY Optimization¶

$$\min_{\theta} ~ \lVert \hat y - y \rVert_2 = \min_{\theta} ~ \lVert A\theta - y \rVert_2$$

import cvxpy as cvx

theta2 = cvx.Variable([2, 1])
obj = cvx.Minimize(cvx.norm(A@theta2 - y, 2))
cvx.Problem(obj,[]).solve()

print('theta:\n', theta2.value)

theta:
 [[0.65306531]
 [0.67129519]]

By the way, do we have to use only $L_2$ norm? No.

Let's use $L_1$ norm

$$ \min_{\theta} ~ \lVert \hat y - y \rVert_1 = \min_{\theta} ~ \lVert A\theta - y \rVert_1 $$

theta1 = cvx.Variable([2, 1])
obj = cvx.Minimize(cvx.norm(A@theta1 - y, 1))
cvx.Problem(obj).solve()

print('theta:\n', theta1.value)

theta:
 [[0.6258404 ]
 [0.68539899]]

# to plot data
plt.figure(figsize = (6, 4))
plt.title('$L_1$ and $L_2$ Regression', fontsize = 12)
plt.xlabel('X', fontsize = 12)
plt.ylabel('Y', fontsize = 12)
plt.plot(x, y, 'ko', label = 'data')

# to plot straight lines (fitted lines)
xp = np.arange(0, 5, 0.01).reshape(-1, 1)
yp1 = theta1.value[0,0] + theta1.value[1,0]*xp
yp2 = theta2.value[0,0] + theta2.value[1,0]*xp

plt.plot(xp, yp1, 'b', linewidth = 2, label = '$L_1$')
plt.plot(xp, yp2, 'r', linewidth = 2, label = '$L_2$')
plt.legend()
plt.axis('equal')
plt.xlim([0, 5])
plt.grid(alpha = 0.3)
plt.show()

$L_1$ norm also provides a decent linear approximation.

What if outliers exist?

Fit with the different norms
Discuss the result
It is important to understand what makes them different.

# add outliers
x = np.vstack([x, np.array([0.5, 3.8]).reshape(-1, 1)])
y = np.vstack([y, np.array([3.9, 0.3]).reshape(-1, 1)])

A = np.hstack([x**0, x])
A = np.asmatrix(A)

A.shape

(15, 2)

plt.figure(figsize = (6, 4))
plt.plot(x, y, 'ko', label = 'data')
plt.axis('equal')
plt.xlim([0, 5])
plt.grid(alpha = 0.3)
plt.show()

theta2 = cvx.Variable([2, 1])
obj2 = cvx.Minimize(cvx.norm(A@theta2 - y, 2))
prob2 = cvx.Problem(obj2).solve()

# to plot straight lines (fitted lines)
plt.figure(figsize = (6, 4))
plt.plot(x, y, 'ko', label = 'data')
xp = np.arange(0, 5, 0.01).reshape(-1,1)
yp2 = theta2.value[0,0] + theta2.value[1,0]*xp

plt.plot(xp, yp2, 'r', linewidth = 2, label = '$L_2$')
plt.axis('equal')
plt.xlim([0, 5])
plt.legend(loc = 5)
plt.grid(alpha = 0.3)
plt.show()

theta1 = cvx.Variable([2, 1])
obj1 = cvx.Minimize(cvx.norm(A@theta1 - y, 1))
prob1 = cvx.Problem(obj1).solve()

# to plot straight lines (fitted lines)
plt.figure(figsize = (6, 4))
plt.plot(x, y, 'ko', label = 'data')
xp = np.arange(0, 5, 0.01).reshape(-1,1)
yp1 = theta1.value[0,0] + theta1.value[1,0]*xp

plt.plot(xp, yp1, 'b', linewidth = 2, label = '$L_1$')
plt.axis('equal')
plt.xlim([0, 5])
plt.legend(loc = 5)
plt.grid(alpha = 0.3)
plt.show()

# to plot data
plt.figure(figsize = (6, 4))
plt.plot(x, y, 'ko', label = 'data')
plt.title('$L_1$ and $L_2$ Regression w/ Outliers', fontsize = 12)
plt.xlabel('X', fontsize = 12)
plt.ylabel('Y', fontsize = 12)

# to plot straight lines (fitted lines)
xp = np.arange(0, 5, 0.01).reshape(-1,1)
yp1 = theta1.value[0,0] + theta1.value[1,0]*xp
yp2 = theta2.value[0,0] + theta2.value[1,0]*xp

plt.plot(xp, yp1, 'b', linewidth = 2, label = '$L_1$')
plt.plot(xp, yp2, 'r', linewidth = 2, label = '$L_2$')
plt.axis('scaled')
plt.xlim([0, 5])
plt.legend(loc = 5)
plt.grid(alpha = 0.3)
plt.show()

Think about what makes them different.

1.2.4. Scikit-Learn¶

Machine Learning in Python
Simple and efficient tools for data mining and data analysis
Accessible to everybody, and reusable in various contexts
Built on NumPy, SciPy, and matplotlib
Open source, commercially usable - BSD license
https://scikit-learn.org/stable/index.html#

x = np.array([0.1, 0.4, 0.7, 1.2, 1.3, 1.7, 2.2, 2.8, 3.0, 4.0, 4.3, 4.4, 4.9]).reshape(-1, 1)
y = np.array([0.5, 0.9, 1.1, 1.5, 1.5, 2.0, 2.2, 2.8, 2.7, 3.0, 3.5, 3.7, 3.9]).reshape(-1, 1)

from sklearn import linear_model

reg = linear_model.LinearRegression()
reg.fit(x, y)

LinearRegression()

LinearRegression()

reg.coef_

array([[0.67129519]])

reg.intercept_

array([0.65306531])

# to plot
plt.figure(figsize = (6, 4))
plt.title('Regression', fontsize = 12)
plt.xlabel('X', fontsize = 12)
plt.ylabel('Y', fontsize = 12)
plt.plot(x, y, 'ko', label = "data")

# to plot a straight line (fitted line)
plt.plot(xp, reg.predict(xp), 'r', linewidth = 2, label = "regression")
plt.legend(fontsize = 12)
plt.axis('equal')
plt.grid(alpha = 0.3)
plt.xlim([0, 5])
plt.show()

2. Multivariate Linear Regression¶

(= linear regression for multivariate data)

$$ \hat{y} = \theta_0 + \theta_{1}x_1 + \theta_{2}x_2 $$ $$$$ $$$$

$$\phi \left(x^{(i)}\right) = \begin{bmatrix}1\\x^{(i)}_{1}\\x^{(i)}_{2} \end{bmatrix}$$

# for 3D plot
from mpl_toolkits.mplot3d import Axes3D

# y = theta0 + theta1*x1 + theta2*x2 + noise

n = 200
x1 = np.random.randn(n, 1)
x2 = np.random.randn(n, 1)
noise = 0.5*np.random.randn(n, 1);

y = 2 + 1*x1 + 3*x2 + noise

fig = plt.figure(figsize = (8, 6))
ax = fig.add_subplot(1, 1, 1, projection = '3d')
ax.set_title('Generated Data', fontsize = 12)
ax.set_xlabel('$X_1$', fontsize = 12)
ax.set_ylabel('$X_2$', fontsize = 12)
ax.set_zlabel('Y', fontsize = 12)
ax.scatter(x1, x2, y, marker = '.', label = 'Data')
#ax.view_init(30,30)
plt.legend(fontsize = 12)
plt.show()

$$\Phi = \begin{bmatrix}1 & x_{1}^{(1)} & x_{2}^{(1)}\\1 & x_{1}^{(2)} & x_{2}^{(2)}\\ \vdots \\1 & x_{1}^{(m)} & x_{2}^{(m)} \end{bmatrix} \quad \implies \quad \hat{y} = \begin{bmatrix}\hat{y}^{(1)} \\\hat{y}^{(2)} \\\vdots \\\hat{y}^{(m)}\end{bmatrix}=\Phi\theta$$

$$\implies \theta^{*} = (\Phi^T \Phi)^{-1} \Phi^T y$$

A = np.hstack([np.ones((n, 1)), x1, x2])
A = np.asmatrix(A)
theta = (A.T*A).I*A.T*y

X1, X2 = np.meshgrid(np.arange(np.min(x1), np.max(x1), 0.5),
                     np.arange(np.min(x2), np.max(x2), 0.5))
YP = theta[0,0] + theta[1,0]*X1 + theta[2,0]*X2

fig = plt.figure(figsize = (8, 6))
ax = fig.add_subplot(1, 1, 1, projection = '3d')
ax.set_title('Regression', fontsize = 12)
ax.set_xlabel('$X_1$', fontsize = 12)
ax.set_ylabel('$X_2$', fontsize = 12)
ax.set_zlabel('Y', fontsize = 12)
ax.scatter(x1, x2, y, marker = '.', label = 'Data')
ax.plot_wireframe(X1, X2, YP, color = 'k', alpha = 0.3, label = 'Regression Plane')
#ax.view_init(30,30)
plt.legend(fontsize = 12)
plt.show()

3. Nonlinear Regression¶

(= Linear Regression for Non-linear Data)

Same as linear regression, just with non-linear features
Method 1: constructing explicit feature vectors
- Polynomial features
- Radial basis function (RBF) features
Method 2: implicit feature vectors, kernel trick (optional)

3.1. Nonlinear Regression with Polynomial Features¶

(here, quad is used as an example)

$$ \begin{align*} y &= \theta_0 + \theta_1 x + \theta_2 x^2 + \text{noise} \end{align*} $$

$$\phi(x_{i}) = \begin{bmatrix}1\\x_{i}\\x_{i}^2 \end{bmatrix}$$

$$\Phi = \begin{bmatrix}1 & x_{1} & x_{1}^2 \\ 1 & x_{2} & x_{2}^2 \\ \vdots \\ 1 & x_{m} & x_{m}^2\end{bmatrix} \quad \implies \quad \hat{y} = \begin{bmatrix}\hat{y}_1 \\\hat{y}_2 \\\vdots \\\hat{y}_m\end{bmatrix}=\Phi\theta$$

$$\implies \theta^{*} = (\Phi^T \Phi)^{-1} \Phi^T y$$

# y = theta0 + theta1*x + theta2*x^2 + noise

n = 100
x = -5 + 15*np.random.rand(n, 1)
noise = 10*np.random.randn(n, 1)

y = 10 + 1*x + 2*x**2 + noise

plt.figure(figsize = (6, 4))
plt.title('True x and y', fontsize = 12)
plt.xlabel('X', fontsize = 12)
plt.ylabel('Y', fontsize = 12)
plt.plot(x, y, 'o', markersize = 4, label = 'actual')
plt.xlim([np.min(x), np.max(x)])
plt.grid(alpha = 0.3)
plt.legend(fontsize = 12)
plt.show()

A = np.hstack([x**0, x, x**2])
A = np.asmatrix(A)

theta = (A.T*A).I*A.T*y
print('theta:\n', theta)

theta:
 [[10.77438394]
 [ 0.3957224 ]
 [ 2.04054824]]

xp = np.linspace(np.min(x), np.max(x))
yp = theta[0,0] + theta[1,0]*xp + theta[2,0]*xp**2

plt.figure(figsize = (6, 4))
plt.plot(x, y, 'o', markersize = 4, label = 'actual')
plt.plot(xp, yp, 'r', linewidth = 2, label = 'estimated')

plt.title('Nonlinear regression', fontsize = 12)
plt.xlabel('X', fontsize = 12)
plt.ylabel('Y', fontsize = 12)
plt.xlim([np.min(x), np.max(x)])
plt.grid(alpha = 0.3)
plt.legend(fontsize = 12)
plt.show()

4. Overfitting¶

This is a very important code that you might want to fully understand or even memorize

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# 10 data points
n = 10
x = np.linspace(-4.5, 4.5, 10).reshape(-1, 1)
y = np.array([0.9819, 0.7973, 1.9737, 0.1838, 1.3180, -0.8361, -0.6591, -2.4701, -2.8122, -6.2512]).reshape(-1, 1)

plt.figure(figsize = (6, 4))
plt.plot(x, y, 'o', label = 'Data')
plt.xlabel('X', fontsize = 12)
plt.ylabel('Y', fontsize = 12)
plt.grid(alpha = 0.3)
plt.axis('equal')
plt.show()

A = np.hstack([x**0, x])
A = np.asmatrix(A)

theta = (A.T*A).I*A.T*y
print(theta)

[[-0.7774    ]
 [-0.71070424]]

# to plot
xp = np.arange(-4.5, 4.5, 0.01).reshape(-1, 1)
yp = theta[0,0] + theta[1,0]*xp

plt.figure(figsize = (6, 4))
plt.plot(x, y, 'o', label = 'Data')
plt.plot(xp[:,0], yp[:,0], linewidth = 2, label = 'Linear')
plt.title('Linear Regression', fontsize = 12)
plt.xlabel('X', fontsize = 12)
plt.ylabel('Y', fontsize = 12)
plt.legend(fontsize = 12)
plt.grid(alpha = 0.3)
plt.axis('equal')
plt.show()

A = np.hstack([x**0, x, x**2])
A = np.asmatrix(A)

theta = (A.T*A).I*A.T*y
print(theta)

[[ 0.33669062]
 [-0.71070424]
 [-0.13504129]]

# to plot
xp = np.arange(-4.5, 4.5, 0.01).reshape(-1, 1)
yp = theta[0,0] + theta[1,0]*xp + theta[2,0]*xp**2

plt.figure(figsize = (6, 4))
plt.plot(x, y, 'o', label = 'Data')
plt.plot(xp[:,0], yp[:,0], linewidth = 2, label = '2nd degree')
plt.title('Nonlinear Regression with Polynomial Functions', fontsize = 12)
plt.xlabel('X', fontsize = 12)
plt.ylabel('Y', fontsize = 12)
plt.legend(fontsize = 12)
plt.grid(alpha = 0.3)
plt.axis('equal')
plt.show()

A = np.hstack([x**i for i in range(10)])
A = np.asmatrix(A)

theta = (A.T*A).I*A.T*y
print(theta)

# to plot
xp = np.arange(-4.5, 4.5, 0.01).reshape(-1, 1)

polybasis = np.hstack([xp**i for i in range(10)])
polybasis = np.asmatrix(polybasis)

yp = polybasis*theta

plt.figure(figsize = (6, 4))
plt.plot(x, y, 'o', label = 'Data')
plt.plot(xp[:,0], yp[:,0], linewidth = 2, label = '9th degree')
plt.title('Nonlinear Regression with Polynomial Functions', fontsize = 12)
plt.xlabel('X', fontsize = 12)
plt.ylabel('Y', fontsize = 12)
plt.legend(fontsize = 12)
plt.grid(alpha = 0.3)
plt.axis('equal')
plt.show()

[[ 3.48274701e-01]
 [-2.58951123e+00]
 [-4.55286474e-01]
 [ 1.85022226e+00]
 [ 1.06250369e-01]
 [-4.43328786e-01]
 [-9.25753472e-03]
 [ 3.63088178e-02]
 [ 2.35143849e-04]
 [-9.24099978e-04]]

d = [1, 3, 5, 9]
RSS = []

plt.figure(figsize = (8, 6))
plt.suptitle('Nonlinear Regression', fontsize = 12)

for k in range(4):
    A = np.hstack([x**i for i in range(d[k]+1)])
    polybasis = np.hstack([xp**i for i in range(d[k]+1)])

    A = np.asmatrix(A)
    polybasis = np.asmatrix(polybasis)

    theta = (A.T*A).I*A.T*y
    yp = polybasis*theta

    RSS.append(np.linalg.norm(y - A*theta, 2)**2)

    plt.subplot(2, 2, k+1)
    plt.plot(x, y, 'o')
    plt.plot(xp, yp)
    plt.axis([-5, 5, -12, 6])
    plt.title('degree = {}'.format(d[k]))
    plt.grid(alpha=0.3)

plt.show()

plt.figure(figsize = (6, 4))
plt.stem(d, RSS, label = 'RSS')
plt.title('Residual Sum of Squares', fontsize = 12)
plt.xlabel('degree', fontsize = 12)
plt.ylabel('RSS', fontsize = 12)
plt.legend(fontsize = 12)
plt.grid(alpha = 0.3)
plt.show()

4.1. Linear Basis Function Models¶

Construct explicit feature vectors
Consider linear combinations of fixed nonlinear functions of the input variables, of the form

$$ \begin{bmatrix} 1 & x_{1} & x_1^2\\1 & x_{2} & x_2^2\\\vdots & \vdots\\1 & x_{m} & x_m^2 \end{bmatrix} \begin{bmatrix}\theta_0\\\theta_1 \\ \theta_2 \end{bmatrix} \quad \Rightarrow \quad \begin{bmatrix} \mid & \mid & \mid \\ b_0(x) & b_1(x) & b_2(x)\\ \mid & \mid & \mid \end{bmatrix} \begin{bmatrix}\theta_0\\\theta_1 \\ \theta_2 \end{bmatrix} $$

$$ \hat{y}=\sum_{i=0}^d{\theta_i b_i(x)} = \Phi \theta$$

1) Polynomial functions

$$b_i(x) = x^i, \quad i = 0,\cdots,d$$

xp = np.arange(-1, 1, 0.01).reshape(-1, 1)
polybasis = np.hstack([xp**i for i in range(6)])

plt.figure(figsize = (6, 4))

for i in range(6):
    plt.plot(xp, polybasis[:,i], label = '$x^{}$'.format(i))

plt.title('Polynomial Basis', fontsize = 12)
plt.xlabel('X', fontsize = 12)
plt.ylabel('Y', fontsize = 12)
plt.axis([-1, 1, -1.1, 1.1])
plt.grid(alpha = 0.3)
plt.legend(fontsize = 12)
plt.show()

2) RBF functions

With bandwidth $\sigma$ and $k$ RBF centers $\mu_i \in \mathbb{R}^n$

$$ b_i(x) = \exp \left( - \frac{\lVert x-\mu_i \rVert^2}{2\sigma^2}\right) $$

d = 9

u = np.linspace(-1, 1, d)
sigma = 0.2

rbfbasis = np.hstack([np.exp(-(xp-u[i])**2/(2*sigma**2)) for i in range(d)])

plt.figure(figsize = (6, 4))

for i in range(d):
    plt.plot(xp, rbfbasis[:,i], label='$\mu = {}$'.format(u[i]))

plt.title('RBF basis', fontsize = 12)
plt.xlabel('X', fontsize = 12)
plt.ylabel('Y', fontsize = 12)
plt.axis([-1, 1, -0.1, 1.1])
plt.legend(loc = 'lower right', fontsize = 10)
plt.grid(alpha = 0.3)
plt.show()

With many features, our prediction function becomes very expressive
Can lead to overfitting (low error on input data points, but high error nearby)

xp = np.arange(-4.5, 4.5, 0.01).reshape(-1, 1)

d = 10
u = np.linspace(-4.5, 4.5, d)
sigma = 0.2

A = np.hstack([np.exp(-(x-u[i])**2/(2*sigma**2)) for i in range(d)])
rbfbasis = np.hstack([np.exp(-(xp-u[i])**2/(2*sigma**2)) for i in range(d)])

A = np.asmatrix(A)
rbfbasis = np.asmatrix(rbfbasis)

theta = (A.T*A).I*A.T*y
yp = rbfbasis*theta

plt.figure(figsize = (6, 4))
plt.plot(x, y, 'o', label = 'Data')
plt.plot(xp, yp, label = 'Overfitted')
plt.title('(Overfitted) Regression with RBF basis', fontsize = 12)
plt.xlabel('X', fontsize = 12)
plt.ylabel('Y', fontsize = 12)
plt.grid(alpha = 0.3)
plt.legend(fontsize = 12)
plt.axis('equal')
plt.show()

d = [2, 4, 6, 10]
sigma = 1

plt.figure(figsize = (10, 8))

for k in range(4):
    u = np.linspace(-4.5, 4.5, d[k])

    A = np.hstack([np.exp(-(x-u[i])**2/(2*sigma**2)) for i in range(d[k])])
    rbfbasis = np.hstack([np.exp(-(xp-u[i])**2/(2*sigma**2)) for i in range(d[k])])

    A = np.asmatrix(A)
    rbfbasis = np.asmatrix(rbfbasis)

    theta = (A.T*A).I*A.T*y
    yp = rbfbasis*theta

    plt.subplot(2, 2, k+1)
    plt.plot(x, y, 'o')
    plt.plot(xp, yp)
    plt.axis([-5, 5, -12, 6])
    plt.title('num RBFs = {}'.format(d[k]), fontsize = 10)
    plt.grid(alpha = 0.3)

plt.suptitle('Nonlinear Regression with RBF Functions', fontsize = 12)
plt.show()

4.2. Regularization (Shrinkage Methods)¶

Often, overfitting associated with very large estimated parameters $\theta$

We want to balance

how well function fits data
magnitude of coefficients

$$ \begin{align*} \text{Total cost } = \;&\underbrace{\text{measure of fit}}_{RSS(\theta)} + \;\lambda \cdot \underbrace{\text{measure of magnitude of coefficients}}_{\lambda \cdot \lVert \theta \rVert_2^2} \\ \\ \implies &\min\; \lVert \Phi \theta - y \rVert_2^2 + \lambda \lVert \theta \rVert_2^2 \end{align*} $$
where $ RSS(\theta) = \lVert \Phi\theta - y \rVert^2_2 $, ( = Rresidual Sum of Squares) and $\lambda$ is a tuning parameter to be determined separately

the second term, $\lambda \cdot \lVert \theta \rVert_2^2$, called a shrinkage penalty, is small when $\theta_1, \cdots,\theta_d$ are close to zeros, and so it has the effect of shrinking the estimates of $\theta_j$ towards zero
The tuning parameter $\lambda$ serves to control the relative impact of these two terms on the regression coefficient estimates
known as a ridge regression

Example: start from rich representation.

$$\min\; \lVert \Phi \theta - y \rVert_2^2$$

# CVXPY code

import cvxpy as cvx

d = 10
u = np.linspace(-4.5, 4.5, d)

sigma = 1

A = np.hstack([np.exp(-(x-u[i])**2/(2*sigma**2)) for i in range(d)])
rbfbasis = np.hstack([np.exp(-(xp-u[i])**2/(2*sigma**2)) for i in range(d)])

A = np.asmatrix(A)
rbfbasis = np.asmatrix(rbfbasis)

theta = cvx.Variable([d, 1])
obj = cvx.Minimize(cvx.sum_squares(A@theta - y))
prob = cvx.Problem(obj).solve()

yp = rbfbasis*theta.value

plt.figure(figsize = (6, 4))
plt.plot(x, y, 'o', label = 'Data')
plt.plot(xp, yp, label = 'Overfitted')
plt.title('(Overfitted) Regression', fontsize = 12)
plt.xlabel('X', fontsize = 12)
plt.ylabel('Y', fontsize = 12)
plt.axis([-5, 5, -12, 6])
plt.legend(fontsize = 12)
plt.grid(alpha = 0.3)
plt.axis('equal')
plt.show()

Start from rich representation. Then, regularize coefficients $\theta$

$$\min\; \lVert \Phi \theta - y \rVert_2^2 + \lambda \lVert \theta \rVert_2^2$$

# ridge regression

lamb = 0.1
theta = cvx.Variable([d, 1])
obj = cvx.Minimize(cvx.sum_squares(A@theta - y) + lamb*cvx.sum_squares(theta))
prob = cvx.Problem(obj).solve()

yp = rbfbasis*theta.value

plt.figure(figsize = (6, 4))
plt.plot(x, y, 'o', label = 'Data')
plt.plot(xp, yp, label = 'Ridge')
plt.title('Ridge Regularization (L2)', fontsize = 12)
plt.xlabel('X', fontsize = 12)
plt.ylabel('Y', fontsize = 12)
plt.axis([-5, 5, -12, 6])
plt.legend(fontsize = 12)
plt.grid(alpha = 0.3)
plt.axis('equal')
plt.show()

# Regulization (= ridge nonlinear regression) encourages small weights, but not exactly 0

plt.figure(figsize = (6, 4))
plt.title(r'Ridge: magnitude of $\theta$', fontsize = 12)
plt.xlabel(r'$\theta$', fontsize = 12)
plt.ylabel('magnitude', fontsize = 12)
plt.stem(np.linspace(1, 10, 10).reshape(-1, 1), theta.value)
plt.xlim([0.5, 10.5])
plt.ylim([-5, 5])
plt.grid(alpha = 0.3)
plt.show()

lamb = np.arange(0,3,0.01)

theta_record = []
for k in lamb:
    theta = cvx.Variable([d, 1])
    obj = cvx.Minimize(cvx.sum_squares(A@theta - y) + k*cvx.sum_squares(theta))
    prob = cvx.Problem(obj).solve()
    theta_record.append(np.ravel(theta.value))

plt.figure(figsize = (6, 4))
plt.plot(lamb, theta_record, linewidth = 1)
plt.title('Ridge coefficients as a function of regularization', fontsize = 12)
plt.xlabel('$\lambda$', fontsize = 12)
plt.ylabel(r'weight $\theta$', fontsize = 12)
plt.show()

4.3. Sparsity for Feature Selection using LASSO¶

Least Squares with a penalty on the $l_1$ norm of the parameters
Start with full model (all possible features)
'Shrink' some coefficients exactly to 0
- i.e., knock out certain features
- the $l_1$ penalty has the effect of forcing some of the coefficient estimates to be exactly equal to zero
Non-zero coefficients indicate 'selected' features

Try this cost instead of ridge...

$$ \begin{align*} \text{Total cost } = \;&\underbrace{\text{measure of fit}}_{RSS(\theta)} + \;\lambda \cdot \underbrace{\text{measure of magnitude of coefficients}}_{\lambda \cdot \lVert \theta \rVert_1} \\ \\ \implies &\min\; \lVert \Phi \theta - y \rVert_2^2 + \lambda \lVert \theta \rVert_1 \end{align*}$$

$\lambda$ is a tuning parameter = balance of fit and sparsity

Another equivalent forms of optimizations

$$ \begin{array}{rl} \begin{align*} \min_{\theta} \quad & \lVert \Phi \theta - y \rVert_2^2 \\ \text{subject to} \quad & \lVert \theta \rVert_1 \leq s_1 \end{align*} \end{array} \quad\quad\quad\quad \begin{array}{rl} \min_{\theta} \; & \lVert \Phi \theta - y \rVert_2^2 \\ \text{subject to} \; & \lVert \theta \rVert_2 \leq s_2 \end{array} $$

$$\min\; \lVert \Phi \theta - y \rVert_2^2 + \lambda \lVert \theta \rVert_1$$

# LASSO regression

lamb = 2
theta = cvx.Variable([d, 1])
obj = cvx.Minimize(cvx.sum_squares(A@theta - y) + lamb*cvx.norm(theta, 1))
prob = cvx.Problem(obj).solve()

yp = rbfbasis*theta.value

plt.figure(figsize = (6, 4))
plt.title('LASSO Regularization (L1)', fontsize = 12)
plt.xlabel('X', fontsize = 12)
plt.ylabel('Y', fontsize = 12)
plt.plot(x, y, 'o', label = 'Data')
plt.plot(xp, yp, label = 'LASSO')
plt.axis([-5, 5, -12, 6])
plt.legend(fontsize = 12)
plt.grid(alpha = 0.3)
plt.axis('equal')
plt.show()

# Regulization (= Lasso nonlinear regression) encourages zero weights

plt.figure(figsize = (6, 4))
plt.title(r'LASSO: magnitude of $\theta$', fontsize = 12)
plt.xlabel(r'$\theta$', fontsize = 12)
plt.ylabel('magnitude', fontsize = 12)
plt.stem(np.arange(1,11), theta.value)
plt.xlim([0.5, 10.5])
plt.ylim([-5,1])
plt.grid(alpha = 0.3)
plt.show()

lamb = np.arange(0,3,0.01)

theta_record = []
for k in lamb:
    theta = cvx.Variable([d, 1])
    obj = cvx.Minimize(cvx.sum_squares(A@theta - y) + k*cvx.norm(theta, 1))
    prob = cvx.Problem(obj).solve()
    theta_record.append(np.ravel(theta.value))

plt.figure(figsize = (6, 4))
plt.plot(lamb, theta_record, linewidth = 1)
plt.title('LASSO coefficients as a function of regularization', fontsize = 12)
plt.xlabel('$\lambda$', fontsize = 12)
plt.ylabel(r'weight $\theta$', fontsize = 12)
plt.show()

# reduced order model
# we will use only theta 2, 3, 8, 10

d = 4
u = np.array([-3.5, -2.5, 2.5, 4.5])

sigma = 1

A = np.hstack([np.exp(-(x-u[i])**2/(2*sigma**2)) for i in range(d)])
rbfbasis = np.hstack([np.exp(-(xp-u[i])**2/(2*sigma**2)) for i in range(d)])

A = np.asmatrix(A)
rbfbasis = np.asmatrix(rbfbasis)

theta = cvx.Variable([d, 1])
obj = cvx.Minimize(cvx.norm(A@theta - y, 2))
prob = cvx.Problem(obj).solve()

yp = rbfbasis*theta.value

plt.figure(figsize = (6, 4))
plt.plot(x, y, 'o', label = 'Data')
plt.plot(xp, yp, label = 'Overfitted')
plt.title('(Overfitted) Regression', fontsize = 12)
plt.xlabel('X', fontsize = 12)
plt.ylabel('Y', fontsize = 12)
plt.axis([-5, 5, -12, 6])
plt.legend(fontsize = 12)
plt.grid(alpha = 0.3)
plt.axis('equal')
plt.show()

4.4. L2 and L1 Regularizers¶

$$\min \;(x-1)^2 + \frac{1}{6}(x-1)$$

4.4.1. L2 Penality¶

$$\min \; (x-1)^2 + \frac{1}{6}(x-1) + kx^2, \quad k=0,1,2,\cdots$$

x = np.arange(-4,4,0.1)

k = 4
y = (x-1)**2 + 1/6*(x-1)**3 + k*x**2

x_star = x[np.argmin(y)]
print(x_star)

0.20000000000000373

plt.plot(x, y, 'g', linewidth = 2.5)
plt.axvline(x = x_star, color = 'k', linewidth = 1, linestyle = '--')
plt.ylim([0,10])
plt.show()

for k in [0,1,2,4]:
    y = (x-1)**2 + 1/6*(x-1)**3 + k*x**2
    x_star = x[np.argmin(y)]

    plt.plot(x,y, 'g', linewidth = 2.5)
    plt.axvline(x = x_star, color = 'k', linewidth = 1, linestyle = '--')
    plt.ylim([0,10])
    plt.title('Ridge: k = {}'.format(k))
    plt.show()

4.4.2. L1 Penalty¶

$$\min \; (x-1)^2 + \frac{1}{6}(x-1) + k\lvert x \rvert, \quad k=0,1,2,\cdots$$

x = np.arange(-4,4,0.1)
k = 2
y = (x-1)**2 + 1/6*(x-1)**3 + k*abs(x)

x_star = x[np.argmin(y)]
print(x_star)

3.552713678800501e-15

plt.plot(x, y, 'g', linewidth = 2.5)
plt.axvline(x = x_star, color = 'k', linewidth = 1, linestyle = '--')
plt.ylim([0,10])
plt.show()

for k in [0,1,2]:
    y = (x-1)**2 + 1/6*(x-1)**3 + k*abs(x)
    x_star = x[np.argmin(y)]

    plt.plot(x,y, 'g', linewidth = 2.5)
    plt.axvline(x = x_star, color = 'k', linewidth = 1, linestyle = '--')
    plt.ylim([0,10])
    plt.title('LASSO: k = {}'.format(k))
    plt.show()

4.5. Scikit-Learn¶

RBF functions $g_i(x)$ as basis

$$ \begin{align*} \Phi &= \begin{bmatrix} \mid & \mid & & \mid \\ b_0(x) & b_1(x) & \cdots & b_d(x) \\ \mid & \mid & & \mid \end{bmatrix} = \begin{bmatrix} 1 & \mid & & \mid \\ \vdots & b_1(x) & \cdots & b_d(x) \\ 1 & \mid & & \mid \end{bmatrix}\\\\ \implies \Phi \theta &= \sum_{i=0}^d \theta_i b_i(x) : \; \text{a linearly-combined function of RBF basis with coefficients of } \theta \end{align*} $$

# 10 data points

n = 10
x = np.linspace(-4.5, 4.5, 10).reshape(-1, 1)
y = np.array([0.9819, 0.7973, 1.9737, 0.1838, 1.3180, -0.8361, -0.6591, -2.4701, -2.8122, -6.2512]).reshape(-1, 1)

d = 10
u = np.linspace(-4.5, 4.5, d)

sigma = 1
A = np.hstack([np.exp(-(x-u[i])**2/(2*sigma**2)) for i in range(d)])

from sklearn import linear_model

reg = linear_model.Ridge(alpha = 0.1)
reg.fit(A,y)

Ridge(alpha=0.1)

Ridge(alpha=0.1)

# to plot

xp = np.arange(-4.5, 4.5, 0.01).reshape(-1, 1)
Ap = np.hstack([np.exp(-(xp-u[i])**2/(2*sigma**2)) for i in range(d)])

plt.figure(figsize = (6, 4))
plt.plot(x, y, 'o', label = "data")
plt.plot(xp, reg.predict(Ap), linewidth = 2, label = "Ridge")
plt.title('Ridge Regression (L2)', fontsize = 12)
plt.xlabel('X', fontsize = 12)
plt.ylabel('Y', fontsize = 12)
plt.legend(fontsize = 12)
plt.grid(alpha = 0.3)
plt.axis('equal')
plt.show()

reg = linear_model.Lasso(alpha = 0.005)
reg.fit(A,y)

Lasso(alpha=0.005)

Lasso(alpha=0.005)

# to plot

xp = np.arange(-4.5, 4.5, 0.01).reshape(-1, 1)
Ap = np.hstack([np.exp(-(xp-u[i])**2/(2*sigma**2)) for i in range(d)])

plt.figure(figsize = (6, 4))
plt.plot(x, y, 'o', label = "data")
plt.plot(xp, reg.predict(Ap), linewidth = 2, label = "LASSO")
plt.title('LASSO Regression (L1)', fontsize = 12)
plt.xlabel('X', fontsize = 12)
plt.ylabel('Y', fontsize = 12)
plt.legend(fontsize = 12)
plt.grid(alpha = 0.3)
plt.axis('equal')
plt.show()

5. De-noising Signal¶

We start with a signal represented by a vector $x \in \mathbb{R}^n$. The coefficients $x_i$ correspond to the value of some function of time, evaluated (or sampled, in the language of signal processing) at evenly spaced points. It is usually assumed that the signal does not vary too rapidly, which means that usually, we have $x_i \approx x_{i+1}$.

Suppose we have a signal $x$, which does not vary too rapidly and that $x$ is corrupted by some small, rapidly varying noise $\varepsilon$, i.e. $x_{cor} = x + \varepsilon$.

Then if we want to reconstruct $x$ from $x_{cor}$ we should solve (with $\hat{x}$ as the parameter)

$$ \text{minimize} \quad \lVert \hat{x} - x_{cor}\rVert_2^2 + \mu\sum_{i=1}^{n-1}(x_{i+1}-x_i)^2 $$

where the parameter $\mu$ controls the ''smoothnes'' of $\hat{x}$.

Source:

Boyd & Vandenberghe's book "Convex Optimization"
http://cvxr.com/cvx/examples/ (Figures 6.8-6.10: Quadratic smoothing)
Week 4 of Linear and Integer Programming by Coursera of Univ. of Colorado

import numpy as np
import matplotlib.pyplot as plt
import cvxpy as cvx

%matplotlib inline

n = 200
t = np.arange(n).reshape(-1,1)
x = 0.5 * np.sin((2*np.pi/n)*t) * (np.sin(0.01*t))
x_cor = x + 0.05*np.random.randn(n,1)

plt.figure(figsize = (6, 4))
plt.subplot(2,1,1)
plt.plot(t,x,'-', linewidth = 2)
plt.axis([0, n, -0.6, 0.6])
plt.xticks([])
plt.title('original signal' , fontsize = 15)
plt.ylabel('$x_{original}$', fontsize = 15)

plt.subplot(2,1,2)
plt.plot(t, x_cor,'-', linewidth = 1)
plt.axis([0, n, -0.6, 0.6])
plt.title('corrupted signal', fontsize = 15)
plt.xlabel('n', fontsize = 15)
plt.ylabel('$x_{corrupted}$', fontsize = 15)
plt.show()

5.1. Transform de-noising in time into an optimization problem¶

$$X = \begin{bmatrix} x_{1} \\ x_{2} \\ \vdots \\ x_{n} \end{bmatrix} $$

$$\large \min\limits_{X}\;\left\{\underbrace{\lVert(X - X_{cor})\rVert^2_{2}}_{\text{how much } X \text{ deviates from }X_{cor}} + \mu \underbrace{\sum_{k = 1}^{n-1}(x_{k+1}-x_{k})^2}_{\text{penalize rapid changes of } X}\right\}$$

$\quad$1) $\min\limits_{X}\;\lVert(X - X_{cor})\rVert^2_{2}$: $\quad$ How much $X$ deviates from $X_{cor}$

$\quad$2) $\mu\sum\limits_{k = 1}^{n-1}(x_{k+1}-x_{k})^2$: $\quad$ penalize rapid changes of $X$

$\quad$3) $\mu$ : to adjust the relative weight of 1) & 2)

In a vector form

$\quad$1) $X - X_{cor} = I_n X - X_{cor}$

$\quad$2) $\sum\;(x_{k+1}-x_{k})^2 \implies$

$$(x_{2} - x_{1}) - 0 = \begin{bmatrix} -1,&1,&0,&\cdots&0 \end{bmatrix} \begin{bmatrix} x_{1} \\ x_{2} \\ \vdots \\ x_{n} \end{bmatrix}-0 $$

$$(x_{3} - x_{2}) - 0 = \begin{bmatrix} 0,&-1,&1,&\cdots&0 \end{bmatrix} \begin{bmatrix} x_{1} \\ x_{2} \\ \vdots \\ x_{n} \end{bmatrix}-0 $$

$$ \vdots $$

$$\implies \left \Arrowvert \; \begin{bmatrix} -1&1&0&\cdots&0&0 \\ 0&-1&1&\cdots&0&0 \\ \vdots&\vdots&\vdots&\ddots&\vdots&\vdots\\ 0&0&0&\cdots&-1&1 \end{bmatrix} \begin{bmatrix} x_{1} \\ x_{2} \\ \vdots \\ x_{n} \end{bmatrix} - \begin{bmatrix} 0 \\ 0 \\ \vdots \\ 0 \end{bmatrix} \; \right\Arrowvert^2_{2} $$

$$\hspace{3.2cm}D \quad\quad\quad\quad\quad\quad\qquad X \quad\; - \; \; 0 $$

$$\begin{align*} \left\Vert I_n X-X_{cor}\big\Vert^2_{2} + \mu\big\Vert DX-0\right\Vert^2_{2} & = \big\Vert Ax-b\big\Vert^2_{2} \\ \\ & = \Bigg\Vert \begin{bmatrix} I_n\\ \sqrt{\mu}D \end{bmatrix} X- \begin{bmatrix} X_{cor}\\ 0 \end{bmatrix} \Bigg\Vert^2_{2} \end{align*}$$

$$ \text{where} \; A = \begin{bmatrix} I_n\\ \sqrt{\mu}D \end{bmatrix},\quad b = \begin{bmatrix} X_{cor}\\ 0 \end{bmatrix} $$

Then, plug $A$, $b$ to $(A^TA)^{-1}A^Tb$

mu = 100

D = np.zeros([n-1, n])
D[:,0:n-1] -= np.eye(n-1)

D[:,1:n] += np.eye(n-1)
A = np.vstack([np.eye(n), np.sqrt(mu)*D])

b = np.vstack([x_cor, np.zeros([n-1,1])])

A = np.asmatrix(A)
b = np.asmatrix(b)

x_reconst = (A.T*A).I*A.T*b

plt.figure(figsize = (6, 2))
plt.plot(t, x_cor, '-', linewidth = 1, alpha = 0.5, label = 'corrupted');
plt.plot(t, x_reconst, 'r', linewidth = 2, label = 'reconstructed')
plt.title('$\mu = {}$'.format(mu), fontsize = 15)
plt.legend(fontsize = 12)
plt.show()

5.1.1. with different $\mu$'s (see how $\mu$ affects smoothing results)¶

plt.figure(figsize = (6, 6))

mu = [0, 10, 100];

for i in range(len(mu)):
    A = np.vstack([np.eye(n), np.sqrt(mu[i])*D])
    b = np.vstack([x_cor, np.zeros([n-1,1])])

    A = np.asmatrix(A)
    b = np.asmatrix(b)

    x_reconst = (A.T*A).I*A.T*b

    plt.subplot(3,1,i+1)
    plt.plot(t, x_cor, '-', linewidth = 1, alpha = 0.5)
    plt.plot(t, x_reconst, 'r', linewidth = 2)
    plt.ylabel('$\mu = {}$'.format(mu[i]), fontsize = 15)

plt.show()

5.1.2. CVXPY Implementation¶

cvxpy toolbox to numerically solve

$$ \min\; \left\{ \lVert x-x_{cor}\rVert_2^2 + \mu \lVert Dx \rVert_2^2 \right\}$$

mu = 100

x_reconst = cvx.Variable([n,1])
#obj = cvx.Minimize(cvx.sum_squares(x_reconst - x_cor) + mu*cvx.sum_squares(x_reconst[1:n]-x_reconst[0:n-1]))
obj = cvx.Minimize(cvx.sum_squares(x_reconst - x_cor) + mu*cvx.sum_squares(D@x_reconst))

prob = cvx.Problem(obj).solve()

plt.figure(figsize = (6, 2))
plt.plot(t, x_cor, '-', linewidth = 1, alpha = 0.5, label = 'corrupted');
plt.plot(t, x_reconst.value, 'r', linewidth = 2, label = 'reconstructed')
plt.title('$\mu = {}$'.format(mu), fontsize = 15)
plt.legend(fontsize = 12)
plt.show()

plt.figure(figsize = (6, 6))

mu = [0, 10, 100]

for i in range(len(mu)):
    x_reconst = cvx.Variable([n,1])
    obj = cvx.Minimize(cvx.sum_squares(x_reconst - x_cor) + mu[i]*cvx.sum_squares(D @ x_reconst))
    prob = cvx.Problem(obj).solve()

    plt.subplot(3,1,i+1)
    plt.plot(t,x_cor,'-', linewidth = 1, alpha = 0.5)
    plt.plot(t,x_reconst.value, 'r', linewidth = 2)
    plt.ylabel('$\mu = {}$'.format(int(mu[i])), fontsize = 15)

plt.show()

5.1.3. L2 Norm¶

CVXPY strongly encourages to eliminate quadratic forms, that is, functions like sum_squares, sum(square(.)) or quad_form
Whenever it is possible to construct equivalent models using norm instead

$$ \min \; \left\{ \lVert x-x_{cor}\rVert_2 + \gamma \lVert Dx \rVert_2 \right \}$$

plt.figure(figsize = (6, 6))

gammas = [0, 2, 7]

for i in range(len(gammas)):
    x_reconst = cvx.Variable([n,1])
    obj = cvx.Minimize(cvx.norm(x_reconst-x_cor, 2) + gammas[i]*(cvx.norm(D @ x_reconst, 2)))
    prob = cvx.Problem(obj).solve()

    plt.subplot(3,1,i+1)
    plt.plot(t,x_cor,'-', linewidth = 1, alpha = 0.5)
    plt.plot(t,x_reconst.value, 'r', linewidth = 2)
    plt.ylabel('$ \gamma = {}$'.format(gammas[i]), fontsize = 15)

plt.show()

5.1.4. L2 Norm with a Constraint¶

$$\min \; \lVert Dx \rVert_2$$

$$ s.t. \quad \lVert x-x_{cor} \rVert_2 < \beta $$

beta = 0.8

x_reconst = cvx.Variable([n,1])
obj = cvx.Minimize(cvx.norm(D @ x_reconst, 2))
const = [cvx.norm(x_reconst-x_cor, 2) <= beta]
prob = cvx.Problem(obj, const).solve()

plt.figure(figsize = (6, 2))
plt.plot(t,x_cor,'-', linewidth = 1, alpha = 0.5)
plt.plot(t,x_reconst.value, 'r', linewidth = 2)
plt.ylabel(r'$\beta = {}$'.format(beta), fontsize = 15)
plt.show()

plt.figure(figsize = (6, 6))

beta = [0.1, 0.6, 0.8]

for i in range(len(beta)):

    x_reconst = cvx.Variable([n,1])
    obj = cvx.Minimize(cvx.norm(x_reconst[1:n] - x_reconst[0:n-1], 2))
    const = [cvx.norm(x_reconst-x_cor, 2) <= beta[i]]
    prob = cvx.Problem(obj, const).solve()

    plt.subplot(len(beta),1,i+1)
    plt.plot(t,x_cor,'-', linewidth = 1, alpha = 0.5)
    plt.plot(t,x_reconst.value, 'r', linewidth = 2)
    plt.ylabel(r'$\beta = {}$'.format(beta[i]), fontsize = 15)

plt.show()

5.2. Signal with Sharp Transition + Noise¶

Suppose we have a signal $x$, which is mostly smooth, but has several rapid variations (or jumps). If we apply quadratic smoothing on this signal, then in order to remove the noise we will not be able to preserve the signal's sharp transitions.

First, apply the same method that we used for smoothing signals before

n = 200
t = np.arange(n).reshape(-1,1)

exact = np.vstack([np.ones([50,1]), -np.ones([50,1]), np.ones([50,1]), -np.ones([50,1])])
x = exact + 0.5*np.sin((2*np.pi/n)*t)
x_cor = x + 0.1*np.random.randn(n,1)

plt.figure(figsize = (6, 4))
plt.subplot(2,1,1)
plt.plot(t, x)
plt.ylim([-2.0,2.0])
plt.ylabel('signal', fontsize = 15)
plt.subplot(2,1,2)
plt.plot(t, x_cor, linewidth = 1)
plt.ylabel('corrupted signal', fontsize = 15)
plt.xlabel('x', fontsize = 15)
plt.show()

5.2.1. L2 Norm (Quadratic Smoothing)¶

plt.figure(figsize = (6, 6))

beta = [0.5, 2, 4]

for i in range(len(beta)):
    x_reconst = cvx.Variable([n,1])
    obj = cvx.Minimize(cvx.norm(x_reconst[1:n] - x_reconst[0:n-1], 2))
    const = [cvx.norm(x_reconst - x_cor, 2) <= beta[i]]
    prob = cvx.Problem(obj, const).solve()

    plt.subplot(len(beta), 1, i+1)
    plt.plot(t, x_cor, linewidth = 1, alpha = 0.5)
    plt.plot(t, x_reconst.value, 'r', linewidth = 2)
    plt.ylabel(r'$\beta = {}$'.format(beta[i]), fontsize = 15)

plt.show()

Quadratic smoothing smooths out noise and sharp transitions in signal, but this is not what we want
Any ideas ?

5.2.2. L1 Norm¶

We can instead apply total variation reconstruction on the signal by solving

$$\min \; \lVert x - x_{cor} \rVert_2 + \lambda \sum_{i=1}^{n-1} \;\lvert x_{i+1}-x_i \rvert $$

where the parameter $\lambda$ controls the ''smoothness'' of $x$.

$$\min \; \lVert Dx \rVert_1$$

$$ s.t. \quad \lVert x-x_{cor} \rVert_2 < \beta $$

plt.figure(figsize = (6, 6))

beta = [0.5, 2, 4]

for i in range(len(beta)):
    x_reconst = cvx.Variable([n,1])
    obj = cvx.Minimize(cvx.norm(x_reconst[1:n] - x_reconst[0:n-1], 1))
    const = [cvx.norm(x_reconst-x_cor, 2) <= beta[i]]
    prob = cvx.Problem(obj, const).solve()

    plt.subplot(len(beta), 1, i+1)
    plt.plot(t, x_cor, linewidth = 1, alpha = 0.5)
    plt.plot(t, x_reconst.value, 'r', linewidth = 2)
    plt.ylabel(r'$\beta = {}$'.format(beta[i]), fontsize = 15)

plt.show()

Total Variation (TV) smoothing preserves sharp transitions in signal, and this is not bad
Note how TV reconstruction does a better job of preserving the sharp transitions in the signal while removing the noise.

6. Total Variation Image Reconstruction¶

Idea comes from here
Download the image

from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).

import cv2

imbw = cv2.imread('/content/drive/MyDrive/ML_Colab/ML_data/dog.jpg', 0)

row = 150
col = 150
resized_imbw = cv2.resize(imbw, (row, col))

plt.figure(figsize = (6, 6))
plt.imshow(resized_imbw, 'gray')
plt.axis('off')
plt.show()

Question: Apply $L_1$ norm to the image, and guess what kind of an image will be produced ?

$$\min \; \lVert Dx \rVert_1$$

$$ s.t. \quad \lVert x-x_{cor} \rVert_2 < \beta $$

n = row*col
imbws = resized_imbw.reshape(-1, 1)

beta = 1500

x = cvx.Variable([n,1])
obj = cvx.Minimize(cvx.norm(x[1:n] - x[0:n-1],1))
const = [cvx.norm(x - imbws,2) <= beta]
prob = cvx.Problem(obj, const).solve()

imbwr = x.value.reshape(row, col)

plt.figure(figsize = (6, 6))
plt.imshow(imbwr,'gray')
plt.axis('off')
plt.show()

%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')