Regression
Table of Contents
1. Linear Regression¶
Consider a linear regression.
$\text{Given} \; \begin{cases} x_{i} \; \text{: inputs} \\ y_{i} \; \text{: outputs} \end{cases}$ , Find $\theta_{0}$ and $\theta_{1}$
$$x= \begin{bmatrix} x_{1} \\ x_{2} \\ \vdots \\ x_{m} \end{bmatrix}, \qquad y= \begin{bmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{m} \end{bmatrix} \approx \hat{y}_{i} = \theta_{0} + \theta_{1}x_{i}$$
$ \hat{y}_{i} $ : predicted output
$ \theta = \begin{bmatrix} \theta_{0} \\ \theta_{1} \\ \end{bmatrix} $ : Model parameters
$$ \hat{y}_{i} = f(x_{i}\,; \theta) \; \text{ in general}$$
- in many cases, a linear model is used to predict $y_{i}$
$$ \hat{y}_{i} = \theta_{0} + \theta_{1}x_{i} \; \quad \text{ such that }\quad \min\limits_{\theta_{0}, \theta_{1}}\sum\limits_{i = 1}^{m} (\hat{y}_{i} - y_{i})^2$$
1.1. Re-cast Problem as a Least Squares¶
- For convenience, we define a function that maps inputs to feature vectors, $\phi$
$$\begin{array}{Icr}\begin{align*} \hat{y}_{i} & = \theta_0 + x_i \theta_1 = 1 \cdot \theta_0 + x_i \theta_1 \\ \\ & = \begin{bmatrix}1 & x_{i}\end{bmatrix}\begin{bmatrix}\theta_{0} \\ \theta_{1}\end{bmatrix} \\\\ & =\begin{bmatrix}1 \\ x_{i} \end{bmatrix}^{T}\begin{bmatrix}\theta_{0} \\ \theta_{1}\end{bmatrix} \\\\ & =\phi^{T}(x_{i})\theta \end{align*}\end{array} \begin{array}{Icr} \quad \quad \text{feature vector} \; \phi(x_{i}) = \begin{bmatrix}1 \\ x_{i}\end{bmatrix} \end{array}$$
$$\Phi = \begin{bmatrix}1 & x_{1} \\ 1 & x_{2} \\ \vdots \\1 & x_{m} \end{bmatrix}=\begin{bmatrix}\phi^T(x_{1}) \\\phi^T(x_{2}) \\\vdots \\\phi^T(x_{m}) \end{bmatrix} \quad \implies \quad \hat{y} = \begin{bmatrix}\hat{y}_{1} \\\hat{y}_{2} \\\vdots \\\hat{y}_{m}\end{bmatrix}=\Phi\theta$$
- Optimization problem
$$\min\limits_{\theta_{0}, \theta_{1}}\sum\limits_{i = 1}^{m} (\hat{y}_{i} - y_{i})^2 =\min\limits_{\theta}\lVert\Phi\theta-y\rVert^2_2 \qquad \qquad \left(\text{same as} \; \min_{x} \lVert Ax-b \rVert_2^2 \right)$$
$$ \text{solution} \; \theta^* = (\Phi^{T}\Phi)^{-1}\Phi^{T} y $$
Note
$$\begin{array}{Icr} \text{input} \\ x_{i} \end{array} \quad \rightarrow \quad \begin{array}{Icr} \text{feature} \\ \begin{bmatrix} 1 \\ x_{i} \end{bmatrix} \end{array} \quad \rightarrow \quad \begin{array}{Icr} \text{predicted output} \\ \hat{y}_{i} \end{array}$$
$$\begin{array}{Icr} \begin{bmatrix} 1 & x_{1} \\1 & x_{2}\\\vdots & \vdots\\1 & x_{m}\end{bmatrix}\begin{bmatrix}\theta_0\\\theta_1\end{bmatrix}=\begin{bmatrix}y_{1} \\y_{2} \\\vdots \\y_{m}\end{bmatrix} \\ \begin{array}{Icr} \uparrow \\ \vec{A}_1 \end{array} \;\; \begin{array}{Icr} \uparrow \\ \vec{A}_2 \end{array} \quad \begin{array}{Icr} \uparrow \\ \vec{x} \end{array} \quad\quad \;\; \begin{array}{Icr} \uparrow \\ \vec{B} \end{array} \end{array} \quad \begin{array}{Icr} \quad \text{over-determined or} \\ \quad \text{projection} \end{array}$$
$$A(= \Phi) = \left[ \vec{A}_1 \;\vec{A}_2 \right]$$
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
# data points in column vector [input, output]
x = np.array([0.1, 0.4, 0.7, 1.2, 1.3, 1.7, 2.2, 2.8, 3.0, 4.0, 4.3, 4.4, 4.9]).reshape(-1, 1)
y = np.array([0.5, 0.9, 1.1, 1.5, 1.5, 2.0, 2.2, 2.8, 2.7, 3.0, 3.5, 3.7, 3.9]).reshape(-1, 1)
plt.figure(figsize = (6, 4))
plt.plot(x, y, 'ko')
plt.title('Data', fontsize = 12)
plt.xlabel('X', fontsize = 12)
plt.ylabel('Y', fontsize = 12)
plt.axis('equal')
plt.grid(alpha = 0.3)
plt.xlim([0, 5])
plt.ylim([0, 4.5])
plt.show()
m = y.shape[0]
#A = np.hstack([np.ones([m, 1]), x])
A = np.hstack([x**0, x])
A = np.asmatrix(A)
theta = (A.T*A).I*A.T*y
print('theta:\n', theta)
# to plot
plt.figure(figsize = (6, 4))
plt.title('Regression', fontsize = 12)
plt.xlabel('X', fontsize = 12)
plt.ylabel('Y', fontsize = 12)
plt.plot(x, y, 'ko', label = "data")
# to plot a straight line (fitted line)
xp = np.arange(0, 5, 0.01).reshape(-1, 1)
yp = theta[0,0] + theta[1,0]*xp
plt.plot(xp, yp, 'r', linewidth = 2, label = "regression")
plt.legend()
plt.axis('equal')
plt.grid(alpha = 0.3)
plt.xlim([0, 5])
plt.ylim([0, 4.5])
plt.show()
1.2.2. Use Gradient Descent¶
$$\min_{\theta} ~ \lVert \hat y - y \rVert_2^2 = \min_{\theta} ~ \lVert A\theta - y \rVert_2^2$$
$$ \begin{align*} f &= (A\theta-y)^T(A\theta-y) = (\theta^TA^T-y^T)(A\theta-y) \\ &= \theta^TA^TA\theta - \theta^TA^Ty - y^TA\theta + y^Ty \\\\ \nabla f &= A^TA\theta + A^TA\theta - A^Ty - A^Ty = 2(A^TA\theta - A^Ty) \end{align*} $$
$$ \theta \leftarrow \theta - \alpha \nabla f $$
theta = np.random.randn(2,1)
theta = np.asmatrix(theta)
alpha = 0.001
for _ in range(1000):
df = 2*(A.T*A*theta - A.T*y)
theta = theta - alpha*df
print (theta)
# to plot
plt.figure(figsize = (6, 4))
plt.title('Regression', fontsize = 12)
plt.xlabel('X', fontsize = 12)
plt.ylabel('Y', fontsize = 12)
plt.plot(x, y, 'ko', label = "data")
# to plot a straight line (fitted line)
xp = np.arange(0, 5, 0.01).reshape(-1, 1)
yp = theta[0,0] + theta[1,0]*xp
plt.plot(xp, yp, 'r', linewidth = 2, label = "regression")
plt.legend()
plt.axis('equal')
plt.grid(alpha = 0.3)
plt.xlim([0, 5])
plt.show()
1.2.3. Use CVXPY Optimization¶
$$\min_{\theta} ~ \lVert \hat y - y \rVert_2 = \min_{\theta} ~ \lVert A\theta - y \rVert_2$$
import cvxpy as cvx
theta2 = cvx.Variable([2, 1])
obj = cvx.Minimize(cvx.norm(A@theta2 - y, 2))
cvx.Problem(obj,[]).solve()
print('theta:\n', theta2.value)
By the way, do we have to use only $L_2$ norm? No.
- Let's use $L_1$ norm
$$ \min_{\theta} ~ \lVert \hat y - y \rVert_1 = \min_{\theta} ~ \lVert A\theta - y \rVert_1 $$
theta1 = cvx.Variable([2, 1])
obj = cvx.Minimize(cvx.norm(A@theta1 - y, 1))
cvx.Problem(obj).solve()
print('theta:\n', theta1.value)
# to plot data
plt.figure(figsize = (6, 4))
plt.title('$L_1$ and $L_2$ Regression', fontsize = 12)
plt.xlabel('X', fontsize = 12)
plt.ylabel('Y', fontsize = 12)
plt.plot(x, y, 'ko', label = 'data')
# to plot straight lines (fitted lines)
xp = np.arange(0, 5, 0.01).reshape(-1, 1)
yp1 = theta1.value[0,0] + theta1.value[1,0]*xp
yp2 = theta2.value[0,0] + theta2.value[1,0]*xp
plt.plot(xp, yp1, 'b', linewidth = 2, label = '$L_1$')
plt.plot(xp, yp2, 'r', linewidth = 2, label = '$L_2$')
plt.legend()
plt.axis('equal')
plt.xlim([0, 5])
plt.grid(alpha = 0.3)
plt.show()
$L_1$ norm also provides a decent linear approximation.
What if outliers exist?
- Fit with the different norms
- Discuss the result
- It is important to understand what makes them different.
# add outliers
x = np.vstack([x, np.array([0.5, 3.8]).reshape(-1, 1)])
y = np.vstack([y, np.array([3.9, 0.3]).reshape(-1, 1)])
A = np.hstack([x**0, x])
A = np.asmatrix(A)
A.shape
plt.figure(figsize = (6, 4))
plt.plot(x, y, 'ko', label = 'data')
plt.axis('equal')
plt.xlim([0, 5])
plt.grid(alpha = 0.3)
plt.show()
theta2 = cvx.Variable([2, 1])
obj2 = cvx.Minimize(cvx.norm(A@theta2 - y, 2))
prob2 = cvx.Problem(obj2).solve()
# to plot straight lines (fitted lines)
plt.figure(figsize = (6, 4))
plt.plot(x, y, 'ko', label = 'data')
xp = np.arange(0, 5, 0.01).reshape(-1,1)
yp2 = theta2.value[0,0] + theta2.value[1,0]*xp
plt.plot(xp, yp2, 'r', linewidth = 2, label = '$L_2$')
plt.axis('equal')
plt.xlim([0, 5])
plt.legend(loc = 5)
plt.grid(alpha = 0.3)
plt.show()
theta1 = cvx.Variable([2, 1])
obj1 = cvx.Minimize(cvx.norm(A@theta1 - y, 1))
prob1 = cvx.Problem(obj1).solve()
# to plot straight lines (fitted lines)
plt.figure(figsize = (6, 4))
plt.plot(x, y, 'ko', label = 'data')
xp = np.arange(0, 5, 0.01).reshape(-1,1)
yp1 = theta1.value[0,0] + theta1.value[1,0]*xp
plt.plot(xp, yp1, 'b', linewidth = 2, label = '$L_1$')
plt.axis('equal')
plt.xlim([0, 5])
plt.legend(loc = 5)
plt.grid(alpha = 0.3)
plt.show()
# to plot data
plt.figure(figsize = (6, 4))
plt.plot(x, y, 'ko', label = 'data')
plt.title('$L_1$ and $L_2$ Regression w/ Outliers', fontsize = 12)
plt.xlabel('X', fontsize = 12)
plt.ylabel('Y', fontsize = 12)
# to plot straight lines (fitted lines)
xp = np.arange(0, 5, 0.01).reshape(-1,1)
yp1 = theta1.value[0,0] + theta1.value[1,0]*xp
yp2 = theta2.value[0,0] + theta2.value[1,0]*xp
plt.plot(xp, yp1, 'b', linewidth = 2, label = '$L_1$')
plt.plot(xp, yp2, 'r', linewidth = 2, label = '$L_2$')
plt.axis('scaled')
plt.xlim([0, 5])
plt.legend(loc = 5)
plt.grid(alpha = 0.3)
plt.show()
Think about what makes them different.
1.2.4. Scikit-Learn¶
- Machine Learning in Python
- Simple and efficient tools for data mining and data analysis
- Accessible to everybody, and reusable in various contexts
- Built on NumPy, SciPy, and matplotlib
- Open source, commercially usable - BSD license
- https://scikit-learn.org/stable/index.html#
x = np.array([0.1, 0.4, 0.7, 1.2, 1.3, 1.7, 2.2, 2.8, 3.0, 4.0, 4.3, 4.4, 4.9]).reshape(-1, 1)
y = np.array([0.5, 0.9, 1.1, 1.5, 1.5, 2.0, 2.2, 2.8, 2.7, 3.0, 3.5, 3.7, 3.9]).reshape(-1, 1)
from sklearn import linear_model
reg = linear_model.LinearRegression()
reg.fit(x, y)
reg.coef_
reg.intercept_
# to plot
plt.figure(figsize = (6, 4))
plt.title('Regression', fontsize = 12)
plt.xlabel('X', fontsize = 12)
plt.ylabel('Y', fontsize = 12)
plt.plot(x, y, 'ko', label = "data")
# to plot a straight line (fitted line)
plt.plot(xp, reg.predict(xp), 'r', linewidth = 2, label = "regression")
plt.legend(fontsize = 12)
plt.axis('equal')
plt.grid(alpha = 0.3)
plt.xlim([0, 5])
plt.show()
2. Multivariate Linear Regression¶
(= linear regression for multivariate data)
$$ \hat{y} = \theta_0 + \theta_{1}x_1 + \theta_{2}x_2 $$ $$$$ $$$$
$$\phi \left(x^{(i)}\right) = \begin{bmatrix}1\\x^{(i)}_{1}\\x^{(i)}_{2} \end{bmatrix}$$
# for 3D plot
from mpl_toolkits.mplot3d import Axes3D
# y = theta0 + theta1*x1 + theta2*x2 + noise
n = 200
x1 = np.random.randn(n, 1)
x2 = np.random.randn(n, 1)
noise = 0.5*np.random.randn(n, 1);
y = 2 + 1*x1 + 3*x2 + noise
fig = plt.figure(figsize = (8, 6))
ax = fig.add_subplot(1, 1, 1, projection = '3d')
ax.set_title('Generated Data', fontsize = 12)
ax.set_xlabel('$X_1$', fontsize = 12)
ax.set_ylabel('$X_2$', fontsize = 12)
ax.set_zlabel('Y', fontsize = 12)
ax.scatter(x1, x2, y, marker = '.', label = 'Data')
#ax.view_init(30,30)
plt.legend(fontsize = 12)
plt.show()
$$\Phi = \begin{bmatrix}1 & x_{1}^{(1)} & x_{2}^{(1)}\\1 & x_{1}^{(2)} & x_{2}^{(2)}\\ \vdots \\1 & x_{1}^{(m)} & x_{2}^{(m)} \end{bmatrix} \quad \implies \quad \hat{y} = \begin{bmatrix}\hat{y}^{(1)} \\\hat{y}^{(2)} \\\vdots \\\hat{y}^{(m)}\end{bmatrix}=\Phi\theta$$
$$\implies \theta^{*} = (\Phi^T \Phi)^{-1} \Phi^T y$$
A = np.hstack([np.ones((n, 1)), x1, x2])
A = np.asmatrix(A)
theta = (A.T*A).I*A.T*y
X1, X2 = np.meshgrid(np.arange(np.min(x1), np.max(x1), 0.5),
np.arange(np.min(x2), np.max(x2), 0.5))
YP = theta[0,0] + theta[1,0]*X1 + theta[2,0]*X2
fig = plt.figure(figsize = (8, 6))
ax = fig.add_subplot(1, 1, 1, projection = '3d')
ax.set_title('Regression', fontsize = 12)
ax.set_xlabel('$X_1$', fontsize = 12)
ax.set_ylabel('$X_2$', fontsize = 12)
ax.set_zlabel('Y', fontsize = 12)
ax.scatter(x1, x2, y, marker = '.', label = 'Data')
ax.plot_wireframe(X1, X2, YP, color = 'k', alpha = 0.3, label = 'Regression Plane')
#ax.view_init(30,30)
plt.legend(fontsize = 12)
plt.show()
3. Nonlinear Regression¶
(= Linear Regression for Non-linear Data)
Same as linear regression, just with non-linear features
Method 1: constructing explicit feature vectors
- Polynomial features
- Radial basis function (RBF) features
Method 2: implicit feature vectors, kernel trick (optional)
3.1. Nonlinear Regression with Polynomial Features¶
(here, quad is used as an example)
$$ \begin{align*} y &= \theta_0 + \theta_1 x + \theta_2 x^2 + \text{noise} \end{align*} $$
$$\phi(x_{i}) = \begin{bmatrix}1\\x_{i}\\x_{i}^2 \end{bmatrix}$$
$$\Phi = \begin{bmatrix}1 & x_{1} & x_{1}^2 \\ 1 & x_{2} & x_{2}^2 \\ \vdots \\ 1 & x_{m} & x_{m}^2\end{bmatrix} \quad \implies \quad \hat{y} = \begin{bmatrix}\hat{y}_1 \\\hat{y}_2 \\\vdots \\\hat{y}_m\end{bmatrix}=\Phi\theta$$
$$\implies \theta^{*} = (\Phi^T \Phi)^{-1} \Phi^T y$$
# y = theta0 + theta1*x + theta2*x^2 + noise
n = 100
x = -5 + 15*np.random.rand(n, 1)
noise = 10*np.random.randn(n, 1)
y = 10 + 1*x + 2*x**2 + noise
plt.figure(figsize = (6, 4))
plt.title('True x and y', fontsize = 12)
plt.xlabel('X', fontsize = 12)
plt.ylabel('Y', fontsize = 12)
plt.plot(x, y, 'o', markersize = 4, label = 'actual')
plt.xlim([np.min(x), np.max(x)])
plt.grid(alpha = 0.3)
plt.legend(fontsize = 12)
plt.show()
A = np.hstack([x**0, x, x**2])
A = np.asmatrix(A)
theta = (A.T*A).I*A.T*y
print('theta:\n', theta)
xp = np.linspace(np.min(x), np.max(x))
yp = theta[0,0] + theta[1,0]*xp + theta[2,0]*xp**2
plt.figure(figsize = (6, 4))
plt.plot(x, y, 'o', markersize = 4, label = 'actual')
plt.plot(xp, yp, 'r', linewidth = 2, label = 'estimated')
plt.title('Nonlinear regression', fontsize = 12)
plt.xlabel('X', fontsize = 12)
plt.ylabel('Y', fontsize = 12)
plt.xlim([np.min(x), np.max(x)])
plt.grid(alpha = 0.3)
plt.legend(fontsize = 12)
plt.show()
4. Overfitting¶
This is a very important code that you might want to fully understand or even memorize
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
# 10 data points
n = 10
x = np.linspace(-4.5, 4.5, 10).reshape(-1, 1)
y = np.array([0.9819, 0.7973, 1.9737, 0.1838, 1.3180, -0.8361, -0.6591, -2.4701, -2.8122, -6.2512]).reshape(-1, 1)
plt.figure(figsize = (6, 4))
plt.plot(x, y, 'o', label = 'Data')
plt.xlabel('X', fontsize = 12)
plt.ylabel('Y', fontsize = 12)
plt.grid(alpha = 0.3)
plt.axis('equal')
plt.show()
A = np.hstack([x**0, x])
A = np.asmatrix(A)
theta = (A.T*A).I*A.T*y
print(theta)
# to plot
xp = np.arange(-4.5, 4.5, 0.01).reshape(-1, 1)
yp = theta[0,0] + theta[1,0]*xp
plt.figure(figsize = (6, 4))
plt.plot(x, y, 'o', label = 'Data')
plt.plot(xp[:,0], yp[:,0], linewidth = 2, label = 'Linear')
plt.title('Linear Regression', fontsize = 12)
plt.xlabel('X', fontsize = 12)
plt.ylabel('Y', fontsize = 12)
plt.legend(fontsize = 12)
plt.grid(alpha = 0.3)
plt.axis('equal')
plt.show()
A = np.hstack([x**0, x, x**2])
A = np.asmatrix(A)
theta = (A.T*A).I*A.T*y
print(theta)
# to plot
xp = np.arange(-4.5, 4.5, 0.01).reshape(-1, 1)
yp = theta[0,0] + theta[1,0]*xp + theta[2,0]*xp**2
plt.figure(figsize = (6, 4))
plt.plot(x, y, 'o', label = 'Data')
plt.plot(xp[:,0], yp[:,0], linewidth = 2, label = '2nd degree')
plt.title('Nonlinear Regression with Polynomial Functions', fontsize = 12)
plt.xlabel('X', fontsize = 12)
plt.ylabel('Y', fontsize = 12)
plt.legend(fontsize = 12)
plt.grid(alpha = 0.3)
plt.axis('equal')
plt.show()
A = np.hstack([x**i for i in range(10)])
A = np.asmatrix(A)
theta = (A.T*A).I*A.T*y
print(theta)
# to plot
xp = np.arange(-4.5, 4.5, 0.01).reshape(-1, 1)
polybasis = np.hstack([xp**i for i in range(10)])
polybasis = np.asmatrix(polybasis)
yp = polybasis*theta
plt.figure(figsize = (6, 4))
plt.plot(x, y, 'o', label = 'Data')
plt.plot(xp[:,0], yp[:,0], linewidth = 2, label = '9th degree')
plt.title('Nonlinear Regression with Polynomial Functions', fontsize = 12)
plt.xlabel('X', fontsize = 12)
plt.ylabel('Y', fontsize = 12)
plt.legend(fontsize = 12)
plt.grid(alpha = 0.3)
plt.axis('equal')
plt.show()
d = [1, 3, 5, 9]
RSS = []
plt.figure(figsize = (8, 6))
plt.suptitle('Nonlinear Regression', fontsize = 12)
for k in range(4):
A = np.hstack([x**i for i in range(d[k]+1)])
polybasis = np.hstack([xp**i for i in range(d[k]+1)])
A = np.asmatrix(A)
polybasis = np.asmatrix(polybasis)
theta = (A.T*A).I*A.T*y
yp = polybasis*theta
RSS.append(np.linalg.norm(y - A*theta, 2)**2)
plt.subplot(2, 2, k+1)
plt.plot(x, y, 'o')
plt.plot(xp, yp)
plt.axis([-5, 5, -12, 6])
plt.title('degree = {}'.format(d[k]))
plt.grid(alpha=0.3)
plt.show()
plt.figure(figsize = (6, 4))
plt.stem(d, RSS, label = 'RSS')
plt.title('Residual Sum of Squares', fontsize = 12)
plt.xlabel('degree', fontsize = 12)
plt.ylabel('RSS', fontsize = 12)
plt.legend(fontsize = 12)
plt.grid(alpha = 0.3)
plt.show()
4.1. Linear Basis Function Models¶
Construct explicit feature vectors
Consider linear combinations of fixed nonlinear functions of the input variables, of the form
$$ \begin{bmatrix} 1 & x_{1} & x_1^2\\1 & x_{2} & x_2^2\\\vdots & \vdots\\1 & x_{m} & x_m^2 \end{bmatrix} \begin{bmatrix}\theta_0\\\theta_1 \\ \theta_2 \end{bmatrix} \quad \Rightarrow \quad \begin{bmatrix} \mid & \mid & \mid \\ b_0(x) & b_1(x) & b_2(x)\\ \mid & \mid & \mid \end{bmatrix} \begin{bmatrix}\theta_0\\\theta_1 \\ \theta_2 \end{bmatrix} $$
$$ \hat{y}=\sum_{i=0}^d{\theta_i b_i(x)} = \Phi \theta$$
1) Polynomial functions
$$b_i(x) = x^i, \quad i = 0,\cdots,d$$
xp = np.arange(-1, 1, 0.01).reshape(-1, 1)
polybasis = np.hstack([xp**i for i in range(6)])
plt.figure(figsize = (6, 4))
for i in range(6):
plt.plot(xp, polybasis[:,i], label = '$x^{}$'.format(i))
plt.title('Polynomial Basis', fontsize = 12)
plt.xlabel('X', fontsize = 12)
plt.ylabel('Y', fontsize = 12)
plt.axis([-1, 1, -1.1, 1.1])
plt.grid(alpha = 0.3)
plt.legend(fontsize = 12)
plt.show()
2) RBF functions
With bandwidth $\sigma$ and $k$ RBF centers $\mu_i \in \mathbb{R}^n$
$$ b_i(x) = \exp \left( - \frac{\lVert x-\mu_i \rVert^2}{2\sigma^2}\right) $$
d = 9
u = np.linspace(-1, 1, d)
sigma = 0.2
rbfbasis = np.hstack([np.exp(-(xp-u[i])**2/(2*sigma**2)) for i in range(d)])
plt.figure(figsize = (6, 4))
for i in range(d):
plt.plot(xp, rbfbasis[:,i], label='$\mu = {}$'.format(u[i]))
plt.title('RBF basis', fontsize = 12)
plt.xlabel('X', fontsize = 12)
plt.ylabel('Y', fontsize = 12)
plt.axis([-1, 1, -0.1, 1.1])
plt.legend(loc = 'lower right', fontsize = 10)
plt.grid(alpha = 0.3)
plt.show()
With many features, our prediction function becomes very expressive
Can lead to overfitting (low error on input data points, but high error nearby)
xp = np.arange(-4.5, 4.5, 0.01).reshape(-1, 1)
d = 10
u = np.linspace(-4.5, 4.5, d)
sigma = 0.2
A = np.hstack([np.exp(-(x-u[i])**2/(2*sigma**2)) for i in range(d)])
rbfbasis = np.hstack([np.exp(-(xp-u[i])**2/(2*sigma**2)) for i in range(d)])
A = np.asmatrix(A)
rbfbasis = np.asmatrix(rbfbasis)
theta = (A.T*A).I*A.T*y
yp = rbfbasis*theta
plt.figure(figsize = (6, 4))
plt.plot(x, y, 'o', label = 'Data')
plt.plot(xp, yp, label = 'Overfitted')
plt.title('(Overfitted) Regression with RBF basis', fontsize = 12)
plt.xlabel('X', fontsize = 12)
plt.ylabel('Y', fontsize = 12)
plt.grid(alpha = 0.3)
plt.legend(fontsize = 12)
plt.axis('equal')
plt.show()
d = [2, 4, 6, 10]
sigma = 1
plt.figure(figsize = (10, 8))
for k in range(4):
u = np.linspace(-4.5, 4.5, d[k])
A = np.hstack([np.exp(-(x-u[i])**2/(2*sigma**2)) for i in range(d[k])])
rbfbasis = np.hstack([np.exp(-(xp-u[i])**2/(2*sigma**2)) for i in range(d[k])])
A = np.asmatrix(A)
rbfbasis = np.asmatrix(rbfbasis)
theta = (A.T*A).I*A.T*y
yp = rbfbasis*theta
plt.subplot(2, 2, k+1)
plt.plot(x, y, 'o')
plt.plot(xp, yp)
plt.axis([-5, 5, -12, 6])
plt.title('num RBFs = {}'.format(d[k]), fontsize = 10)
plt.grid(alpha = 0.3)
plt.suptitle('Nonlinear Regression with RBF Functions', fontsize = 12)
plt.show()
4.2. Regularization (Shrinkage Methods)¶
Often, overfitting associated with very large estimated parameters $\theta$
We want to balance
how well function fits data
magnitude of coefficients
$$ \begin{align*} \text{Total cost } = \;&\underbrace{\text{measure of fit}}_{RSS(\theta)} + \;\lambda \cdot \underbrace{\text{measure of magnitude of coefficients}}_{\lambda \cdot \lVert \theta \rVert_2^2} \\ \\ \implies &\min\; \lVert \Phi \theta - y \rVert_2^2 + \lambda \lVert \theta \rVert_2^2 \end{align*} $$
where $ RSS(\theta) = \lVert \Phi\theta - y \rVert^2_2 $, ( = Rresidual Sum of Squares) and $\lambda$ is a tuning parameter to be determined separately
the second term, $\lambda \cdot \lVert \theta \rVert_2^2$, called a shrinkage penalty, is small when $\theta_1, \cdots,\theta_d$ are close to zeros, and so it has the effect of shrinking the estimates of $\theta_j$ towards zero
The tuning parameter $\lambda$ serves to control the relative impact of these two terms on the regression coefficient estimates
known as a ridge regression
- Example: start from rich representation.
$$\min\; \lVert \Phi \theta - y \rVert_2^2$$
# CVXPY code
import cvxpy as cvx
d = 10
u = np.linspace(-4.5, 4.5, d)
sigma = 1
A = np.hstack([np.exp(-(x-u[i])**2/(2*sigma**2)) for i in range(d)])
rbfbasis = np.hstack([np.exp(-(xp-u[i])**2/(2*sigma**2)) for i in range(d)])
A = np.asmatrix(A)
rbfbasis = np.asmatrix(rbfbasis)
theta = cvx.Variable([d, 1])
obj = cvx.Minimize(cvx.sum_squares(A@theta - y))
prob = cvx.Problem(obj).solve()
yp = rbfbasis*theta.value
plt.figure(figsize = (6, 4))
plt.plot(x, y, 'o', label = 'Data')
plt.plot(xp, yp, label = 'Overfitted')
plt.title('(Overfitted) Regression', fontsize = 12)
plt.xlabel('X', fontsize = 12)
plt.ylabel('Y', fontsize = 12)
plt.axis([-5, 5, -12, 6])
plt.legend(fontsize = 12)
plt.grid(alpha = 0.3)
plt.axis('equal')
plt.show()
- Start from rich representation. Then, regularize coefficients $\theta$
$$\min\; \lVert \Phi \theta - y \rVert_2^2 + \lambda \lVert \theta \rVert_2^2$$
# ridge regression
lamb = 0.1
theta = cvx.Variable([d, 1])
obj = cvx.Minimize(cvx.sum_squares(A@theta - y) + lamb*cvx.sum_squares(theta))
prob = cvx.Problem(obj).solve()
yp = rbfbasis*theta.value
plt.figure(figsize = (6, 4))
plt.plot(x, y, 'o', label = 'Data')
plt.plot(xp, yp, label = 'Ridge')
plt.title('Ridge Regularization (L2)', fontsize = 12)
plt.xlabel('X', fontsize = 12)
plt.ylabel('Y', fontsize = 12)
plt.axis([-5, 5, -12, 6])
plt.legend(fontsize = 12)
plt.grid(alpha = 0.3)
plt.axis('equal')
plt.show()
# Regulization (= ridge nonlinear regression) encourages small weights, but not exactly 0
plt.figure(figsize = (6, 4))
plt.title(r'Ridge: magnitude of $\theta$', fontsize = 12)
plt.xlabel(r'$\theta$', fontsize = 12)
plt.ylabel('magnitude', fontsize = 12)
plt.stem(np.linspace(1, 10, 10).reshape(-1, 1), theta.value)
plt.xlim([0.5, 10.5])
plt.ylim([-5, 5])
plt.grid(alpha = 0.3)
plt.show()
lamb = np.arange(0,3,0.01)
theta_record = []
for k in lamb:
theta = cvx.Variable([d, 1])
obj = cvx.Minimize(cvx.sum_squares(A@theta - y) + k*cvx.sum_squares(theta))
prob = cvx.Problem(obj).solve()
theta_record.append(np.ravel(theta.value))
plt.figure(figsize = (6, 4))
plt.plot(lamb, theta_record, linewidth = 1)
plt.title('Ridge coefficients as a function of regularization', fontsize = 12)
plt.xlabel('$\lambda$', fontsize = 12)
plt.ylabel(r'weight $\theta$', fontsize = 12)
plt.show()
4.3. Sparsity for Feature Selection using LASSO¶
Least Squares with a penalty on the $l_1$ norm of the parameters
Start with full model (all possible features)
'Shrink' some coefficients exactly to 0
- i.e., knock out certain features
- the $l_1$ penalty has the effect of forcing some of the coefficient estimates to be exactly equal to zero
Non-zero coefficients indicate 'selected' features
Try this cost instead of ridge...
$$
\begin{align*}
\text{Total cost } = \;&\underbrace{\text{measure of fit}}_{RSS(\theta)} + \;\lambda \cdot \underbrace{\text{measure of magnitude of coefficients}}_{\lambda \cdot \lVert \theta \rVert_1} \\ \\
\implies &\min\; \lVert \Phi \theta - y \rVert_2^2 + \lambda \lVert \theta \rVert_1
\end{align*}$$
- $\lambda$ is a tuning parameter = balance of fit and sparsity
- Another equivalent forms of optimizations
$$ \begin{array}{rl} \begin{align*} \min_{\theta} \quad & \lVert \Phi \theta - y \rVert_2^2 \\ \text{subject to} \quad & \lVert \theta \rVert_1 \leq s_1 \end{align*} \end{array} \quad\quad\quad\quad \begin{array}{rl} \min_{\theta} \; & \lVert \Phi \theta - y \rVert_2^2 \\ \text{subject to} \; & \lVert \theta \rVert_2 \leq s_2 \end{array} $$
$$\min\; \lVert \Phi \theta - y \rVert_2^2 + \lambda \lVert \theta \rVert_1$$
# LASSO regression
lamb = 2
theta = cvx.Variable([d, 1])
obj = cvx.Minimize(cvx.sum_squares(A@theta - y) + lamb*cvx.norm(theta, 1))
prob = cvx.Problem(obj).solve()
yp = rbfbasis*theta.value
plt.figure(figsize = (6, 4))
plt.title('LASSO Regularization (L1)', fontsize = 12)
plt.xlabel('X', fontsize = 12)
plt.ylabel('Y', fontsize = 12)
plt.plot(x, y, 'o', label = 'Data')
plt.plot(xp, yp, label = 'LASSO')
plt.axis([-5, 5, -12, 6])
plt.legend(fontsize = 12)
plt.grid(alpha = 0.3)
plt.axis('equal')
plt.show()
# Regulization (= Lasso nonlinear regression) encourages zero weights
plt.figure(figsize = (6, 4))
plt.title(r'LASSO: magnitude of $\theta$', fontsize = 12)
plt.xlabel(r'$\theta$', fontsize = 12)
plt.ylabel('magnitude', fontsize = 12)
plt.stem(np.arange(1,11), theta.value)
plt.xlim([0.5, 10.5])
plt.ylim([-5,1])
plt.grid(alpha = 0.3)
plt.show()
lamb = np.arange(0,3,0.01)
theta_record = []
for k in lamb:
theta = cvx.Variable([d, 1])
obj = cvx.Minimize(cvx.sum_squares(A@theta - y) + k*cvx.norm(theta, 1))
prob = cvx.Problem(obj).solve()
theta_record.append(np.ravel(theta.value))
plt.figure(figsize = (6, 4))
plt.plot(lamb, theta_record, linewidth = 1)
plt.title('LASSO coefficients as a function of regularization', fontsize = 12)
plt.xlabel('$\lambda$', fontsize = 12)
plt.ylabel(r'weight $\theta$', fontsize = 12)
plt.show()
# reduced order model
# we will use only theta 2, 3, 8, 10
d = 4
u = np.array([-3.5, -2.5, 2.5, 4.5])
sigma = 1
A = np.hstack([np.exp(-(x-u[i])**2/(2*sigma**2)) for i in range(d)])
rbfbasis = np.hstack([np.exp(-(xp-u[i])**2/(2*sigma**2)) for i in range(d)])
A = np.asmatrix(A)
rbfbasis = np.asmatrix(rbfbasis)
theta = cvx.Variable([d, 1])
obj = cvx.Minimize(cvx.norm(A@theta - y, 2))
prob = cvx.Problem(obj).solve()
yp = rbfbasis*theta.value
plt.figure(figsize = (6, 4))
plt.plot(x, y, 'o', label = 'Data')
plt.plot(xp, yp, label = 'Overfitted')
plt.title('(Overfitted) Regression', fontsize = 12)
plt.xlabel('X', fontsize = 12)
plt.ylabel('Y', fontsize = 12)
plt.axis([-5, 5, -12, 6])
plt.legend(fontsize = 12)
plt.grid(alpha = 0.3)
plt.axis('equal')
plt.show()
x = np.arange(-4,4,0.1)
k = 4
y = (x-1)**2 + 1/6*(x-1)**3 + k*x**2
x_star = x[np.argmin(y)]
print(x_star)
plt.plot(x, y, 'g', linewidth = 2.5)
plt.axvline(x = x_star, color = 'k', linewidth = 1, linestyle = '--')
plt.ylim([0,10])
plt.show()
for k in [0,1,2,4]:
y = (x-1)**2 + 1/6*(x-1)**3 + k*x**2
x_star = x[np.argmin(y)]
plt.plot(x,y, 'g', linewidth = 2.5)
plt.axvline(x = x_star, color = 'k', linewidth = 1, linestyle = '--')
plt.ylim([0,10])
plt.title('Ridge: k = {}'.format(k))
plt.show()
x = np.arange(-4,4,0.1)
k = 2
y = (x-1)**2 + 1/6*(x-1)**3 + k*abs(x)
x_star = x[np.argmin(y)]
print(x_star)
plt.plot(x, y, 'g', linewidth = 2.5)
plt.axvline(x = x_star, color = 'k', linewidth = 1, linestyle = '--')
plt.ylim([0,10])
plt.show()
for k in [0,1,2]:
y = (x-1)**2 + 1/6*(x-1)**3 + k*abs(x)
x_star = x[np.argmin(y)]
plt.plot(x,y, 'g', linewidth = 2.5)
plt.axvline(x = x_star, color = 'k', linewidth = 1, linestyle = '--')
plt.ylim([0,10])
plt.title('LASSO: k = {}'.format(k))
plt.show()
4.5. Scikit-Learn¶
- RBF functions $g_i(x)$ as basis
$$ \begin{align*} \Phi &= \begin{bmatrix} \mid & \mid & & \mid \\ b_0(x) & b_1(x) & \cdots & b_d(x) \\ \mid & \mid & & \mid \end{bmatrix} = \begin{bmatrix} 1 & \mid & & \mid \\ \vdots & b_1(x) & \cdots & b_d(x) \\ 1 & \mid & & \mid \end{bmatrix}\\\\ \implies \Phi \theta &= \sum_{i=0}^d \theta_i b_i(x) : \; \text{a linearly-combined function of RBF basis with coefficients of } \theta \end{align*} $$
# 10 data points
n = 10
x = np.linspace(-4.5, 4.5, 10).reshape(-1, 1)
y = np.array([0.9819, 0.7973, 1.9737, 0.1838, 1.3180, -0.8361, -0.6591, -2.4701, -2.8122, -6.2512]).reshape(-1, 1)
d = 10
u = np.linspace(-4.5, 4.5, d)
sigma = 1
A = np.hstack([np.exp(-(x-u[i])**2/(2*sigma**2)) for i in range(d)])
from sklearn import linear_model
reg = linear_model.Ridge(alpha = 0.1)
reg.fit(A,y)
# to plot
xp = np.arange(-4.5, 4.5, 0.01).reshape(-1, 1)
Ap = np.hstack([np.exp(-(xp-u[i])**2/(2*sigma**2)) for i in range(d)])
plt.figure(figsize = (6, 4))
plt.plot(x, y, 'o', label = "data")
plt.plot(xp, reg.predict(Ap), linewidth = 2, label = "Ridge")
plt.title('Ridge Regression (L2)', fontsize = 12)
plt.xlabel('X', fontsize = 12)
plt.ylabel('Y', fontsize = 12)
plt.legend(fontsize = 12)
plt.grid(alpha = 0.3)
plt.axis('equal')
plt.show()
reg = linear_model.Lasso(alpha = 0.005)
reg.fit(A,y)
# to plot
xp = np.arange(-4.5, 4.5, 0.01).reshape(-1, 1)
Ap = np.hstack([np.exp(-(xp-u[i])**2/(2*sigma**2)) for i in range(d)])
plt.figure(figsize = (6, 4))
plt.plot(x, y, 'o', label = "data")
plt.plot(xp, reg.predict(Ap), linewidth = 2, label = "LASSO")
plt.title('LASSO Regression (L1)', fontsize = 12)
plt.xlabel('X', fontsize = 12)
plt.ylabel('Y', fontsize = 12)
plt.legend(fontsize = 12)
plt.grid(alpha = 0.3)
plt.axis('equal')
plt.show()
5. De-noising Signal¶
We start with a signal represented by a vector $x \in \mathbb{R}^n$. The coefficients $x_i$ correspond to the value of some function of time, evaluated (or sampled, in the language of signal processing) at evenly spaced points. It is usually assumed that the signal does not vary too rapidly, which means that usually, we have $x_i \approx x_{i+1}$.
Suppose we have a signal $x$, which does not vary too rapidly and that $x$ is corrupted by some small, rapidly varying noise $\varepsilon$, i.e. $x_{cor} = x + \varepsilon$.
Then if we want to reconstruct $x$ from $x_{cor}$ we should solve (with $\hat{x}$ as the parameter)
$$ \text{minimize} \quad \lVert \hat{x} - x_{cor}\rVert_2^2 + \mu\sum_{i=1}^{n-1}(x_{i+1}-x_i)^2 $$
where the parameter $\mu$ controls the ''smoothnes'' of $\hat{x}$.
Source:
- Boyd & Vandenberghe's book "Convex Optimization"
- http://cvxr.com/cvx/examples/ (Figures 6.8-6.10: Quadratic smoothing)
- Week 4 of Linear and Integer Programming by Coursera of Univ. of Colorado
import numpy as np
import matplotlib.pyplot as plt
import cvxpy as cvx
%matplotlib inline
n = 200
t = np.arange(n).reshape(-1,1)
x = 0.5 * np.sin((2*np.pi/n)*t) * (np.sin(0.01*t))
x_cor = x + 0.05*np.random.randn(n,1)
plt.figure(figsize = (6, 4))
plt.subplot(2,1,1)
plt.plot(t,x,'-', linewidth = 2)
plt.axis([0, n, -0.6, 0.6])
plt.xticks([])
plt.title('original signal' , fontsize = 15)
plt.ylabel('$x_{original}$', fontsize = 15)
plt.subplot(2,1,2)
plt.plot(t, x_cor,'-', linewidth = 1)
plt.axis([0, n, -0.6, 0.6])
plt.title('corrupted signal', fontsize = 15)
plt.xlabel('n', fontsize = 15)
plt.ylabel('$x_{corrupted}$', fontsize = 15)
plt.show()
5.1. Transform de-noising in time into an optimization problem¶
$$X = \begin{bmatrix} x_{1} \\ x_{2} \\ \vdots \\ x_{n} \end{bmatrix} $$
$$\large \min\limits_{X}\;\left\{\underbrace{\lVert(X - X_{cor})\rVert^2_{2}}_{\text{how much } X \text{ deviates from }X_{cor}} + \mu \underbrace{\sum_{k = 1}^{n-1}(x_{k+1}-x_{k})^2}_{\text{penalize rapid changes of } X}\right\}$$
$\quad$1) $\min\limits_{X}\;\lVert(X - X_{cor})\rVert^2_{2}$: $\quad$ How much $X$ deviates from $X_{cor}$
$\quad$2) $\mu\sum\limits_{k = 1}^{n-1}(x_{k+1}-x_{k})^2$: $\quad$ penalize rapid changes of $X$
$\quad$3) $\mu$ : to adjust the relative weight of 1) & 2)
- In a vector form
$\quad$1) $X - X_{cor} = I_n X - X_{cor}$
$\quad$2) $\sum\;(x_{k+1}-x_{k})^2 \implies$
$$(x_{2} - x_{1}) - 0 = \begin{bmatrix} -1,&1,&0,&\cdots&0 \end{bmatrix} \begin{bmatrix} x_{1} \\ x_{2} \\ \vdots \\ x_{n} \end{bmatrix}-0 $$
$$(x_{3} - x_{2}) - 0 = \begin{bmatrix} 0,&-1,&1,&\cdots&0 \end{bmatrix} \begin{bmatrix} x_{1} \\ x_{2} \\ \vdots \\ x_{n} \end{bmatrix}-0 $$
$$ \vdots $$
$$\implies \left \Arrowvert \; \begin{bmatrix} -1&1&0&\cdots&0&0 \\ 0&-1&1&\cdots&0&0 \\ \vdots&\vdots&\vdots&\ddots&\vdots&\vdots\\ 0&0&0&\cdots&-1&1 \end{bmatrix} \begin{bmatrix} x_{1} \\ x_{2} \\ \vdots \\ x_{n} \end{bmatrix} - \begin{bmatrix} 0 \\ 0 \\ \vdots \\ 0 \end{bmatrix} \; \right\Arrowvert^2_{2} $$
$$\hspace{3.2cm}D \quad\quad\quad\quad\quad\quad\qquad X \quad\; - \; \; 0 $$
$$\begin{align*} \left\Vert I_n X-X_{cor}\big\Vert^2_{2} + \mu\big\Vert DX-0\right\Vert^2_{2} & = \big\Vert Ax-b\big\Vert^2_{2} \\ \\ & = \Bigg\Vert \begin{bmatrix} I_n\\ \sqrt{\mu}D \end{bmatrix} X- \begin{bmatrix} X_{cor}\\ 0 \end{bmatrix} \Bigg\Vert^2_{2} \end{align*}$$
$$ \text{where} \; A = \begin{bmatrix} I_n\\ \sqrt{\mu}D \end{bmatrix},\quad b = \begin{bmatrix} X_{cor}\\ 0 \end{bmatrix} $$
- Then, plug $A$, $b$ to $(A^TA)^{-1}A^Tb$
mu = 100
D = np.zeros([n-1, n])
D[:,0:n-1] -= np.eye(n-1)
D[:,1:n] += np.eye(n-1)
A = np.vstack([np.eye(n), np.sqrt(mu)*D])
b = np.vstack([x_cor, np.zeros([n-1,1])])
A = np.asmatrix(A)
b = np.asmatrix(b)
x_reconst = (A.T*A).I*A.T*b
plt.figure(figsize = (6, 2))
plt.plot(t, x_cor, '-', linewidth = 1, alpha = 0.5, label = 'corrupted');
plt.plot(t, x_reconst, 'r', linewidth = 2, label = 'reconstructed')
plt.title('$\mu = {}$'.format(mu), fontsize = 15)
plt.legend(fontsize = 12)
plt.show()
5.1.1. with different $\mu$'s (see how $\mu$ affects smoothing results)¶
plt.figure(figsize = (6, 6))
mu = [0, 10, 100];
for i in range(len(mu)):
A = np.vstack([np.eye(n), np.sqrt(mu[i])*D])
b = np.vstack([x_cor, np.zeros([n-1,1])])
A = np.asmatrix(A)
b = np.asmatrix(b)
x_reconst = (A.T*A).I*A.T*b
plt.subplot(3,1,i+1)
plt.plot(t, x_cor, '-', linewidth = 1, alpha = 0.5)
plt.plot(t, x_reconst, 'r', linewidth = 2)
plt.ylabel('$\mu = {}$'.format(mu[i]), fontsize = 15)
plt.show()
5.1.2. CVXPY Implementation¶
cvxpy
toolbox to numerically solve
$$ \min\; \left\{ \lVert x-x_{cor}\rVert_2^2 + \mu \lVert Dx \rVert_2^2 \right\}$$
mu = 100
x_reconst = cvx.Variable([n,1])
#obj = cvx.Minimize(cvx.sum_squares(x_reconst - x_cor) + mu*cvx.sum_squares(x_reconst[1:n]-x_reconst[0:n-1]))
obj = cvx.Minimize(cvx.sum_squares(x_reconst - x_cor) + mu*cvx.sum_squares(D@x_reconst))
prob = cvx.Problem(obj).solve()
plt.figure(figsize = (6, 2))
plt.plot(t, x_cor, '-', linewidth = 1, alpha = 0.5, label = 'corrupted');
plt.plot(t, x_reconst.value, 'r', linewidth = 2, label = 'reconstructed')
plt.title('$\mu = {}$'.format(mu), fontsize = 15)
plt.legend(fontsize = 12)
plt.show()
plt.figure(figsize = (6, 6))
mu = [0, 10, 100]
for i in range(len(mu)):
x_reconst = cvx.Variable([n,1])
obj = cvx.Minimize(cvx.sum_squares(x_reconst - x_cor) + mu[i]*cvx.sum_squares(D @ x_reconst))
prob = cvx.Problem(obj).solve()
plt.subplot(3,1,i+1)
plt.plot(t,x_cor,'-', linewidth = 1, alpha = 0.5)
plt.plot(t,x_reconst.value, 'r', linewidth = 2)
plt.ylabel('$\mu = {}$'.format(int(mu[i])), fontsize = 15)
plt.show()
5.1.3. L2 Norm¶
CVXPY strongly encourages to eliminate quadratic forms, that is, functions like
sum_squares
,sum(square(.))
orquad_form
Whenever it is possible to construct equivalent models using
norm
instead
$$ \min \; \left\{ \lVert x-x_{cor}\rVert_2 + \gamma \lVert Dx \rVert_2 \right \}$$
plt.figure(figsize = (6, 6))
gammas = [0, 2, 7]
for i in range(len(gammas)):
x_reconst = cvx.Variable([n,1])
obj = cvx.Minimize(cvx.norm(x_reconst-x_cor, 2) + gammas[i]*(cvx.norm(D @ x_reconst, 2)))
prob = cvx.Problem(obj).solve()
plt.subplot(3,1,i+1)
plt.plot(t,x_cor,'-', linewidth = 1, alpha = 0.5)
plt.plot(t,x_reconst.value, 'r', linewidth = 2)
plt.ylabel('$ \gamma = {}$'.format(gammas[i]), fontsize = 15)
plt.show()
5.1.4. L2 Norm with a Constraint¶
$$\min \; \lVert Dx \rVert_2$$
$$ s.t. \quad \lVert x-x_{cor} \rVert_2 < \beta $$
beta = 0.8
x_reconst = cvx.Variable([n,1])
obj = cvx.Minimize(cvx.norm(D @ x_reconst, 2))
const = [cvx.norm(x_reconst-x_cor, 2) <= beta]
prob = cvx.Problem(obj, const).solve()
plt.figure(figsize = (6, 2))
plt.plot(t,x_cor,'-', linewidth = 1, alpha = 0.5)
plt.plot(t,x_reconst.value, 'r', linewidth = 2)
plt.ylabel(r'$\beta = {}$'.format(beta), fontsize = 15)
plt.show()
plt.figure(figsize = (6, 6))
beta = [0.1, 0.6, 0.8]
for i in range(len(beta)):
x_reconst = cvx.Variable([n,1])
obj = cvx.Minimize(cvx.norm(x_reconst[1:n] - x_reconst[0:n-1], 2))
const = [cvx.norm(x_reconst-x_cor, 2) <= beta[i]]
prob = cvx.Problem(obj, const).solve()
plt.subplot(len(beta),1,i+1)
plt.plot(t,x_cor,'-', linewidth = 1, alpha = 0.5)
plt.plot(t,x_reconst.value, 'r', linewidth = 2)
plt.ylabel(r'$\beta = {}$'.format(beta[i]), fontsize = 15)
plt.show()
5.2. Signal with Sharp Transition + Noise¶
Suppose we have a signal $x$, which is mostly smooth, but has several rapid variations (or jumps). If we apply quadratic smoothing on this signal, then in order to remove the noise we will not be able to preserve the signal's sharp transitions.
- First, apply the same method that we used for smoothing signals before
n = 200
t = np.arange(n).reshape(-1,1)
exact = np.vstack([np.ones([50,1]), -np.ones([50,1]), np.ones([50,1]), -np.ones([50,1])])
x = exact + 0.5*np.sin((2*np.pi/n)*t)
x_cor = x + 0.1*np.random.randn(n,1)
plt.figure(figsize = (6, 4))
plt.subplot(2,1,1)
plt.plot(t, x)
plt.ylim([-2.0,2.0])
plt.ylabel('signal', fontsize = 15)
plt.subplot(2,1,2)
plt.plot(t, x_cor, linewidth = 1)
plt.ylabel('corrupted signal', fontsize = 15)
plt.xlabel('x', fontsize = 15)
plt.show()
5.2.1. L2 Norm (Quadratic Smoothing)¶
plt.figure(figsize = (6, 6))
beta = [0.5, 2, 4]
for i in range(len(beta)):
x_reconst = cvx.Variable([n,1])
obj = cvx.Minimize(cvx.norm(x_reconst[1:n] - x_reconst[0:n-1], 2))
const = [cvx.norm(x_reconst - x_cor, 2) <= beta[i]]
prob = cvx.Problem(obj, const).solve()
plt.subplot(len(beta), 1, i+1)
plt.plot(t, x_cor, linewidth = 1, alpha = 0.5)
plt.plot(t, x_reconst.value, 'r', linewidth = 2)
plt.ylabel(r'$\beta = {}$'.format(beta[i]), fontsize = 15)
plt.show()
Quadratic smoothing smooths out noise and sharp transitions in signal, but this is not what we want
Any ideas ?
5.2.2. L1 Norm¶
We can instead apply total variation reconstruction on the signal by solving
$$\min \; \lVert x - x_{cor} \rVert_2 + \lambda \sum_{i=1}^{n-1} \;\lvert x_{i+1}-x_i \rvert $$
where the parameter $\lambda$ controls the ''smoothness'' of $x$.
$$\min \; \lVert Dx \rVert_1$$
$$ s.t. \quad \lVert x-x_{cor} \rVert_2 < \beta $$
plt.figure(figsize = (6, 6))
beta = [0.5, 2, 4]
for i in range(len(beta)):
x_reconst = cvx.Variable([n,1])
obj = cvx.Minimize(cvx.norm(x_reconst[1:n] - x_reconst[0:n-1], 1))
const = [cvx.norm(x_reconst-x_cor, 2) <= beta[i]]
prob = cvx.Problem(obj, const).solve()
plt.subplot(len(beta), 1, i+1)
plt.plot(t, x_cor, linewidth = 1, alpha = 0.5)
plt.plot(t, x_reconst.value, 'r', linewidth = 2)
plt.ylabel(r'$\beta = {}$'.format(beta[i]), fontsize = 15)
plt.show()
Total Variation (TV) smoothing preserves sharp transitions in signal, and this is not bad
Note how TV reconstruction does a better job of preserving the sharp transitions in the signal while removing the noise.
from google.colab import drive
drive.mount('/content/drive')
import cv2
imbw = cv2.imread('/content/drive/MyDrive/ML_Colab/ML_data/dog.jpg', 0)
row = 150
col = 150
resized_imbw = cv2.resize(imbw, (row, col))
plt.figure(figsize = (6, 6))
plt.imshow(resized_imbw, 'gray')
plt.axis('off')
plt.show()
- Question: Apply $L_1$ norm to the image, and guess what kind of an image will be produced ?
$$\min \; \lVert Dx \rVert_1$$
$$ s.t. \quad \lVert x-x_{cor} \rVert_2 < \beta $$
n = row*col
imbws = resized_imbw.reshape(-1, 1)
beta = 1500
x = cvx.Variable([n,1])
obj = cvx.Minimize(cvx.norm(x[1:n] - x[0:n-1],1))
const = [cvx.norm(x - imbws,2) <= beta]
prob = cvx.Problem(obj, const).solve()
imbwr = x.value.reshape(row, col)
plt.figure(figsize = (6, 6))
plt.imshow(imbwr,'gray')
plt.axis('off')
plt.show()
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')