Regression
Table of Contents
1. Linear Regression¶
Regression is a fundamental concept in machine learning, playing a crucial role in understanding and predicting continuous outcomes based on input features. Unlike classification, which assigns data points to discrete categories, regression models aim to establish a relationship between independent variables (features) and a continuous dependent variable (target).
At its core, regression seeks to capture patterns in data and predict numerical values by minimizing the error between the predicted values and the actual values. This makes regression indispensable for a wide range of applications.
from IPython.display import YouTubeVideo
YouTubeVideo('IRrGVQV8vZ8', width = "560", height = "315")
1.1. Linear Regression: Problem Setup¶
Let's begin by considering a simple linear regression problem, where the goal is to predict a continuous output based on a single input feature.
Given a dataset of input-output pairs:
$$\; \begin{cases}x_{i} \; \text{: inputs} \\y_{i} \; \text{: outputs}\end{cases} \qquad \text{for} \;i = 1, 2, \cdots, m$$
We can organize the inputs and outputs as column vectors:
$$x= \begin{bmatrix} x_{1} \\ x_{2} \\ \vdots \\ x_{m} \end{bmatrix}, \qquad y= \begin{bmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{m} \end{bmatrix}$$
Prediction Model
Let $\hat{y}_i$ be the predicted output for input $x_i$. In general, we assume there's a function $f(x_i; \theta)$ that maps the input to the predicted output, parameterized by $\theta$:
$$ \hat{y}_{i} = f(x_{i}\,; \theta) $$
For linear regression, we use a linear model to approximate the relationship between $x_i$ and $y_i$:
$$y_i \approx \hat{y}_{i} = \theta_{0} + \theta_{1}x_{i}$$
Here,
- $\theta_0$ is the intercept (bias term),
- $\theta_1$ is the slope (weight) of the input, and
- $\theta = \begin{bmatrix} \theta_0 \\ \theta_1 \end{bmatrix}$ are the model parameters we want to learn.
Objective: Minimizing Prediction Error
To find the best values of $\theta_0$ and $\theta_1$, we minimize the sum of squared errors between the predicted values $\hat{y}_i$ and the actual values $y_i$:
$$ \min\limits_{\theta_{0}, \theta_{1}}\sum\limits_{i = 1}^{m} \left(\hat{y}_{i} - y_{i} \right)^2$$
This is known as the least squares criterion and is a standard method for fitting linear models.
Vector Form of the Model
We can express the model for all data points at once using vectors and matrix operations:
$$ y= \begin{bmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{m} \end{bmatrix} \approx \begin{bmatrix} \hat y_{1} \\ \hat y_{2} \\ \vdots \\ \hat y_{m} \end{bmatrix} = \theta_{0} \begin{bmatrix} 1 \\ 1 \\ \vdots \\ 1 \end{bmatrix} + \theta_{1} \begin{bmatrix} x_{1} \\ x_{2} \\ \vdots \\ x_{m} \end{bmatrix}$$
1.2. Re-cast Problem as a Least Squares¶
For convenience, we define a function that maps inputs to feature vectors, denoted as $\phi$
$$\begin{array}{Icr}\begin{align*} \hat{y}_{i} & = \theta_0 + x_i \theta_1 = 1 \cdot \theta_0 + x_i \theta_1 \\ \\ & = \begin{bmatrix}1 & x_{i}\end{bmatrix}\begin{bmatrix}\theta_{0} \\ \theta_{1}\end{bmatrix} \\\\ & =\begin{bmatrix}1 \\ x_{i} \end{bmatrix}^{T}\begin{bmatrix}\theta_{0} \\ \theta_{1}\end{bmatrix} \\\\ & =\phi^{T}(x_{i})\theta \end{align*}\end{array} \begin{array}{Icr} \quad \quad \text{feature vector} \; \phi(x_{i}) = \begin{bmatrix}1 \\ x_{i}\end{bmatrix} \end{array}$$
We can then construct the matrix $\Phi$ by stacking the transposed feature vectors for all $m$ data points:
$$\Phi = \begin{bmatrix}1 & x_{1} \\ 1 & x_{2} \\ \vdots \\1 & x_{m} \end{bmatrix}=\begin{bmatrix}\phi^T(x_{1}) \\\phi^T(x_{2}) \\\vdots \\\phi^T(x_{m}) \end{bmatrix} \quad \implies \quad \hat{y} = \begin{bmatrix}\hat{y}_{1} \\\hat{y}_{2} \\\vdots \\\hat{y}_{m}\end{bmatrix}=\Phi\theta$$
Optimization Problem (or Least Squares)
We aim to minimize the total squared error:
$$\min\limits_{\theta_{0}, \theta_{1}}\sum\limits_{i = 1}^{m} (\hat{y}_{i} - y_{i})^2 =\color{red}{\min\limits_{\theta}\lVert\Phi\theta-y\rVert^2_2} \qquad \qquad \left(\text{same as} \; \min_{x} \lVert Ax-b \rVert_2^2 \right)$$
The solution to this least squares problem is:
$$ \theta^* = (A^TA)^{-1}A^T b $$
Why? (here, $A_i$ means the $i$th row of matrix $A$)
$$\sum_{i=1}^{m} \lvert \underbrace{ A_i x - b_i}_{\text{scalar}} \rvert^2 = \left\| \begin{bmatrix} A_1 x - b_1 \\ \vdots \\ A_m x - b_m \end{bmatrix} \right\|^2 = \left\| \begin{bmatrix} A_1 \\ \vdots \\ A_m \end{bmatrix} x - \begin{bmatrix} b_1 \\ \vdots \\ b_m \end{bmatrix} \right\|^2 = \lVert \underbrace{Ax-b}_{\text{vector}} \rVert_2^2$$
Note
From Input to Prediction (Feature Mapping + Linear Model)
$$\begin{array}{Icr} \text{input} \\ x_{i} \end{array} \quad \rightarrow \quad \begin{array}{Icr} \text{feature vector} \\ \begin{bmatrix} 1 \\ x_{i} \end{bmatrix} \end{array} \quad \rightarrow \quad \begin{array}{Icr} \text{predicted output} \\ \hat{y}_{i} \end{array}$$
Matrix Form of the Linear System
We collect all feature vectors into a design matrix and model the predictions as:
$$ \underbrace{ \begin{bmatrix} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_m \end{bmatrix} }_{A = \Phi} \underbrace{ \begin{bmatrix} \theta_0 \\ \theta_1 \end{bmatrix} }_{\theta} = \underbrace{ \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_m \end{bmatrix} }_{y} $$
Below each column of $A$, we can identify the column vectors (features):
$$ \begin{array}{cccccc} \vec{A}_1 = \begin{bmatrix}1 \\ 1 \\ \vdots \\ 1\end{bmatrix}, & \quad \vec{A}_2 = \begin{bmatrix}x_1 \\ x_2 \\ \vdots \\ x_m\end{bmatrix} , &\quad \quad \theta = \begin{bmatrix} \theta_0 \\ \theta_1 \end{bmatrix}, & \quad \quad y = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_m \end{bmatrix} \end{array} $$
This system is typically over-determined (more equations than unknowns), so we seek the best approximate solution via least squares projection.
$$A(= \Phi) = \begin{bmatrix} \vec{A}_1 & \vec{A}_2 \end{bmatrix}$$
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
# data points in column vector [input, output]
x = np.array([0.1, 0.4, 0.7, 1.2, 1.3, 1.7, 2.2, 2.8, 3.0, 4.0, 4.3, 4.4, 4.9]).reshape(-1, 1)
y = np.array([0.5, 0.9, 1.1, 1.5, 1.5, 2.0, 2.2, 2.8, 2.7, 3.0, 3.5, 3.7, 3.9]).reshape(-1, 1)
plt.figure(figsize = (6, 4))
plt.plot(x, y, 'ko', alpha = 0.3)
plt.xlabel('X')
plt.ylabel('Y')
plt.axis('equal')
plt.grid(alpha = 0.3)
plt.show()
m = y.shape[0]
#A = np.hstack([np.ones([m, 1]), x])
A = np.hstack([x**0, x])
A = np.asmatrix(A)
theta = (A.T*A).I*A.T*y
print('theta:\n', theta)
# to plot
plt.figure(figsize = (6, 4))
plt.xlabel('X')
plt.ylabel('Y')
plt.plot(x, y, 'ko', label = "data", alpha = 0.3)
# to plot a straight line (fitted line)
xp = np.arange(0, 5, 0.01).reshape(-1, 1)
yp = theta[0,0] + theta[1,0]*xp
plt.plot(xp, yp, 'r', linewidth = 2, label = "regression")
# plt.legend()
plt.axis('equal')
plt.grid(alpha = 0.3)
plt.show()
1.3.2. Solving with Gradient Descent¶
Instead of solving the normal equation analytically, we can also minimize the loss function using gradient descent, an iterative optimization method.
We aim to solve:
$$\min_{\theta} ~ \lVert \hat y - y \rVert_2^2 = \min_{\theta} ~ \lVert A\theta - y \rVert_2^2$$
Let the objective function be:
$$f(\theta) = \lVert A\theta - y \rVert_2^2 = (A\theta - y)^T (A\theta - y)$$
Expand the objective function:
$$ \begin{align*} f(\theta) &= (A\theta-y)^T(A\theta-y) \\ &= (\theta^TA^T-y^T)(A\theta-y) \\ &= \theta^TA^TA\theta - \theta^TA^Ty - y^TA\theta + y^Ty \\\\ \nabla_{\theta} f &= A^TA\theta + A^TA\theta - A^Ty - A^Ty = 2(A^TA\theta - A^Ty) \end{align*} $$
Gradient descent update rule:
$$ \theta \leftarrow \theta - \alpha \nabla_{\theta} f = \theta - 2 \alpha (A^TA\theta - A^Ty) $$
theta = np.random.randn(2,1)
theta = np.asmatrix(theta)
alpha = 0.001
for _ in range(1000):
df = 2*(A.T*A*theta - A.T*y)
theta = theta - alpha*df
print(theta)
# to plot
plt.figure(figsize = (6, 4))
plt.xlabel('X')
plt.ylabel('Y')
plt.plot(x, y, 'ko', label = "data", alpha = 0.3)
# to plot a straight line (fitted line)
xp = np.arange(0, 5, 0.01).reshape(-1, 1)
yp = theta[0,0] + theta[1,0]*xp
plt.plot(xp, yp, 'r', linewidth = 2, label = "regression")
plt.legend()
plt.axis('equal')
plt.grid(alpha = 0.3)
plt.show()
1.3.3. Solving with CVXPY (Convex Optimization)¶
Another way to solve the linear regression problem is by using CVXPY, a Python library for convex optimization.
$$\min_{\theta} ~ \lVert \hat y - y \rVert_2 = \min_{\theta} ~ \lVert A\theta - y \rVert_2$$
import cvxpy as cvx
theta2 = cvx.Variable([2, 1])
obj = cvx.Minimize(cvx.norm(A@theta2 - y, 2))
cvx.Problem(obj,[]).solve()
print('theta:\n', theta2.value)
# to plot straight lines (fitted lines)
plt.figure(figsize = (6, 4))
plt.plot(x, y, 'ko', label = 'data', alpha = 0.3)
xp = np.arange(0, 5, 0.01).reshape(-1,1)
yp2 = theta2.value[0,0] + theta2.value[1,0]*xp
plt.plot(xp, yp2, 'r', linewidth = 2, label = '$L_2$')
plt.axis('equal')
plt.legend(loc = 5)
plt.grid(alpha = 0.3)
plt.show()
1.3.4. Scikit-Learn¶
Scikit-Learn is a powerful and user-friendly machine learning library in Python. It provides efficient implementations of many standard algorithms, including linear regression.
- Simple and efficient tools for data mining and data analysis
- Accessible to everybody, and reusable in various contexts
- Built on NumPy, SciPy, and Matplotlib
- Open source
- https://scikit-learn.org/stable/index.html#
from sklearn import linear_model
reg = linear_model.LinearRegression()
reg.fit(x, y)
reg.coef_
reg.intercept_
# to plot
plt.figure(figsize = (6, 4))
plt.xlabel('X', fontsize = 12)
plt.ylabel('Y', fontsize = 12)
plt.plot(x, y, 'ko', label = "data", alpha = 0.3)
# to plot a straight line (fitted line)
plt.plot(xp, reg.predict(xp), 'r', linewidth = 2, label = "regression")
plt.legend(fontsize = 12)
plt.axis('equal')
plt.grid(alpha = 0.3)
plt.show()
We have demonstrated four different implementations of linear regression: least squares, gradient descent, CVXPY, and Scikit-Learn.
1.3.5. L1 Norm¶
In the objective function, the error is defined as the difference between the predicted value and the ground truth. To measure the magnitude or strength of this error, the L2 norm is typically used, as it provides a smooth and differentiable measure of error, making it convenient for optimization.
However, as we have learned, there are many different norms available. It is often interesting to explore what happens when a different norm, such as the L1 norm, is used instead. The choice of norm can significantly impact the behavior and properties of the optimization process, affecting factors such as robustness, sparsity, and computational efficiency.
Let's use $L_1$ norm
$$ \min_{\theta} ~ \lVert \hat y - y \rVert_1 = \min_{\theta} ~ \lVert A\theta - y \rVert_1 $$
You might wonder how to solve this optimization problem since the L1 norm is used in the objective function. However, we will use CVXPY to find the optimal solution. You do not need to worry too much about the algorithm running beneath CVXPY.
theta1 = cvx.Variable([2, 1])
obj = cvx.Minimize(cvx.norm(A@theta1 - y, 1))
cvx.Problem(obj).solve()
print('theta:\n', theta1.value)
# to plot data
plt.figure(figsize = (6, 4))
plt.xlabel('X', fontsize = 12)
plt.ylabel('Y', fontsize = 12)
plt.plot(x, y, 'ko', label = 'data', alpha = 0.3)
# to plot straight lines (fitted lines)
xp = np.arange(0, 5, 0.01).reshape(-1, 1)
yp1 = theta1.value[0,0] + theta1.value[1,0]*xp
yp2 = theta2.value[0,0] + theta2.value[1,0]*xp
plt.plot(xp, yp1, 'b', linewidth = 2, label = '$L_1$')
plt.plot(xp, yp2, 'r', linewidth = 2, label = '$L_2$')
plt.legend()
plt.axis('equal')
plt.grid(alpha = 0.3)
plt.show()
Interestingly, the $L_1$ norm also provides a reasonable linear approximation. Therefore, at this point, you might conclude that the choice of distance definition does not significantly impact the results.
However, to truly understand their differences, let's consider a scenario where outliers are present in the dataset.
What Happens When Outliers Exist?
We'll fit the data using both the L1 and L2 norms.
Then, we'll compare the resulting models and discuss the differences.
This will help highlight the sensitivity of each norm to outliers and clarify why the choice of distance metric matters.
# add outliers
x = np.vstack([x, np.array([0.5, 3.8]).reshape(-1, 1)])
y = np.vstack([y, np.array([3.0, 1.0]).reshape(-1, 1)])
A = np.hstack([x**0, x])
A = np.asmatrix(A)
A.shape
plt.figure(figsize = (6, 4))
plt.plot(x, y, 'ko', label = 'data', alpha = 0.3)
plt.axis('equal')
plt.grid(alpha = 0.3)
plt.show()
L2 norm
theta2 = cvx.Variable([2, 1])
obj2 = cvx.Minimize(cvx.norm(A@theta2 - y, 2))
prob2 = cvx.Problem(obj2).solve()
# to plot straight lines (fitted lines)
plt.figure(figsize = (6, 4))
plt.plot(x, y, 'ko', label = 'data', alpha = 0.3)
xp = np.arange(0, 5, 0.01).reshape(-1,1)
yp2 = theta2.value[0,0] + theta2.value[1,0]*xp
plt.plot(xp, yp2, 'r', linewidth = 2, label = '$L_2$')
plt.axis('equal')
plt.legend(loc = 5)
plt.grid(alpha = 0.3)
plt.show()
L1 norm
theta1 = cvx.Variable([2, 1])
obj1 = cvx.Minimize(cvx.norm(A@theta1 - y, 1))
prob1 = cvx.Problem(obj1).solve()
# to plot straight lines (fitted lines)
plt.figure(figsize = (6, 4))
plt.plot(x, y, 'ko', label = 'data', alpha = 0.3)
xp = np.arange(0, 5, 0.01).reshape(-1,1)
yp1 = theta1.value[0,0] + theta1.value[1,0]*xp
plt.plot(xp, yp1, 'b', linewidth = 2, label = '$L_1$')
plt.axis('equal')
plt.legend(loc = 5)
plt.grid(alpha = 0.3)
plt.show()
# to plot data
plt.figure(figsize = (6, 4))
plt.plot(x, y, 'ko', label = 'data', alpha = 0.3)
plt.title('$L_1$ and $L_2$ Regression w/ Outliers', fontsize = 12)
plt.xlabel('X', fontsize = 12)
plt.ylabel('Y', fontsize = 12)
# to plot straight lines (fitted lines)
xp = np.arange(0, 5, 0.01).reshape(-1,1)
yp1 = theta1.value[0,0] + theta1.value[1,0]*xp
yp2 = theta2.value[0,0] + theta2.value[1,0]*xp
plt.plot(xp, yp1, 'b', linewidth = 2, label = '$L_1$')
plt.plot(xp, yp2, 'r', linewidth = 2, label = '$L_2$')
plt.axis('scaled')
plt.legend(loc = 5)
plt.grid(alpha = 0.3)
plt.show()
The key differences between these two regression approaches arise from how they handle outliers in the data.
(1) Sensitivity to Outliers in L2-Norm Regression
The L2 norm minimizes the sum of squared errors:
$$\min ~ \sum \left( \hat y - y \right)^2 $$
Squaring the residuals magnifies the effect of large errors, meaning that outliers have a significant influence on the final regression line.
As a result, the L2 regression line (red) is pulled towards the outliers, making it less robust.
(2) Robustness of L1-Norm Regression
The L1 norm minimizes the sum of absolute errors:
$$\min ~ \sum \lvert \hat y - y \rvert $$
Since absolute differences are not squared, large errors (outliers) contribute less to the total loss compared to the L2 norm.
This makes the L1 regression line (blue) less sensitive to outliers, leading to a more robust fit that better represents the majority of the data.
2. Multivariate Linear Regression¶
Multivariate Linear Regression is an extension of simple linear regression where multiple independent variables (features) are used to predict a dependent variable. This is useful in scenarios where multiple factors influence an outcome.
2.1. Linear Regression for Multivariate Data¶
The model for multivariate linear regression with two input features is:
$$ \hat{y} = \theta_0 + \theta_{1}x_1 + \theta_{2}x_2 $$
We can express this using a feature vector:
$$\phi \left(x^{(i)}\right) = \begin{bmatrix}1\\x^{(i)}_{1}\\x^{(i)}_{2} \end{bmatrix}$$
# for 3D plot
from mpl_toolkits.mplot3d import Axes3D
# y = theta0 + theta1*x1 + theta2*x2 + noise
n = 200
x1 = np.random.randn(n, 1)
x2 = np.random.randn(n, 1)
noise = 0.5*np.random.randn(n, 1);
y = 2 + 1*x1 + 3*x2 + noise
fig = plt.figure(figsize = (8, 6))
ax = fig.add_subplot(1, 1, 1, projection = '3d')
ax.set_title('Generated Data', fontsize = 12)
ax.set_xlabel('$X_1$', fontsize = 12)
ax.set_ylabel('$X_2$', fontsize = 12)
ax.set_zlabel('Y', fontsize = 12)
ax.scatter(x1, x2, y, marker = '.', label = 'Data')
#ax.view_init(30,30)
plt.legend(fontsize = 12)
plt.show()
We can represent the multivariate linear regression problem compactly in matrix form:
$$\Phi = \begin{bmatrix}1 & x_{1}^{(1)} & x_{2}^{(1)}\\1 & x_{1}^{(2)} & x_{2}^{(2)}\\ \vdots \\1 & x_{1}^{(m)} & x_{2}^{(m)} \end{bmatrix} \quad \implies \quad \hat{y} = \begin{bmatrix}\hat{y}^{(1)} \\\hat{y}^{(2)} \\\vdots \\\hat{y}^{(m)}\end{bmatrix}=\Phi\theta$$
The corresponding least squares solution for the optimal parameter vector $\theta$ is:
$$ \theta^{*} = (\Phi^T \Phi)^{-1} \Phi^T y$$
In matrix form, the least-squares solution remains identical in its formulation. The only difference is that when extending to multivariate regression, more columns are included in the feature matrix $\Phi$, representing additional independent variables.
A = np.hstack([np.ones((n, 1)), x1, x2])
A = np.asmatrix(A)
theta = (A.T*A).I*A.T*y
X1, X2 = np.meshgrid(np.arange(np.min(x1), np.max(x1), 0.5),
np.arange(np.min(x2), np.max(x2), 0.5))
YP = theta[0,0] + theta[1,0]*X1 + theta[2,0]*X2
fig = plt.figure(figsize = (8, 6))
ax = fig.add_subplot(1, 1, 1, projection = '3d')
ax.set_title('Regression', fontsize = 12)
ax.set_xlabel('$X_1$', fontsize = 12)
ax.set_ylabel('$X_2$', fontsize = 12)
ax.set_zlabel('Y', fontsize = 12)
ax.scatter(x1, x2, y, marker = '.', label = 'Data')
ax.plot_wireframe(X1, X2, YP, color = 'k', alpha = 0.3, label = 'Regression Plane')
#ax.view_init(30,30)
plt.legend(fontsize = 12)
plt.show()
2.2. Feature (or Input) Importance in Multivariate Linear Regression¶
In multivariate linear regression, multiple features (inputs) contribute to predicting the target variable. However, not all features are equally important.
Understanding feature importance is essential for:
Model interpretability
Feature selection
Improving generalization and performance
The prediction model is expressed as:
$$ \hat{y} = \theta_0 + \theta_{1}x_1 + \theta_{2}x_2 + \cdots $$
The importance of a feature is directly related to the magnitude of its coefficient $\theta_i$.
A larger $\lvert \theta_i \rvert$ suggests a stronger influence on the predicted outcome.
A smaller $\lvert \theta_i \rvert$ implies the feature has minimal effect and may be considered for removal.
Potential issue: Features may be on different scales, which can distort the comparison of coefficients. Feature scaling or standardization is important before interpreting weights.
2.3. Feature Selection in Multivariate Linear Regression¶
In multivariate linear regression, some features may have little impact on the target variable. If a feature's coefficient is relatively small compared to others, it contributes minimally to the prediction and can be eliminated to simplify the model. Simplifying the model leads to better generalization and computational efficiency.
Example
Before feature selection:
$$ \hat y = 3.2x_1 + 0.01x_2 - 2.5x_3 + 0.0001x_4 $$
Here, $x_2$ and $x_4$ have negligible impact on the output.
After feature selection:
$$ \hat y = 3.2x_1 - 2.5x_3 $$
By eliminating $x_2$ and $x_4$, the model becomes simpler, easier to interpret, and computationally efficient.
(1) Real-World Insight:
If the original model relied on four sensors to measure the input features, feature selection suggests that only two sensors are truly necessary - reducing cost, complexity, and maintenance without sacrificing performance.
(2) Control Perspective:
In a control system, where the predicted output $\hat y$ needs to match a desired target:
The regression model can be used proactively to adjust inputs and steer the system.
If $\hat y$ is too low, you can:
Increase $x_1$ (since it has a positive weight)
Decrease $x_3$ (since it has a negative weight)
This makes regression models not just predictive but also actionable, guiding decisions and interventions in real-time systems.
3. Nonlinear Regression¶
Linear regression is a foundational technique in statistical modeling that assumes a straight-line (linear) relationship between the independent variables and the dependent variable. It finds the line that best fits the data points by minimizing the sum of squared errors. This approach works well if the underlying relationship in the data is approximately linear. However, when the data exhibit a nonlinear pattern, a straight line may not adequately capture the relationship, leading to large errors or systematic patterns in the residuals.
When a linear model is insufficient, one should consider using nonlinear regression techniques that can accommodate curvature or other complex nonlinear behaviors in the data. In such cases, it is natural to transition from a linear regression approach to a nonlinear regression model in order to better fit the data.
This section explores how to construct a regression model capable of approximating non-linearly distributed data.
3.1. Nonlinear Regression with Polynomial Features¶
One of the simplest approaches to nonlinear regression is polynomial regression, which extends linear regression by incorporating polynomial terms to capture curvature in the data.
Consider a dataset consisting of paired data points $(x_i, y_i)$, where each $x_i$ is an independent variable and $y_i$ is the corresponding dependent variable. Instead of modeling $y$ as a simple linear function of $x$, polynomial regression extends the feature set by including higher-order terms:
$$ \begin{align*} y &= \theta_0 + \theta_1 x + \theta_2 x^2 + \text{noise} \end{align*} $$
$$\phi(x_{i}) = \begin{bmatrix}1\\x_{i}\\x_{i}^2 \end{bmatrix}$$
Here, quad is used as an example and $\Phi$ is constructed using powers of $x$, forming an augmented feature space:
$$\Phi = \begin{bmatrix}1 & x_{1} & x_{1}^2 \\ 1 & x_{2} & x_{2}^2 \\ \vdots \\ 1 & x_{m} & x_{m}^2\end{bmatrix} \quad \implies \quad \hat{y} = \begin{bmatrix}\hat{y}_1 \\\hat{y}_2 \\\vdots \\\hat{y}_m\end{bmatrix}=\Phi\theta$$
where each row represents an observation, and each column corresponds to a different polynomial term.
Although the equation appears nonlinear in terms of $x$, it remains linear in terms of the unknown parameters $\theta$. The model is still solvable using linear regression techniques, such as the least-squares method:
$$\implies \theta^{*} = (\Phi^T \Phi)^{-1} \Phi^T y$$
# y = theta0 + theta1*x + theta2*x^2 + noise
n = 100
x = -5 + 15*np.random.rand(n, 1)
noise = 10*np.random.randn(n, 1)
y = 10 + 1*x + 2*x**2 + noise
plt.figure(figsize = (6, 4))
plt.xlabel('X')
plt.ylabel('Y')
plt.plot(x, y, 'o', markersize = 4, alpha = 0.3)
plt.grid(alpha = 0.3)
plt.show()
A = np.hstack([x**0, x, x**2])
A = np.asmatrix(A)
theta = (A.T*A).I*A.T*y
print('theta:\n', theta)
xp = np.linspace(np.min(x), np.max(x))
yp = theta[0,0] + theta[1,0]*xp + theta[2,0]*xp**2
plt.figure(figsize = (6, 4))
plt.plot(x, y, 'o', markersize = 4, alpha = 0.3)
plt.plot(xp, yp, 'r', linewidth = 2)
plt.xlabel('X')
plt.ylabel('Y')
plt.grid(alpha = 0.3)
plt.show()
By understanding this example, we can effectively model nonlinear patterns using polynomial regression while leveraging the simplicity of linear regression methods.
3.2. Linear Basis Function Models¶
In the above section, we introduced polynomial regression. However, here, we interpret the same method from a slightly different perspective. We have learned that matrix-vector multiplication can be thought of as a linear combination of the columns of the matrix, with coefficients given by the vector. If $\Phi$ consists of columns representing basis functions, then $\Phi \theta$ represents one of the functions that is spanned by these basis functions.
This approach is referred to as linear basis function models. With this perspective, polynomial regression can be viewed as a special case of linear basis function models, where the basis functions are polynomials.
Consider linear combinations of fixed nonlinear functions of the input variables, of the form
$$ \begin{bmatrix} 1 & x_{1} & x_1^2\\1 & x_{2} & x_2^2\\\vdots & \vdots\\1 & x_{m} & x_m^2 \end{bmatrix} \begin{bmatrix}\theta_0\\\theta_1 \\ \theta_2 \end{bmatrix} \quad \implies \quad \begin{bmatrix} \mid & \mid & \mid \\ b_0(x) & b_1(x) & b_2(x)\\ \mid & \mid & \mid \end{bmatrix} \begin{bmatrix}\theta_0\\\theta_1 \\ \theta_2 \end{bmatrix} $$
$$ \hat{y}=\sum_{i=0}^d{\theta_i b_i(x)} = \Phi \theta$$
1) Polynomial functions
$$b_i(x) = x^i, \quad i = 0,\cdots,d$$
xp = np.arange(-1, 1, 0.01).reshape(-1, 1)
polybasis = np.hstack([xp**i for i in range(6)])
plt.figure(figsize = (6, 4))
for i in range(6):
plt.plot(xp, polybasis[:,i], label = '$x^{}$'.format(i))
plt.title('Polynomial Basis')
plt.xlabel('X')
plt.ylabel('Y')
plt.axis([-1, 1, -1.1, 1.1])
plt.grid(alpha = 0.3)
plt.legend()
plt.show()
For instance, if the coefficients are $[-1, 1, 1, -1, 1, -1]^T$ with the polynomial basis, the nonlinear function would be:
$$\hat y = -1 + 1 \cdot x + 1 \cdot x^2 - 1 \cdot x^3 + 1 \cdot x^4 -1 \cdot x^5$$
theta = np.array([-1, 1, 1, -1, 1, -1])
yhat = polybasis.dot(theta)
plt.figure(figsize = (6, 4))
plt.plot(xp, yhat)
plt.xlabel('X')
plt.ylabel('Y')
plt.axis([-1, 1, -1.1, 1.1])
plt.grid(alpha = 0.3)
plt.show()
The form of a nonlinear function is determined by the specific values of its coefficients. Consequently, solving a nonlinear regression problem involves identifying the set of coefficients that enable the function to most accurately approximate the given data points.
2) RBF functions
A Radial Basis Function (RBF) is a type of basis function whose value depends only on the distance between the input and a fixed center point.
- Gaussian RBF (most commonly used) with bandwidth $\sigma$ and $k$ RBF centers $\mu_i \in \mathbb{R}^n$
$$ b_i(x) = \exp \left( - \frac{\lVert x-\mu_i \rVert^2}{2\sigma^2}\right) $$
- Useful for capturing localized patterns
d = 9
u = np.linspace(-1, 1, d)
sigma = 0.2
rbfbasis = np.hstack([np.exp(-(xp-u[i])**2/(2*sigma**2)) for i in range(d)])
plt.figure(figsize = (6, 4))
for i in range(d):
plt.plot(xp, rbfbasis[:,i], label='$\mu = {}$'.format(u[i]))
plt.title('RBF basis')
plt.xlabel('X')
plt.ylabel('Y')
plt.axis([-1, 1, -0.1, 1.1])
plt.legend(loc = 'lower right')
plt.grid(alpha = 0.3)
plt.show()
theta = np.array([1, 1, 1, -1, -1, 1, -1, -1, -1])
yhat = rbfbasis.dot(theta)
plt.figure(figsize = (6, 4))
plt.plot(xp, yhat)
plt.xlabel('X')
plt.ylabel('Y')
# plt.axis([-1, 1, -1.1, 1.1])
plt.grid(alpha = 0.3)
plt.show()
Nonlinear regression with linear basis function models
We are now ready to perform nonlinear regression on the given data points.
(1) Polynomial functions
n = 100
x = -5 + 15*np.random.rand(n, 1)
noise = 10*np.random.randn(n, 1)
y = 10 + 1*x + 2*x**2 + noise
plt.figure(figsize = (6, 4))
plt.xlabel('X')
plt.ylabel('Y')
plt.plot(x, y, 'o', markersize = 4, alpha = 0.3)
plt.grid(alpha = 0.3)
plt.show()
xp = np.linspace(np.min(x), np.max(x)).reshape(-1, 1)
d = 3
A = np.hstack([x**i for i in range(d)])
polybasis = np.hstack([xp**i for i in range(d)])
A = np.asmatrix(A)
polybasis = np.asmatrix(polybasis)
theta = (A.T*A).I*A.T*y
yp = polybasis*theta
plt.figure(figsize = (6, 4))
plt.plot(x, y, 'o', markersize = 4, alpha = 0.3)
plt.plot(xp[:,0], yp[:,0], linewidth = 2)
plt.xlabel('X')
plt.ylabel('Y')
plt.grid(alpha = 0.3)
plt.show()
(2) RBF functions
xp = np.linspace(np.min(x), np.max(x)).reshape(-1, 1)
d = 5
u = np.linspace(-6, 12, d)
sigma = 3
A = np.hstack([np.exp(-(x-u[i])**2/(2*sigma**2)) for i in range(d)])
rbfbasis = np.hstack([np.exp(-(xp-u[i])**2/(2*sigma**2)) for i in range(d)])
A = np.asmatrix(A)
rbfbasis = np.asmatrix(rbfbasis)
theta = (A.T*A).I*A.T*y
yp = rbfbasis*theta
plt.figure(figsize = (6, 4))
plt.plot(x, y, 'o', markersize = 4, alpha = 0.3)
plt.plot(xp, yp)
plt.xlabel('X')
plt.ylabel('Y')
plt.grid(alpha = 0.3)
plt.show()
Nonlinear regression utilizing polynomial features and linear basis function models employing polynomial basis functions are fundamentally equivalent in their mathematical structure. Both approaches aim to model complex, nonlinear relationships by transforming the original input variables into a higher-dimensional space where linear regression techniques can be applied. In the case of polynomial regression, this transformation involves augmenting the input data with polynomial terms (e.g., $ x, x^2, x^3, \cdots $), effectively enabling the modeling of nonlinear patterns within the data. Similarly, linear basis function models utilize polynomial basis functions to achieve a comparable transformation, facilitating the capture of nonlinear relationships through a linear modeling framework.
Despite this equivalence, the two methodologies offer distinct perspectives in terms of implementation and interpretation. Polynomial regression is often viewed as an extension of linear regression, where additional polynomial terms are included to account for curvature in the data. This approach is straightforward and leverages the simplicity of linear models, making it accessible and interpretable. On the other hand, linear basis function models provide a more generalized framework, allowing for the incorporation of various types of basis functions beyond polynomials, such as splines or radial basis functions. This flexibility enables practitioners to tailor the model to the specific characteristics of the data, potentially enhancing predictive performance and interpretability.
4. Overfitting¶
In machine learning, one of the most common challenges is overfitting. Overfitting occurs when a model learns not only the important patterns in the training data but also the noise and random fluctuations. Instead of performing well on new, unseen data, the model becomes highly tuned to the training data, resulting in poor generalization.
This is similar to a student who memorizes answers to specific questions rather than understanding the concepts, leading to poor performance when asked different questions. In the same way, an overfitted model may perform very well on the training set but fail to make accurate predictions on test data.
Understanding and addressing overfitting is essential for building models that generalize effectively. We introduce the concept of overfitting, its causes, and its impact on model performance. We will also cover practical methods to prevent overfitting, such as regularization.
With clear examples and explanations, this section aims to provide an understanding of overfitting and how to handle it. By the end of this section, you will be equipped with the knowledge to create models that strike the right balance between fitting the data and making reliable predictions on new data.
4.1. Nonlinear Regressoin with 10 Data Points¶
We will consider a simple case where we have ten data points and build a non-linear regression model using polynomial functions of varying degrees. By visualizing the fitted models, we will observe how the degree of the polynomial affects the fit and identify cases of underfitting, appropriate fitting, and overfitting.
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
# 10 data points
n = 10
x = np.linspace(-4.5, 4.5, 10).reshape(-1, 1)
y = np.array([0.9819, 0.7973, 1.9737, 0.1838, 1.3180, -0.8361, -0.6591, -2.4701, -2.8122, -6.2512]).reshape(-1, 1)
plt.figure(figsize = (6, 4))
plt.plot(x, y, 'o', alpha = 0.5)
plt.xlabel('X')
plt.ylabel('Y')
plt.grid(alpha = 0.3)
plt.axis('equal')
plt.show()
Degree 1 (Linear Regression):
- The model underfits the data
- It fails to capture the non-linear pattern and results in a straight-line fit.
A = np.hstack([x**0, x])
A = np.asmatrix(A)
theta = (A.T*A).I*A.T*y
print(theta)
# to plot
xp = np.arange(-4.5, 4.5, 0.01).reshape(-1, 1)
yp = theta[0,0] + theta[1,0]*xp
plt.figure(figsize = (6, 4))
plt.plot(x, y, 'o', alpha = 0.5)
plt.plot(xp[:,0], yp[:,0], linewidth = 2)
plt.title('Linear Regression')
plt.xlabel('X')
plt.ylabel('Y')
plt.grid(alpha = 0.3)
plt.axis('equal')
plt.show()
Degree 3 (Cubic Polynomial):
- The model fits the data well
- It captures the underlying trend without being too flexible or too rigid.
A = np.hstack([x**0, x, x**2, x**3])
A = np.asmatrix(A)
theta = (A.T*A).I*A.T*y
print(theta)
# to plot
xp = np.arange(-4.5, 4.5, 0.01).reshape(-1, 1)
yp = theta[0,0] + theta[1,0]*xp + theta[2,0]*xp**2
plt.figure(figsize = (6, 4))
plt.plot(x, y, 'o', alpha = 0.5)
plt.plot(xp[:,0], yp[:,0], linewidth = 2)
plt.xlabel('X')
plt.ylabel('Y')
plt.grid(alpha = 0.3)
plt.axis('equal')
plt.show()
Degree 9 (High-Degree Polynomial):
- The model overfits the data
- It closely follows every data point, including noise, resulting in a wavy curve that fails to generalize well to unseen data.
A = np.hstack([x**i for i in range(10)])
A = np.asmatrix(A)
theta = (A.T*A).I*A.T*y
# to plot
xp = np.arange(-4.5, 4.5, 0.01).reshape(-1, 1)
polybasis = np.hstack([xp**i for i in range(10)])
polybasis = np.asmatrix(polybasis)
yp = polybasis*theta
plt.figure(figsize = (6, 8))
plt.plot(x, y, 'o', alpha = 0.5)
plt.plot(xp[:,0], yp[:,0], linewidth = 2)
plt.xlabel('X')
plt.ylabel('Y')
plt.grid(alpha = 0.3)
plt.axis('equal')
plt.show()
d = [1, 3, 5, 9]
RSS = []
plt.figure(figsize = (8, 6))
plt.suptitle('Nonlinear Regression', fontsize = 12)
for k in range(4):
A = np.hstack([x**i for i in range(d[k]+1)])
polybasis = np.hstack([xp**i for i in range(d[k]+1)])
A = np.asmatrix(A)
polybasis = np.asmatrix(polybasis)
theta = (A.T*A).I*A.T*y
yp = polybasis*theta
RSS.append(np.linalg.norm(y - A*theta, 2)**2)
plt.subplot(2, 2, k+1)
plt.plot(x, y, 'o', alpha = 0.5)
plt.plot(xp, yp)
plt.axis([-5, 5, -12, 6])
plt.title('degree = {}'.format(d[k]))
plt.grid(alpha=0.3)
plt.show()
Residual Sum of Squares
plt.figure(figsize = (6, 4))
plt.stem(d, RSS, label = 'RSS')
plt.title('Residual Sum of Squares', fontsize = 12)
plt.xlabel('degree', fontsize = 12)
plt.ylabel('RSS', fontsize = 12)
plt.legend(fontsize = 12)
plt.grid(alpha = 0.3)
plt.show()
Residual Sum of Squares (RSS) decreases as the polynomial degree increases. A 9th-order polynomial regression applied to a dataset with 10 data points will result in zero RSS. However, a lower error does not necessarily indicate a better model, as overfitting may compromise generalization.
Key Takeaways
Underfitting: Happens when the model is too simple (e.g., a low-degree polynomial) and fails to capture the pattern.
Good Fit: Achieved when the model is complex enough to capture the trend but not so flexible that it fits noise.
Overfitting: Happens when the model is too complex (e.g., a high-degree polynomial) and memorizes noise, reducing generalization.
By testing different degrees of polynomial functions, this example demonstrates the importance of finding the right balance between model complexity and generalization to avoid underfitting and overfitting.
We will repeat the same case to build a nonlinear regression model using RBF functions.
xp = np.arange(-4.5, 4.5, 0.01).reshape(-1, 1)
d = 5
u = np.linspace(-4.5, 4.5, d)
sigma = 2
A = np.hstack([np.exp(-(x-u[i])**2/(2*sigma**2)) for i in range(d)])
rbfbasis = np.hstack([np.exp(-(xp-u[i])**2/(2*sigma**2)) for i in range(d)])
A = np.asmatrix(A)
rbfbasis = np.asmatrix(rbfbasis)
theta = (A.T*A).I*A.T*y
yp = rbfbasis*theta
plt.figure(figsize = (6, 4))
plt.plot(x, y, 'o', alpha = 0.5)
plt.plot(xp, yp)
plt.xlabel('X')
plt.ylabel('Y')
plt.grid(alpha = 0.3)
plt.axis('equal')
plt.show()
d = [2, 4, 6, 10]
sigma = 2
plt.figure(figsize = (8, 6))
for k in range(4):
u = np.linspace(-4.5, 4.5, d[k])
A = np.hstack([np.exp(-(x-u[i])**2/(2*sigma**2)) for i in range(d[k])])
rbfbasis = np.hstack([np.exp(-(xp-u[i])**2/(2*sigma**2)) for i in range(d[k])])
A = np.asmatrix(A)
rbfbasis = np.asmatrix(rbfbasis)
theta = (A.T*A).I*A.T*y
yp = rbfbasis*theta
plt.subplot(2, 2, k+1)
plt.plot(x, y, 'o', alpha = 0.5)
plt.plot(xp, yp)
plt.axis([-5, 5, -12, 6])
plt.title('num RBFs = {}'.format(d[k]), fontsize = 10)
plt.grid(alpha = 0.3)
plt.show()
4.2. Overfitting¶
Overfitting occurs when a machine learning model performs well on the training data but poorly on unseen or test data. The model essentially "memorizes" the noise and details in the training set rather than learning the general patterns that can be applied to new data.
Overfitting Problem
- Have you come across a situation where your model performed exceptionally well on train data, but was not able to predict test data?
- One of the most common problem data science professionals face is to avoid overfitting.
Causes of Overfitting
- Model Complexity: A model with too many parameters (e.g., deep neural networks) can overfit the training data.
- Insufficient Training Data: If the training dataset is small, the model may fit every point precisely, capturing noise rather than general patterns.
- Noisy Data: If the data has a lot of noise or irrelevant features, the model may try to learn this noise.
- Too Many Training Iterations: Training for too long can lead to overfitting, where the model starts fitting noise instead of actual patterns.
Signs of Overfitting
- High Training Accuracy, Low Test Accuracy: The model performs well on the training set but poorly on the validation/test set.
- Large Gap Between Training and Validation Loss: Training loss continues to decrease, but validation loss increases.
Generalization Error
Generalization Error refers to the difference between a machine learning model's performance on the training data and its performance on unseen data (test or validation data). In other words, the generalization error measures the difference between the expected loss on unseen test data and the loss on the training data. It measures how well the model can generalize to new, unseen examples. A model with a low generalization error performs well not only on the training set but also on new data.
Mathematically, the generalization error can be defined as:
$$ \text{Generalization Error} = \mathbb{E}_{D_{\text{test}}}[\mathcal{L}(f(x), y)] - \mathbb{E}_{D_{\text{train}}}[\mathcal{L}(f(x), y)], $$
where:
$D_{\text{test}}$ and $D_{\text{train}}$ are the test and training data distributions.
$f(x)$ is the model's prediction function.
$\mathcal{L}(f(x), y)$ is the loss function measuring the difference between the predicted and true values.
Techniques to Prevent Overfitting
(1) Regularization
- Prevent overfitting by adding a penalty to the loss function, encouraging simpler models with smaller weights.
- This penalty discourages the model from fitting noise in the training data by constraining the values of the model's parameters.
- Regularization helps the model generalize better to unseen data by making it simpler and less sensitive to outliers.
(2) Early Stopping
- Monitors the validation loss during training and stops when the loss starts increasing to prevent overfitting.
(3) Data Augmentation
- Increases the training set size by applying transformations (e.g., rotations, flips, cropping) to existing data, making the model more generalizable.
(4) Cross-Validation
- Splits the data into multiple folds to train and validate the model on different subsets, ensuring that the model is not overfitting specific subsets of data.
(5) Reduce Model Complexity
- Use simpler models with fewer parameters to avoid memorizing the training data.
Among the various techniques to prevent overfitting, the regularization method will be the primary focus of the following session.
4.3. Regularization to Reduce Overfitting¶
L2 regularization, also known as Ridge Regularization or weight decay, involves adding a penalty equal to the sum of the squared weights to the loss function. It discourages large weights by penalizing their size, helping the model learn simpler and more robust patterns. Often, overfitting associated with very large estimated parameters $\theta$.
We want to balance
how well function fits data
magnitude of coefficients
$$ \begin{align*} \text{Total cost } = \;&\underbrace{\text{measure of fit}}_{RSS(\theta)} + \;\lambda \cdot \underbrace{\text{measure of magnitude of coefficients}}_{\lambda \cdot \lVert \theta \rVert_2^2} \\ \\ \implies &\min\; \lVert \Phi \theta - y \rVert_2^2 + \lambda \lVert \theta \rVert_2^2 \end{align*} $$
Where
the first term, $ RSS(\theta) = \lVert \Phi\theta - y \rVert^2_2 $ (= Rresidual Sum of Squares)
the second term, $\lambda \cdot \lVert \theta \rVert_2^2$, called a shrinkage penalty, is small when $\theta_1, \cdots,\theta_d$ are close to zeros, and so it has the effect of shrinking the estimates of $\theta_j$ towards zero
The tuning parameter $\lambda$ serves to control the relative impact of these two terms on the regression coefficient estimates
known as a ridge regression
How L2 Regularization Works
Effect on Weights:
- L2 regularization forces the model to keep weights small.
Interpretation:
- L2 regularization distributes the penalty uniformly across all weights, shrinking their magnitude.
Overfitting Reduction:
- By limiting the model's ability to learn large weights, the model becomes less likely to overfit the noise in the training data.
Example:
Start with a rich representation
$$\min\; \lVert \Phi \theta - y \rVert_2^2$$
# CVXPY code
import cvxpy as cvx
d = 10
u = np.linspace(-4.5, 4.5, d)
sigma = 1
A = np.hstack([np.exp(-(x-u[i])**2/(2*sigma**2)) for i in range(d)])
rbfbasis = np.hstack([np.exp(-(xp-u[i])**2/(2*sigma**2)) for i in range(d)])
A = np.asmatrix(A)
rbfbasis = np.asmatrix(rbfbasis)
theta = cvx.Variable([d, 1])
obj = cvx.Minimize(cvx.sum_squares(A@theta - y))
prob = cvx.Problem(obj).solve()
yp = rbfbasis*theta.value
plt.figure(figsize = (6, 4))
plt.plot(x, y, 'o', alpha = 0.5)
plt.plot(xp, yp)
plt.xlabel('X')
plt.ylabel('Y')
plt.axis([-5, 5, -12, 6])
plt.grid(alpha = 0.3)
plt.axis('equal')
plt.show()
As observed in the graph above, using 10 RBF functions for 10 data points results in a perfect fit, achieving zero RSS for the training data. However, this clearly indicates overfitting.
Visualizing the learned coefficients of a regression model is a valuable step in understanding the influence of each basis.
plt.figure(figsize = (6, 4))
plt.title(r'Ridge: magnitude of $\theta$', fontsize = 12)
plt.xlabel(r'$\theta$', fontsize = 12)
plt.ylabel('magnitude', fontsize = 12)
plt.stem(np.linspace(1, 10, 10).reshape(-1, 1), theta.value)
plt.xlim([0.5, 10.5])
# plt.ylim([-5, 5])
plt.grid(alpha = 0.3)
plt.show()
Then, regularize coefficients $\theta$
$$\min\; \lVert \Phi \theta - y \rVert_2^2 + \lambda \lVert \theta \rVert_2^2$$
# ridge regression
lamb = 0.1
theta = cvx.Variable([d, 1])
obj = cvx.Minimize(cvx.sum_squares(A@theta - y) + lamb*cvx.sum_squares(theta))
prob = cvx.Problem(obj).solve()
yp = rbfbasis*theta.value
plt.figure(figsize = (6, 4))
plt.plot(x, y, 'o', alpha = 0.5)
plt.plot(xp, yp, label = 'Ridge')
plt.xlabel('X')
plt.ylabel('Y')
plt.axis([-5, 5, -12, 6])
plt.grid(alpha = 0.3)
plt.axis('equal')
plt.show()
Plot the values of all the coefficients.
# Regulization (= ridge nonlinear regression) encourages small weights, but not exactly 0
plt.figure(figsize = (6, 4))
plt.title(r'Ridge: magnitude of $\theta$', fontsize = 12)
plt.xlabel(r'$\theta$', fontsize = 12)
plt.ylabel('magnitude', fontsize = 12)
plt.stem(np.linspace(1, 10, 10).reshape(-1, 1), theta.value)
plt.xlim([0.5, 10.5])
plt.ylim([-5, 5])
plt.grid(alpha = 0.3)
plt.show()
Larger $\lambda$ forces the model to keep weights small.
lamb = np.arange(0, 3, 0.01)
theta_record = []
for k in lamb:
theta = cvx.Variable([d, 1])
obj = cvx.Minimize(cvx.sum_squares(A@theta - y) + k*cvx.sum_squares(theta))
prob = cvx.Problem(obj).solve()
theta_record.append(np.ravel(theta.value))
plt.figure(figsize = (6, 4))
plt.plot(lamb, theta_record, linewidth = 1)
plt.title('Ridge coefficients as a function of regularization', fontsize = 12)
plt.xlabel('$\lambda$', fontsize = 12)
plt.ylabel(r'weight $\theta$', fontsize = 12)
plt.grid(alpha = 0.3)
plt.show()
From this example, we learn that regularization, specifically ridge regression, can be an effective method for overcoming overfitting challenges.
4.4. Sparsity for Feature Selection using LASSO¶
Ridge regression applies a penalty on the magnitude of the weights using the L2 norm. Why not consider using the L1 norm instead?
Try this cost instead of ridge...
$$ \begin{align*} \text{Total cost } = \;&\underbrace{\text{measure of fit}}_{RSS(\theta)} + \;\lambda \cdot \underbrace{\text{measure of magnitude of coefficients}}_{\lambda \cdot \lVert \theta \rVert_1} \\ \\ \implies &\min\; \lVert \Phi \theta - y \rVert_2^2 + \lambda \lVert \theta \rVert_1 \end{align*}$$
Consider the same problem but with a different objective function:
$$\min\; \lVert \Phi \theta - y \rVert_2^2 + \lambda \lVert \theta \rVert_1$$
# LASSO regression
lamb = 2
theta = cvx.Variable([d, 1])
obj = cvx.Minimize(cvx.sum_squares(A@theta - y) + lamb*cvx.norm(theta, 1))
prob = cvx.Problem(obj).solve()
yp = rbfbasis*theta.value
plt.figure(figsize = (6, 4))
plt.title('LASSO Regularization (L1)', fontsize = 12)
plt.xlabel('X')
plt.ylabel('Y')
plt.plot(x, y, 'o', alpha = 0.5)
plt.plot(xp, yp)
plt.axis([-5, 5, -12, 6])
plt.grid(alpha = 0.3)
plt.axis('equal')
plt.show()
By selecting an appropriate value for $ \lambda $, the above example demonstrates that the L1 norm can also effectively mitigate overfitting.
This naturally leads to the consideration of the differences between L2 and L1 regularization techniques.
# Regulization (= Lasso nonlinear regression) encourages zero weights
plt.figure(figsize = (6, 4))
plt.title(r'LASSO: magnitude of $\theta$', fontsize = 12)
plt.xlabel(r'$\theta$')
plt.ylabel('magnitude')
plt.stem(np.arange(1,11), theta.value)
plt.xlim([0.5, 10.5])
plt.ylim([-6, 2])
plt.grid(alpha = 0.3)
plt.show()
The L1 penalty (used in Lasso Regression) has the unique property of:
Shrinking some coefficients exactly to 0
This effectively removes certain features from the model.
In other words, it performs automatic feature selection.
As a result:
Only features with non-zero coefficients are considered important or selected by the model.
This leads to sparser, more interpretable models, especially useful when dealing with high-dimensional data.
lamb = np.arange(0, 3, 0.01)
theta_record = []
for k in lamb:
theta = cvx.Variable([d, 1])
obj = cvx.Minimize(cvx.sum_squares(A@theta - y) + k*cvx.norm(theta, 1))
prob = cvx.Problem(obj).solve()
theta_record.append(np.ravel(theta.value))
plt.figure(figsize = (6, 4))
plt.plot(lamb, theta_record, linewidth = 1)
plt.title('LASSO coefficients as a function of regularization', fontsize = 12)
plt.xlabel('$\lambda$')
plt.ylabel(r'weight $\theta$')
plt.show()
Sparsity and Feature Selection:
The L1 penalty in LASSO (Least Absolute Shrinkage and Selection Operator) encourages sparsity in the coefficient estimates, meaning it drives some coefficients to be exactly zero. This property is particularly useful for feature selection, as it allows the model to identify and retain only the most relevant predictors, simplifying the model and potentially improving interpretability.
# reduced order model
# we will use only theta 2, 3, 8, 10
d = 4
u = np.array([-3.5, -2.5, 2.5, 4.5])
sigma = 1
A = np.hstack([np.exp(-(x-u[i])**2/(2*sigma**2)) for i in range(d)])
rbfbasis = np.hstack([np.exp(-(xp-u[i])**2/(2*sigma**2)) for i in range(d)])
A = np.asmatrix(A)
rbfbasis = np.asmatrix(rbfbasis)
theta = cvx.Variable([d, 1])
obj = cvx.Minimize(cvx.norm(A@theta - y, 2))
prob = cvx.Problem(obj).solve()
yp = rbfbasis*theta.value
plt.figure(figsize = (6, 4))
plt.plot(x, y, 'o', alpha = 0.5)
plt.plot(xp, yp)
plt.xlabel('X')
plt.ylabel('Y')
plt.axis([-5, 5, -12, 6])
plt.grid(alpha = 0.3)
plt.axis('equal')
plt.show()
We numerically demonstrated that the L1 penalty enforces sparsity in the coefficient estimates. But why does this occur?
To better understand this phenomenon, we can explore alternative but equivalent optimization formulations for both Ridge and LASSO regression.
$$ \begin{array}{rl} \begin{align*} \min_{\theta} \quad & \lVert \Phi \theta - y \rVert_2^2 \\ \text{subject to} \quad & \lVert \theta \rVert_{\color{red}1} \leq s_1 \end{align*} \end{array} \quad\quad\quad\quad \begin{array}{rl} \min_{\theta} \; & \lVert \Phi \theta - y \rVert_2^2 \\ \text{subject to} \; & \lVert \theta \rVert_{\color{red}2} \leq s_2 \end{array} $$
These constrained formulations reveal the underlying geometric properties of L1 and L2 regularization. The L1 constraint encourages solutions where some coefficients are exactly zero, resulting in sparse parameter estimates. In contrast, the L2 constraint tends to shrink all coefficients uniformly, rarely producing exact zeros but effectively controlling overall model complexity.
4.5. Scikit-Learn¶
# 10 data points
n = 10
x = np.linspace(-4.5, 4.5, 10).reshape(-1, 1)
y = np.array([0.9819, 0.7973, 1.9737, 0.1838, 1.3180, -0.8361, -0.6591, -2.4701, -2.8122, -6.2512]).reshape(-1, 1)
d = 10
u = np.linspace(-4.5, 4.5, d)
sigma = 1
A = np.hstack([np.exp(-(x-u[i])**2/(2*sigma**2)) for i in range(d)])
from sklearn import linear_model
reg = linear_model.Ridge(alpha = 0.1)
reg.fit(A,y)
# to plot
xp = np.arange(-4.5, 4.5, 0.01).reshape(-1, 1)
Ap = np.hstack([np.exp(-(xp-u[i])**2/(2*sigma**2)) for i in range(d)])
plt.figure(figsize = (6, 4))
plt.plot(x, y, 'o', alpha = 0.5)
plt.plot(xp, reg.predict(Ap), linewidth = 2, label = "Ridge")
plt.title('Ridge Regression (L2)', fontsize = 12)
plt.xlabel('X')
plt.ylabel('Y')
plt.grid(alpha = 0.3)
plt.axis('equal')
plt.show()
reg = linear_model.Lasso(alpha = 0.005)
reg.fit(A,y)
# to plot
xp = np.arange(-4.5, 4.5, 0.01).reshape(-1, 1)
Ap = np.hstack([np.exp(-(xp-u[i])**2/(2*sigma**2)) for i in range(d)])
plt.figure(figsize = (6, 4))
plt.plot(x, y, 'o', alpha = 0.5)
plt.plot(xp, reg.predict(Ap), linewidth = 2, label = "LASSO")
plt.title('LASSO Regression (L1)', fontsize = 12)
plt.xlabel('X')
plt.ylabel('Y')
plt.grid(alpha = 0.3)
plt.axis('equal')
plt.show()
We have covered the essential concepts of regression, including both theory and implementation. Now, we will explore additional examples of regression.
5. Example: De-noising Signal¶
We start with a signal represented by a vector $x \in \mathbb{R}^n$. The coefficients $x_i$ correspond to the value of some function of time, evaluated (or sampled, in the language of signal processing) at evenly spaced points. It is usually assumed that the signal does not vary too rapidly, which means that usually, we have $x_i \approx x_{i+1}$.
Suppose we have a signal $x$, which does not vary too rapidly and that $x$ is corrupted by some small, rapidly varying noise $\varepsilon$, i.e., $x_{\text{cor}} = x + \varepsilon$.
Then if we want to reconstruct $x$ from $x_{\text{cor}}$ we should solve (with $\hat{x}$ as the parameter)
$$ \text{minimize} \quad \lVert \hat{x} - x_{\text{cor}}\rVert_2^2 + \mu\sum_{i=1}^{n-1}(x_{i+1}-x_i)^2 $$
where the parameter $\mu$ controls the ''smoothnes'' of $\hat{x}$.
Source:
- Boyd & Vandenberghe's book "Convex Optimization"
- http://cvxr.com/cvx/examples/ (Figures 6.8-6.10: Quadratic smoothing)
- Week 4 of Linear and Integer Programming by Coursera of Univ. of Colorado
import numpy as np
import matplotlib.pyplot as plt
import cvxpy as cvx
%matplotlib inline
n = 200
t = np.arange(n).reshape(-1,1)
x = 0.5 * np.sin((2*np.pi/n)*t) * (np.sin(0.01*t))
x_cor = x + 0.05*np.random.randn(n,1)
plt.figure(figsize = (6, 4))
plt.subplot(2,1,1)
plt.plot(t,x,'-', linewidth = 2)
plt.axis([0, n, -0.6, 0.6])
plt.xticks([])
plt.title('original signal' , fontsize = 15)
plt.ylabel('$x_{original}$', fontsize = 15)
plt.subplot(2,1,2)
plt.plot(t, x_cor,'-', linewidth = 1)
plt.axis([0, n, -0.6, 0.6])
plt.title('corrupted signal', fontsize = 15)
plt.xlabel('n', fontsize = 15)
plt.ylabel('$x_{corrupted}$', fontsize = 15)
plt.show()
5.1. Transforming Denoising in Time into an Optimization Problem¶
We represent the signal as a column vector:
$$X = \begin{bmatrix} x_{1} \\ x_{2} \\ \vdots \\ x_{n} \end{bmatrix} $$
We aim to recover a smooth and denoised version of the signal $X$, given a corrupted version $X_{\text{cor}}$. This can be framed as an optimization problem:
$$\large \min\limits_{X}\;\left\{\underbrace{\lVert(X - X_{\text{cor}})\rVert^2_{2}}_{\text{how much } X \text{ deviates from }X_{\text{cor}}} + \mu \underbrace{\sum_{k = 1}^{n-1}(x_{k+1}-x_{k})^2}_{\text{penalize rapid changes of } X}\right\}$$
(1) Fidelity Term:
$$\min_{X}\;\lVert(X - X_{\text{cor}})\rVert^2_{2}$$
$\quad \;$This term ensures the recovered signal stays close to the original (possibly noisy) observations.
(2) Smoothness Term
$$\sum_{k = 1}^{n-1}(x_{k+1}-x_{k})^2$$
$\quad \;$This term penalizes rapid changes in the signal, encouraging smooth transitions from one time point to the next.
(3) $\mu$ : to adjust the relative weight of (1) & (2)
Vectorizing the Denoising Optimization Problem
To solve the denoising problem efficiently, we express both components of the cost function in vector-matrix form.
(1) $X - X_{\text{cor}} = I_n X - X_{\text{cor}}$
(2) $\sum\;(x_{k+1}-x_{k})^2 \implies$
$$(x_{2} - x_{1}) - 0 = \begin{bmatrix} -1,&1,&0,&\cdots&0 \end{bmatrix} \begin{bmatrix} x_{1} \\ x_{2} \\ \vdots \\ x_{n} \end{bmatrix}-0 $$
$$(x_{3} - x_{2}) - 0 = \begin{bmatrix} 0,&-1,&1,&\cdots&0 \end{bmatrix} \begin{bmatrix} x_{1} \\ x_{2} \\ \vdots \\ x_{n} \end{bmatrix}-0 $$
$$ \vdots $$
$$\implies \left \Arrowvert \; \begin{bmatrix} -1&1&0&\cdots&0&0 \\ 0&-1&1&\cdots&0&0 \\ \vdots&\vdots&\vdots&\ddots&\vdots&\vdots\\ 0&0&0&\cdots&-1&1 \end{bmatrix} \begin{bmatrix} x_{1} \\ x_{2} \\ \vdots \\ x_{n} \end{bmatrix} - \begin{bmatrix} 0 \\ 0 \\ \vdots \\ 0 \end{bmatrix} \; \right\Arrowvert^2_{2} $$
$$\hspace{3.2cm}D \quad\quad\quad\quad\quad\quad\qquad X \quad\; - \; \; 0 $$
We now combine both terms:
$$\begin{align*} \left\Vert I_n X-X_{\text{cor}}\big\Vert^2_{2} + \mu\big\Vert DX-0\right\Vert^2_{2} & = \big\Vert Ax-b\big\Vert^2_{2} \\ \\ & = \Bigg\Vert \begin{bmatrix} I_n\\ \sqrt{\mu}D \end{bmatrix} X- \begin{bmatrix} X_{\text{cor}}\\ 0 \end{bmatrix} \Bigg\Vert^2_{2} \end{align*}$$
$$ \text{where} \; A = \begin{bmatrix} I_n\\ \sqrt{\mu}D \end{bmatrix},\quad b = \begin{bmatrix} X_{\text{cor}}\\ 0 \end{bmatrix} $$
This is equivalent to a standard least-squares problem. To find the optimal denoised signal $X^*$, solve using the normal equation:
$$X^* = (A^TA)^{-1}A^Tb$$
mu = 100
D = np.zeros([n-1, n])
D[:,0:n-1] -= np.eye(n-1)
D[:,1:n] += np.eye(n-1)
A = np.vstack([np.eye(n), np.sqrt(mu)*D])
b = np.vstack([x_cor, np.zeros([n-1,1])])
A = np.asmatrix(A)
b = np.asmatrix(b)
x_reconst = (A.T*A).I*A.T*b
plt.figure(figsize = (6, 2))
plt.plot(t, x_cor, '-', linewidth = 1, alpha = 0.5, label = 'corrupted');
plt.plot(t, x_reconst, 'r', linewidth = 2, label = 'reconstructed')
plt.title('$\mu = {}$'.format(mu), fontsize = 15)
plt.legend(fontsize = 12)
plt.show()
As demonstrated, the original signal can be effectively reconstructed from the corrupted one using regression techniques. While this process highlights the potential of regression for noise reduction, it is important to note that such tasks are typically achieved more effectively and efficiently using low-pass filters in either the time or frequency domain. Nevertheless, this example serves to highlight the broader applicability of regression methods, demonstrating their potential to address various practical problems.
Another important point is that this example emphasizes the importance of formulating problems in a form of a regression model, as it can provide valuable insights and effective solutions even in contexts that may not initially seem related to regression models.
Different $\mu$'s (See How $\mu$ Affects Smoothing Results)
plt.figure(figsize = (6, 6))
mu = [0, 10, 100];
for i in range(len(mu)):
A = np.vstack([np.eye(n), np.sqrt(mu[i])*D])
b = np.vstack([x_cor, np.zeros([n-1,1])])
A = np.asmatrix(A)
b = np.asmatrix(b)
x_reconst = (A.T*A).I*A.T*b
plt.subplot(3,1,i+1)
plt.plot(t, x_cor, '-', linewidth = 1, alpha = 0.5)
plt.plot(t, x_reconst, 'r', linewidth = 2)
plt.ylabel('$\mu = {}$'.format(mu[i]), fontsize = 15)
plt.show()
5.1.1. CVXPY Implementation¶
To solve the denoising optimization problem numerically, we can use the cvxpy
library:
$$\min_{X}\;\lVert(X - X_{\text{cor}})\rVert^2_{2} + \mu\sum_{k = 1}^{n-1}(x_{k+1}-x_{k})^2$$
mu = 100
x_reconst = cvx.Variable([n,1])
obj = cvx.Minimize(cvx.sum_squares(x_reconst - x_cor) + mu*cvx.sum_squares(x_reconst[1:n]-x_reconst[0:n-1]))
# obj = cvx.Minimize(cvx.sum_squares(x_reconst - x_cor) + mu*cvx.sum_squares(D@x_reconst))
prob = cvx.Problem(obj).solve()
plt.figure(figsize = (6, 2))
plt.plot(t, x_cor, '-', linewidth = 1, alpha = 0.5, label = 'corrupted');
plt.plot(t, x_reconst.value, 'r', linewidth = 2, label = 'reconstructed')
plt.title('$\mu = {}$'.format(mu), fontsize = 15)
plt.legend(fontsize = 12)
plt.show()
plt.figure(figsize = (6, 6))
mu = [0, 10, 100]
for i in range(len(mu)):
x_reconst = cvx.Variable([n,1])
obj = cvx.Minimize(cvx.sum_squares(x_reconst - x_cor) + mu[i]*cvx.sum_squares(D @ x_reconst))
prob = cvx.Problem(obj).solve()
plt.subplot(3,1,i+1)
plt.plot(t,x_cor,'-', linewidth = 1, alpha = 0.5)
plt.plot(t,x_reconst.value, 'r', linewidth = 2)
plt.ylabel('$\mu = {}$'.format(int(mu[i])), fontsize = 15)
plt.show()
5.1.2. L2 Norm¶
CVXPY strongly encourages the use of norm expressions (like cvx.norm
) over explicit quadratic forms such as:
cvx.sum_squares()
cvx.sum(cp.square())
cvx.quad_form()
This is because norms are convex, efficient, and automatically handled under CVXPY's disciplined convex programming (DCP) rules.
Preferred Objective Form:
$$ \min \; \left\{ \lVert X - X_{\text{cor}}\rVert_2 + \gamma \lVert DX \rVert_2 \right \}$$
plt.figure(figsize = (6, 6))
gammas = [0, 2, 7]
for i in range(len(gammas)):
x_reconst = cvx.Variable([n,1])
obj = cvx.Minimize(cvx.norm(x_reconst-x_cor, 2) + gammas[i]*(cvx.norm(D @ x_reconst, 2)))
prob = cvx.Problem(obj).solve()
plt.subplot(3,1,i+1)
plt.plot(t,x_cor,'-', linewidth = 1, alpha = 0.5)
plt.plot(t,x_reconst.value, 'r', linewidth = 2)
plt.ylabel('$ \gamma = {}$'.format(gammas[i]), fontsize = 15)
plt.show()
5.1.3. L2 Norm with a Constraint¶
Instead of combining the fidelity and smoothness terms in a single objective function, we can pose the problem as a constrained optimization
$$ \begin{align*} \min \quad & \lVert Dx \rVert_2 \\ \text{subject to} \quad & \lVert x-x_{\text{cor}} \rVert_2 \leq \beta \end{align*} $$
Where
The objective function $ \lVert D x \rVert_2 $ encourages smoothness in the recovered signal.
The constraint $ \lVert x - x_{\text{cor}} \rVert_2 < \beta $ ensures that the denoised signal doesn't deviate too far from the observed noisy input.
The parameter $ \beta $ sets a tolerance level for how much deviation is acceptable.
beta = 0.8
x_reconst = cvx.Variable([n,1])
obj = cvx.Minimize(cvx.norm(D @ x_reconst, 2))
const = [cvx.norm(x_reconst-x_cor, 2) <= beta]
prob = cvx.Problem(obj, const).solve()
plt.figure(figsize = (6, 2))
plt.plot(t,x_cor,'-', linewidth = 1, alpha = 0.5)
plt.plot(t,x_reconst.value, 'r', linewidth = 2)
plt.ylabel(r'$\beta = {}$'.format(beta), fontsize = 15)
plt.show()
plt.figure(figsize = (6, 6))
beta = [0.1, 0.6, 0.8]
for i in range(len(beta)):
x_reconst = cvx.Variable([n,1])
obj = cvx.Minimize(cvx.norm(x_reconst[1:n] - x_reconst[0:n-1], 2))
const = [cvx.norm(x_reconst-x_cor, 2) <= beta[i]]
prob = cvx.Problem(obj, const).solve()
plt.subplot(len(beta),1,i+1)
plt.plot(t,x_cor,'-', linewidth = 1, alpha = 0.5)
plt.plot(t,x_reconst.value, 'r', linewidth = 2)
plt.ylabel(r'$\beta = {}$'.format(beta[i]), fontsize = 15)
plt.show()
5.2. Signal with Sharp Transition + Noise¶
The previous example demonstrated that the optimization techniques we learned from linear regression can also be effectively applied to the problem of denoising continuous and corrupted signals.
Now, suppose we are dealing with a signal $x$ that is mostly smooth, but contains a few sharp transitions - such as sudden jumps or step-like changes. These are common in real-world signals, such as changes in system states, edges in images, or abrupt shifts in sensor readings.
n = 200
t = np.arange(n).reshape(-1,1)
exact = np.vstack([np.ones([50,1]), -np.ones([50,1]), np.ones([50,1]), -np.ones([50,1])])
x = exact + 0.5*np.sin((2*np.pi/n)*t)
x_cor = x + 0.1*np.random.randn(n,1)
plt.figure(figsize = (6, 4))
plt.subplot(2,1,1)
plt.plot(t, x)
plt.ylim([-2.0,2.0])
plt.ylabel('signal', fontsize = 15)
plt.subplot(2,1,2)
plt.plot(t, x_cor, linewidth = 1)
plt.ylabel('corrupted signal', fontsize = 15)
plt.xlabel('x', fontsize = 15)
plt.show()
5.2.1. L2 Norm (Quadratic Smoothing)¶
Let's start by applying the same L2-based smoothing technique used before. We'll observe how it performs when the underlying signal includes sharp transitions.
plt.figure(figsize = (6, 6))
beta = [0.5, 2, 4]
for i in range(len(beta)):
x_reconst = cvx.Variable([n,1])
obj = cvx.Minimize(cvx.norm(x_reconst[1:n] - x_reconst[0:n-1], 2))
const = [cvx.norm(x_reconst - x_cor, 2) <= beta[i]]
prob = cvx.Problem(obj, const).solve()
plt.subplot(len(beta), 1, i+1)
plt.plot(t, x_cor, linewidth = 1, alpha = 0.5)
plt.plot(t, x_reconst.value, 'r', linewidth = 2)
plt.ylabel(r'$\beta = {}$'.format(beta[i]), fontsize = 15)
plt.show()
Quadratic smoothing (L2 norm panalty) smooths out noise and sharp transitions in signal, but this is not what we want.
Why?
If we apply the same quadratic smoothing method (based on the L2 norm of differences between adjacent points), we will indeed reduce noise. However, this method also tends to:
Smooth out the sharp transitions in the signal
Blur the regions that should remain distinct
Fail to capture important structural information
This is because the quadratic penalty strongly discourages large differences between adjacent values, regardless of whether those differences are due to noise or actual signal features.
5.2.2. L1 Norm¶
We can instead apply L1 norm on the signal by solving
$$\min \; \lVert X - X_{\text{cor}} \rVert_2 + \lambda \sum_{i=1}^{n-1} \;\lvert x_{i+1}-x_i \rvert $$
Where
The first term ensures the denoised signal stays close to the corrupted signal.
The second term enforces sparsity in differences, encouraging flat regions and allowing for a few sharp changes.
The regularization parameter $\lambda$ controls the balance between data fidelity and smoothness.
An alternative form focuses on enforcing a hard constraint on data fidelity, while minimizing total variation:
$$ \begin{align*} \min \quad & \lVert DX \rVert_{\color{red} 1} \\ \text{subject to} \quad & \lVert X - X_{\text{cor}} \rVert_2 < \beta \end{align*} $$
plt.figure(figsize = (6, 6))
beta = [0.5, 2, 4]
for i in range(len(beta)):
x_reconst = cvx.Variable([n,1])
obj = cvx.Minimize(cvx.norm(x_reconst[1:n] - x_reconst[0:n-1], 1))
const = [cvx.norm(x_reconst-x_cor, 2) <= beta[i]]
prob = cvx.Problem(obj, const).solve()
plt.subplot(len(beta), 1, i+1)
plt.plot(t, x_cor, linewidth = 1, alpha = 0.5)
plt.plot(t, x_reconst.value, 'r', linewidth = 2)
plt.ylabel(r'$\beta = {}$'.format(beta[i]), fontsize = 15)
plt.show()
Consider the results presented above and reflect on why they behave the way they do.
Total Variation (TV) Smoothing
Total Variation (TV) smoothing is a powerful and widely used technique in signal and image processing. Its goal is to reduce noise while preserving important structural details - especially sharp transitions, such as edges or abrupt changes in the signal.
Unlike traditional smoothing methods (e.g., those based on the L2 norm), which often introduce unwanted blurring across the entire signal, TV smoothing excels at maintaining sharp features while still achieving effective denoising.
Why TV Works Well for Sharp Transitions
One of the most notable strengths of TV smoothing is its ability to preserve discontinuities in the data:
In image processing: TV helps maintain edges, where pixel intensities change suddenly.
In time-series or 1D signals: It retains jumps or step-like behaviors, which might represent important events or transitions.
This is possible because TV denoising uses L1 regularization on signal differences, which:
Penalizes large changes, but less aggressively than the L2 norm
Encourages sparsity in the derivative (i.e., most differences are zero, but a few sharp jumps are allowed)
By balancing noise reduction with the preservation of sharp transitions, TV smoothing provides a practical and robust solution for many real-world denoising problems.
6. Example: Total Variation Image Reconstruction¶
In this final example, we apply Total Variation (TV) smoothing to an image - a practical demonstration of how regression principles can extend beyond standard prediction tasks into areas like image denoising and signal reconstruction.
from google.colab import drive
drive.mount('/content/drive')
import cv2
imbw = cv2.imread('/content/drive/MyDrive/ML/ML_data/dog.jpg', 0)
row = 150
col = 150
resized_imbw = cv2.resize(imbw, (row, col))
plt.figure(figsize = (6, 6))
plt.imshow(resized_imbw, 'gray')
plt.axis('off')
plt.show()
Question: What kind of image do you expect if we apply an L1 norm on the image gradients?
$$ \begin{align*} \min \quad & \lVert DX \rVert_{\color{red} 1} \\ \text{subject to} \quad & \lVert X - X_{\text{cor}} \rVert_2 < \beta \end{align*} $$
n = row*col
imbws = resized_imbw.reshape(-1, 1)
beta = 1500
x = cvx.Variable([n,1])
obj = cvx.Minimize(cvx.norm(x[1:n] - x[0:n-1],1))
const = [cvx.norm(x - imbws,2) <= beta]
prob = cvx.Problem(obj, const).solve()
imbwr = x.value.reshape(row, col)
plt.figure(figsize = (6, 6))
plt.imshow(imbwr,'gray')
plt.axis('off')
plt.show()
Applying this optimization yields an image that is:
Denoised, with noise and small pixel variations suppressed
Sharp, as edges and contours are preserved
Piecewise smooth, with flat regions where possible and sharp transitions where necessary
This example demonstrates how regression principles extend beyond conventional predictive models and can be applied effectively to complex tasks such as image denoising and signal reconstruction.
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')