Regression

Table of Contents



1. Linear Regression


Regression is a fundamental concept in machine learning, playing a crucial role in understanding and predicting continuous outcomes based on input features. Unlike classification, which assigns data points to discrete categories, regression models aim to establish a relationship between independent variables (features) and a continuous dependent variable (target).

At its core, regression seeks to capture patterns in data and predict numerical values by minimizing the error between the predicted values and the actual values. This makes regression indispensable for a wide range of applications.


In [ ]:
from IPython.display import YouTubeVideo
YouTubeVideo('IRrGVQV8vZ8', width = "560", height = "315")
Out[ ]:

Consider a linear regression


  • Given $\; \begin{cases}x_{i} \; \text{: inputs} \\y_{i} \; \text{: outputs}\end{cases}$

$$x= \begin{bmatrix} x_{1} \\ x_{2} \\ \vdots \\ x_{m} \end{bmatrix}, \qquad y= \begin{bmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{m} \end{bmatrix}$$
  • $ \hat{y}_{i} $: predicted output

$$ \hat{y}_{i} = f(x_{i}\,; \theta) \; \text{ in general}$$
  • In many cases, a linear model is used to predict $y_{i}$

  • Unknown model parameters $ \theta = \begin{bmatrix}\theta_{0} \\\theta_{1} \\\end{bmatrix}$

$$ y_i \approx \hat{y}_{i} = \theta_{0} + \theta_{1}x_{i} \; \quad \text{ such that }\quad \min\limits_{\theta_{0}, \theta_{1}}\sum\limits_{i = 1}^{m} \left(\hat{y}_{i} - y_{i} \right)^2$$
  • Need to find $\theta_{0}$ and $\theta_{1}$

$$ y= \begin{bmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{m} \end{bmatrix} \approx \begin{bmatrix} \hat y_{1} \\ \hat y_{2} \\ \vdots \\ \hat y_{m} \end{bmatrix} = \theta_{0} \begin{bmatrix} 1 \\ 1 \\ \vdots \\ 1 \end{bmatrix} + \theta_{1} \begin{bmatrix} x_{1} \\ x_{2} \\ \vdots \\ x_{m} \end{bmatrix}$$





1.1. Re-cast Problem as a Least Squares

  • For convenience, we define a function that maps inputs to feature vectors, $\phi$

$$\begin{array}{Icr}\begin{align*} \hat{y}_{i} & = \theta_0 + x_i \theta_1 = 1 \cdot \theta_0 + x_i \theta_1 \\ \\ & = \begin{bmatrix}1 & x_{i}\end{bmatrix}\begin{bmatrix}\theta_{0} \\ \theta_{1}\end{bmatrix} \\\\ & =\begin{bmatrix}1 \\ x_{i} \end{bmatrix}^{T}\begin{bmatrix}\theta_{0} \\ \theta_{1}\end{bmatrix} \\\\ & =\phi^{T}(x_{i})\theta \end{align*}\end{array} \begin{array}{Icr} \quad \quad \text{feature vector} \; \phi(x_{i}) = \begin{bmatrix}1 \\ x_{i}\end{bmatrix} \end{array}$$



$$\Phi = \begin{bmatrix}1 & x_{1} \\ 1 & x_{2} \\ \vdots \\1 & x_{m} \end{bmatrix}=\begin{bmatrix}\phi^T(x_{1}) \\\phi^T(x_{2}) \\\vdots \\\phi^T(x_{m}) \end{bmatrix} \quad \implies \quad \hat{y} = \begin{bmatrix}\hat{y}_{1} \\\hat{y}_{2} \\\vdots \\\hat{y}_{m}\end{bmatrix}=\Phi\theta$$
  • Optimization problem (or least squares)

$$\min\limits_{\theta_{0}, \theta_{1}}\sum\limits_{i = 1}^{m} (\hat{y}_{i} - y_{i})^2 =\color{red}{\min\limits_{\theta}\lVert\Phi\theta-y\rVert^2_2} \qquad \qquad \left(\text{same as} \; \min_{x} \lVert Ax-b \rVert_2^2 \right)$$



$$ \text{solution} \; \theta^* = (A^TA)^{-1}A^T b $$
  • Why? (here, $A_i$ means the $i$th row of matrix $A$)

$$\sum_{i=1}^{m} \lvert \underbrace{ A_i x - b_i}_{\text{scalar}} \rvert^2 = \left\| \begin{bmatrix} A_1 x - b_1 \\ \vdots \\ A_m x - b_m \end{bmatrix} \right\|^2 = \left\| \begin{bmatrix} A_1 \\ \vdots \\ A_m \end{bmatrix} x - \begin{bmatrix} b_1 \\ \vdots \\ b_m \end{bmatrix} \right\|^2 = \lVert \underbrace{Ax-b}_{\text{vector}} \rVert_2^2$$

Note




$$\begin{array}{Icr} \text{input} \\ x_{i} \end{array} \quad \rightarrow \quad \begin{array}{Icr} \text{feature} \\ \begin{bmatrix} 1 \\ x_{i} \end{bmatrix} \end{array} \quad \rightarrow \quad \begin{array}{Icr} \text{predicted output} \\ \hat{y}_{i} \end{array}$$
$$\begin{array}{Icr} \begin{bmatrix} 1 & x_{1} \\1 & x_{2}\\\vdots & \vdots\\1 & x_{m}\end{bmatrix}\begin{bmatrix}\theta_0\\\theta_1\end{bmatrix}=\begin{bmatrix}y_{1} \\y_{2} \\\vdots \\y_{m}\end{bmatrix} \\ \begin{array}{Icr} \uparrow \\ \vec{A}_1 \end{array} \;\; \begin{array}{Icr} \uparrow \\ \vec{A}_2 \end{array} \quad \begin{array}{Icr} \uparrow \\ \vec{x} \end{array} \quad\quad \;\; \begin{array}{Icr} \uparrow \\ \vec{B} \end{array} \end{array} \quad \begin{array}{Icr} \quad \text{over-determined or} \\ \quad \text{projection} \end{array}$$
$$A(= \Phi) = \left[ \vec{A}_1 \;\; \vec{A}_2 \right]$$

1.2. Solve Optimizaton in Linear Regression

1.2.1. Use Linear Algebra

  • known as least square

$$ \theta = (A^TA)^{-1}A^T y $$
In [84]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
In [85]:
# data points in column vector [input, output]
x = np.array([0.1, 0.4, 0.7, 1.2, 1.3, 1.7, 2.2, 2.8, 3.0, 4.0, 4.3, 4.4, 4.9]).reshape(-1, 1)
y = np.array([0.5, 0.9, 1.1, 1.5, 1.5, 2.0, 2.2, 2.8, 2.7, 3.0, 3.5, 3.7, 3.9]).reshape(-1, 1)

plt.figure(figsize = (6, 4))
plt.plot(x, y, 'ko', alpha = 0.3)
plt.xlabel('X')
plt.ylabel('Y')
plt.axis('equal')
plt.grid(alpha = 0.3)
plt.show()
In [86]:
m = y.shape[0]
#A = np.hstack([np.ones([m, 1]), x])
A = np.hstack([x**0, x])
A = np.asmatrix(A)

theta = (A.T*A).I*A.T*y

print('theta:\n', theta)
theta:
 [[0.65306531]
 [0.67129519]]
In [87]:
# to plot
plt.figure(figsize = (6, 4))
plt.xlabel('X')
plt.ylabel('Y')
plt.plot(x, y, 'ko', label = "data", alpha = 0.3)

# to plot a straight line (fitted line)
xp = np.arange(0, 5, 0.01).reshape(-1, 1)
yp = theta[0,0] + theta[1,0]*xp

plt.plot(xp, yp, 'r', linewidth = 2, label = "regression")
plt.legend()
plt.axis('equal')
plt.grid(alpha = 0.3)
plt.show()

1.2.2. Use Gradient Descent


$$\min_{\theta} ~ \lVert \hat y - y \rVert_2^2 = \min_{\theta} ~ \lVert A\theta - y \rVert_2^2$$
$$ \begin{align*} f &= (A\theta-y)^T(A\theta-y) = (\theta^TA^T-y^T)(A\theta-y) \\ &= \theta^TA^TA\theta - \theta^TA^Ty - y^TA\theta + y^Ty \\\\ \nabla f &= A^TA\theta + A^TA\theta - A^Ty - A^Ty = 2(A^TA\theta - A^Ty) \end{align*} $$
$$ \theta \leftarrow \theta - \alpha \nabla f $$
In [88]:
theta = np.random.randn(2,1)
theta = np.asmatrix(theta)

alpha = 0.001

for _ in range(1000):
    df = 2*(A.T*A*theta - A.T*y)
    theta = theta - alpha*df

print(theta)
[[0.65261835]
 [0.67143025]]
In [89]:
# to plot
plt.figure(figsize = (6, 4))
plt.xlabel('X')
plt.ylabel('Y')
plt.plot(x, y, 'ko', label = "data", alpha = 0.3)

# to plot a straight line (fitted line)
xp = np.arange(0, 5, 0.01).reshape(-1, 1)
yp = theta[0,0] + theta[1,0]*xp

plt.plot(xp, yp, 'r', linewidth = 2, label = "regression")
plt.legend()
plt.axis('equal')
plt.grid(alpha = 0.3)
plt.show()

1.2.3. Use CVXPY Optimization


$$\min_{\theta} ~ \lVert \hat y - y \rVert_2 = \min_{\theta} ~ \lVert A\theta - y \rVert_2$$
In [90]:
import cvxpy as cvx

theta2 = cvx.Variable([2, 1])
obj = cvx.Minimize(cvx.norm(A@theta2 - y, 2))
cvx.Problem(obj,[]).solve()

print('theta:\n', theta2.value)
theta:
 [[0.65306531]
 [0.67129519]]
In [91]:
# to plot straight lines (fitted lines)
plt.figure(figsize = (6, 4))
plt.plot(x, y, 'ko', label = 'data', alpha = 0.3)
xp = np.arange(0, 5, 0.01).reshape(-1,1)
yp2 = theta2.value[0,0] + theta2.value[1,0]*xp

plt.plot(xp, yp2, 'r', linewidth = 2, label = '$L_2$')
plt.axis('equal')
plt.legend(loc = 5)
plt.grid(alpha = 0.3)
plt.show()

1.2.4. Scikit-Learn


  • Machine Learning in Python
  • Simple and efficient tools for data mining and data analysis
  • Accessible to everybody, and reusable in various contexts
  • Built on NumPy, SciPy, and matplotlib
  • Open source, commercially usable - BSD license
  • https://scikit-learn.org/stable/index.html#
In [92]:
from sklearn import linear_model
In [93]:
reg = linear_model.LinearRegression()
reg.fit(x, y)
Out[93]:
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In [94]:
reg.coef_
Out[94]:
array([[0.67129519]])
In [95]:
reg.intercept_
Out[95]:
array([0.65306531])
In [96]:
# to plot
plt.figure(figsize = (6, 4))
plt.xlabel('X', fontsize = 12)
plt.ylabel('Y', fontsize = 12)
plt.plot(x, y, 'ko', label = "data", alpha = 0.3)

# to plot a straight line (fitted line)
plt.plot(xp, reg.predict(xp), 'r', linewidth = 2, label = "regression")
plt.legend(fontsize = 12)
plt.axis('equal')
plt.grid(alpha = 0.3)
plt.show()

We demonstrated four different implementations of linear regression: least squares, gradient descent, CVXPY, and Scikit-Learn.



1.2.5. L1 Norm

In the objective function, the error is defined as the difference between the predicted value and the ground truth. To measure the magnitude or strength of this error, the L2 norm is typically used, as it provides a smooth and differentiable measure of error, making it convenient for optimization.

However, as we have learned, there are many different norms available. It is often interesting to explore what happens when a different norm, such as the L1 norm, is used instead. The choice of norm can significantly impact the behavior and properties of the optimization process, affecting factors such as robustness, sparsity, and computational efficiency.


Let's use $L_1$ norm


$$ \min_{\theta} ~ \lVert \hat y - y \rVert_1 = \min_{\theta} ~ \lVert A\theta - y \rVert_1 $$

You might wonder how to solve this optimization problem since the L1 norm is used in the objective function. However, we will use CVXPY to find the optimal solution. You do not need to worry too much about the algorithm running beneath CVXPY.


In [97]:
theta1 = cvx.Variable([2, 1])
obj = cvx.Minimize(cvx.norm(A@theta1 - y, 1))
cvx.Problem(obj).solve()

print('theta:\n', theta1.value)
theta:
 [[0.6258208]
 [0.685448 ]]
In [98]:
# to plot data
plt.figure(figsize = (6, 4))
plt.xlabel('X', fontsize = 12)
plt.ylabel('Y', fontsize = 12)
plt.plot(x, y, 'ko', label = 'data', alpha = 0.3)

# to plot straight lines (fitted lines)
xp = np.arange(0, 5, 0.01).reshape(-1, 1)
yp1 = theta1.value[0,0] + theta1.value[1,0]*xp
yp2 = theta2.value[0,0] + theta2.value[1,0]*xp

plt.plot(xp, yp1, 'b', linewidth = 2, label = '$L_1$')
plt.plot(xp, yp2, 'r', linewidth = 2, label = '$L_2$')
plt.legend()
plt.axis('equal')
plt.grid(alpha = 0.3)
plt.show()

Interestingly, the $L_1$ norm also provides a reasonable linear approximation. Therefore, at this point, you might conclude that the choice of distance definition does not significantly impact the results. However, let’s examine the following example, which includes outliers in the dataset, to further explore this effect.


What if outliers exist?

  • Fit with the different norms
  • Discuss the result
  • It is important to understand what makes them different.

In [99]:
# add outliers
x = np.vstack([x, np.array([0.5, 3.8]).reshape(-1, 1)])
y = np.vstack([y, np.array([3.0, 1.0]).reshape(-1, 1)])

A = np.hstack([x**0, x])
A = np.asmatrix(A)
In [100]:
A.shape
Out[100]:
(15, 2)
In [102]:
plt.figure(figsize = (6, 4))
plt.plot(x, y, 'ko', label = 'data', alpha = 0.3)
plt.axis('equal')
plt.grid(alpha = 0.3)
plt.show()
In [103]:
theta2 = cvx.Variable([2, 1])
obj2 = cvx.Minimize(cvx.norm(A@theta2 - y, 2))
prob2 = cvx.Problem(obj2).solve()
In [104]:
# to plot straight lines (fitted lines)
plt.figure(figsize = (6, 4))
plt.plot(x, y, 'ko', label = 'data', alpha = 0.3)
xp = np.arange(0, 5, 0.01).reshape(-1,1)
yp2 = theta2.value[0,0] + theta2.value[1,0]*xp

plt.plot(xp, yp2, 'r', linewidth = 2, label = '$L_2$')
plt.axis('equal')
plt.legend(loc = 5)
plt.grid(alpha = 0.3)
plt.show()
In [108]:
theta1 = cvx.Variable([2, 1])
obj1 = cvx.Minimize(cvx.norm(A@theta1 - y, 1))
prob1 = cvx.Problem(obj1).solve()
In [109]:
# to plot straight lines (fitted lines)
plt.figure(figsize = (6, 4))
plt.plot(x, y, 'ko', label = 'data', alpha = 0.3)
xp = np.arange(0, 5, 0.01).reshape(-1,1)
yp1 = theta1.value[0,0] + theta1.value[1,0]*xp

plt.plot(xp, yp1, 'b', linewidth = 2, label = '$L_1$')
plt.axis('equal')
plt.legend(loc = 5)
plt.grid(alpha = 0.3)
plt.show()
In [110]:
# to plot data
plt.figure(figsize = (6, 4))
plt.plot(x, y, 'ko', label = 'data', alpha = 0.3)
plt.title('$L_1$ and $L_2$ Regression w/ Outliers', fontsize = 12)
plt.xlabel('X', fontsize = 12)
plt.ylabel('Y', fontsize = 12)

# to plot straight lines (fitted lines)
xp = np.arange(0, 5, 0.01).reshape(-1,1)
yp1 = theta1.value[0,0] + theta1.value[1,0]*xp
yp2 = theta2.value[0,0] + theta2.value[1,0]*xp

plt.plot(xp, yp1, 'b', linewidth = 2, label = '$L_1$')
plt.plot(xp, yp2, 'r', linewidth = 2, label = '$L_2$')
plt.axis('scaled')
plt.legend(loc = 5)
plt.grid(alpha = 0.3)
plt.show()

The key differences between these two regression approaches arise from how they handle outliers in the data.


(1) Sensitivity to Outliers in L2-Norm Regression

  • The L2 norm minimizes the sum of squared errors:


    $$\min ~ \sum \left( \hat y - y \right)^2 $$
  • Squaring the residuals magnifies the effect of large errors, meaning that outliers have a significant influence on the final regression line.

  • As a result, the L2 regression line (red) is pulled towards the outliers, making it less robust.


(2) Robustness of L1-Norm Regression

  • The L1 norm minimizes the sum of absolute errors:


    $$\min ~ \sum \lvert \hat y - y \rvert $$
  • Since absolute differences are not squared, large errors (outliers) contribute less to the total loss compared to the L2 norm.

  • This makes the L1 regression line (blue) less sensitive to outliers, leading to a more robust fit that better represents the majority of the data.


2. Multivariate Linear Regression

Multivariate Linear Regression is an extension of simple linear regression where multiple independent variables (features) are used to predict a dependent variable. This is useful in scenarios where multiple factors influence an outcome.


2.1. Linear Regression for Multivariate Data


$$ \hat{y} = \theta_0 + \theta_{1}x_1 + \theta_{2}x_2 $$



$$\phi \left(x^{(i)}\right) = \begin{bmatrix}1\\x^{(i)}_{1}\\x^{(i)}_{2} \end{bmatrix}$$
In [ ]:
# for 3D plot
from mpl_toolkits.mplot3d import Axes3D
In [ ]:
# y = theta0 + theta1*x1 + theta2*x2 + noise

n = 200
x1 = np.random.randn(n, 1)
x2 = np.random.randn(n, 1)
noise = 0.5*np.random.randn(n, 1);

y = 2 + 1*x1 + 3*x2 + noise

fig = plt.figure(figsize = (8, 6))
ax = fig.add_subplot(1, 1, 1, projection = '3d')
ax.set_title('Generated Data', fontsize = 12)
ax.set_xlabel('$X_1$', fontsize = 12)
ax.set_ylabel('$X_2$', fontsize = 12)
ax.set_zlabel('Y', fontsize = 12)
ax.scatter(x1, x2, y, marker = '.', label = 'Data')
#ax.view_init(30,30)
plt.legend(fontsize = 12)
plt.show()



$$\Phi = \begin{bmatrix}1 & x_{1}^{(1)} & x_{2}^{(1)}\\1 & x_{1}^{(2)} & x_{2}^{(2)}\\ \vdots \\1 & x_{1}^{(m)} & x_{2}^{(m)} \end{bmatrix} \quad \implies \quad \hat{y} = \begin{bmatrix}\hat{y}^{(1)} \\\hat{y}^{(2)} \\\vdots \\\hat{y}^{(m)}\end{bmatrix}=\Phi\theta$$



$$\implies \theta^{*} = (\Phi^T \Phi)^{-1} \Phi^T y$$

In matrix form, the least-squares solution remains identical in its formulation. The only difference is that when extending to multivariate regression, more columns are included in the feature matrix $\Phi$, representing additional independent variables.


In [ ]:
A = np.hstack([np.ones((n, 1)), x1, x2])
A = np.asmatrix(A)
theta = (A.T*A).I*A.T*y

X1, X2 = np.meshgrid(np.arange(np.min(x1), np.max(x1), 0.5),
                     np.arange(np.min(x2), np.max(x2), 0.5))
YP = theta[0,0] + theta[1,0]*X1 + theta[2,0]*X2

fig = plt.figure(figsize = (8, 6))
ax = fig.add_subplot(1, 1, 1, projection = '3d')
ax.set_title('Regression', fontsize = 12)
ax.set_xlabel('$X_1$', fontsize = 12)
ax.set_ylabel('$X_2$', fontsize = 12)
ax.set_zlabel('Y', fontsize = 12)
ax.scatter(x1, x2, y, marker = '.', label = 'Data')
ax.plot_wireframe(X1, X2, YP, color = 'k', alpha = 0.3, label = 'Regression Plane')
#ax.view_init(30,30)
plt.legend(fontsize = 12)
plt.show()

2.2. Feature (or Input) Importance in Multivariate Linear Regression

In multivariate linear regression, multiple features (inputs) contribute to predicting the target variable. However, not all features are equally important. Understanding feature importance helps in model interpretability, feature selection, and improving prediction performance.


$$ \hat{y} = \theta_0 + \theta_{1}x_1 + \theta_{2}x_2 + \cdots $$

The importance of a feature is directly related to the magnitude of its coefficient $\theta_i$. A larger absolute value of $\theta_i$ suggests a stronger influence on the target variable.

  • The larger the absolute value $\lvert \theta_i \rvert$, the more impact the feature has on the predicted outcome.

  • Potential issue: Different features may have different scales, which can distort comparisons.


2.3. Feature Selection in Multivariate Linear Regression

In multivariate linear regression, some features may have little to no impact on the target variable. If a feature’s coefficient is relatively small compared to others, it contributes minimally to the prediction and can be eliminated to simplify the model. Simplifying the model leads to better generalization and computational efficiency.


Example

Before Feature Selection:


$$ y = 3.2x_1 + 0.01x_2 + 2.5x_3 + 0.0001x_4 $$

Here, $x_2$ and $x_4$ have negligible impact.


After Feature Selection:


$$ y \approx 3.2x_1 + 2.5x_3 $$

Simpler model, easier to interpret, and more efficient.



3. Nonlinear Regression

Linear regression is a foundational technique in statistical modeling that assumes a straight-line (linear) relationship between the independent variables and the dependent variable. It finds the line that best fits the data points by minimizing the sum of squared errors. This approach works well if the underlying relationship in the data is approximately linear. However, when the data exhibit a nonlinear pattern, a straight line may not adequately capture the relationship, leading to large errors or systematic patterns in the residuals.

When a linear model is insufficient, one should consider using nonlinear regression techniques that can accommodate curvature or other complex nonlinear behaviors in the data. In such cases, it is natural to transition from a linear regression approach to a nonlinear regression model in order to better fit the data.

This section explores how to construct a regression model capable of approximating non-linearly distributed data.

(= Linear Regression for Non-linear Data)



3.1. Nonlinear Regression with Polynomial Features

One of the simplest approaches to nonlinear regression is polynomial regression, which extends linear regression by incorporating polynomial terms to capture curvature in the data.

Consider a dataset consisting of paired data points $(x_i, y_i)$, where each $x_i$ is an independent variable and $y_i$ is the corresponding dependent variable. Instead of modeling $y$ as a simple linear function of $x$, polynomial regression extends the feature set by including higher-order terms:


$$ \begin{align*} y &= \theta_0 + \theta_1 x + \theta_2 x^2 + \text{noise} \end{align*} $$



$$\phi(x_{i}) = \begin{bmatrix}1\\x_{i}\\x_{i}^2 \end{bmatrix}$$

Here, quad is used as an example and $\Phi$ is constructed using powers of $x$, forming an augmented feature space:


$$\Phi = \begin{bmatrix}1 & x_{1} & x_{1}^2 \\ 1 & x_{2} & x_{2}^2 \\ \vdots \\ 1 & x_{m} & x_{m}^2\end{bmatrix} \quad \implies \quad \hat{y} = \begin{bmatrix}\hat{y}_1 \\\hat{y}_2 \\\vdots \\\hat{y}_m\end{bmatrix}=\Phi\theta$$

where each row represents an observation, and each column corresponds to a different polynomial term.


Although the equation appears nonlinear in terms of $x$, it remains linear in terms of the unknown parameters $\theta$. The model is still solvable using linear regression techniques, such as the least-squares method:


$$\implies \theta^{*} = (\Phi^T \Phi)^{-1} \Phi^T y$$
In [ ]:
# y = theta0 + theta1*x + theta2*x^2 + noise

n = 100
x = -5 + 15*np.random.rand(n, 1)
noise = 10*np.random.randn(n, 1)

y = 10 + 1*x + 2*x**2 + noise

plt.figure(figsize = (6, 4))
plt.xlabel('X')
plt.ylabel('Y')
plt.plot(x, y, 'o', markersize = 4, alpha = 0.3)
plt.grid(alpha = 0.3)
plt.show()
In [ ]:
A = np.hstack([x**0, x, x**2])
A = np.asmatrix(A)

theta = (A.T*A).I*A.T*y
print('theta:\n', theta)
theta:
 [[11.47683907]
 [ 1.07754213]
 [ 1.93265672]]
In [ ]:
xp = np.linspace(np.min(x), np.max(x))
yp = theta[0,0] + theta[1,0]*xp + theta[2,0]*xp**2

plt.figure(figsize = (6, 4))
plt.plot(x, y, 'o', markersize = 4, alpha = 0.3)
plt.plot(xp, yp, 'r', linewidth = 2)

plt.xlabel('X')
plt.ylabel('Y')
plt.grid(alpha = 0.3)
plt.show()

By understanding this example, we can effectively model nonlinear patterns using polynomial regression while leveraging the simplicity of linear regression methods.


3.2. Linear Basis Function Models

In the above section, we introduced polynomial regression. However, here, we interpret the same method from a slightly different perspective. We have learned that matrix-vector multiplication can be thought of as a linear combination of the columns of the matrix, with coefficients given by the vector. If $\Phi$ consists of columns representing basis functions, then $\Phi \theta$ represents one of the functions that is spanned by these basis functions. The least-squares solution $\theta^*$ will provide the optimal coefficients for this function.

This approach is referred to as linear basis function models. With this perspective, polynomial regression can be viewed as a special case of linear basis function models, where the basis functions are polynomials.


Consider linear combinations of fixed nonlinear functions of the input variables, of the form


$$ \begin{bmatrix} 1 & x_{1} & x_1^2\\1 & x_{2} & x_2^2\\\vdots & \vdots\\1 & x_{m} & x_m^2 \end{bmatrix} \begin{bmatrix}\theta_0\\\theta_1 \\ \theta_2 \end{bmatrix} \quad \implies \quad \begin{bmatrix} \mid & \mid & \mid \\ b_0(x) & b_1(x) & b_2(x)\\ \mid & \mid & \mid \end{bmatrix} \begin{bmatrix}\theta_0\\\theta_1 \\ \theta_2 \end{bmatrix} $$
$$ \hat{y}=\sum_{i=0}^d{\theta_i b_i(x)} = \Phi \theta$$

1) Polynomial functions


$$b_i(x) = x^i, \quad i = 0,\cdots,d$$
In [ ]:
xp = np.arange(-1, 1, 0.01).reshape(-1, 1)
polybasis = np.hstack([xp**i for i in range(6)])

plt.figure(figsize = (6, 4))

for i in range(6):
    plt.plot(xp, polybasis[:,i], label = '$x^{}$'.format(i))

plt.title('Polynomial Basis')
plt.xlabel('X')
plt.ylabel('Y')
plt.axis([-1, 1, -1.1, 1.1])
plt.grid(alpha = 0.3)
plt.legend()
plt.show()

For instance, if the coefficients are $[-1, 1, 1, -1, 1, -1]^T$ with the polynomial basis, the nonlinear function would be:


$$\hat y = -1 + 1 \cdot x + 1 \cdot x^2 - 1 \cdot x^3 + 1 \cdot x^4 -1 \cdot x^5$$
In [ ]:
theta = np.array([-1, 1, 1, -1, 1, -1])
yhat = polybasis.dot(theta)

plt.figure(figsize = (6, 4))

plt.plot(xp, yhat)
plt.xlabel('X')
plt.ylabel('Y')
plt.axis([-1, 1, -1.1, 1.1])
plt.grid(alpha = 0.3)
plt.show()

​The form of a nonlinear function is determined by the specific values of its coefficients. Consequently, solving a nonlinear regression problem involves identifying the set of coefficients that enable the function to most accurately approximate the given data points.



2) RBF functions

A Radial Basis Function (RBF) is a type of basis function whose value depends only on the distance between the input and a fixed center point.

  • Gaussian RBF (most commonly used) with bandwidth $\sigma$ and $k$ RBF centers $\mu_i \in \mathbb{R}^n$

$$ b_i(x) = \exp \left( - \frac{\lVert x-\mu_i \rVert^2}{2\sigma^2}\right) $$
  • Useful for capturing localized patterns

In [ ]:
d = 9

u = np.linspace(-1, 1, d)
sigma = 0.2

rbfbasis = np.hstack([np.exp(-(xp-u[i])**2/(2*sigma**2)) for i in range(d)])

plt.figure(figsize = (6, 4))

for i in range(d):
    plt.plot(xp, rbfbasis[:,i], label='$\mu = {}$'.format(u[i]))

plt.title('RBF basis')
plt.xlabel('X')
plt.ylabel('Y')
plt.axis([-1, 1, -0.1, 1.1])
plt.legend(loc = 'lower right')
plt.grid(alpha = 0.3)
plt.show()
In [ ]:
theta = np.array([1, 1, 1, -1, -1, 1, -1, -1, -1])
yhat = rbfbasis.dot(theta)

plt.figure(figsize = (6, 4))

plt.plot(xp, yhat)
plt.xlabel('X')
plt.ylabel('Y')
# plt.axis([-1, 1, -1.1, 1.1])
plt.grid(alpha = 0.3)
plt.show()

Nonlinear regression with linear basis function models

We are now ready to perform nonlinear regression on the given data points.


(1) Polynomial functions


In [ ]:
n = 100
x = -5 + 15*np.random.rand(n, 1)
noise = 10*np.random.randn(n, 1)

y = 10 + 1*x + 2*x**2 + noise

plt.figure(figsize = (6, 4))
plt.xlabel('X')
plt.ylabel('Y')
plt.plot(x, y, 'o', markersize = 4, alpha = 0.3)
plt.grid(alpha = 0.3)
plt.show()
In [ ]:
xp = np.linspace(np.min(x), np.max(x)).reshape(-1, 1)

d = 3

A = np.hstack([x**i for i in range(d)])
polybasis = np.hstack([xp**i for i in range(d)])

A = np.asmatrix(A)
polybasis = np.asmatrix(polybasis)

theta = (A.T*A).I*A.T*y
yp = polybasis*theta

plt.figure(figsize = (6, 4))
plt.plot(x, y, 'o', markersize = 4, alpha = 0.3)
plt.plot(xp[:,0], yp[:,0], linewidth = 2)
plt.xlabel('X')
plt.ylabel('Y')
plt.grid(alpha = 0.3)
plt.show()

(2) RBF functions


In [ ]:
xp = np.linspace(np.min(x), np.max(x)).reshape(-1, 1)

d = 5
u = np.linspace(-6, 12, d)
sigma = 3

A = np.hstack([np.exp(-(x-u[i])**2/(2*sigma**2)) for i in range(d)])
rbfbasis = np.hstack([np.exp(-(xp-u[i])**2/(2*sigma**2)) for i in range(d)])

A = np.asmatrix(A)
rbfbasis = np.asmatrix(rbfbasis)

theta = (A.T*A).I*A.T*y
yp = rbfbasis*theta

plt.figure(figsize = (6, 4))
plt.plot(x, y, 'o', markersize = 4, alpha = 0.3)
plt.plot(xp, yp)
plt.xlabel('X')
plt.ylabel('Y')
plt.grid(alpha = 0.3)
plt.show()

Nonlinear regression utilizing polynomial features and linear basis function models employing polynomial basis functions are fundamentally equivalent in their mathematical structure. Both approaches aim to model complex, nonlinear relationships by transforming the original input variables into a higher-dimensional space where linear regression techniques can be applied. In the case of polynomial regression, this transformation involves augmenting the input data with polynomial terms (e.g., $ x, x^2, x^3, \cdots $), effectively enabling the modeling of nonlinear patterns within the data. Similarly, linear basis function models utilize polynomial basis functions to achieve a comparable transformation, facilitating the capture of nonlinear relationships through a linear modeling framework.

Despite this equivalence, the two methodologies offer distinct perspectives in terms of implementation and interpretation. Polynomial regression is often viewed as an extension of linear regression, where additional polynomial terms are included to account for curvature in the data. This approach is straightforward and leverages the simplicity of linear models, making it accessible and interpretable. On the other hand, linear basis function models provide a more generalized framework, allowing for the incorporation of various types of basis functions beyond polynomials, such as splines or radial basis functions. This flexibility enables practitioners to tailor the model to the specific characteristics of the data, potentially enhancing predictive performance and interpretability.


4. Overfitting

In machine learning, one of the most common challenges is overfitting. Overfitting occurs when a model learns not only the important patterns in the training data but also the noise and random fluctuations. Instead of performing well on new, unseen data, the model becomes highly tuned to the training data, resulting in poor generalization.

This is similar to a student who memorizes answers to specific questions rather than understanding the concepts, leading to poor performance when asked different questions. In the same way, an overfitted model may perform very well on the training set but fail to make accurate predictions on test data.

Understanding and addressing overfitting is essential for building models that generalize effectively. We introduce the concept of overfitting, its causes, and its impact on model performance. We will also cover practical methods to prevent overfitting, such as regularization.

With clear examples and explanations, this section aims to provide an understanding of overfitting and how to handle it. By the end of this section, you will be equipped with the knowledge to create models that strike the right balance between fitting the data and making reliable predictions on new data.


4.1. Nonlinear Regressoin with 10 Data Points

We will consider a simple case where we have ten data points and build a non-linear regression model using polynomial functions of varying degrees. By visualizing the fitted models, we will observe how the degree of the polynomial affects the fit and identify cases of underfitting, appropriate fitting, and overfitting.


In [ ]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# 10 data points
n = 10
x = np.linspace(-4.5, 4.5, 10).reshape(-1, 1)
y = np.array([0.9819, 0.7973, 1.9737, 0.1838, 1.3180, -0.8361, -0.6591, -2.4701, -2.8122, -6.2512]).reshape(-1, 1)

plt.figure(figsize = (6, 4))
plt.plot(x, y, 'o', alpha = 0.5)
plt.xlabel('X')
plt.ylabel('Y')
plt.grid(alpha = 0.3)
plt.axis('equal')
plt.show()

Degree 1 (Linear Regression):

  • The model underfits the data
  • It fails to capture the non-linear pattern and results in a straight-line fit.

In [ ]:
A = np.hstack([x**0, x])
A = np.asmatrix(A)

theta = (A.T*A).I*A.T*y
print(theta)
[[-0.7774    ]
 [-0.71070424]]
In [ ]:
# to plot
xp = np.arange(-4.5, 4.5, 0.01).reshape(-1, 1)
yp = theta[0,0] + theta[1,0]*xp

plt.figure(figsize = (6, 4))
plt.plot(x, y, 'o', alpha = 0.5)
plt.plot(xp[:,0], yp[:,0], linewidth = 2)
plt.title('Linear Regression')
plt.xlabel('X')
plt.ylabel('Y')
plt.grid(alpha = 0.3)
plt.axis('equal')
plt.show()

Degree 3 (Cubic Polynomial):

  • The model fits the data well
  • It captures the underlying trend without being too flexible or too rigid.

In [ ]:
A = np.hstack([x**0, x, x**2, x**3])
A = np.asmatrix(A)

theta = (A.T*A).I*A.T*y
print(theta)
[[ 0.33669062]
 [-0.58052442]
 [-0.13504129]
 [-0.00888599]]
In [ ]:
# to plot
xp = np.arange(-4.5, 4.5, 0.01).reshape(-1, 1)
yp = theta[0,0] + theta[1,0]*xp + theta[2,0]*xp**2

plt.figure(figsize = (6, 4))
plt.plot(x, y, 'o', alpha = 0.5)
plt.plot(xp[:,0], yp[:,0], linewidth = 2)
plt.xlabel('X')
plt.ylabel('Y')
plt.grid(alpha = 0.3)
plt.axis('equal')
plt.show()

Degree 9 (High-Degree Polynomial):

  • The model overfits the data
  • It closely follows every data point, including noise, resulting in a wavy curve that fails to generalize well to unseen data.

In [ ]:
A = np.hstack([x**i for i in range(10)])
A = np.asmatrix(A)

theta = (A.T*A).I*A.T*y

# to plot
xp = np.arange(-4.5, 4.5, 0.01).reshape(-1, 1)

polybasis = np.hstack([xp**i for i in range(10)])
polybasis = np.asmatrix(polybasis)

yp = polybasis*theta

plt.figure(figsize = (6, 8))
plt.plot(x, y, 'o', alpha = 0.5)
plt.plot(xp[:,0], yp[:,0], linewidth = 2)
plt.xlabel('X')
plt.ylabel('Y')
plt.grid(alpha = 0.3)
plt.axis('equal')
plt.show()
In [ ]:
d = [1, 3, 5, 9]
RSS = []

plt.figure(figsize = (8, 6))
plt.suptitle('Nonlinear Regression', fontsize = 12)

for k in range(4):
    A = np.hstack([x**i for i in range(d[k]+1)])
    polybasis = np.hstack([xp**i for i in range(d[k]+1)])

    A = np.asmatrix(A)
    polybasis = np.asmatrix(polybasis)

    theta = (A.T*A).I*A.T*y
    yp = polybasis*theta

    RSS.append(np.linalg.norm(y - A*theta, 2)**2)

    plt.subplot(2, 2, k+1)
    plt.plot(x, y, 'o', alpha = 0.5)
    plt.plot(xp, yp)
    plt.axis([-5, 5, -12, 6])
    plt.title('degree = {}'.format(d[k]))
    plt.grid(alpha=0.3)

plt.show()

Residual Sum of Squares


In [ ]:
plt.figure(figsize = (6, 4))
plt.stem(d, RSS, label = 'RSS')
plt.title('Residual Sum of Squares', fontsize = 12)
plt.xlabel('degree', fontsize = 12)
plt.ylabel('RSS', fontsize = 12)
plt.legend(fontsize = 12)
plt.grid(alpha = 0.3)
plt.show()

Residual Sum of Squares (RSS) decreases as the polynomial degree increases. A 9th-order polynomial regression applied to a dataset with 10 data points will result in zero RSS. However, a lower error does not necessarily indicate a better model, as overfitting may compromise generalization.


Key Takeaways

  • Underfitting: Happens when the model is too simple (e.g., a low-degree polynomial) and fails to capture the pattern.

  • Good Fit: Achieved when the model is complex enough to capture the trend but not so flexible that it fits noise.

  • Overfitting: Happens when the model is too complex (e.g., a high-degree polynomial) and memorizes noise, reducing generalization.


By testing different degrees of polynomial functions, this example demonstrates the importance of finding the right balance between model complexity and generalization to avoid underfitting and overfitting.


We will repeat the same case to build a nonlinear regression model using RBF functions.


In [ ]:
xp = np.arange(-4.5, 4.5, 0.01).reshape(-1, 1)

d = 5
u = np.linspace(-4.5, 4.5, d)
sigma = 2

A = np.hstack([np.exp(-(x-u[i])**2/(2*sigma**2)) for i in range(d)])
rbfbasis = np.hstack([np.exp(-(xp-u[i])**2/(2*sigma**2)) for i in range(d)])

A = np.asmatrix(A)
rbfbasis = np.asmatrix(rbfbasis)

theta = (A.T*A).I*A.T*y
yp = rbfbasis*theta

plt.figure(figsize = (6, 4))
plt.plot(x, y, 'o', alpha = 0.5)
plt.plot(xp, yp)
plt.xlabel('X')
plt.ylabel('Y')
plt.grid(alpha = 0.3)
plt.axis('equal')
plt.show()
In [ ]:
d = [2, 4, 6, 10]
sigma = 2

plt.figure(figsize = (8, 6))

for k in range(4):
    u = np.linspace(-4.5, 4.5, d[k])

    A = np.hstack([np.exp(-(x-u[i])**2/(2*sigma**2)) for i in range(d[k])])
    rbfbasis = np.hstack([np.exp(-(xp-u[i])**2/(2*sigma**2)) for i in range(d[k])])

    A = np.asmatrix(A)
    rbfbasis = np.asmatrix(rbfbasis)

    theta = (A.T*A).I*A.T*y
    yp = rbfbasis*theta

    plt.subplot(2, 2, k+1)
    plt.plot(x, y, 'o', alpha = 0.5)
    plt.plot(xp, yp)
    plt.axis([-5, 5, -12, 6])
    plt.title('num RBFs = {}'.format(d[k]), fontsize = 10)
    plt.grid(alpha = 0.3)

plt.show()

4.2. Overfitting

Overfitting occurs when a machine learning model performs well on the training data but poorly on unseen or test data. The model essentially "memorizes" the noise and details in the training set rather than learning the general patterns that can be applied to new data.


Overfitting Problem

  • Have you come across a situation where your model performed exceptionally well on train data, but was not able to predict test data?
  • One of the most common problem data science professionals face is to avoid overfitting.

Causes of Overfitting

  • Model Complexity: A model with too many parameters (e.g., deep neural networks) can overfit the training data.
  • Insufficient Training Data: If the training dataset is small, the model may fit every point precisely, capturing noise rather than general patterns.
  • Noisy Data: If the data has a lot of noise or irrelevant features, the model may try to learn this noise.
  • Too Many Training Iterations: Training for too long can lead to overfitting, where the model starts fitting noise instead of actual patterns.

Signs of Overfitting

  • High Training Accuracy, Low Test Accuracy: The model performs well on the training set but poorly on the validation/test set.
  • Large Gap Between Training and Validation Loss: Training loss continues to decrease, but validation loss increases.

Generalization Error

Generalization Error refers to the difference between a machine learning model's performance on the training data and its performance on unseen data (test or validation data). In other words, the generalization error measures the difference between the expected loss on unseen test data and the loss on the training data. It measures how well the model can generalize to new, unseen examples. A model with a low generalization error performs well not only on the training set but also on new data.

Mathematically, the generalization error can be defined as:


$$ \text{Generalization Error} = \mathbb{E}_{D_{\text{test}}}[\mathcal{L}(f(x), y)] - \mathbb{E}_{D_{\text{train}}}[\mathcal{L}(f(x), y)], $$

where:

  • $D_{\text{test}}$ and $D_{\text{train}}$ are the test and training data distributions.
  • $f(x)$ is the model's prediction function.
  • $\mathcal{L}(f(x), y)$ is the loss function measuring the difference between the predicted and true values.

Techniques to Prevent Overfitting


Regularization

  • Prevent overfitting by adding a penalty to the loss function, encouraging simpler models with smaller weights.
  • This penalty discourages the model from fitting noise in the training data by constraining the values of the model's parameters.
  • Regularization helps the model generalize better to unseen data by making it simpler and less sensitive to outliers.

Early Stopping

  • Monitors the validation loss during training and stops when the loss starts increasing to prevent overfitting.




Data Augmentation

  • Increases the training set size by applying transformations (e.g., rotations, flips, cropping) to existing data, making the model more generalizable.





Cross-Validation

  • Splits the data into multiple folds to train and validate the model on different subsets, ensuring that the model is not overfitting specific subsets of data.

Reduce Model Complexity

  • Use simpler models with fewer parameters to avoid memorizing the training data.

Among the various techniques to prevent overfitting, the regularization method will be the primary focus of the following session.


4.3. Regularization to Reduce Overfitting

L2 regularization, also known as Ridge Regularization or weight decay, involves adding a penalty equal to the sum of the squared weights to the loss function. It discourages large weights by penalizing their size, helping the model learn simpler and more robust patterns. Often, overfitting associated with very large estimated parameters $\theta$


We want to balance

  • how well function fits data

  • magnitude of coefficients



$$ \begin{align*} \text{Total cost } = \;&\underbrace{\text{measure of fit}}_{RSS(\theta)} + \;\lambda \cdot \underbrace{\text{measure of magnitude of coefficients}}_{\lambda \cdot \lVert \theta \rVert_2^2} \\ \\ \implies &\min\; \lVert \Phi \theta - y \rVert_2^2 + \lambda \lVert \theta \rVert_2^2 \end{align*} $$


where $ RSS(\theta) = \lVert \Phi\theta - y \rVert^2_2 $, ( = Rresidual Sum of Squares) and $\lambda$ is a tuning parameter to be determined separately
  • the second term, $\lambda \cdot \lVert \theta \rVert_2^2$, called a shrinkage penalty, is small when $\theta_1, \cdots,\theta_d$ are close to zeros, and so it has the effect of shrinking the estimates of $\theta_j$ towards zero

  • The tuning parameter $\lambda$ serves to control the relative impact of these two terms on the regression coefficient estimates

  • known as a ridge regression


How L2 Regularization Works

  • Effect on Weights:

    • L2 regularization forces the model to keep weights small.
  • Interpretation:

    • L2 regularization distributes the penalty uniformly across all weights, shrinking their magnitude.
  • Overfitting Reduction:

    • By limiting the model's ability to learn large weights, the model becomes less likely to overfit the noise in the training data.

Example:

Start with a rich representation


$$\min\; \lVert \Phi \theta - y \rVert_2^2$$
In [ ]:
# CVXPY code

import cvxpy as cvx

d = 10
u = np.linspace(-4.5, 4.5, d)

sigma = 1

A = np.hstack([np.exp(-(x-u[i])**2/(2*sigma**2)) for i in range(d)])
rbfbasis = np.hstack([np.exp(-(xp-u[i])**2/(2*sigma**2)) for i in range(d)])

A = np.asmatrix(A)
rbfbasis = np.asmatrix(rbfbasis)

theta = cvx.Variable([d, 1])
obj = cvx.Minimize(cvx.sum_squares(A@theta - y))
prob = cvx.Problem(obj).solve()

yp = rbfbasis*theta.value

plt.figure(figsize = (6, 4))
plt.plot(x, y, 'o', alpha = 0.5)
plt.plot(xp, yp)
plt.xlabel('X')
plt.ylabel('Y')
plt.axis([-5, 5, -12, 6])
plt.grid(alpha = 0.3)
plt.axis('equal')
plt.show()

As observed in the graph above, using 10 RBF functions for 10 data points results in a perfect fit, achieving zero RSS for the training data. However, this clearly indicates overfitting.


Visualizing the learned coefficients of a regression model is a valuable step in understanding the influence of each basis.


In [ ]:
plt.figure(figsize = (6, 4))
plt.title(r'Ridge: magnitude of $\theta$', fontsize = 12)
plt.xlabel(r'$\theta$', fontsize = 12)
plt.ylabel('magnitude', fontsize = 12)
plt.stem(np.linspace(1, 10, 10).reshape(-1, 1), theta.value)
plt.xlim([0.5, 10.5])
# plt.ylim([-5, 5])
plt.grid(alpha = 0.3)
plt.show()

  • Then, regularize coefficients $\theta$

$$\min\; \lVert \Phi \theta - y \rVert_2^2 + \lambda \lVert \theta \rVert_2^2$$
In [ ]:
# ridge regression

lamb = 0.1

theta = cvx.Variable([d, 1])
obj = cvx.Minimize(cvx.sum_squares(A@theta - y) + lamb*cvx.sum_squares(theta))
prob = cvx.Problem(obj).solve()

yp = rbfbasis*theta.value

plt.figure(figsize = (6, 4))
plt.plot(x, y, 'o', alpha = 0.5)
plt.plot(xp, yp, label = 'Ridge')
plt.xlabel('X')
plt.ylabel('Y')
plt.axis([-5, 5, -12, 6])
plt.grid(alpha = 0.3)
plt.axis('equal')
plt.show()

Plot the values of all the coefficients.


In [ ]:
# Regulization (= ridge nonlinear regression) encourages small weights, but not exactly 0

plt.figure(figsize = (6, 4))
plt.title(r'Ridge: magnitude of $\theta$', fontsize = 12)
plt.xlabel(r'$\theta$', fontsize = 12)
plt.ylabel('magnitude', fontsize = 12)
plt.stem(np.linspace(1, 10, 10).reshape(-1, 1), theta.value)
plt.xlim([0.5, 10.5])
plt.ylim([-5, 5])
plt.grid(alpha = 0.3)
plt.show()

Larger $\lambda$ forces the model to keep weights small.


In [ ]:
lamb = np.arange(0, 3, 0.01)

theta_record = []
for k in lamb:
    theta = cvx.Variable([d, 1])
    obj = cvx.Minimize(cvx.sum_squares(A@theta - y) + k*cvx.sum_squares(theta))
    prob = cvx.Problem(obj).solve()
    theta_record.append(np.ravel(theta.value))

plt.figure(figsize = (6, 4))
plt.plot(lamb, theta_record, linewidth = 1)
plt.title('Ridge coefficients as a function of regularization', fontsize = 12)
plt.xlabel('$\lambda$', fontsize = 12)
plt.ylabel(r'weight $\theta$', fontsize = 12)
plt.grid(alpha = 0.3)
plt.show()

From this example, we learn that regularization, specifically ridge regression, can be an effective method for overcoming overfitting challenges.



4.4. Sparsity for Feature Selection using LASSO

Ridge regression applies a penalty on the magnitude of the weights using the L2 norm. Why not consider using the L1 norm instead?


Try this cost instead of ridge...


$$ \begin{align*} \text{Total cost } = \;&\underbrace{\text{measure of fit}}_{RSS(\theta)} + \;\lambda \cdot \underbrace{\text{measure of magnitude of coefficients}}_{\lambda \cdot \lVert \theta \rVert_1} \\ \\ \implies &\min\; \lVert \Phi \theta - y \rVert_2^2 + \lambda \lVert \theta \rVert_1 \end{align*}$$

Consider the same problem but with a different objective function:


$$\min\; \lVert \Phi \theta - y \rVert_2^2 + \lambda \lVert \theta \rVert_1$$
In [ ]:
# LASSO regression

lamb = 2
theta = cvx.Variable([d, 1])
obj = cvx.Minimize(cvx.sum_squares(A@theta - y) + lamb*cvx.norm(theta, 1))
prob = cvx.Problem(obj).solve()

yp = rbfbasis*theta.value

plt.figure(figsize = (6, 4))
plt.title('LASSO Regularization (L1)', fontsize = 12)
plt.xlabel('X')
plt.ylabel('Y')
plt.plot(x, y, 'o', alpha = 0.5)
plt.plot(xp, yp)
plt.axis([-5, 5, -12, 6])
plt.grid(alpha = 0.3)
plt.axis('equal')
plt.show()

By selecting an appropriate value for $ \lambda $, the above example demonstrates that the L1 norm can also effectively mitigate overfitting.


This naturally leads to the consideration of the differences between L2 and L1 regularization techniques.


In [ ]:
# Regulization (= Lasso nonlinear regression) encourages zero weights

plt.figure(figsize = (6, 4))
plt.title(r'LASSO: magnitude of $\theta$', fontsize = 12)
plt.xlabel(r'$\theta$')
plt.ylabel('magnitude')
plt.stem(np.arange(1,11), theta.value)
plt.xlim([0.5, 10.5])
plt.ylim([-6, 2])
plt.grid(alpha = 0.3)
plt.show()

  • 'Shrink' some coefficients exactly to 0

    • i.e., knock out certain features
    • the $l_1$ penalty has the effect of forcing some of the coefficient estimates to be exactly zero
  • Non-zero coefficients indicate the 'selected' features


In [ ]:
lamb = np.arange(0, 3, 0.01)

theta_record = []
for k in lamb:
    theta = cvx.Variable([d, 1])
    obj = cvx.Minimize(cvx.sum_squares(A@theta - y) + k*cvx.norm(theta, 1))
    prob = cvx.Problem(obj).solve()
    theta_record.append(np.ravel(theta.value))

plt.figure(figsize = (6, 4))
plt.plot(lamb, theta_record, linewidth = 1)
plt.title('LASSO coefficients as a function of regularization', fontsize = 12)
plt.xlabel('$\lambda$')
plt.ylabel(r'weight $\theta$')
plt.show()

Sparsity and Feature Selection:

The L1 penalty in LASSO (Least Absolute Shrinkage and Selection Operator) encourages sparsity in the coefficient estimates, meaning it drives some coefficients to be exactly zero. This property is particularly useful for feature selection, as it allows the model to identify and retain only the most relevant predictors, simplifying the model and potentially improving interpretability.


In [ ]:
# reduced order model
# we will use only theta 2, 3, 8, 10

d = 4
u = np.array([-3.5, -2.5, 2.5, 4.5])

sigma = 1

A = np.hstack([np.exp(-(x-u[i])**2/(2*sigma**2)) for i in range(d)])
rbfbasis = np.hstack([np.exp(-(xp-u[i])**2/(2*sigma**2)) for i in range(d)])

A = np.asmatrix(A)
rbfbasis = np.asmatrix(rbfbasis)

theta = cvx.Variable([d, 1])
obj = cvx.Minimize(cvx.norm(A@theta - y, 2))
prob = cvx.Problem(obj).solve()

yp = rbfbasis*theta.value

plt.figure(figsize = (6, 4))
plt.plot(x, y, 'o', alpha = 0.5)
plt.plot(xp, yp)
plt.xlabel('X')
plt.ylabel('Y')
plt.axis([-5, 5, -12, 6])
plt.grid(alpha = 0.3)
plt.axis('equal')
plt.show()

We numerically demonstrate that the L1 penalty enforces sparsity in the coefficient estimates. But why does this occur?


To better understand this phenomenon, we can explore alternative but equivalent optimization formulations for both Ridge and LASSO regression.


$$ \begin{array}{rl} \begin{align*} \min_{\theta} \quad & \lVert \Phi \theta - y \rVert_2^2 \\ \text{subject to} \quad & \lVert \theta \rVert_{\color{red}1} \leq s_1 \end{align*} \end{array} \quad\quad\quad\quad \begin{array}{rl} \min_{\theta} \; & \lVert \Phi \theta - y \rVert_2^2 \\ \text{subject to} \; & \lVert \theta \rVert_{\color{red}2} \leq s_2 \end{array} $$




These constrained formulations reveal the underlying geometric properties of L1 and L2 regularization. The L1 constraint encourages solutions where some coefficients are exactly zero, resulting in sparse parameter estimates. In contrast, the L2 constraint tends to shrink all coefficients uniformly, rarely producing exact zeros but effectively controlling overall model complexity.


4.5. Scikit-Learn

In [ ]:
# 10 data points

n = 10
x = np.linspace(-4.5, 4.5, 10).reshape(-1, 1)
y = np.array([0.9819, 0.7973, 1.9737, 0.1838, 1.3180, -0.8361, -0.6591, -2.4701, -2.8122, -6.2512]).reshape(-1, 1)

d = 10
u = np.linspace(-4.5, 4.5, d)

sigma = 1
A = np.hstack([np.exp(-(x-u[i])**2/(2*sigma**2)) for i in range(d)])
In [ ]:
from sklearn import linear_model

reg = linear_model.Ridge(alpha = 0.1)
reg.fit(A,y)
Out[ ]:
Ridge(alpha=0.1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In [ ]:
# to plot

xp = np.arange(-4.5, 4.5, 0.01).reshape(-1, 1)
Ap = np.hstack([np.exp(-(xp-u[i])**2/(2*sigma**2)) for i in range(d)])

plt.figure(figsize = (6, 4))
plt.plot(x, y, 'o', alpha = 0.5)
plt.plot(xp, reg.predict(Ap), linewidth = 2, label = "Ridge")
plt.title('Ridge Regression (L2)', fontsize = 12)
plt.xlabel('X')
plt.ylabel('Y')
plt.grid(alpha = 0.3)
plt.axis('equal')
plt.show()
In [ ]:
reg = linear_model.Lasso(alpha = 0.005)
reg.fit(A,y)
Out[ ]:
Lasso(alpha=0.005)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In [ ]:
# to plot

xp = np.arange(-4.5, 4.5, 0.01).reshape(-1, 1)
Ap = np.hstack([np.exp(-(xp-u[i])**2/(2*sigma**2)) for i in range(d)])

plt.figure(figsize = (6, 4))
plt.plot(x, y, 'o', alpha = 0.5)
plt.plot(xp, reg.predict(Ap), linewidth = 2, label = "LASSO")
plt.title('LASSO Regression (L1)', fontsize = 12)
plt.xlabel('X')
plt.ylabel('Y')
plt.grid(alpha = 0.3)
plt.axis('equal')
plt.show()

We have covered the essential concepts of regression, including both theory and implementation. Now, we will explore additional applications of regression.


5. Example: De-noising Signal

We start with a signal represented by a vector $x \in \mathbb{R}^n$. The coefficients $x_i$ correspond to the value of some function of time, evaluated (or sampled, in the language of signal processing) at evenly spaced points. It is usually assumed that the signal does not vary too rapidly, which means that usually, we have $x_i \approx x_{i+1}$.


Suppose we have a signal $x$, which does not vary too rapidly and that $x$ is corrupted by some small, rapidly varying noise $\varepsilon$, i.e., $x_{cor} = x + \varepsilon$.

Then if we want to reconstruct $x$ from $x_{cor}$ we should solve (with $\hat{x}$ as the parameter)


$$ \text{minimize} \quad \lVert \hat{x} - x_{cor}\rVert_2^2 + \mu\sum_{i=1}^{n-1}(x_{i+1}-x_i)^2 $$

where the parameter $\mu$ controls the ''smoothnes'' of $\hat{x}$.


Source:


In [ ]:
import numpy as np
import matplotlib.pyplot as plt
import cvxpy as cvx

%matplotlib inline
In [ ]:
n = 200
t = np.arange(n).reshape(-1,1)
x = 0.5 * np.sin((2*np.pi/n)*t) * (np.sin(0.01*t))
x_cor = x + 0.05*np.random.randn(n,1)

plt.figure(figsize = (6, 4))
plt.subplot(2,1,1)
plt.plot(t,x,'-', linewidth = 2)
plt.axis([0, n, -0.6, 0.6])
plt.xticks([])
plt.title('original signal' , fontsize = 15)
plt.ylabel('$x_{original}$', fontsize = 15)

plt.subplot(2,1,2)
plt.plot(t, x_cor,'-', linewidth = 1)
plt.axis([0, n, -0.6, 0.6])
plt.title('corrupted signal', fontsize = 15)
plt.xlabel('n', fontsize = 15)
plt.ylabel('$x_{corrupted}$', fontsize = 15)
plt.show()

5.1. Transform de-noising in time into an optimization problem



$$X = \begin{bmatrix} x_{1} \\ x_{2} \\ \vdots \\ x_{n} \end{bmatrix} $$
$$\large \min\limits_{X}\;\left\{\underbrace{\lVert(X - X_{cor})\rVert^2_{2}}_{\text{how much } X \text{ deviates from }X_{cor}} + \mu \underbrace{\sum_{k = 1}^{n-1}(x_{k+1}-x_{k})^2}_{\text{penalize rapid changes of } X}\right\}$$



  • $\min\limits_{X}\;\lVert(X - X_{cor})\rVert^2_{2}$: $\quad$ How much $X$ deviates from $X_{cor}$

  • $\mu\sum\limits_{k = 1}^{n-1}(x_{k+1}-x_{k})^2$: $\quad$ penalize rapid changes of $X$

  • $\mu$ : to adjust the relative weight of 1) & 2)

  • In a vector form

$\quad$1) $X - X_{cor} = I_n X - X_{cor}$

$\quad$2) $\sum\;(x_{k+1}-x_{k})^2 \implies$


$$(x_{2} - x_{1}) - 0 = \begin{bmatrix} -1,&1,&0,&\cdots&0 \end{bmatrix} \begin{bmatrix} x_{1} \\ x_{2} \\ \vdots \\ x_{n} \end{bmatrix}-0 $$$$(x_{3} - x_{2}) - 0 = \begin{bmatrix} 0,&-1,&1,&\cdots&0 \end{bmatrix} \begin{bmatrix} x_{1} \\ x_{2} \\ \vdots \\ x_{n} \end{bmatrix}-0 $$
$$ \vdots $$
$$\implies \left \Arrowvert \; \begin{bmatrix} -1&1&0&\cdots&0&0 \\ 0&-1&1&\cdots&0&0 \\ \vdots&\vdots&\vdots&\ddots&\vdots&\vdots\\ 0&0&0&\cdots&-1&1 \end{bmatrix} \begin{bmatrix} x_{1} \\ x_{2} \\ \vdots \\ x_{n} \end{bmatrix} - \begin{bmatrix} 0 \\ 0 \\ \vdots \\ 0 \end{bmatrix} \; \right\Arrowvert^2_{2} $$
$$\hspace{3.2cm}D \quad\quad\quad\quad\quad\quad\qquad X \quad\; - \; \; 0 $$




$$\begin{align*} \left\Vert I_n X-X_{cor}\big\Vert^2_{2} + \mu\big\Vert DX-0\right\Vert^2_{2} & = \big\Vert Ax-b\big\Vert^2_{2} \\ \\ & = \Bigg\Vert \begin{bmatrix} I_n\\ \sqrt{\mu}D \end{bmatrix} X- \begin{bmatrix} X_{cor}\\ 0 \end{bmatrix} \Bigg\Vert^2_{2} \end{align*}$$
$$ \text{where} \; A = \begin{bmatrix} I_n\\ \sqrt{\mu}D \end{bmatrix},\quad b = \begin{bmatrix} X_{cor}\\ 0 \end{bmatrix} $$



  • Then, plug $A$, $b$ to $(A^TA)^{-1}A^Tb$
In [ ]:
mu = 100

D = np.zeros([n-1, n])
D[:,0:n-1] -= np.eye(n-1)

D[:,1:n] += np.eye(n-1)
A = np.vstack([np.eye(n), np.sqrt(mu)*D])

b = np.vstack([x_cor, np.zeros([n-1,1])])

A = np.asmatrix(A)
b = np.asmatrix(b)

x_reconst = (A.T*A).I*A.T*b

plt.figure(figsize = (6, 2))
plt.plot(t, x_cor, '-', linewidth = 1, alpha = 0.5, label = 'corrupted');
plt.plot(t, x_reconst, 'r', linewidth = 2, label = 'reconstructed')
plt.title('$\mu = {}$'.format(mu), fontsize = 15)
plt.legend(fontsize = 12)
plt.show()

As demonstrated, the original signal can be effectively reconstructed from the corrupted one using regression techniques. While this process highlights the potential of regression for noise reduction, it is important to note that such tasks are typically achieved more effectively and efficiently using low-pass filters in either the time or frequency domain. Nevertheless, this example serves to highlight the broader applicability of regression methods, demonstrating their potential to address various practical problems.


This emphasizes the importance of formulating problems in a form of a regression model, as it can provide valuable insights and effective solutions even in contexts that may not initially appear to be related to regression models.


5.1.1. with different $\mu$'s (see how $\mu$ affects smoothing results)

In [ ]:
plt.figure(figsize = (6, 6))

mu = [0, 10, 100];

for i in range(len(mu)):
    A = np.vstack([np.eye(n), np.sqrt(mu[i])*D])
    b = np.vstack([x_cor, np.zeros([n-1,1])])

    A = np.asmatrix(A)
    b = np.asmatrix(b)

    x_reconst = (A.T*A).I*A.T*b

    plt.subplot(3,1,i+1)
    plt.plot(t, x_cor, '-', linewidth = 1, alpha = 0.5)
    plt.plot(t, x_reconst, 'r', linewidth = 2)
    plt.ylabel('$\mu = {}$'.format(mu[i]), fontsize = 15)

plt.show()

5.1.2. CVXPY Implementation

  • cvxpy toolbox to numerically solve

$$ \min\; \left\{ \lVert x-x_{cor}\rVert_2^2 + \mu \lVert Dx \rVert_2^2 \right\}$$
In [ ]:
mu = 100

x_reconst = cvx.Variable([n,1])
#obj = cvx.Minimize(cvx.sum_squares(x_reconst - x_cor) + mu*cvx.sum_squares(x_reconst[1:n]-x_reconst[0:n-1]))
obj = cvx.Minimize(cvx.sum_squares(x_reconst - x_cor) + mu*cvx.sum_squares(D@x_reconst))

prob = cvx.Problem(obj).solve()

plt.figure(figsize = (6, 2))
plt.plot(t, x_cor, '-', linewidth = 1, alpha = 0.5, label = 'corrupted');
plt.plot(t, x_reconst.value, 'r', linewidth = 2, label = 'reconstructed')
plt.title('$\mu = {}$'.format(mu), fontsize = 15)
plt.legend(fontsize = 12)
plt.show()
In [ ]:
plt.figure(figsize = (6, 6))

mu = [0, 10, 100]

for i in range(len(mu)):
    x_reconst = cvx.Variable([n,1])
    obj = cvx.Minimize(cvx.sum_squares(x_reconst - x_cor) + mu[i]*cvx.sum_squares(D @ x_reconst))
    prob = cvx.Problem(obj).solve()

    plt.subplot(3,1,i+1)
    plt.plot(t,x_cor,'-', linewidth = 1, alpha = 0.5)
    plt.plot(t,x_reconst.value, 'r', linewidth = 2)
    plt.ylabel('$\mu = {}$'.format(int(mu[i])), fontsize = 15)

plt.show()

5.1.3. L2 Norm

  • CVXPY strongly encourages to eliminate quadratic forms, that is, functions like sum_squares, sum(square(.)) or quad_form

  • Whenever it is possible to construct equivalent models using norm instead


$$ \min \; \left\{ \lVert x-x_{cor}\rVert_2 + \gamma \lVert Dx \rVert_2 \right \}$$
In [ ]:
plt.figure(figsize = (6, 6))

gammas = [0, 2, 7]

for i in range(len(gammas)):
    x_reconst = cvx.Variable([n,1])
    obj = cvx.Minimize(cvx.norm(x_reconst-x_cor, 2) + gammas[i]*(cvx.norm(D @ x_reconst, 2)))
    prob = cvx.Problem(obj).solve()

    plt.subplot(3,1,i+1)
    plt.plot(t,x_cor,'-', linewidth = 1, alpha = 0.5)
    plt.plot(t,x_reconst.value, 'r', linewidth = 2)
    plt.ylabel('$ \gamma = {}$'.format(gammas[i]), fontsize = 15)

plt.show()

5.1.4. L2 Norm with a Constraint


$$\min \; \lVert Dx \rVert_2$$$$ s.t. \quad \lVert x-x_{cor} \rVert_2 < \beta $$
In [ ]:
beta = 0.8

x_reconst = cvx.Variable([n,1])
obj = cvx.Minimize(cvx.norm(D @ x_reconst, 2))
const = [cvx.norm(x_reconst-x_cor, 2) <= beta]
prob = cvx.Problem(obj, const).solve()

plt.figure(figsize = (6, 2))
plt.plot(t,x_cor,'-', linewidth = 1, alpha = 0.5)
plt.plot(t,x_reconst.value, 'r', linewidth = 2)
plt.ylabel(r'$\beta = {}$'.format(beta), fontsize = 15)
plt.show()
In [ ]:
plt.figure(figsize = (6, 6))

beta = [0.1, 0.6, 0.8]

for i in range(len(beta)):

    x_reconst = cvx.Variable([n,1])
    obj = cvx.Minimize(cvx.norm(x_reconst[1:n] - x_reconst[0:n-1], 2))
    const = [cvx.norm(x_reconst-x_cor, 2) <= beta[i]]
    prob = cvx.Problem(obj, const).solve()

    plt.subplot(len(beta),1,i+1)
    plt.plot(t,x_cor,'-', linewidth = 1, alpha = 0.5)
    plt.plot(t,x_reconst.value, 'r', linewidth = 2)
    plt.ylabel(r'$\beta = {}$'.format(beta[i]), fontsize = 15)

plt.show()

5.2. Signal with Sharp Transition + Noise

Suppose we have a signal $x$, which is mostly smooth, but has several rapid variations (or jumps). If we apply quadratic smoothing on this signal, then in order to remove the noise we will not be able to preserve the signal's sharp transitions.


First, apply the same method that we used for smoothing signals before


In [ ]:
n = 200
t = np.arange(n).reshape(-1,1)

exact = np.vstack([np.ones([50,1]), -np.ones([50,1]), np.ones([50,1]), -np.ones([50,1])])
x = exact + 0.5*np.sin((2*np.pi/n)*t)
x_cor = x + 0.1*np.random.randn(n,1)

plt.figure(figsize = (6, 4))
plt.subplot(2,1,1)
plt.plot(t, x)
plt.ylim([-2.0,2.0])
plt.ylabel('signal', fontsize = 15)
plt.subplot(2,1,2)
plt.plot(t, x_cor, linewidth = 1)
plt.ylabel('corrupted signal', fontsize = 15)
plt.xlabel('x', fontsize = 15)
plt.show()

5.2.1. L2 Norm (Quadratic Smoothing)

In [ ]:
plt.figure(figsize = (6, 6))

beta = [0.5, 2, 4]

for i in range(len(beta)):
    x_reconst = cvx.Variable([n,1])
    obj = cvx.Minimize(cvx.norm(x_reconst[1:n] - x_reconst[0:n-1], 2))
    const = [cvx.norm(x_reconst - x_cor, 2) <= beta[i]]
    prob = cvx.Problem(obj, const).solve()

    plt.subplot(len(beta), 1, i+1)
    plt.plot(t, x_cor, linewidth = 1, alpha = 0.5)
    plt.plot(t, x_reconst.value, 'r', linewidth = 2)
    plt.ylabel(r'$\beta = {}$'.format(beta[i]), fontsize = 15)

plt.show()

  • Quadratic smoothing (L2 norm panalty) smooths out noise and sharp transitions in signal, but this is not what we want

  • Any ideas ?


5.2.2. L1 Norm

We can instead apply L1 norm on the signal by solving


$$\min \; \lVert x - x_{cor} \rVert_2 + \lambda \sum_{i=1}^{n-1} \;\lvert x_{i+1}-x_i \rvert $$

where the parameter $\lambda$ controls the ''smoothness'' of $x$.


$$\min \; \lVert Dx \rVert_1$$$$ s.t. \quad \lVert x-x_{cor} \rVert_2 < \beta $$
In [ ]:
plt.figure(figsize = (6, 6))

beta = [0.5, 2, 4]

for i in range(len(beta)):
    x_reconst = cvx.Variable([n,1])
    obj = cvx.Minimize(cvx.norm(x_reconst[1:n] - x_reconst[0:n-1], 1))
    const = [cvx.norm(x_reconst-x_cor, 2) <= beta[i]]
    prob = cvx.Problem(obj, const).solve()

    plt.subplot(len(beta), 1, i+1)
    plt.plot(t, x_cor, linewidth = 1, alpha = 0.5)
    plt.plot(t, x_reconst.value, 'r', linewidth = 2)
    plt.ylabel(r'$\beta = {}$'.format(beta[i]), fontsize = 15)

plt.show()

Consider the result presented above and think about why.


Total Variation (TV) smoothing is a robust and widely used technique in signal and image processing, designed to reduce noise while effectively preserving important structural details such as edges. Unlike traditional smoothing methods, which tend to introduce excessive blurring, TV smoothing excels at maintaining sharp transitions within the data.


One notable characteristic of TV smoothing is its ability to preserve sharp transitions in the signal, which is often desirable in practical applications. This property ensures that key features, such as edges in images or abrupt changes in signals, remain well-defined even after the noise is reduced.



6. Example: Total Variation Image Reconstruction

Here, consider Total Variation (TV) smoothing applied to image processing as the final example of regression.


In [ ]:
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
In [ ]:
import cv2

imbw = cv2.imread('/content/drive/MyDrive/ML_Colab/ML_data/dog.jpg', 0)

row = 150
col = 150
resized_imbw = cv2.resize(imbw, (row, col))

plt.figure(figsize = (6, 6))
plt.imshow(resized_imbw, 'gray')
plt.axis('off')
plt.show()

  • Question: Apply $L_1$ norm to the image, and guess what kind of an image will be produced ?

$$\min \; \lVert Dx \rVert_1$$$$ s.t. \quad \lVert x-x_{cor} \rVert_2 < \beta $$
In [ ]:
n = row*col
imbws = resized_imbw.reshape(-1, 1)

beta = 1500

x = cvx.Variable([n,1])
obj = cvx.Minimize(cvx.norm(x[1:n] - x[0:n-1],1))
const = [cvx.norm(x - imbws,2) <= beta]
prob = cvx.Problem(obj, const).solve()

imbwr = x.value.reshape(row, col)

plt.figure(figsize = (6, 6))
plt.imshow(imbwr,'gray')
plt.axis('off')
plt.show()

This example demonstrates how regression principles extend beyond conventional predictive models and can be applied effectively to complex tasks such as image denoising and signal reconstruction.


In [ ]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')