Classification
Table of Contents
from IPython.display import YouTubeVideo
YouTubeVideo('IRrGVQV8vZ8?si=_hkRnh2jEhJVDXsY&start=931', width = "560", height = "315")
Classification is an another core task in machine learning, aimed at categorizing data into predefined classes or categories based on input features. Unlike regression, where the target variable is continuous, classification deals with discrete outputs. It plays a crucial role in a variety of real-world applications, such as
Key Characteristics of Classification
Type of Classification Problems
(1) Binary Classification:
(2) Multiclass Classification:
Topics Covered in This Section
To build a solid understanding of classification techniques, we will explore the following key algorithms:
(1) Perceptron:
(2) Support Vector Machines (SVM):
(3) Logistic Regression:
The Perceptron is one of the simplest types of artificial neural networks and is used for binary classification. It was invented by Frank Rosenblatt in 1958 and serves as the foundation for more advanced neural networks.
To understand how the perceptron works, let's consider a bank loan approval scenario where the bank decides whether to approve or reject a loan application based on specific criteria.
Suppose a decision can be made based on a simple linear combination of features:
where a sign function is:
Introduce an artificial coordinate $x_0 = 1$:
In a vector form
Classification boundary: hyperplane $g(x) = 0 \implies \omega^T x = 0$
Given the vector $ \omega $, the decision boundary is defined by the hyperplane:
This hyperplane is orthogonal to the vector $ \omega $, meaning $ \omega $ determines the direction perpendicular to the boundary.
Blue Points: Since all blue points are located on the same side of the hyperplane relative to $ \omega $, they satisfy:
Green Points: Conversely, the green points are positioned on the opposite side of the hyperplane relative to $ \omega $, satisfying:
In this linear classification framework, $ \omega $ not only defines the orientation of the boundary but also plays a crucial role in determining which points belong to each class based on their position relative to the hyperplane.
Learning a Hyperplane for Classification
(1) Goal:
The objective is to learn the hyperplane $g_{\omega}(x)=0$ using the given training data. This hyperplane serves as the decision boundary that separates different classes.
(2) How to find $\omega$:
The parameter vector $\omega$ defines the orientation and position of the hyperplane. The classification rule is determined as follows:
This formulation highlights the fundamental idea behind linear classifiers, where the decision boundary is defined by a linear function of the input features. By appropriately determining $\omega$, we can effectively separate the data points into their respective classes.
Note that in traditional methods, the vector $\omega$ is often determined by domain experts based on their prior knowledge and insights about the system. However, in machine learning, the objective is to estimate $\omega$ directly from the data without any inherent bias or preconception. This data-driven approach allows the model to adaptively learn the optimal decision boundary, potentially uncovering complex patterns that may not be apparent through manual design.
Key Insights
'Learning from Data' as a Paradigm Shift:
The concept of "learning from data" represents a significant philosophical shift for many engineers. Traditional engineering methods often rely on explicitly defined models and expert-derived parameters. In contrast, machine learning emphasizes empirical learning, where patterns and relationships are discovered directly from observed data.
Estimating $\omega$:
There are various techniques available to estimate $\omega$, each designed to suit different types of data and modeling requirements. We will explore several key algorithms for effectively estimating $\omega$. These methods form the foundation of machine learning approaches to classification.
We will first walk through the perceptron algorithm and then explore the underlying principles that explain how and why it works.
The perceptron implements
Given the training set
Why Perceptron Updates Work ?
Note:
(1) Perceptron Update Rule (discrete version):
$$\omega \leftarrow \omega \pm 1 \cdot x_n$$(2) Gradient Descent Update Rule (continuous version):
$$\omega \leftarrow \omega - \alpha \nabla_{\omega} f$$Understanding this similarity provides insight into the foundational mechanisms of various learning algorithms, illustrating how they iteratively adjust model parameters to enhance performance.
Summary
The perceptron is a simple yet powerful model for binary classification tasks like bank loan approval. It classifies applicants based on features like credit score, income, and debt. While it works well for linearly separable data, it has limitations for more complex datasets. Understanding how the perceptron learns and updates its weights is fundamental to understanding modern neural networks.
This diagram symbolically illustrates the Perceptron, a fundamental model in machine learning. The Perceptron operates as follows:
(1) Input Stage
The model takes multiple inputs, denoted as $ x_0, x_1, \cdots, x_d $, each associated with a corresponding weight $ \omega_0, w_1, \cdots, w_d $.
These weights are key parameters that the model adjusts during the learning process.
(2) Weighted Sum Calculation
(3) Activation Function
(4) Weight Update (Learning Process)
Important Insight for Deep Learning
This diagram is also significant because the Perceptron can be viewed as a simple neuron in an Artificial Neural Network (ANN). When we study deep learning later, this structure will serve as the fundamental building block of more complex models. Understanding this visualization now will provide valuable intuition when exploring neural networks and their hierarchical architecture.
Perceptron loss function
If you do not want to explicitly check whether each point is misclassified or not, you can write the perceptron loss function in a more compact expression that automatically accumulates contributions only from misclassified points.
The loss for an individual sample is defined as:
Note:
$\text{sign}\left(\omega^T x_n \right) \neq y_n$ is equivalent to $ y_n \cdot \left(\omega^T x_n \right) < 0$
$\text{ReLU}(z) = \max(0, z)$: the Rectified Linear Unit (ReLU) will be revisited later in the discussion, as it plays a crucial role not only in perceptron loss formulation but also in modern deep learning architectures where it is widely used as an activation function.
This function returns zero when the point is correctly classified and returns a positive value proportional to the margin violation when misclassified.
Compact Expression for Total Perceptron Loss
The total loss across all samples in the dataset $D$ is given by:
This summation aggregates the loss over all training samples and only penalizes incorrectly classified points.
By adding an artificial coordinate $x_0 = 1$, we simplify the perceptron implementation and avoid the need to handle the bias separately. This technique is commonly used in machine learning to streamline calculations and make the formulation more uniform.
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
#training data gerneration
m = 100
x1 = 8*np.random.rand(m, 1)
x2 = 7*np.random.rand(m, 1) - 4
g = 0.8*x1 + x2 - 3
C1 = np.where(g >= 1)
C0 = np.where(g < -1)
print(C1)
C1 = np.where(g >= 1)[0]
C0 = np.where(g < -1)[0]
print(C1.shape)
print(C0.shape)
plt.figure(figsize = (6, 4))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.title('Linearly Separable Classes')
plt.legend(loc = 1)
plt.xlabel(r'$x_1$')
plt.ylabel(r'$x_2$')
plt.show()
X1 = np.hstack([np.ones([C1.shape[0],1]), x1[C1], x2[C1]])
X0 = np.hstack([np.ones([C0.shape[0],1]), x1[C0], x2[C0]])
X = np.vstack([X1, X0])
y = np.vstack([np.ones([C1.shape[0],1]), -np.ones([C0.shape[0],1])])
X = np.asmatrix(X)
y = np.asmatrix(y)
where $(x, y)$ is a misclassified training point
w = np.ones([3,1])
w = np.asmatrix(w)
n_iter = y.shape[0]
for k in range(n_iter):
for i in range(n_iter):
if y[i,0] != np.sign(X[i,:]*w)[0,0]:
w += y[i,0]*X[i,:].T
print(w)
x1p = np.linspace(0,8,100).reshape(-1,1)
x2p = - w[1,0]/w[2,0]*x1p - w[0,0]/w[2,0]
plt.figure(figsize = (6, 4))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.plot(x1p, x2p, c = 'k', linewidth = 3, label = 'perceptron')
plt.xlim([0, 8])
plt.xlabel('$x_1$')
plt.ylabel('$x_2$')
plt.legend(loc = 1)
plt.show()
Perceptron using Scikit-Learn
In scikit-learn, the Perceptron includes the bias term by default, so you don't need to add an artificial coordinate manually. This makes it more convenient for implementing and training perceptron models on datasets.
X1 = np.hstack([x1[C1], x2[C1]])
X0 = np.hstack([x1[C0], x2[C0]])
X = np.vstack([X1, X0])
y = np.vstack([np.ones([C1.shape[0],1]), -np.ones([C0.shape[0],1])])
from sklearn import linear_model
clf = linear_model.Perceptron(tol=1e-3)
clf.fit(X, np.ravel(y))
clf.predict([[3, -2]])
clf.predict([[6, 2]])
clf.coef_
clf.intercept_
w0 = clf.intercept_[0]
w1 = clf.coef_[0,0]
w2 = clf.coef_[0,1]
x1p = np.linspace(0,8,100).reshape(-1,1)
x2p = - w1/w2*x1p - w0/w2
plt.figure(figsize = (6, 4))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.plot(x1p, x2p, c = 'k', linewidth = 4, label = 'perceptron')
plt.xlim([0, 8])
plt.xlabel('$x_1$')
plt.ylabel('$x_2$')
plt.legend(loc = 1)
plt.show()
Limitations
The Perceptron Algorithm successfully identifies a separating hyperplane if the data is linearly separable. However, there are typically many possible hyperplanes that can separate the data. The Perceptron algorithm does not guarantee the optimal hyperplane - it merely finds one of the possible solutions.
Improvements
While the Perceptron identifies one valid separating hyperplane, identifying the best hyperplane requires an optimization framework. This concept is crucial in understanding more advanced models like Support Vector Machines (SVM) and Logistic Regression, which are designed to find optimal decision boundaries with improved performance and generalization.
In the late 1960s, Marvin Minsky and Seymour Papert published the influential book "Perceptrons" (1969), where they mathematically demonstrated the limitations of the perceptron. One key result from their work was that single-layer perceptrons cannot solve the XOR problem (exclusive OR), highlighting a significant limitation of early neural networks.
Stagnation in Neural Network Research:
Rediscovery with Multi-Layer Networks:
As a first classification algorithm that improve upon the simple perceptron, we will study Support Vector Machines (SVM) in this section. To build the foundation for understanding SVMs, we will first examine how to compute the distance from a point to a line. This concept is crucial, as it will later be used to identify and optimize the best possible hyperplane separator.
(1) If $\vec p$ and $\vec q$ are on the decision line
(2) Compute $d$ which is a normal signed distance from the origin to the line
(3) Compute $h$ which is a normal signed distance from $x$ to the line
(1) Find a distance between $g(x) = -1$ and $g(x) = 1$
(2) Another method to find a distance between $g(x) = -1$ and $g(x) = 1$
Is it possible to distinguish between $C_1$ and $C_0$ by its values in $x$?
We need to find a separating hyperplane (or a line in 2D)
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
## training data gerneration
np.random.seed(1)
x1 = 9*np.random.rand(100, 1)
x2 = 6*np.random.rand(100, 1)
g = 1/3*(2*x1 + 3*x2 - 18)
C1 = np.where(g >= 1)[0]
C0 = np.where(g < -1)[0]
xp = np.linspace(0,9,100).reshape(-1,1)
ypt = -2/3*xp + 6
plt.figure(figsize = (6, 4))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.plot(xp, ypt, 'k', linewidth = 3, label = 'True')
plt.title('Linearly and Strictly Separable Classes')
plt.xlabel(r'$x_1$')
plt.ylabel(r'$x_2$')
plt.legend(loc = 3)
plt.xlim([0, 9])
plt.ylim([0, 6])
plt.show()
With this setup, the goal is to find a linear classifier that separates data points using a decision boundary defined by.
The Key Insight in SVM: Introducing a Margin
In the traditional perceptron algorithm, the goal is simply to find any hyperplane that separates the data. However, SVM introduces a crucial improvement - maximizing the margin between the classes to improve robustness.
If the data is strictly separable, we can always establish a margin condition:
Where $b$ is a positive constant representing the margin distance.
Why Does This Work?
The above margin condition may seem arbitrary at first, but it is a crucial step in building a robust classifier. Here's why:
By requiring points to satisfy these inequalities, we ensure that no data points lie arbitrarily close to the decision boundary.
This margin condition provides a 'buffer zone' that reduces the model's sensitivity to small perturbations or noise in the data.
Scaling to Simplify the Problem
The next step is to scale the inequalities such that the margin boundary conditions are set to $\pm 1$. This step simplifies the mathematics without changing the geometry of the problem.
If the data is strictly separable, we can always establish a margin condition:
At first glance, it may seem counterintuitive that scaling is acceptable. However, the key insight is that only the relative positioning of points matters in defining the decision boundary. Scaling by $b$ preserves the geometry of the hyperplane and the relative margins. Therefore, the resulting classifier remains functionally identical but now operates under simplified conditions.
Now we will begin studying Support Vector Machine (SVM) algorithms. It is important to keep in mind that the concept of a 'buffer zone' - a critical idea linked to the robustness of the classifier - is first introduced here in SVM. This buffer zone ensures that data points are not positioned too close to the decision boundary, enhancing the model's ability to generalize effectively and resist overfitting in the presence of noise or small perturbations. This key innovation distinguishes SVM from earlier linear classifiers like the Perceptron.
# see how data are generated
xp = np.linspace(0,9,100).reshape(-1,1)
ypt = -2/3*xp + 6
plt.figure(figsize = (6, 4))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.plot(xp, ypt, 'k', alpha = 0.3, linewidth = 3, label = 'True')
plt.plot(xp, ypt-1, '--k', alpha = 0.3)
plt.plot(xp, ypt+1, '--k', alpha = 0.3)
plt.title('Linearly and Strictly Separable Classes')
plt.xlabel(r'$x_1$')
plt.ylabel(r'$x_2$')
plt.legend(loc = 3)
plt.xlim([0, 9])
plt.ylim([0, 6])
plt.show()
Settings:
$n \;(=2)$ features
$m = N + M$ data points in training set
We will approach SVM as an optimization problem.
In naive expression:
To simplify the notation and formulation, we will introduce an artificial feature by setting $x_0 = 1$
These constraints ensure that all correctly classified points lie outside the margin boundaries, establishing a buffer zone that enhances the classifier's robustness. However, the appropriate objective function to be minimized has not yet been determined.
In Matrix Form
Optimization in Form 1 (Standard Form with Separate Bias Term)
# CVXPY
import cvxpy as cvx
X1 = np.hstack([x1[C1], x2[C1]])
X0 = np.hstack([x1[C0], x2[C0]])
X1 = np.asmatrix(X1)
X0 = np.asmatrix(X0)
N = X1.shape[0]
M = X0.shape[0]
w0 = cvx.Variable([1,1])
w = cvx.Variable([2,1])
obj = cvx.Minimize(1)
const = [w0 + X1@w >= 1, w0 + X0@w <= -1]
prob = cvx.Problem(obj, const).solve()
w0 = w0.value
w = w.value
xp = np.linspace(0,9,100).reshape(-1,1)
yp = - w[0,0]/w[1,0]*xp - w0/w[1,0]
ypt = -2/3*xp + 6
plt.figure(figsize = (6, 4))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.plot(xp, ypt, 'k', alpha = 0.3, label = 'True')
plt.plot(xp, ypt-1, '--k', alpha = 0.3)
plt.plot(xp, ypt+1, '--k', alpha = 0.3)
plt.plot(xp, yp, 'g', linewidth = 3, label = 'Attempt 1')
plt.title('Linearly and Strictly Separable Classes')
plt.xlabel(r'$x_1$')
plt.ylabel(r'$x_2$')
plt.legend(loc = 3)
plt.xlim([0, 9])
plt.ylim([0, 6])
plt.show()
Interestingly, this optimization process leads to one of the possible linear boundaries that can separate the data.
Optimization in Form 2
import cvxpy as cvx
N = C1.shape[0]
M = C0.shape[0]
X1 = np.hstack([np.ones([N,1]), x1[C1], x2[C1]])
X0 = np.hstack([np.ones([M,1]), x1[C0], x2[C0]])
X1 = np.asmatrix(X1)
X0 = np.asmatrix(X0)
w = cvx.Variable([3,1])
obj = cvx.Minimize(1)
const = [X1@w >= 1, X0@w <= -1]
prob = cvx.Problem(obj, const).solve()
w = w.value
xp = np.linspace(0,9,100).reshape(-1,1)
yp = - w[1,0]/w[2,0]*xp - w[0,0]/w[2,0]
ypt = -2/3*xp + 6
plt.figure(figsize = (6, 4))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.plot(xp, ypt, 'k', alpha = 0.3, label = 'True')
plt.plot(xp, ypt-1, '--k', alpha = 0.3)
plt.plot(xp, ypt+1, '--k', alpha = 0.3)
plt.plot(xp, yp, 'g', linewidth = 3, label = 'Attempt 1')
plt.title('Linearly and Strictly Separable Classes')
plt.xlabel(r'$x_1$')
plt.ylabel(r'$x_2$')
plt.legend(loc = 3)
plt.xlim([0, 9])
plt.ylim([0, 6])
plt.show()
It is evident that the above implementation may fail to produce a valid boundary when the data is not linearly separable. In real-world scenarios, datasets often contain noise, errors, or outliers that deviate from the true underlying patterns.
Consequently, under these conditions, a strict margin condition (e.g., $\;y_i (\omega_0 + \omega^T x_i) \geq 1$) may be impossible to satisfy for all data points.
X1 = np.hstack([np.ones([N,1]), x1[C1], x2[C1]])
X0 = np.hstack([np.ones([M,1]), x1[C0], x2[C0]])
outlier = np.array([1, 8.5, 1.8]).reshape(1,-1)
X0 = np.vstack([X0, outlier])
X1 = np.asmatrix(X1)
X0 = np.asmatrix(X0)
plt.figure(figsize = (6, 4))
plt.plot(X1[:,1], X1[:,2], 'ro', alpha = 0.4, label = 'C1')
plt.plot(X0[:,1], X0[:,2], 'bo', alpha = 0.4, label = 'C0')
plt.title('When Outliers Exist')
plt.xlabel(r'$x_1$')
plt.ylabel(r'$x_2$')
plt.legend(loc = 3)
plt.xlim([0, 9])
plt.ylim([0, 6])
plt.show()
w = cvx.Variable([3,1])
obj = cvx.Minimize(1)
const = [X1@w >= 1, X0@w <= -1]
prob = cvx.Problem(obj, const).solve()
print(w.value)
As observed, no feasible solution exists when the data is not linearly separable using a strict margin condition.
Allowing Misclassifications
To address this, we introduce the idea of soft margins - a critical enhancement to the original SVM formulation. In this approach:
In real-world scenarios, data is often not linearly separable due to noise, errors, or outliers. To address this, we introduce a more flexible framework known as the Soft Margin SVM, which allows some training points to violate the margin conditions.
Relaxing the Constraints
Key Idea Behind Slack Variables
Each slack variable corresponds to a specific data point:
Both $u$ and $v$ are strictly non-negative values ($u_i \geq 0$ and $v_i \geq 0$)
Optimization Problem for the Non-separable Case
The objective is now to minimize both the total margin violations (slack variables) - effectively controlling the number of misclassified points and their deviation from the margin.
w = cvx.Variable([3,1])
u = cvx.Variable([N,1])
v = cvx.Variable([M+1,1])
obj = cvx.Minimize(np.ones((1,N))@u + np.ones((1,M+1))@v)
const = [X1@w >= 1-u, X0@w <= -(1-v), u >= 0, v >= 0 ]
prob = cvx.Problem(obj, const).solve()
w = w.value
xp = np.linspace(0,9,100).reshape(-1,1)
yp = - w[1,0]/w[2,0]*xp - w[0,0]/w[2,0]
ypt = -2/3*xp + 6
plt.figure(figsize = (6, 4))
plt.plot(X1[:,1], X1[:,2], 'ro', alpha = 0.4, label = 'C1')
plt.plot(X0[:,1], X0[:,2], 'bo', alpha = 0.4, label = 'C0')
plt.plot(xp, ypt, 'k', alpha = 0.3, label = 'True')
plt.plot(xp, ypt-1, '--k', alpha = 0.3)
plt.plot(xp, ypt+1, '--k', alpha = 0.3)
plt.plot(xp, yp, 'g', linewidth = 3, label = 'Attempt 2')
plt.plot(xp, yp-1/w[2,0], '--g')
plt.plot(xp, yp+1/w[2,0], '--g')
plt.title('When Outliers Exist')
plt.xlabel(r'$x_1$')
plt.ylabel(r'$x_2$')
plt.legend(loc = 3)
plt.xlim([0, 9])
plt.ylim([0, 6])
plt.show()
The presence of outliers can significantly affect the position and orientation of the hyperplane. As a result, the hyperplane may no longer accurately represent the optimal division between classes. This issue arises because, in the second attempt, the linear classifier aims to separate the data as much as possible, making them highly sensitive to noise or extreme points.
Further Improvement: Large Margin for Better Generalization
The core insight in SVM is that a large margin not only separates the data effectively but also improves the model's generalization on unseen data.
By maximizing the margin (i.e., buffer zone), the model achieves:
Multiple objectives
Use gamma ($\gamma$) as a weighting betwwen the followings:
g = 2
w = cvx.Variable([3,1])
u = cvx.Variable([N,1])
v = cvx.Variable([M+1,1])
obj = cvx.Minimize(cvx.norm(w,2) + g*(np.ones((1,N))@u + np.ones((1,M+1))@v))
const = [X1@w >= 1-u, X0@w <= -(1-v), u >= 0, v >= 0 ]
prob = cvx.Problem(obj, const).solve()
w = w.value
xp = np.linspace(0,9,100).reshape(-1,1)
yp = - w[1,0]/w[2,0]*xp - w[0,0]/w[2,0]
ypt = -2/3*xp + 6
plt.figure(figsize = (6, 4))
plt.plot(X1[:,1], X1[:,2], 'ro', alpha = 0.4, label = 'C1')
plt.plot(X0[:,1], X0[:,2], 'bo', alpha = 0.4, label = 'C0')
plt.plot(xp, ypt, 'k', alpha = 0.3, label = 'True')
plt.plot(xp, ypt-1, '--k', alpha = 0.3)
plt.plot(xp, ypt+1, '--k', alpha = 0.3)
plt.plot(xp, yp, 'g', linewidth = 3, label = 'Attempt 2')
plt.plot(xp, yp-1/w[2,0], '--g')
plt.plot(xp, yp+1/w[2,0], '--g')
plt.title('When Outliers Exist')
plt.xlabel(r'$x_1$')
plt.ylabel(r'$x_2$')
plt.legend(loc = 3)
plt.xlim([0, 9])
plt.ylim([0, 6])
plt.show()
By shifting focus from simply separating the data to maximizing the margin, SVM introduces a powerful improvement that enhances performance, particularly in the presence of noise and outliers. This margin-based approach is a fundamental reason why SVMs are widely regarded as one of the most effective classifiers in machine learning.
Probably the most popular/influential classification algorithm
A hyperplane based classifier (like the Perceptron)
Additionally uses the maximum margin principle
$$ \text{maximize {minimum distance}} $$
X = np.vstack([X1, X0])
y = np.vstack([np.ones([N,1]), -np.ones([M+1,1])])
m = N + M + 1
g = 2
w = cvx.Variable([3,1])
d = cvx.Variable([m,1])
obj = cvx.Minimize(cvx.norm(w,2) + g*(np.ones([1,m])@d))
const = [cvx.multiply(y, X@w) >= 1-d, d >= 0]
prob = cvx.Problem(obj, const).solve()
w = w.value
xp = np.linspace(0,9,100).reshape(-1,1)
yp = - w[1,0]/w[2,0]*xp - w[0,0]/w[2,0]
ypt = -2/3*xp + 6
plt.figure(figsize = (6, 4))
plt.plot(X1[:,1], X1[:,2], 'ro', alpha = 0.4, label = 'C1')
plt.plot(X0[:,1], X0[:,2], 'bo', alpha = 0.4, label = 'C0')
plt.plot(xp, ypt, 'k', alpha = 0.3, label = 'True')
plt.plot(xp, ypt-1, '--k', alpha = 0.3)
plt.plot(xp, ypt+1, '--k', alpha = 0.3)
plt.plot(xp, yp, 'g', linewidth = 3, label = 'Attempt 2')
plt.plot(xp, yp-1/w[2,0], '--g')
plt.plot(xp, yp+1/w[2,0], '--g')
plt.title('When Outliers Exist')
plt.xlabel(r'$x_1$')
plt.ylabel(r'$x_2$')
plt.legend(loc = 3)
plt.xlim([0, 9])
plt.ylim([0, 6])
plt.show()
Throughout the development of SVM, we encourage you to recognize the continuous improvement process - a progression that starts with a simple scenario, encounters new challenges, and subsequently introduces innovative ideas to address those challenges.
We began with a straightforward linear classifier designed to separate clean, linearly separable data.
Upon encountering non-separable data due to noise, errors, or outliers, we introduced slack variables to relax the margin conditions, enhancing the model's robustness.
To further improve generalization on unseen data, we adopted the concept of a large margin, which maximizes the distance between the decision boundary and the closest data points, making the model more resilient to noise and improving overall performance.
This step-by-step refinement mirrors the natural progression of real-world problem-solving - starting with a basic concept, confronting obstacles, and evolving the model to achieve better performance and broader applicability.
from IPython.display import YouTubeVideo
YouTubeVideo('IRrGVQV8vZ8?si=8H3cmkAUyNIu2NDb&start=2750', width = "560", height = "315")
Perceptron $\rightarrow$ SVM $\rightarrow$ Logistic Regression
Perceptron: Utilizes only the sign of the data to determine class labels, relying solely on binary outcomes without considering the distance of points from the decision boundary.
SVM: Focuses on only a subset of data points - specifically, those that lie directly on the margin (called support vectors) - to define the optimal hyperplane. While this approach effectively maximizes the margin, it ignores other data points that may still carry useful distance information.
Logistic Regression: To better utilize the distance information of all data points, we turn to logistic regression, which considers the positions of all points in the dataset. This broader perspective enhances the model's ability to capture patterns and improve generalization.
Let's start with the case of only two data points. Which linear classification boundary would be considered better?
On the left: The classification boundary is positioned near the center of the data points.
On the right: The classification boundary is biased toward one of the data points.
The boundary on the right is positioned too close to one of the data points, leaving minimal space (margin) on that side. This smaller margin increases the risk of misclassifying new data points that may fall within this narrow buffer zone. The boundary that maintains a balanced margin between the data points is preferred for its improved stability and performance.
Basic idea:
Why?
Roughly speaking, this optimization of $\max \prod_i \lvert h_i \rvert$ tends to position a hyperplane in the middle of two classes
Considering all data points may seem more effective than SVM, which relies only on the support vectors near the margin. However, some data points may be located far from the decision boundary. These extreme points (or outliers) can negatively impact the model’s performance by overly influencing the decision boundary due to the multiplication of all $\lvert h_i \rvert$.
To mitigate this negative impact, we apply a squeezing function that compresses the range of extreme values, mapping $(-\infty, +\infty)$ to $(0,1)$. This transformation ensures that points far from the decision boundary have minimal impact on the model's outcome.
By squeezing the values in this way, we create a more robust model that effectively emphasizes the most informative points near the decision boundary while reducing the adverse effects of outliers.
Here, let's examine the sigmoid function, also known as the logistic function, denoted as $\sigma(z)$, which often serves as a squeezing function.
# plot a sigmoid function
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
z = np.linspace(-4,4,100)
s = 1/(1 + np.exp(-z))
plt.figure(figsize = (6, 3))
plt.plot(z, s)
plt.xlim([-4, 4])
plt.axis('equal')
plt.grid(alpha = 0.3)
plt.show()
The output of the logistic function is bounded between 0 and 1, making it suitable for binary classification tasks.
The logistic function compresses large positive and negative values:
One significant advantage of the logistic function is that its output can be interpreted as a probability.
The Role of Distance in Classification
We have studied the Perceptron, SVM, and Logistic Regression as key algorithms for classification. The primary distinction among these methods lies in how each approach defines and utilizes the distance from the decision boundary to the data points. The figures below provide intuitive visual insights into these differences.
Note:
Although the term "regression" appears in logistic regression, it is, in fact, a classification algorithm. This terminology may seem counterintuitive at first. However, upon closer examination, logistic regression can be understood as a regression-based approach in which the distances between data points and the linear decision boundary are transformed via a logistic (sigmoid) function. This transformation maps the outputs to a probability distribution within the range [0, 1], thereby enabling binary classification. Notably, the transformed distance is continuous, which aligns with the regression-like nature of the model.
Goal: we need to fit $\omega$ to our data
For a single data point $(x,y)$ with parameters $\omega$
It can be compactly written as (since $y$ is either $0$ or $1$)
For $m$ training data points, the likelihood function of the parameters:
It would be easier to work on the log likelihood.
Then, the logistic regression problem can be solved as a (convex) optimization problem as
To use the gradient descent method, we need to find the derivative of it
Think about a single data point with a single parameter $\omega$ for the simplicity.
For $m$ training data points with the parameters $\omega$
# datat generation
m = 100
w = np.array([[-6], [2], [1]])
X = np.hstack([np.ones([m,1]), 4*np.random.rand(m,1), 4*np.random.rand(m,1)])
w = np.asmatrix(w)
X = np.asmatrix(X)
y = 1/(1 + np.exp(-X*w)) > 0.5
C1 = np.where(y == True)[0]
C0 = np.where(y == False)[0]
y = np.empty([m,1])
y[C1] = 1
y[C0] = 0
plt.figure(figsize = (6, 4))
plt.plot(X[C1,1], X[C1,2], 'ro', alpha = 0.3, label = 'C1')
plt.plot(X[C0,1], X[C0,2], 'bo', alpha = 0.3, label = 'C0')
plt.xlabel(r'$x_1$')
plt.ylabel(r'$x_2$')
plt.legend(loc = 1)
plt.xlim([0,4])
plt.ylim([0,4])
plt.show()
# be careful with matrix shape
def h(x,w):
return 1/(1 + np.exp(-x*w))
Gradient descent
w = np.zeros([3,1])
alpha = 0.01
for i in range(10000):
df = -X.T*(y - h(X,w))
w = w - alpha*df
print(w)
xp = np.linspace(0,4,100).reshape(-1,1)
yp = - w[1,0]/w[2,0]*xp - w[0,0]/w[2,0]
plt.figure(figsize = (6, 4))
plt.plot(X[C1,1], X[C1,2], 'ro', alpha = 0.3, label = 'C1')
plt.plot(X[C0,1], X[C0,2], 'bo', alpha = 0.3, label = 'C0')
plt.plot(xp, yp, 'g', linewidth = 4, label = 'Logistic Regression')
plt.title('Logistic Regression')
plt.xlabel(r'$x_1$')
plt.ylabel(r'$x_2$')
plt.legend(loc = 1)
plt.xlim([0,4])
plt.ylim([0,4])
plt.show()
We can re-order the training data so
The likelihood function
The log likelihood function
Since $\ell$ is a concave function of $\omega$, the logistic regression problem can be solved as a convex optimization problem
Refer to cvxpy functions
scalar function: cvx.sum(x)
= $\sum_{ij} x_{ij}$
elementwise function: cvx.logistic(x)
= $\log \left(1+e^{x} \right)$
import cvxpy as cvx
w = cvx.Variable([3, 1])
obj = cvx.Maximize(y.T@X@w - cvx.sum(cvx.logistic(X@w)))
prob = cvx.Problem(obj).solve()
w = w.value
xp = np.linspace(0,4,100).reshape(-1,1)
yp = - w[1,0]/w[2,0]*xp - w[0,0]/w[2,0]
plt.figure(figsize = (6, 4))
plt.plot(X[C1,1], X[C1,2], 'ro', alpha = 0.3, label = 'C1')
plt.plot(X[C0,1], X[C0,2], 'bo', alpha = 0.3, label = 'C0')
plt.plot(xp, yp, 'g', linewidth = 4, label = 'Logistic Regression')
plt.title('Logistic Regression')
plt.xlabel(r'$x_1$')
plt.ylabel(r'$x_2$')
plt.legend(loc = 1)
plt.xlim([0,4])
plt.ylim([0,4])
plt.show()
In a more compact form
Change $y \in \{0,+1\} \; \rightarrow \; y \in \{-1,+1\}$ for compuational convenience
y = np.empty([m, 1])
y[C1] = 1
y[C0] = -1
y = np.asmatrix(y)
w = cvx.Variable([3, 1])
obj = cvx.Minimize(cvx.sum(cvx.logistic(-cvx.multiply(y, X@w))))
prob = cvx.Problem(obj).solve()
w = w.value
xp = np.linspace(0,4,100).reshape(-1,1)
yp = - w[1,0]/w[2,0]*xp - w[0,0]/w[2,0]
plt.figure(figsize = (6, 4))
plt.plot(X[C1,1], X[C1,2], 'ro', alpha = 0.3, label = 'C1')
plt.plot(X[C0,1], X[C0,2], 'bo', alpha = 0.3, label = 'C0')
plt.plot(xp, yp, 'g', linewidth = 4, label = 'Logistic Regression')
plt.title('Logistic Regression')
plt.xlabel(r'$x_1$')
plt.ylabel(r'$x_2$')
plt.legend(loc = 1)
plt.xlim([0,4])
plt.ylim([0,4])
plt.show()
X = X[:, 1:3]
X.shape
from sklearn import linear_model
clf = linear_model.LogisticRegression(solver = 'lbfgs')
clf.fit(np.asarray(X), np.ravel(y))
clf.coef_
clf.intercept_
w0 = clf.intercept_[0]
w1 = clf.coef_[0,0]
w2 = clf.coef_[0,1]
xp = np.linspace(0,4,100).reshape(-1,1)
yp = - w1/w2*xp - w0/w2
plt.figure(figsize = (6, 4))
plt.plot(X[C1,0], X[C1,1], 'ro', alpha = 0.3, label = 'C1')
plt.plot(X[C0,0], X[C0,1], 'bo', alpha = 0.3, label = 'C0')
plt.plot(xp, yp, 'g', linewidth = 4, label = 'Logistic Regression')
plt.title('Logistic Regression')
plt.xlabel(r'$x_1$')
plt.ylabel(r'$x_2$')
plt.legend(loc = 1)
plt.xlim([0,4])
plt.ylim([0,4])
plt.show()
You might have seen the concept of entropy in physics. In fact, the concept of entropy is closely related to the cross-entropy that we just encountered in logistic regression.
Entropy in Physics
In statistical physics, entropy measures the uncertainty or disorder of a system:
where:
Entropy quantifies how much "randomness" or "information" is needed to describe the system's configuration.
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
p = np.linspace(0.01, 0.99, 100)
S = -p*np.log(p) - (1-p)*np.log(1-p)
plt.figure(figsize = (6, 4))
plt.plot(p, S, linewidth = 3)
plt.xlabel('p')
plt.axis('equal')
plt.grid(alpha = 0.3)
plt.show()
Cross-Entropy in Logistic Regression
In logistic regression, cross-entropy measures how well the predicted probability distribution matches the true distribution:
where:
Connecting the Concepts
Multiclass classification is an extension of binary classification where the goal is to assign each input instance to one of three or more distinct classes. Unlike binary classification, where the model predicts between two outcomes, multiclass classification requires handling multiple labels effectively.
Generalization to more than 2 classes is straightforward
(1) one vs. all (one vs. rest)
(2) one vs. one
Softmax (Multinomial Logistic Regression)
Directly models the probability distribution over multiple classes
Using the soft-max function instead of the logistic function (refer to UFLDL Tutorial)
One-Hot Encoding
One-hot encoding is a technique used to represent categorical variables as numerical vectors. It is commonly applied in machine learning, particularly for tasks involving classification or categorical feature encoding.
Suppose you have three classes for a classification problem:
In this representation:
In real-world scenarios, data is often non-linearly separable due to complex patterns, interactions, or overlapping features. Nonlinear classification refers to the process of classifying data points that cannot be separated by a straight line (or hyperplane in higher dimensions). Unlike linear classifiers, which rely on linear decision boundaries, nonlinear classifiers are designed to capture complex patterns and relationships within the data.
One method to achieve nonlinear classification is to transform the original features into a higher-dimensional space where linear separation becomes possible.
Kernels: make linear model work in nonlinear settings
1D Example
Consider the binary classification problem
2D Example
Each example defined by a two features, $ x = \begin{bmatrix}x_1\\ x_2 \end{bmatrix} $
No linear separator exists for this data
Each example now has three features (derived from the old represenation)
Data now becomes linear separable in the new representation
Visual Illustration of Kernel Mapping
from IPython.display import YouTubeVideo
YouTubeVideo('3liCbRZPrZA', width = "560", height = "315")
Selecting the Appropriate Kernel
Determining the appropriate kernel function is a crucial aspect of effectively applying kernel-based methods. While we have demonstrated that applying the right kernel enables linear classification techniques to handle nonlinearly distributed data, we have not yet addressed the process of selecting this optimal kernel.
Throughout our previous discussions, we assumed that the kernel function was predefined. However, identifying the most suitable kernel for a given dataset remains a non-trivial challenge. Techniques such as the kernel trick offer a systematic approach to finding effective kernel functions, but we will not explore this topic further.
The primary reason for this decision is that in deep learning, the model architecture inherently learns effective feature transformations directly from the data. Unlike traditional kernel methods, modern deep learning frameworks are capable of automatically discovering complex and data-driven feature mappings, eliminating the need for manually selecting or designing an optimal kernel.
In essence, while understanding kernel selection is valuable in classical machine learning, deep learning models provide a more flexible and adaptive solution by learning these transformations directly during training.
X1 = np.array([[-1.1,0],[-0.3,0.1],[-0.9,1],[0.8,0.4],[0.4,0.9],[0.3,-0.6],
[-0.5,0.3],[-0.8,0.6],[-0.5,-0.5]])
X0 = np.array([[-1,-1.3], [-1.6,2.2],[0.9,-0.7],[1.6,0.5],[1.8,-1.1],[1.6,1.6],
[-1.6,-1.7],[-1.4,1.8],[1.6,-0.9],[0,-1.6],[0.3,1.7],[-1.6,0],[-2.1,0.2]])
X1 = np.asmatrix(X1)
X0 = np.asmatrix(X0)
plt.figure(figsize = (6, 4))
plt.plot(X1[:,0], X1[:,1], 'ro', label = 'C1')
plt.plot(X0[:,0], X0[:,1], 'bo', label = 'C0')
plt.title('SVM for Nonlinear Data')
plt.xlabel(r'$x_1$')
plt.ylabel(r'$x_2$')
plt.legend(loc = 1)
plt.show()
N = X1.shape[0]
M = X0.shape[0]
X = np.vstack([X1, X0])
y = np.vstack([np.ones([N,1]), -np.ones([M,1])])
X = np.asmatrix(X)
y = np.asmatrix(y)
m = N + M
Z = np.hstack([np.ones([m,1]), np.square(X[:,0]), np.sqrt(2)*np.multiply(X[:,0],X[:,1]), np.square(X[:,1])])
g = 10
w = cvx.Variable([4, 1])
d = cvx.Variable([m, 1])
obj = cvx.Minimize(cvx.norm(w, 2) + g*np.ones([1,m])@d)
const = [cvx.multiply(y, Z@w) >= 1-d, d>=0]
prob = cvx.Problem(obj, const).solve()
w = w.value
print(w)
# to plot
[X1gr, X2gr] = np.meshgrid(np.arange(-3,3,0.1), np.arange(-3,3,0.1))
Xp = np.hstack([X1gr.reshape(-1,1), X2gr.reshape(-1,1)])
Xp = np.asmatrix(Xp)
m = Xp.shape[0]
Zp = np.hstack([np.ones([m,1]), np.square(Xp[:,0]), np.sqrt(2)*np.multiply(Xp[:,0], Xp[:,1]), np.square(Xp[:,1])])
q = Zp*w
B = []
for i in range(m):
if q[i,0] > 0:
B.append(Xp[i,:])
B = np.vstack(B)
plt.figure(figsize = (6, 4))
plt.plot(X1[:,0], X1[:,1], 'ro', label = 'C1')
plt.plot(X0[:,0], X0[:,1], 'bo', label = 'C0')
plt.plot(B[:,0], B[:,1], 'gs', markersize = 10, alpha = 0.1, label = 'SVM')
plt.title('SVM with Kernel')
plt.xlabel(r'$x_1$')
plt.ylabel(r'$x_2$')
plt.legend(loc = 1)
plt.show()
X1 = np.array([[-1.1,0],[-0.3,0.1],[-0.9,1],[0.8,0.4],[0.4,0.9],[0.3,-0.6],
[-0.5,0.3],[-0.8,0.6],[-0.5,-0.5]])
X0 = np.array([[-1,-1.3], [-1.6,2.2],[0.9,-0.7],[1.6,0.5],[1.8,-1.1],[1.6,1.6],
[-1.6,-1.7],[-1.4,1.8],[1.6,-0.9],[0,-1.6],[0.3,1.7],[-1.6,0],[-2.1,0.2]])
X1 = np.asmatrix(X1)
X0 = np.asmatrix(X0)
plt.figure(figsize = (6, 4))
plt.plot(X1[:,0], X1[:,1], 'ro', label = 'C1')
plt.plot(X0[:,0], X0[:,1], 'bo', label = 'C0')
plt.title('Logistic Regression for Nonlinear Data')
plt.xlabel(r'$x_1$')
plt.ylabel(r'$x_2$')
plt.legend(loc = 4)
plt.show()
N = X1.shape[0]
M = X0.shape[0]
X = np.vstack([X1, X0])
y = np.vstack([np.ones([N,1]), -np.ones([M,1])])
X = np.asmatrix(X)
y = np.asmatrix(y)
m = N + M
Z = np.hstack([np.ones([m,1]), np.sqrt(2)*X[:,0], np.sqrt(2)*X[:,1], np.square(X[:,0]),
np.sqrt(2)*np.multiply(X[:,0], X[:,1]), np.square(X[:,1])])
w = cvx.Variable([6, 1])
obj = cvx.Minimize(cvx.sum(cvx.logistic(-cvx.multiply(y,Z @ w))))
prob = cvx.Problem(obj).solve()
w = w.value
# to plot
[X1gr, X2gr] = np.meshgrid(np.arange(-3,3,0.1), np.arange(-3,3,0.1))
Xp = np.hstack([X1gr.reshape(-1,1), X2gr.reshape(-1,1)])
Xp = np.asmatrix(Xp)
m = Xp.shape[0]
Zp = np.hstack([np.ones([m,1]), np.sqrt(2)*Xp[:,0], np.sqrt(2)*Xp[:,1], np.square(Xp[:,0]),
np.sqrt(2)*np.multiply(Xp[:,0], Xp[:,1]), np.square(Xp[:,1])])
q = Zp*w
B = []
for i in range(m):
if q[i,0] > 0:
B.append(Xp[i,:])
B = np.vstack(B)
plt.figure(figsize = (6, 4))
plt.plot(X1[:,0], X1[:,1], 'ro', label = 'C1')
plt.plot(X0[:,0], X0[:,1], 'bo', label = 'C0')
plt.plot(B[:,0], B[:,1], 'gs', markersize = 10, alpha = 0.1, label = 'Logistic Regression')
plt.title('Logistic Regression with Kernel')
plt.xlabel(r'$x_1$')
plt.ylabel(r'$x_2$')
plt.legend(loc = 4)
plt.show()
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')