Classification

Table of Contents

1. Classification¶

from IPython.display import YouTubeVideo
YouTubeVideo('IRrGVQV8vZ8?si=_hkRnh2jEhJVDXsY&amp;start=931', width = "560", height = "315")

Classification is an another core task in machine learning, aimed at categorizing data into predefined classes or categories based on input features. Unlike regression, where the target variable is continuous, classification deals with discrete outputs. It plays a crucial role in a variety of real-world applications, such as

Spam detection: Classifying emails as either spam or not spam
Image recognition: Identifying objects, animals, or faces in images
Medical diagnosis: Predicting whether a patient has a particular condition based on diagnostic data

Key Characteristics of Classification

The output variable $y$ is discrete and often categorical.
The goal is to develop a model that accurately determines the class to which a new input belongs.
Classification models are evaluated based on their ability to correctly predict class labels for unseen data.

(1) Binary Classification:

Involves two classes (e.g., spam vs. non-spam, positive vs. negative).
Often employs decision boundaries to separate the two categories.

(2) Multiclass Classification:

Involves more than two classes (e.g., classifying handwritten digits from 0 to 9).
Typically treated as an extension of binary classification by combining multiple binary classifiers.

Topics Covered in This Section

To build a solid understanding of classification techniques, we will explore the following key algorithms:

(1) Perceptron:

A foundational algorithm inspired by biological neurons, ideal for linearly separable data.

(2) Support Vector Machines (SVM):

A powerful technique that identifies the optimal hyperplane to maximize the margin between different classes.

(3) Logistic Regression:

A widely used method for binary classification that models the probability of class membership using the logistic function.

2. Perceptron¶

The Perceptron is one of the simplest types of artificial neural networks and is used for binary classification. It was invented by Frank Rosenblatt in 1958 and serves as the foundation for more advanced neural networks.

To understand how the perceptron works, let's consider a bank loan approval scenario where the bank decides whether to approve or reject a loan application based on specific criteria.

For input $x = \begin{bmatrix}x_1\\ \vdots\\ x_d \end{bmatrix}\;$ 'attributes (or features) of a customer'

weights $\omega = \begin{bmatrix}\omega_1\\ \vdots\\ \omega_d \end{bmatrix}$

Suppose a decision can be made based on a simple linear combination of features:

$$\begin{align*} \text{Approve credit if} \; & \sum\limits_{i=1}^{d}\omega_ix_i > \text{threshold}, \\ \text{Deny credit if} \; & \sum\limits_{i=1}^{d}\omega_ix_i < \text{threshold}. \end{align*}$$

$$g(x) = \text{sign} \left(\left( \sum\limits_{i=1}^{d}\omega_ix_i \right)- \text{threshold} \right) = \text{sign}\left(\left( \sum\limits_{i=1}^{d}\omega_ix_i \right)+ \omega_0\right)$$

where a sign function is:

$$ \text{sign}(x) = \begin{cases} 1, &\text{if }\; x > 0\\ 0, &\text{if }\; x = 0\\ -1, &\text{if }\; x < 0 \end{cases} $$

Introduce an artificial coordinate $x_0 = 1$:

To simplify the perceptron formulation, we can include the bias term $\omega_0$ into the weight vector by adding an artificial coordinate $x_0 = 1$ to the input vector $x$. This eliminates the explicit bias term and simplifies the computation.

$$g(x) = \text{sign}\left( \sum\limits_{\color{red}{i=0}}^{d}\omega_ix_i \right)$$

In a vector form

$$g(x) = \text{sign}\left( \omega^T x \right)$$

Classification boundary: hyperplane $g(x) = 0 \implies \omega^T x = 0$

Separates a D-dimensional space into two half-spaces
Defined by an outward pointing normal vector $\omega$
Assume the hyperplane passes through origin, $\omega^T x = 0$ with $x_0 = 1$
$\omega$ is orthogonal to any vector lying on the hyperplane

$$ \begin{align*} \omega = \begin{bmatrix}\omega_1 \\ \omega_2 \end{bmatrix}, \quad x = \begin{bmatrix} x_1 \\ x_2\end{bmatrix} &\implies g(x) = \omega_0 + \omega_1 x_1 + \omega_2 x_2 = \omega_0 + \omega^T x \qquad \text{or}\\\\\\ \omega = \begin{bmatrix}\omega_0 \\ \omega_1 \\ \omega_2 \end{bmatrix}, \quad x = \begin{bmatrix} 1 \\ x_1 \\ x_2\end{bmatrix} &\implies g(x) = \omega_0 + \omega_1 x_1 + \omega_2 x_2 = \omega^T x \end{align*} $$

Given the vector $ \omega $, the decision boundary is defined by the hyperplane:

$$ \omega^T x = 0 $$

This hyperplane is orthogonal to the vector $ \omega $, meaning $ \omega $ determines the direction perpendicular to the boundary.

Blue Points: Since all blue points are located on the same side of the hyperplane relative to $ \omega $, they satisfy:

$$ \omega^T x > 0 $$
This corresponds to the region where $ g(x) > 0 $, classifying these points as belonging to Class 1.

Green Points: Conversely, the green points are positioned on the opposite side of the hyperplane relative to $ \omega $, satisfying:

$$ \omega^T x < 0 $$
This corresponds to the region where $ g(x) < 0 $, classifying these points as belonging to Class 0.

In this linear classification framework, $ \omega $ not only defines the orientation of the boundary but also plays a crucial role in determining which points belong to each class based on their position relative to the hyperplane.

Learning a Hyperplane for Classification

(1) Goal:

The objective is to learn the hyperplane $g_{\omega}(x)=0$ using the given training data. This hyperplane serves as the decision boundary that separates different classes.

(2) How to find $\omega$:

The parameter vector $\omega$ defines the orientation and position of the hyperplane. The classification rule is determined as follows:

For all data points belonging to Class 1:

$$g(x) > 0$$

For all data points belonging to Class 0:

$$g(x) < 0$$

This formulation highlights the fundamental idea behind linear classifiers, where the decision boundary is defined by a linear function of the input features. By appropriately determining $\omega$, we can effectively separate the data points into their respective classes.

Note that in traditional methods, the vector $\omega$ is often determined by domain experts based on their prior knowledge and insights about the system. However, in machine learning, the objective is to estimate $\omega$ directly from the data without any inherent bias or preconception. This data-driven approach allows the model to adaptively learn the optimal decision boundary, potentially uncovering complex patterns that may not be apparent through manual design.

Key Insights

'Learning from Data' as a Paradigm Shift:

The concept of "learning from data" represents a significant philosophical shift for many engineers. Traditional engineering methods often rely on explicitly defined models and expert-derived parameters. In contrast, machine learning emphasizes empirical learning, where patterns and relationships are discovered directly from observed data.

Estimating $\omega$:

There are various techniques available to estimate $\omega$, each designed to suit different types of data and modeling requirements. We will explore several key algorithms for effectively estimating $\omega$. These methods form the foundation of machine learning approaches to classification.

2.1. Perceptron Algorithm¶

We will first walk through the perceptron algorithm and then explore the underlying principles that explain how and why it works.

The perceptron implements

$$h(x) = \text{sign}\left( \omega^Tx \right)$$

Given the training set

$$(x_1, y_1), (x_2, y_2), \cdots, (x_N, y_N) \quad \text{where } y_i \in \{-1,1\}$$

pick a misclassified point

$$ \text{sign}\left(\omega^Tx_n \right) \neq y_n$$

and update the weight vector

$$\omega \leftarrow \omega + y_nx_n$$

Why Perceptron Updates Work ?

Let's look at a misclassified positive example ($y_n = +1$)
- perceptron (wrongly) thinks $\omega_{old}^T x_n < 0$

Updates would be

$$ \begin{align*}\omega_{new} &= \omega_{old} + y_n x_n = \omega_{old} + x_n \\ \\ \omega_{new}^T x_n &= (\omega_{old} + x_n)^T x_n = \omega_{old}^T x_n + x_n^T x_n \geq \omega_{old}^T x_n \end{align*}$$

Thus $\omega_{new}^T x_n$ is less negative than $\omega_{old}^T x_n$

Note:

(1) Perceptron Update Rule (discrete version):

$$\omega \leftarrow \omega \pm 1 \cdot x_n$$

(2) Gradient Descent Update Rule (continuous version):

$$\omega \leftarrow \omega - \alpha \nabla_{\omega} f$$

Understanding this similarity provides insight into the foundational mechanisms of various learning algorithms, illustrating how they iteratively adjust model parameters to enhance performance.

Summary

The perceptron is a simple yet powerful model for binary classification tasks like bank loan approval. It classifies applicants based on features like credit score, income, and debt. While it works well for linearly separable data, it has limitations for more complex datasets. Understanding how the perceptron learns and updates its weights is fundamental to understanding modern neural networks.

This diagram symbolically illustrates the Perceptron, a fundamental model in machine learning. The Perceptron operates as follows:

(1) Input Stage

The model takes multiple inputs, denoted as $ x_0, x_1, \cdots, x_d $, each associated with a corresponding weight $ \omega_0, w_1, \cdots, w_d $.
These weights are key parameters that the model adjusts during the learning process.

(2) Weighted Sum Calculation

Each input is multiplied by its respective weight, and the results are summed:

$$ z = \omega_0 x_0 + w_1 x_1 + \cdots + w_d x_d $$

(3) Activation Function

The summation result $ z $ is passed through the sign function, which determines the predicted output:

$$ \hat{y} = \text{sign}(z) $$

The sign function outputs either $ +1 $ or $ -1 $, representing the two possible classes in binary classification.

(4) Weight Update (Learning Process)

The predicted output $ \hat{y} $ is compared with the true label $ y $.
If the prediction is incorrect, the weights are updated as follows:

$$ \omega \leftarrow \omega + y_n x_n $$

This iterative adjustment is the foundation of the Perceptron Learning Algorithm, ensuring that the decision boundary improves over time based on the training data — a clear example of data-driven learning.

Important Insight for Deep Learning

This diagram is also significant because the Perceptron can be viewed as a simple neuron in an Artificial Neural Network (ANN). When we study deep learning later, this structure will serve as the fundamental building block of more complex models. Understanding this visualization now will provide valuable intuition when exploring neural networks and their hierarchical architecture.

Perceptron loss function

If you do not want to explicitly check whether each point is misclassified or not, you can write the perceptron loss function in a more compact expression that automatically accumulates contributions only from misclassified points.

The loss for an individual sample is defined as:

$$ \mathscr{L}(\omega) = \max \left\{ 0, -y_n \cdot \left(\omega^T x_n \right)\right\} $$

Loss $\mathscr{L}(\omega) = 0$ on examples where perceptron is correct, i.e., $y_n \cdot \left(\omega^T x_n \right) > 0$

Loss $\mathscr{L}(\omega) > 0$ on examples where perceptron misclassified, i.e., $y_n \cdot \left(\omega^T x_n \right) < 0$

Note:

$\text{sign}\left(\omega^T x_n \right) \neq y_n$ is equivalent to $ y_n \cdot \left(\omega^T x_n \right) < 0$
$\text{ReLU}(z) = \max(0, z)$: the Rectified Linear Unit (ReLU) will be revisited later in the discussion, as it plays a crucial role not only in perceptron loss formulation but also in modern deep learning architectures where it is widely used as an activation function.

This function returns zero when the point is correctly classified and returns a positive value proportional to the margin violation when misclassified.

Compact Expression for Total Perceptron Loss

The total loss across all samples in the dataset $D$ is given by:

$$ \mathscr{L}(\omega) = \sum_{n =1}^{m} \max \left\{ 0, -y_n \cdot \left(\omega^T x_n \right)\right\} $$

This summation aggregates the loss over all training samples and only penalizes incorrectly classified points.

2.2. Perceptron in Python¶

By adding an artificial coordinate $x_0 = 1$, we simplify the perceptron implementation and avoid the need to handle the bias separately. This technique is commonly used in machine learning to streamline calculations and make the formulation more uniform.

$$g(x) = \omega_0 + \omega^Tx = \omega_0 + \omega_1x_1 + \omega_2x_2 = 0$$

$$ \begin{align*} \omega &= \begin{bmatrix} \omega_0 \\ \omega_1 \\ \omega_2\end{bmatrix}\\ \\ x &= \begin{bmatrix} \left(x^{(1)}\right)^T \\ \left(x^{(2)}\right)^T \\ \left(x^{(3)}\right)^T\\ \vdots \\ \left(x^{(m)}\right)^T \end{bmatrix} = \begin{bmatrix} 1 & x_1^{(1)} & x_2^{(1)} \\ 1 & x_1^{(2)} & x_2^{(2)} \\ 1 & x_1^{(3)} & x_2^{(3)}\\\vdots & \vdots & \vdots \\ 1 & x_1^{(m)} & x_2^{(m)}\end{bmatrix} \\ \\ y &= \begin{bmatrix}y^{(1)} \\ y^{(2)} \\ y^{(3)}\\ \vdots \\ y^{(m)} \end{bmatrix} \end{align*}$$

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

#training data gerneration
m = 100
x1 = 8*np.random.rand(m, 1)
x2 = 7*np.random.rand(m, 1) - 4

g = 0.8*x1 + x2 - 3

C1 = np.where(g >= 1)
C0 = np.where(g < -1)
print(C1)

(array([ 5, 14, 16, 18, 19, 30, 32, 33, 34, 36, 46, 47, 52, 53, 56, 61, 64,
       66, 67, 70, 71, 72, 74, 77, 80, 82, 84, 91, 97]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0]))

C1 = np.where(g >= 1)[0]
C0 = np.where(g < -1)[0]
print(C1.shape)
print(C0.shape)

(29,)
(37,)

plt.figure(figsize = (6, 4))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.title('Linearly Separable Classes')
plt.legend(loc = 1)
plt.xlabel(r'$x_1$')
plt.ylabel(r'$x_2$')
plt.show()

$$ \begin{align*} x &= \begin{bmatrix} \left(x^{(1)}\right)^T \\ \left(x^{(2)}\right)^T \\ \left(x^{(3)}\right)^T\\ \vdots \\ \left(x^{(m)}\right)^T \end{bmatrix} = \begin{bmatrix} 1 & x_1^{(1)} & x_2^{(1)} \\ 1 & x_1^{(2)} & x_2^{(2)} \\ 1 & x_1^{(3)} & x_2^{(3)}\\\vdots & \vdots & \vdots \\ 1 & x_1^{(m)} & x_2^{(m)}\end{bmatrix} \\\\ y &= \begin{bmatrix}y^{(1)} \\ y^{(2)} \\ y^{(3)}\\ \vdots \\ y^{(m)} \end{bmatrix} \end{align*}$$

X1 = np.hstack([np.ones([C1.shape[0],1]), x1[C1], x2[C1]])
X0 = np.hstack([np.ones([C0.shape[0],1]), x1[C0], x2[C0]])
X = np.vstack([X1, X0])

y = np.vstack([np.ones([C1.shape[0],1]), -np.ones([C0.shape[0],1])])

X = np.asmatrix(X)
y = np.asmatrix(y)

$$\omega = \begin{bmatrix} \omega_0 \\ \omega_1 \\ \omega_2\end{bmatrix}$$

$$\omega \leftarrow \omega + yx$$ where $(x, y)$ is a misclassified training point

w = np.ones([3,1])
w = np.asmatrix(w)

n_iter = y.shape[0]
for k in range(n_iter):
    for i in range(n_iter):
        if y[i,0] != np.sign(X[i,:]*w)[0,0]:
            w += y[i,0]*X[i,:].T

print(w)

[[-8.        ]
 [ 2.99076528]
 [ 5.25407248]]

$$ \begin{align*} g(x) &= \omega_0 + \omega^Tx = \omega_0 + \omega_1x_1 + \omega_2x_2 = 0 \\\\ \implies x_2 &= -\frac{\omega_1}{\omega_2} x_1 - \frac{\omega_0}{\omega_2} \end{align*} $$

x1p = np.linspace(0,8,100).reshape(-1,1)
x2p = - w[1,0]/w[2,0]*x1p - w[0,0]/w[2,0]

plt.figure(figsize = (6, 4))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.plot(x1p, x2p, c = 'k', linewidth = 3, label = 'perceptron')
plt.xlim([0, 8])
plt.xlabel('$x_1$')
plt.ylabel('$x_2$')
plt.legend(loc = 1)
plt.show()

Perceptron using Scikit-Learn

In scikit-learn, the Perceptron includes the bias term by default, so you don't need to add an artificial coordinate manually. This makes it more convenient for implementing and training perceptron models on datasets.

$$ \begin{align*} x &= \begin{bmatrix} \left(x^{(1)}\right)^T \\ \left(x^{(2)}\right)^T \\ \left(x^{(3)}\right)^T\\ \vdots \\ \left(x^{(m)}\right)^T \end{bmatrix} = \begin{bmatrix} x_1^{(1)} & x_2^{(1)} \\ x_1^{(2)} & x_2^{(2)} \\ x_1^{(3)} & x_2^{(3)}\\ \vdots & \vdots \\ x_1^{(m)} & x_2^{(m)}\end{bmatrix} \\\\ y &= \begin{bmatrix}y^{(1)} \\ y^{(2)} \\ y^{(3)}\\ \vdots \\ y^{(m)} \end{bmatrix} \end{align*}$$

X1 = np.hstack([x1[C1], x2[C1]])
X0 = np.hstack([x1[C0], x2[C0]])
X = np.vstack([X1, X0])

y = np.vstack([np.ones([C1.shape[0],1]), -np.ones([C0.shape[0],1])])

from sklearn import linear_model

clf = linear_model.Perceptron(tol=1e-3)
clf.fit(X, np.ravel(y))

Perceptron()

Perceptron()

clf.predict([[3, -2]])

array([-1.])

clf.predict([[6, 2]])

array([1.])

clf.coef_

array([[3.28886445, 5.97667198]])

clf.intercept_

array([-8.])

$$ \begin{align*} g(x) &= \omega_0 + \omega^Tx = \omega_0 + \omega_1x_1 + \omega_2x_2 = 0 \\\\ \implies x_2 &= -\frac{\omega_1}{\omega_2} x_1 - \frac{\omega_0}{\omega_2} \end{align*} $$

w0 = clf.intercept_[0]
w1 = clf.coef_[0,0]
w2 = clf.coef_[0,1]

x1p = np.linspace(0,8,100).reshape(-1,1)
x2p = - w1/w2*x1p - w0/w2

plt.figure(figsize = (6, 4))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.plot(x1p, x2p, c = 'k', linewidth = 4, label = 'perceptron')
plt.xlim([0, 8])
plt.xlabel('$x_1$')
plt.ylabel('$x_2$')
plt.legend(loc = 1)
plt.show()

2.3. The Best Hyperplane Separator?¶

Limitations

The Perceptron Algorithm successfully identifies a separating hyperplane if the data is linearly separable. However, there are typically many possible hyperplanes that can separate the data. The Perceptron algorithm does not guarantee the optimal hyperplane - it merely finds one of the possible solutions.

Improvements

While the Perceptron identifies one valid separating hyperplane, identifying the best hyperplane requires an optimization framework. This concept is crucial in understanding more advanced models like Support Vector Machines (SVM) and Logistic Regression, which are designed to find optimal decision boundaries with improved performance and generalization.

2.4. Historical Notes¶

In the late 1960s, Marvin Minsky and Seymour Papert published the influential book "Perceptrons" (1969), where they mathematically demonstrated the limitations of the perceptron. One key result from their work was that single-layer perceptrons cannot solve the XOR problem (exclusive OR), highlighting a significant limitation of early neural networks.

Stagnation in Neural Network Research:

The book led to a significant decline in interest and funding for neural networks, as it was seen as proof that neural networks had severe limitations.
This period is often referred to as the "AI Winter".

Rediscovery with Multi-Layer Networks:

In the 1980s, researchers like Geoffrey Hinton showed that multi-layer perceptrons (MLPs) with backpropagation could learn non-linear functions like XOR.
Adding hidden layers and non-linear activation functions allows neural networks to learn complex, non-linear patterns.

3. SVM¶

As a first classification algorithm that improve upon the simple perceptron, we will study Support Vector Machines (SVM) in this section. To build the foundation for understanding SVMs, we will first examine how to compute the distance from a point to a line. This concept is crucial, as it will later be used to identify and optimize the best possible hyperplane separator.

3.1. Distance from a Line¶

$$\omega = \begin{bmatrix}\omega_1 \\ \omega_2\end{bmatrix}, \, x = \begin{bmatrix}x_1\\x_2\end{bmatrix} \; \implies g(x) = \omega_0 + \omega^Tx = \omega_0 + \omega_1x_1 + \omega_2x_2 $$

(1) If $\vec p$ and $\vec q$ are on the decision line

$$ \begin{align*} g\left(\vec p\right) = g\left(\vec q\right) = 0 \quad & \implies \quad \omega_0 + \omega^T \vec p = \omega_0 + \omega^T \vec q = 0 \\ & \implies \quad \omega^T \left( \vec p- \vec q \right) = 0 \end{align*} $$

$$ \begin{align*} & \therefore \, \omega : \text{orthogonal to the line} \\ & \implies \text{tells the direction of the line} \end{align*}$$

(2) Compute $d$ which is a normal signed distance from the origin to the line

If $x$ is on the line and $x = d\frac{\omega}{\lVert \omega \rVert}$ (where $d$ is a normal distance from the origin to the line)

$$ \begin{align*} g(x)& = \omega_0 + \omega^Tx = 0 \; \\ & \Rightarrow \omega_0 + \omega^Td\frac{\omega}{\lVert \omega \rVert} = \omega_0 + d\frac{\omega^T\omega}{\lVert \omega \rVert} = \omega_0 + d\lVert \omega \rVert = 0 \\\\ & \therefore d \, = - \frac{\omega_0}{\lVert \omega \rVert} \end{align*}$$

(3) Compute $h$ which is a normal signed distance from $x$ to the line

For any vector of $x$

$$ x = x_{\perp} + r \frac{\omega}{\lVert \omega \rVert}$$

$$ \omega^Tx = \omega^T \left( x_{\perp} + r \frac{\omega}{\lVert \omega \rVert}\right) = r \frac{\omega^T\omega}{\lVert \omega \rVert} = r \lVert \omega \rVert$$

$$ \begin{align*} g(x) & = \omega_0 + \omega^Tx \\ & = \omega_0 + r \lVert \omega \rVert \qquad (r = d + h) \\ & = \omega_0 + (d +h) \lVert \omega \rVert\\ & = \omega_0 + \left(- \frac{\omega_0}{\lVert \omega \rVert} + h \right)\lVert \omega \rVert\\ & = h \lVert \omega \rVert \end{align*}$$

$$\therefore \; h = \frac{g(x)}{\lVert \omega \rVert} \implies\; \text{orthogonal signed distance from the line}$$

$\text{sign}(h)$ indicates whether $x$ is located on the same side of the hyperplane relative to $\omega$ or not.

3.2. Distance Between Two Parallel Lines $g(x) = -1$ and $g(x) = 1$¶

(1) Find a distance between $g(x) = -1$ and $g(x) = 1$

Suppose $g(x_1) = -1,\; g(x_2) = 1$

$$ \begin{align*} h_1 &= \frac{g(x_1)}{\lVert \omega \rVert} = \frac{-1}{\lVert \omega \rVert} \\ h_2 &= \frac{g(x_2)}{\lVert \omega \rVert} = \frac{1}{\lVert \omega \rVert} \\\\ \lvert \; \lvert h_2 \rvert + \lvert h_1 \rvert \; \rvert &= \frac{1}{\lVert \omega \rVert} + \frac{1}{\lVert \omega \rVert} = \frac{2}{\lVert \omega \rVert} \end{align*} $$

This distance formula is especially important in SVMs, where maximizing the margin between parallel decision boundaries is key to optimal classification.

(2) Another method to find a distance between $g(x) = -1$ and $g(x) = 1$

Suppose $g(x_1) = -1,\; g(x_2) = 1$

$$ \begin{array}{I} \omega_0 + \omega^Tx_1 = -1\\ \omega_0 + \omega^Tx_2 = 1\\ \end{array} \; \implies \; \omega^T(x_2 - x_1) = 2$$

$$s = \langle\frac{\omega}{\lVert \omega \rVert}, \; x_2 - x_1 \rangle = \frac{1}{\lVert \omega \rVert}\omega^T(x_2 - x_1) = \frac {2}{\lVert \omega \rVert}$$

3.3. Illustrative Example of SVM¶

Binary classification
- $C_1$ and $C_0$

Inputs or Features

$$x = \begin{bmatrix}x_1 \\ x_2 \end{bmatrix}$$

Is it possible to distinguish between $C_1$ and $C_0$ by its values in $x$?
We need to find a separating hyperplane (or a line in 2D)

$$ \begin{align*} \omega_0 + \omega_1x_1 + \omega_2x_2 &= 0 \\\\ \omega_0 + \begin{bmatrix}\omega_1 & \omega_2 \end{bmatrix} \begin{bmatrix}x_1 \\ x_2 \end{bmatrix} &= 0\\\\ \omega_0 + \omega^Tx &= 0 \end{align*} $$

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

## training data gerneration

np.random.seed(1)

x1 = 9*np.random.rand(100, 1)
x2 = 6*np.random.rand(100, 1)

g = 1/3*(2*x1 + 3*x2 - 18)

C1 = np.where(g >= 1)[0]
C0 = np.where(g < -1)[0]

xp = np.linspace(0,9,100).reshape(-1,1)
ypt = -2/3*xp + 6

plt.figure(figsize = (6, 4))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.plot(xp, ypt, 'k', linewidth = 3, label = 'True')
plt.title('Linearly and Strictly Separable Classes')
plt.xlabel(r'$x_1$')
plt.ylabel(r'$x_2$')
plt.legend(loc = 3)
plt.xlim([0, 9])
plt.ylim([0, 6])
$$g(x) = b$$plt.show()

Given:
- Unknown hyperplane defined by $\omega$ and $\omega_0$
- $x$ and its labels $y$

Decision making:

$$ \omega_0 + \omega^Tx > 0 \implies x \; \text{belongs to} \; C_1$$

$$ \omega_0 + \omega^Tx < 0 \implies x \; \text{belongs to} \; C_0$$

With this setup, the goal is to find a linear classifier that separates data points using a decision boundary defined by.

$$\omega_0 + \omega^Tx = 0$$

The Key Insight in SVM: Introducing a Margin

In the traditional perceptron algorithm, the goal is simply to find any hyperplane that separates the data. However, SVM introduces a crucial improvement - maximizing the margin between the classes to improve robustness.

If the data is strictly separable, we can always establish a margin condition:

For all points in Class 1

$$\omega_0 + \omega^Tx > b$$

For all points in Class 0

$$\omega_0 + \omega^Tx < b$$

Where $b$ is a positive constant representing the margin distance.

Why Does This Work?

The above margin condition may seem arbitrary at first, but it is a crucial step in building a robust classifier. Here's why:

By requiring points to satisfy these inequalities, we ensure that no data points lie arbitrarily close to the decision boundary.
This margin condition provides a 'buffer zone' that reduces the model's sensitivity to small perturbations or noise in the data.

Scaling to Simplify the Problem

The next step is to scale the inequalities such that the margin boundary conditions are set to $\pm 1$. This step simplifies the mathematics without changing the geometry of the problem.

If the data is strictly separable, we can always establish a margin condition:

For all points in Class 1

$$\omega_0 + \omega^Tx > 1$$

For all points in Class 0

$$\omega_0 + \omega^Tx < 1$$

At first glance, it may seem counterintuitive that scaling is acceptable. However, the key insight is that only the relative positioning of points matters in defining the decision boundary. Scaling by $b$ preserves the geometry of the hyperplane and the relative margins. Therefore, the resulting classifier remains functionally identical but now operates under simplified conditions.

$$ \begin{align*} &\omega_0 + \omega^Tx > b, \quad (b>0)\\ \\ \Longleftrightarrow \;&\frac{\omega_0}{b} + \frac{\omega^T}{b}x > 1 \\\\ \Longleftrightarrow \;&\omega'_0 + \omega'^Tx > 1 \end{align*}$$

Now we will begin studying Support Vector Machine (SVM) algorithms. It is important to keep in mind that the concept of a 'buffer zone' - a critical idea linked to the robustness of the classifier - is first introduced here in SVM. This buffer zone ensures that data points are not positioned too close to the decision boundary, enhancing the model's ability to generalize effectively and resist overfitting in the presence of noise or small perturbations. This key innovation distinguishes SVM from earlier linear classifiers like the Perceptron.

# see how data are generated

xp = np.linspace(0,9,100).reshape(-1,1)
ypt = -2/3*xp + 6

plt.figure(figsize = (6, 4))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.plot(xp, ypt, 'k', alpha = 0.3, linewidth = 3, label = 'True')
plt.plot(xp, ypt-1, '--k', alpha = 0.3)
plt.plot(xp, ypt+1, '--k', alpha = 0.3)
plt.title('Linearly and Strictly Separable Classes')
plt.xlabel(r'$x_1$')
plt.ylabel(r'$x_2$')
plt.legend(loc = 3)
plt.xlim([0, 9])
plt.ylim([0, 6])
plt.show()

3.3.1. The First Attempt¶

Settings:

$n \;(=2)$ features
$m = N + M$ data points in training set
- $N$ belongs to $C_1$ in training set
- $M$ belongs to $C_0$ in training set

$$ x^{(i)} = \begin{bmatrix} x^{(i)}_1 \\ x^{(i)}_2 \end{bmatrix} \;\text{with}\; \omega = \begin{bmatrix} \omega_1 \\ \omega_2 \end{bmatrix}\qquad \text{or} \qquad x^{(i)} = \begin{bmatrix} 1 \\ x^{(i)}_1 \\ x^{(i)}_2 \end{bmatrix} \;\; \text{with}\; \omega = \begin{bmatrix} \omega_0 \\ \omega_1 \\ \omega_2 \end{bmatrix}$$

We will approach SVM as an optimization problem.

$\omega$ and $\omega_0$ are the unknown variables

In naive expression:

$$\begin{align*} \text{minimize} \quad & \text{something} \\ \\ \text{subject to} \quad & \begin{cases} \omega_0 + \omega^Tx^{(1)} \geq1\\ \omega_0 + \omega^Tx^{(2)} \geq1\\ \quad \quad \vdots \\ \omega_0 + \omega^Tx^{(N)}\geq1\\ \end{cases} \\ & \begin{cases} \omega_0 + \omega^Tx^{(N+1)} \leq{-1}\\ \omega_0 + \omega^Tx^{(N+2)} \leq{-1}\\ \quad \quad \vdots \\ \omega_0 + \omega^Tx^{(N+M)} \leq{-1}\\ \end{cases} \end{align*}$$

To simplify the notation and formulation, we will introduce an artificial feature by setting $x_0 = 1$

$$\begin{align*} \text{minimize} \quad & \text{something} \\ \\ \text{subject to} \quad & \begin{cases} \omega^Tx^{(1)} \geq1\\ \omega^Tx^{(2)} \geq1\\ \quad \quad \vdots \\ \omega^Tx^{(N)} \geq1\\ \end{cases} \\ & \begin{cases} \omega^Tx^{(N+1)} \leq{-1}\\ \omega^Tx^{(N+2)} \leq{-1}\\ \quad \quad \vdots \\ \omega^Tx^{(N+M)} \leq{-1}\\ \end{cases} \end{align*}$$

These constraints ensure that all correctly classified points lie outside the margin boundaries, establishing a buffer zone that enhances the classifier's robustness. However, the appropriate objective function to be minimized has not yet been determined.

In Matrix Form

$$\begin{align*} X_1 &= \begin{bmatrix} \left(x^{(1)}\right)^T \\ \left(x^{(2)}\right)^T \\ \vdots \\ \left(x^{(N)}\right)^T\end{bmatrix} = \begin{bmatrix} x_1^{(1)} & x_2^{(1)} \\ x_1^{(2)} & x_2^{(2)} \\ \vdots & \vdots \\ x_1^{(N)} & x_2^{(N)} \\ \end{bmatrix} \qquad \qquad X_1 = \begin{bmatrix} \left(x^{(1)}\right)^T \\ \left(x^{(2)}\right)^T \\ \vdots \\ \left(x^{(N)}\right)^T\end{bmatrix} = \begin{bmatrix} 1 & x_1^{(1)} & x_2^{(1)} \\ 1 & x_1^{(2)} & x_2^{(2)} \\ \vdots & \vdots & \vdots \\ 1 & x_1^{(N)} & x_2^{(N)} \\ \end{bmatrix} \end{align*}$$

$$ \begin{align*} X_0 = \begin{bmatrix} \left(x^{(N+1)}\right)^T \\ \left(x^{(N+2)}\right)^T \\ \vdots \\ \left(x^{(N+M)}\right)^T\end{bmatrix} = \begin{bmatrix} x_1^{(N+1)} & x_2^{(N+1)} \\ x_1^{(N+2)} & x_2^{(N+2)} \\ \vdots & \vdots \\ x_1^{(N+M)} & x_2^{(N+M)} \\ \end{bmatrix} \qquad \qquad X_0 = \begin{bmatrix} \left(x^{(N+1)}\right)^T \\ \left(x^{(N+2)}\right)^T \\ \vdots \\ \left(x^{(N+M)}\right)^T\end{bmatrix} = \begin{bmatrix} 1 & x_1^{(N+1)} & x_2^{(N+1)} \\ 1 & x_1^{(N+2)} & x_2^{(N+2)} \\ \vdots & \vdots & \vdots \\ 1 & x_1^{(N+M)} & x_2^{(N+M)} \\ \end{bmatrix} \end{align*} $$

Optimization in Form 1 (Standard Form with Separate Bias Term)

$$\begin{align*} \text{minimize} \quad & \text{something} \\ \text{subject to} \quad & \omega_0 + X_1\omega \geq 1 \\ & \omega_0 + X_0\omega \leq -1 \end{align*}$$

# CVXPY

import cvxpy as cvx

X1 = np.hstack([x1[C1], x2[C1]])
X0 = np.hstack([x1[C0], x2[C0]])

X1 = np.asmatrix(X1)
X0 = np.asmatrix(X0)

N = X1.shape[0]
M = X0.shape[0]

w0 = cvx.Variable([1,1])
w = cvx.Variable([2,1])

obj = cvx.Minimize(1)
const = [w0 + X1@w >= 1, w0 + X0@w <= -1]
prob = cvx.Problem(obj, const).solve()

w0 = w0.value
w = w.value

xp = np.linspace(0,9,100).reshape(-1,1)
yp = - w[0,0]/w[1,0]*xp - w0/w[1,0]

ypt = -2/3*xp + 6

plt.figure(figsize = (6, 4))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.plot(xp, ypt, 'k', alpha = 0.3, label = 'True')
plt.plot(xp, ypt-1, '--k', alpha = 0.3)
plt.plot(xp, ypt+1, '--k', alpha = 0.3)
plt.plot(xp, yp, 'g', linewidth = 3, label = 'Attempt 1')
plt.title('Linearly and Strictly Separable Classes')
plt.xlabel(r'$x_1$')
plt.ylabel(r'$x_2$')
plt.legend(loc = 3)
plt.xlim([0, 9])
plt.ylim([0, 6])
plt.show()

Interestingly, this optimization process leads to one of the possible linear boundaries that can separate the data.

Optimization in Form 2

$$\begin{align*} \text{minimize} \quad & \text{something} \\ \text{subject to} \quad & X_1\omega \geq 1 \\ & X_0\omega \leq -1 \end{align*}$$

import cvxpy as cvx

N = C1.shape[0]
M = C0.shape[0]

X1 = np.hstack([np.ones([N,1]), x1[C1], x2[C1]])
X0 = np.hstack([np.ones([M,1]), x1[C0], x2[C0]])

X1 = np.asmatrix(X1)
X0 = np.asmatrix(X0)

w = cvx.Variable([3,1])

obj = cvx.Minimize(1)
const = [X1@w >= 1, X0@w <= -1]
prob = cvx.Problem(obj, const).solve()

w = w.value

xp = np.linspace(0,9,100).reshape(-1,1)
yp = - w[1,0]/w[2,0]*xp - w[0,0]/w[2,0]

ypt = -2/3*xp + 6

plt.figure(figsize = (6, 4))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.plot(xp, ypt, 'k', alpha = 0.3, label = 'True')
plt.plot(xp, ypt-1, '--k', alpha = 0.3)
plt.plot(xp, ypt+1, '--k', alpha = 0.3)
plt.plot(xp, yp, 'g', linewidth = 3, label = 'Attempt 1')
plt.title('Linearly and Strictly Separable Classes')
plt.xlabel(r'$x_1$')
plt.ylabel(r'$x_2$')
plt.legend(loc = 3)
plt.xlim([0, 9])
plt.ylim([0, 6])
plt.show()

3.3.2. Outlier¶

It is evident that the above implementation may fail to produce a valid boundary when the data is not linearly separable. In real-world scenarios, datasets often contain noise, errors, or outliers that deviate from the true underlying patterns.

Consequently, under these conditions, a strict margin condition (e.g., $\;y_i (\omega_0 + \omega^T x_i) \geq 1$) may be impossible to satisfy for all data points.

X1 = np.hstack([np.ones([N,1]), x1[C1], x2[C1]])
X0 = np.hstack([np.ones([M,1]), x1[C0], x2[C0]])

outlier = np.array([1, 8.5, 1.8]).reshape(1,-1)
X0 = np.vstack([X0, outlier])

X1 = np.asmatrix(X1)
X0 = np.asmatrix(X0)

plt.figure(figsize = (6, 4))
plt.plot(X1[:,1], X1[:,2], 'ro', alpha = 0.4, label = 'C1')
plt.plot(X0[:,1], X0[:,2], 'bo', alpha = 0.4, label = 'C0')
plt.title('When Outliers Exist')
plt.xlabel(r'$x_1$')
plt.ylabel(r'$x_2$')
plt.legend(loc = 3)
plt.xlim([0, 9])
plt.ylim([0, 6])
plt.show()

w = cvx.Variable([3,1])

obj = cvx.Minimize(1)
const = [X1@w >= 1, X0@w <= -1]
prob = cvx.Problem(obj, const).solve()

print(w.value)

None

As observed, no feasible solution exists when the data is not linearly separable using a strict margin condition.

Allowing Misclassifications

To address this, we introduce the idea of soft margins - a critical enhancement to the original SVM formulation. In this approach:

Some training examples are allowed to be misclassified.
However, the goal remains to minimize the number of misclassified points or, more formally, to minimize the total deviation from the margin constraints.

3.3.3. The Second Attempt¶

In real-world scenarios, data is often not linearly separable due to noise, errors, or outliers. To address this, we introduce a more flexible framework known as the Soft Margin SVM, which allows some training points to violate the margin conditions.

Relaxing the Constraints

For non-separable data, we relax the original constraints by introducing slack variables.
These slack variables, denoted as $u$ and $v$, are non-negative values that quantify the extent to which data points violate the margin conditions.

Key Idea Behind Slack Variables

Each slack variable corresponds to a specific data point:
- $u_i$ represents the margin violation for points in Class 1.
- $v_i$ represents the margin violation for points in Class 0.
Both $u$ and $v$ are strictly non-negative values ($u_i \geq 0$ and $v_i \geq 0$)

Optimization Problem for the Non-separable Case

The objective is now to minimize both the total margin violations (slack variables) - effectively controlling the number of misclassified points and their deviation from the margin.

$$\begin{align*} \text{minimize} \quad & \sum\limits_{i=1}^{N}u_i + \sum\limits_{i=1}^{M}\upsilon_i \\ \\ \text{subject to} \quad & \begin{cases} \omega^Tx^{(1)} \geq 1-u_1\\ \omega^Tx^{(2)} \geq 1-u_2\\ \quad \quad \vdots \\ \omega^Tx^{(N)} \geq 1-u_N\\ \end{cases} \\\\ & \begin{cases} \omega^Tx^{(N+1)} \leq {-(1-\upsilon_1)}\\ \omega^Tx^{(N+2)} \leq {-(1-\upsilon_2)}\\ \quad \quad \vdots \\ \omega^Tx^{(N+M)} \leq {-(1-\upsilon_M)}\\ \end{cases} \\\\ & \begin{cases} u \geq 0\\ v \geq 0\\ \end{cases} \end{align*}$$

Expressed in a matrix form

$$\begin{align*} \text{minimize} \quad & 1^Tu + 1^T\upsilon \\ \text{subject to} \quad & X_1\omega \geq 1-u \\ & X_0\omega \leq -(1-\upsilon) \\ & u \geq 0 \\ & \upsilon \geq 0 \end{align*}$$

w = cvx.Variable([3,1])
u = cvx.Variable([N,1])
v = cvx.Variable([M+1,1])

obj = cvx.Minimize(np.ones((1,N))@u + np.ones((1,M+1))@v)
const = [X1@w >= 1-u, X0@w <= -(1-v), u >= 0, v >= 0 ]
prob = cvx.Problem(obj, const).solve()

w = w.value

xp = np.linspace(0,9,100).reshape(-1,1)
yp = - w[1,0]/w[2,0]*xp - w[0,0]/w[2,0]

ypt = -2/3*xp + 6

plt.figure(figsize = (6, 4))
plt.plot(X1[:,1], X1[:,2], 'ro', alpha = 0.4, label = 'C1')
plt.plot(X0[:,1], X0[:,2], 'bo', alpha = 0.4, label = 'C0')
plt.plot(xp, ypt, 'k', alpha = 0.3, label = 'True')
plt.plot(xp, ypt-1, '--k', alpha = 0.3)
plt.plot(xp, ypt+1, '--k', alpha = 0.3)
plt.plot(xp, yp, 'g', linewidth = 3, label = 'Attempt 2')
plt.plot(xp, yp-1/w[2,0], '--g')
plt.plot(xp, yp+1/w[2,0], '--g')
plt.title('When Outliers Exist')
plt.xlabel(r'$x_1$')
plt.ylabel(r'$x_2$')
plt.legend(loc = 3)
plt.xlim([0, 9])
plt.ylim([0, 6])
plt.show()

The presence of outliers can significantly affect the position and orientation of the hyperplane. As a result, the hyperplane may no longer accurately represent the optimal division between classes. This issue arises because, in the second attempt, the linear classifier aims to separate the data as much as possible, making them highly sensitive to noise or extreme points.

Further Improvement: Large Margin for Better Generalization

The core insight in SVM is that a large margin not only separates the data effectively but also improves the model's generalization on unseen data.
By maximizing the margin (i.e., buffer zone), the model achieves:
- Increased robustness to noise and outliers.
- Better predictive performance on new data by reducing the risk of overfitting.

3.3.4. Maximize Margin (Finally, this is Support Vector Machine)¶

Distance (= margin)

$$\text{margin} = \frac{2}{\lVert \omega \rVert _2}$$

Minimize $\lVert \omega \rVert_2$ to maximize the margin

$$\begin{align*} \text{minimize} \quad & \lVert \omega \rVert_2 + \gamma(1^Tu + 1^T\upsilon) \\ \text{subject to} \quad & X_1\omega \geq 1-u \\ & X_0\omega \leq -(1-\upsilon) \\ & u \geq 0 \\ & \upsilon \geq 0 \end{align*}$$

Multiple objectives
Use gamma ($\gamma$) as a weighting betwwen the followings:
- Bigger margin given robustness to outliers
- Hyperplane that has few (or no) errors

g = 2

w = cvx.Variable([3,1])
u = cvx.Variable([N,1])
v = cvx.Variable([M+1,1])

obj = cvx.Minimize(cvx.norm(w,2) + g*(np.ones((1,N))@u + np.ones((1,M+1))@v))
const = [X1@w >= 1-u, X0@w <= -(1-v), u >= 0, v >= 0 ]
prob = cvx.Problem(obj, const).solve()

w = w.value

xp = np.linspace(0,9,100).reshape(-1,1)
yp = - w[1,0]/w[2,0]*xp - w[0,0]/w[2,0]

ypt = -2/3*xp + 6

plt.figure(figsize = (6, 4))
plt.plot(X1[:,1], X1[:,2], 'ro', alpha = 0.4, label = 'C1')
plt.plot(X0[:,1], X0[:,2], 'bo', alpha = 0.4, label = 'C0')
plt.plot(xp, ypt, 'k', alpha = 0.3, label = 'True')
plt.plot(xp, ypt-1, '--k', alpha = 0.3)
plt.plot(xp, ypt+1, '--k', alpha = 0.3)
plt.plot(xp, yp, 'g', linewidth = 3, label = 'Attempt 2')
plt.plot(xp, yp-1/w[2,0], '--g')
plt.plot(xp, yp+1/w[2,0], '--g')
plt.title('When Outliers Exist')
plt.xlabel(r'$x_1$')
plt.ylabel(r'$x_2$')
plt.legend(loc = 3)
plt.xlim([0, 9])
plt.ylim([0, 6])
plt.show()

By shifting focus from simply separating the data to maximizing the margin, SVM introduces a powerful improvement that enhances performance, particularly in the presence of noise and outliers. This margin-based approach is a fundamental reason why SVMs are widely regarded as one of the most effective classifiers in machine learning.

3.4. Support Vector Machine¶

Probably the most popular/influential classification algorithm
A hyperplane based classifier (like the Perceptron)
Additionally uses the maximum margin principle
- Maximize distance (margin) of closest samples from the decision line
  
  $$ \text{maximize {minimum distance}} $$
- Note: perceptron only utilizes a sign of distance
- Finds the hyperplane with maximum separation margin on the training data

$$\begin{align*} \text{minimize} \quad & \lVert \omega \rVert_2 + \gamma(1^Tu + 1^T\upsilon) \\ \text{subject to} \quad & X_1\omega \geq 1-u \\ & X_0\omega \leq -(1-\upsilon) \\ & u \geq 0 \\ & \nu \geq 0 \end{align*}$$

In a more compact form

$$ \begin{align*} \omega^T x_n &\geq 1 \;\text{for }\; y_n = +1 \\ \omega^T x_n &\leq -1 \;\text{for }\; y_n = -1 \end{align*} \quad \Longleftrightarrow \quad y_n \cdot \left( \omega^T x_n \right) \geq 1 $$

$$\begin{align*} \text{minimize} \quad & \lVert \omega \rVert_2 + \gamma(1^T \xi) \\ \text{subject to} \quad & y_n \cdot \left( \omega^T x_n \right) \geq 1 - \xi_n \\ & \xi \geq 0 \\ \end{align*}$$

X = np.vstack([X1, X0])
y = np.vstack([np.ones([N,1]), -np.ones([M+1,1])])

m = N + M + 1

g = 2

w = cvx.Variable([3,1])
d = cvx.Variable([m,1])

obj = cvx.Minimize(cvx.norm(w,2) + g*(np.ones([1,m])@d))
const = [cvx.multiply(y, X@w) >= 1-d, d >= 0]
prob = cvx.Problem(obj, const).solve()

w = w.value

xp = np.linspace(0,9,100).reshape(-1,1)
yp = - w[1,0]/w[2,0]*xp - w[0,0]/w[2,0]

ypt = -2/3*xp + 6

plt.figure(figsize = (6, 4))
plt.plot(X1[:,1], X1[:,2], 'ro', alpha = 0.4, label = 'C1')
plt.plot(X0[:,1], X0[:,2], 'bo', alpha = 0.4, label = 'C0')
plt.plot(xp, ypt, 'k', alpha = 0.3, label = 'True')
plt.plot(xp, ypt-1, '--k', alpha = 0.3)
plt.plot(xp, ypt+1, '--k', alpha = 0.3)
plt.plot(xp, yp, 'g', linewidth = 3, label = 'Attempt 2')
plt.plot(xp, yp-1/w[2,0], '--g')
plt.plot(xp, yp+1/w[2,0], '--g')
plt.title('When Outliers Exist')
plt.xlabel(r'$x_1$')
plt.ylabel(r'$x_2$')
plt.legend(loc = 3)
plt.xlim([0, 9])
plt.ylim([0, 6])
plt.show()

Throughout the development of SVM, we encourage you to recognize the continuous improvement process - a progression that starts with a simple scenario, encounters new challenges, and subsequently introduces innovative ideas to address those challenges.

We began with a straightforward linear classifier designed to separate clean, linearly separable data.
Upon encountering non-separable data due to noise, errors, or outliers, we introduced slack variables to relax the margin conditions, enhancing the model's robustness.
To further improve generalization on unseen data, we adopted the concept of a large margin, which maximizes the distance between the decision boundary and the closest data points, making the model more resilient to noise and improving overall performance.

This step-by-step refinement mirrors the natural progression of real-world problem-solving - starting with a basic concept, confronting obstacles, and evolving the model to achieve better performance and broader applicability.

4. Logistic Regression¶

Logistic regression is a classification algorithm - don't be misled by its name.

from IPython.display import YouTubeVideo
YouTubeVideo('IRrGVQV8vZ8?si=8H3cmkAUyNIu2NDb&amp;start=2750', width = "560", height = "315")

4.1. Using All Distances¶

Perceptron $\rightarrow$ SVM $\rightarrow$ Logistic Regression

Perceptron: Utilizes only the sign of the data to determine class labels, relying solely on binary outcomes without considering the distance of points from the decision boundary.
SVM: Focuses on only a subset of data points - specifically, those that lie directly on the margin (called support vectors) - to define the optimal hyperplane. While this approach effectively maximizes the margin, it ignores other data points that may still carry useful distance information.
Logistic Regression: To better utilize the distance information of all data points, we turn to logistic regression, which considers the positions of all points in the dataset. This broader perspective enhances the model's ability to capture patterns and improve generalization.

Let's start with the case of only two data points. Which linear classification boundary would be considered better?

On the left: The classification boundary is positioned near the center of the data points.
On the right: The classification boundary is biased toward one of the data points.

The boundary on the right is positioned too close to one of the data points, leaving minimal space (margin) on that side. This smaller margin increases the risk of misclassifying new data points that may fall within this narrow buffer zone. The boundary that maintains a balanced margin between the data points is preferred for its improved stability and performance.

Basic idea:

Identify the decision boundary $g(x)=\omega^T x =0$ that maximizes the product of the distances $\lvert h(x) \rvert$ from the hyperplane to the nearest points of each class:

$$\prod_i \lvert h_i \rvert$$

Why?

By the Arithmetic Mean-Geometric Mean (AM-GM) Inequality, for non-negative real numbers $\lvert h_1 \rvert$ and $\lvert h_2 \rvert$:

$$ \frac{\lvert h_1 \rvert + \lvert h_2 \rvert}{2} \geq \sqrt{\lvert h_1 \rvert \lvert h_2 \rvert} $$

Equality holds if and only if $\lvert h_1 \rvert = \lvert h_2 \rvert$. Therefore, maximizing $\prod_i \lvert h_i \rvert$ encourages the distances $\lvert h_i \rvert$ to be equal, effectively positioning the hyperplane equidistantly between the two classes.

Roughly speaking, this optimization of $\max \prod_i \lvert h_i \rvert$ tends to position a hyperplane in the middle of two classes

$$h = \frac{g(x)}{\lVert \omega \rVert} = \frac{\omega^T x}{\lVert \omega \rVert} \sim \omega^T x$$

Considering all data points may seem more effective than SVM, which relies only on the support vectors near the margin. However, some data points may be located far from the decision boundary. These extreme points (or outliers) can negatively impact the model’s performance by overly influencing the decision boundary due to the multiplication of all $\lvert h_i \rvert$.

To mitigate this negative impact, we apply a squeezing function that compresses the range of extreme values, mapping $(-\infty, +\infty)$ to $(0,1)$. This transformation ensures that points far from the decision boundary have minimal impact on the model's outcome.

By squeezing the values in this way, we create a more robust model that effectively emphasizes the most informative points near the decision boundary while reducing the adverse effects of outliers.

Here, let's examine the sigmoid function, also known as the logistic function, denoted as $\sigma(z)$, which often serves as a squeezing function.

$$ \sigma(z) = \frac{1}{1+e^{-z}} \implies \sigma \left(\omega^T x \right) = \frac{1}{1+e^{-\omega^T x}}$$

Logistic function always generates a value between 0 and 1
Crosses 0.5 at the origin, then flattens out
Monotonic: same or similar optimziation solution
Continuous and differentiable: good for gradient descent optimization
Reduces the impact of incorrect predictions with overly confident large distances.

# plot a sigmoid function

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

z = np.linspace(-4,4,100)
s = 1/(1 + np.exp(-z))

plt.figure(figsize = (6, 3))
plt.plot(z, s)
plt.xlim([-4, 4])
plt.axis('equal')
plt.grid(alpha = 0.3)
plt.show()

The output of the logistic function is bounded between 0 and 1, making it suitable for binary classification tasks.

For logistic regression, output label will be $y_i \in \{0,1\}$

The logistic function compresses large positive and negative values:

Large positive values are mapped close to 1.
Large negative values are mapped close to 0.

One significant advantage of the logistic function is that its output can be interpreted as a probability.

Often we do note care about predicting the label $y$
Rather, we want to predict the label probabilities $P\left(y \mid x\,;\omega\right)$

$$P\left(y = +1 \mid x\,;\omega\right) = \frac{1}{1+e^{-\omega^T x}} \;\; \in \; [0,1]$$

the probability that the label is $+1$

$$P\left(y = +1 \mid x ;\omega\right)$$

the probability that the label is $0$

$$P\left(y = 0 \mid x ;\omega\right) = 1 - P\left(y = +1 \mid x;\omega\right)$$

The Role of Distance in Classification

We have studied the Perceptron, SVM, and Logistic Regression as key algorithms for classification. The primary distinction among these methods lies in how each approach defines and utilizes the distance from the decision boundary to the data points. The figures below provide intuitive visual insights into these differences.

Perceptron: Considers only the sign of the data points, completely ignoring the distance.
SVM: Utilizes the distance but focuses only on the margin - the distance of the closest points (support vectors) to the decision boundary.
Logistic Regression: Considers the distance of all points, assigning probabilities accordingly, with points farther from the boundary contributing less significantly.

Note:

Although the term "regression" appears in logistic regression, it is, in fact, a classification algorithm. This terminology may seem counterintuitive at first. However, upon closer examination, logistic regression can be understood as a regression-based approach in which the distances between data points and the linear decision boundary are transformed via a logistic (sigmoid) function. This transformation maps the outputs to a probability distribution within the range [0, 1], thereby enabling binary classification. Notably, the transformed distance is continuous, which aligns with the regression-like nature of the model.

Goal: we need to fit $\omega$ to our data

For a single data point $(x,y)$ with parameters $\omega$

$$ \begin{align*} P\left(y = +1 \mid x\,;\omega\right) &= h_{\omega}(x) = \sigma \left(\omega^T x \right)\\ P\left(y = 0 \mid x\,;\omega\right) &= 1 - h_{\omega}(x) = 1- \sigma \left(\omega^T x \right) \end{align*} $$

It can be compactly written as (since $y$ is either $0$ or $1$)

$$P\left(y \mid x\,;\omega\right) = \left(h_{\omega}(x) \right)^y \left(1 - h_{\omega}(x)\right)^{1-y}$$

For $m$ training data points, the likelihood function of the parameters:

$$ \begin{align*} \mathscr{L}(\omega) &= P\left(y^{(1)}, \cdots, y^{(m)} \mid x^{(1)}, \cdots, x^{(m)}\,;\omega\right)\\ &= \prod\limits_{i=1}^{m}P\left(y^{(i)} \mid x^{(i)}\,;\omega\right)\\ &= \prod\limits_{i=1}^{m}\left(h_{\omega}\left(x^{(i)}\right) \right)^{y^{(i)}} \left(1 - h_{\omega}\left(x^{(i)}\right)\right)^{1-y^{(i)}} \qquad \left(\sim \prod_i \lvert h_i \rvert \right) \end{align*} $$

It would be easier to work on the log likelihood.

Taking the logarithm converts the product into a sum
Probabilities close to 0 can cause numerical instability when directly multiplied.
The log transformation helps keep values in a manageable range.

$$\ell(\omega) = \log \mathscr{L}(\omega) = \sum_{i=1}^{m} y^{(i)} \log h_{\omega} \left(x^{(i)} \right) + \left(1-y^{(i)} \right) \log \left(1-h_{\omega} \left(x^{(i)} \right) \right)$$

This is known as cross-entropy, a commonly used loss function in classification problems.

Then, the logistic regression problem can be solved as a (convex) optimization problem as

$$\hat{\omega} = \arg\max_{\omega} \ell(\omega)$$

To use the gradient descent method, we need to find the derivative of it

$$\nabla \ell(\omega) = \begin{bmatrix} \frac{\partial \ell(\omega)}{\partial \omega_1} \\ \vdots \\ \frac{\partial \ell(\omega)}{\partial \omega_n} \end{bmatrix}$$

Think about a single data point with a single parameter $\omega$ for the simplicity.

$$ \begin{align*} \frac{\partial}{\partial \omega} \left[ y \log (\sigma) + (1-y) \log (1-\sigma)\right] & = y\frac{\sigma'}{\sigma} + (1-y)\frac{-\sigma}{1-\sigma}\\ & = \left(\frac{y}{\sigma}-\frac{1-y}{1-\sigma} \right)\sigma'\\ & = \frac{y-\sigma}{\sigma (1-\sigma)}\sigma'\\ & = \frac{y-\sigma}{\sigma (1-\sigma)}\sigma (1-\sigma)x\\ & = (y-\sigma)x \end{align*} $$

For $m$ training data points with the parameters $\omega$

$$\frac{\partial \ell(\omega)}{\partial \omega_j} = \sum_{i=1}^{m} \left(y^{(i)}-h_{\omega} \left(x^{(i)} \right) \right) x_{j}^{(i)} \quad \stackrel{\text{vectorization}}{=} \quad \left(y-h_{\omega}(x)\right)^T x_{j} = x_{j}^T \left(y-h_{\omega}(x)\right) $$

4.1.1. Logistic Regression using Gradient Descent¶

$$ \begin{align*} \omega &= \begin{bmatrix} \omega_0 \\ \omega_1 \\ \omega_2\end{bmatrix}, \qquad x = \begin{bmatrix} 1 \\ x_1 \\ x_2\end{bmatrix}\\ \\ X &= \begin{bmatrix} \left(x^{(1)}\right)^T \\ \left(x^{(2)}\right)^T \\ \left(x^{(3)}\right)^T \\ \vdots\end{bmatrix} = \begin{bmatrix} 1 & x_1^{(1)} & x_2^{(1)} \\ 1 & x_1^{(2)} & x_2^{(2)} \\ 1 & x_1^{(3)} & x_2^{(3)} \\ \vdots & \vdots & \vdots \\\end{bmatrix}, \qquad y = \begin{bmatrix} y^{(1)}\\ y^{(2)} \\y^{(3)} \\ \vdots \end{bmatrix} \end{align*} $$

$$\nabla \ell(\omega) = \begin{bmatrix} \frac{\partial \ell(\omega)}{\partial \omega_0} \\ \frac{\partial \ell(\omega)}{\partial \omega_1} \\ \frac{\partial \ell(\omega)}{\partial \omega_2} \end{bmatrix} = X^T \left(y-h_{\omega}(x)\right) = X^T \left(y-\sigma(X \omega)\right) $$

# datat generation

m = 100
w = np.array([[-6], [2], [1]])
X = np.hstack([np.ones([m,1]), 4*np.random.rand(m,1), 4*np.random.rand(m,1)])

w = np.asmatrix(w)
X = np.asmatrix(X)

y = 1/(1 + np.exp(-X*w)) > 0.5

C1 = np.where(y == True)[0]
C0 = np.where(y == False)[0]

y = np.empty([m,1])
y[C1] = 1
y[C0] = 0

plt.figure(figsize = (6, 4))
plt.plot(X[C1,1], X[C1,2], 'ro', alpha = 0.3, label = 'C1')
plt.plot(X[C0,1], X[C0,2], 'bo', alpha = 0.3, label = 'C0')
plt.xlabel(r'$x_1$')
plt.ylabel(r'$x_2$')
plt.legend(loc = 1)
plt.xlim([0,4])
plt.ylim([0,4])
plt.show()

$$h_{\omega}(x) = h(x\,;\omega) = \sigma \left(\omega^T x\right) = \frac{1}{1+e^{-\omega^T x}}$$

# be careful with matrix shape

def h(x,w):
    return 1/(1 + np.exp(-x*w))

Gradient descent

Maximization problem

$$\nabla \ell(\omega) = \begin{bmatrix} \frac{\partial \ell(\omega)}{\partial \omega_0} \\ \frac{\partial \ell(\omega)}{\partial \omega_1} \\ \frac{\partial \ell(\omega)}{\partial \omega_2} \end{bmatrix} = X^T \left(y-h_{\omega}(x)\right) = X^T \left(y-\sigma(X \omega)\right) $$

$$\omega \leftarrow \omega - \eta \left( - \nabla \ell(\omega)\right)$$

w = np.zeros([3,1])

alpha = 0.01

for i in range(10000):
    df = -X.T*(y - h(X,w))
    w = w - alpha*df

print(w)

[[-31.74029027]
 [ 11.24643299]
 [  5.02396212]]

xp = np.linspace(0,4,100).reshape(-1,1)
yp = - w[1,0]/w[2,0]*xp - w[0,0]/w[2,0]

plt.figure(figsize = (6, 4))
plt.plot(X[C1,1], X[C1,2], 'ro', alpha = 0.3, label = 'C1')
plt.plot(X[C0,1], X[C0,2], 'bo', alpha = 0.3, label = 'C0')
plt.plot(xp, yp, 'g', linewidth = 4, label = 'Logistic Regression')
plt.title('Logistic Regression')
plt.xlabel(r'$x_1$')
plt.ylabel(r'$x_2$')
plt.legend(loc = 1)
plt.xlim([0,4])
plt.ylim([0,4])
plt.show()

4.1.2. Logistic Regression using CVXPY¶

$$ \begin{align*} p &= \frac{1}{1+e^{-\omega^T x}} = \frac{e^{\omega^T x}}{e^{\omega^T x} + 1}\\ 1-p &= \frac{1}{e^{\omega^T x} + 1} \end{align*} $$

We can re-order the training data so

for $x_1, \cdots, x_q$, the outcome is $y = +1$, and
for $x_{q+1}, \cdots, x_m$, the outcome is $y=0$

The likelihood function

$$\mathscr{L} = \prod\limits_{i=1}^{q}{p_i}\prod\limits_{i=q+1}^{m}{(1-p_i)} \qquad \left(\sim \prod_i \lvert h_i \rvert \right)$$

The log likelihood function

$$ \begin{align*} \ell(\omega) &= \log \mathscr{L} = \sum\limits_{i=1}^{q}{\log p_i} + \sum\limits_{i=q+1}^{m}{\log(1 - p_i)} \\ & = \sum\limits_{i=1}^{q}{\log \frac{\exp\left(\omega^T x_i\right)}{1 + \exp \left(\omega^T x_i \right)}} + \sum\limits_{i=q+1}^{m}{\log \frac{1}{1+\exp \left(\omega^T x_i \right)}} \\ &= \sum\limits_{i=1}^{q}{\left(\omega^T x_i\right)} - \sum\limits_{i=1}^{m}{\log \left(1+\exp \left(\omega^T x_i \right) \right)} \end{align*} $$

Since $\ell$ is a concave function of $\omega$, the logistic regression problem can be solved as a convex optimization problem

$$\hat{\omega} = \arg\max_{\omega} \ell(\omega)$$

$$ \begin{align*} \omega &= \begin{bmatrix} \omega_0 \\ \omega_1 \\ \omega_2\end{bmatrix}, \qquad x = \begin{bmatrix} 1 \\ x_1 \\ x_2\end{bmatrix}\\ \\ X &= \begin{bmatrix} \left(x^{(1)}\right)^T \\ \left(x^{(2)}\right)^T \\ \left(x^{(3)}\right)^T \\ \vdots\end{bmatrix} = \begin{bmatrix} 1 & x_1^{(1)} & x_2^{(1)} \\ 1 & x_1^{(2)} & x_2^{(2)} \\ 1 & x_1^{(3)} & x_2^{(3)} \\ \vdots & \vdots & \vdots \\\end{bmatrix}, \qquad y = \begin{bmatrix} y^{(1)}\\ y^{(2)} \\y^{(3)} \\ \vdots \end{bmatrix} \end{align*} $$

$$ \begin{align*} \ell(\omega) &= \log \mathscr{L} = \sum\limits_{i=1}^{q}{\log p_i} + \sum\limits_{i=q+1}^{m}{\log(1 - p_i)} \\ \\ & = \sum\limits_{i=1}^{q}{\log \frac{\exp\left(\omega^T x_i\right)}{1 + \exp \left(\omega^T x_i \right)}} + \sum\limits_{i=q+1}^{m}{\log \frac{1}{1+\exp \left(\omega^T x_i \right)}} \\ \\ &= \sum\limits_{i=1}^{q}{\left(\omega^T x_i\right)} - \sum\limits_{i=1}^{m}{\log \left(1+\exp \left(\omega^T x_i \right) \right)} \end{align*} $$

Refer to cvxpy functions

scalar function: cvx.sum(x) = $\sum_{ij} x_{ij}$
elementwise function: cvx.logistic(x) = $\log \left(1+e^{x} \right)$

import cvxpy as cvx

w = cvx.Variable([3, 1])

obj = cvx.Maximize(y.T@X@w - cvx.sum(cvx.logistic(X@w)))
prob = cvx.Problem(obj).solve()

w = w.value

xp = np.linspace(0,4,100).reshape(-1,1)
yp = - w[1,0]/w[2,0]*xp - w[0,0]/w[2,0]

plt.figure(figsize = (6, 4))
plt.plot(X[C1,1], X[C1,2], 'ro', alpha = 0.3, label = 'C1')
plt.plot(X[C0,1], X[C0,2], 'bo', alpha = 0.3, label = 'C0')
plt.plot(xp, yp, 'g', linewidth = 4, label = 'Logistic Regression')
plt.title('Logistic Regression')
plt.xlabel(r'$x_1$')
plt.ylabel(r'$x_2$')
plt.legend(loc = 1)
plt.xlim([0,4])
plt.ylim([0,4])
plt.show()

In a more compact form

Change $y \in \{0,+1\} \; \rightarrow \; y \in \{-1,+1\}$ for compuational convenience

Consider the following function

$$P(y=+1) = p = \sigma(\omega^T x), \quad P(y=-1) = 1-p = 1- \sigma(\omega^T x) = \sigma(-\omega^T x)$$

$$P\left(y \mid x\,;\omega\right) = \sigma\left(y\omega^Tx\right) = \frac{1}{1+\exp\left(-y\omega^T x\right)} \in [0,1]$$

Log-likelihood

$$ \begin{align*} \ell(\omega) = \log \mathscr{L} = \log P\left(y \mid x\,;\omega\right)& = \log \prod_{n=1}^{m} P\left(y_n \mid x_n\,;\omega\right)\\ &= \sum_{n=1}^{m} \log P\left(y_n \mid x_n\,;\omega\right)\\ &= \sum_{n=1}^{m} \log \frac{1}{1+\exp\left(-y_n\omega^T x_n\right)}\\ &= \sum\limits_{n=1}^{m}{-\log \left(1+\exp \left(-y_n\omega^T x_n \right) \right)} \end{align*} $$

MLE solution

$$ \begin{align*} \hat{\omega} &= \arg \max_{\omega} \sum\limits_{n=1}^{m}{-\log \left(1+\exp \left(-y_n\omega^T x_n \right) \right)}\\ &= \arg \min_{\omega} \sum\limits_{n=1}^{m}{\log \left(1+\exp \left(-y_n\omega^T x_n \right) \right)} \end{align*} $$

y = np.empty([m, 1])
y[C1] = 1
y[C0] = -1
y = np.asmatrix(y)

w = cvx.Variable([3, 1])

obj = cvx.Minimize(cvx.sum(cvx.logistic(-cvx.multiply(y, X@w))))
prob = cvx.Problem(obj).solve()

w = w.value

xp = np.linspace(0,4,100).reshape(-1,1)
yp = - w[1,0]/w[2,0]*xp - w[0,0]/w[2,0]

plt.figure(figsize = (6, 4))
plt.plot(X[C1,1], X[C1,2], 'ro', alpha = 0.3, label = 'C1')
plt.plot(X[C0,1], X[C0,2], 'bo', alpha = 0.3, label = 'C0')
plt.plot(xp, yp, 'g', linewidth = 4, label = 'Logistic Regression')
plt.title('Logistic Regression')
plt.xlabel(r'$x_1$')
plt.ylabel(r'$x_2$')
plt.legend(loc = 1)
plt.xlim([0,4])
plt.ylim([0,4])
plt.show()

4.1.3. Logistic Regression using Scikit-Learn¶

$$ \begin{align*} \omega &= \begin{bmatrix} \omega_1 \\ \omega_2\end{bmatrix}, \qquad \omega_0, \qquad x = \begin{bmatrix} x_1 \\ x_2\end{bmatrix}\\ \\ X &= \begin{bmatrix} \left(x^{(1)}\right)^T \\ \left(x^{(2)}\right)^T \\ \left(x^{(3)}\right)^T \\ \vdots\end{bmatrix} = \begin{bmatrix} x_1^{(1)} & x_2^{(1)} \\ x_1^{(2)} & x_2^{(2)} \\ x_1^{(3)} & x_2^{(3)} \\ \vdots & \vdots \\\end{bmatrix}, \qquad y = \begin{bmatrix} y^{(1)}\\ y^{(2)} \\y^{(3)} \\ \vdots \end{bmatrix} \end{align*} $$

X = X[:, 1:3]

X.shape

(100, 2)

from sklearn import linear_model

clf = linear_model.LogisticRegression(solver = 'lbfgs')
clf.fit(np.asarray(X), np.ravel(y))

LogisticRegression()

LogisticRegression()

clf.coef_

array([[3.30887193, 1.4874399 ]])

clf.intercept_

array([-9.74106087])

w0 = clf.intercept_[0]
w1 = clf.coef_[0,0]
w2 = clf.coef_[0,1]

xp = np.linspace(0,4,100).reshape(-1,1)
yp = - w1/w2*xp - w0/w2

plt.figure(figsize = (6, 4))
plt.plot(X[C1,0], X[C1,1], 'ro', alpha = 0.3, label = 'C1')
plt.plot(X[C0,0], X[C0,1], 'bo', alpha = 0.3, label = 'C0')
plt.plot(xp, yp, 'g', linewidth = 4, label = 'Logistic Regression')
plt.title('Logistic Regression')
plt.xlabel(r'$x_1$')
plt.ylabel(r'$x_2$')
plt.legend(loc = 1)
plt.xlim([0,4])
plt.ylim([0,4])
plt.show()

4.2. Cross-Entropy (Optional)¶

You might have seen the concept of entropy in physics. In fact, the concept of entropy is closely related to the cross-entropy that we just encountered in logistic regression.

Entropy in Physics

In statistical physics, entropy measures the uncertainty or disorder of a system:

$$ S = -k_B \sum_{i} p_i \log p_i, $$

where:

$ S $ is the Boltzmann entropy.
$ k_B $ is Boltzmann's constant.
$ p_i $ is the probability of the system being in state $ i $.

Entropy quantifies how much "randomness" or "information" is needed to describe the system's configuration.

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

p = np.linspace(0.01, 0.99, 100)
S = -p*np.log(p) - (1-p)*np.log(1-p)

plt.figure(figsize = (4, 3))
plt.plot(p, S, linewidth = 3)
plt.xlabel('p')
plt.axis('equal')
plt.grid(alpha = 0.3)
plt.show()

Cross-Entropy in Logistic Regression

In logistic regression, cross-entropy measures how well the predicted probability distribution matches the true distribution:

$$ \mathcal{L} = - \sum_{i} y_i \log \hat{y}_i, $$

where:

$ y_i $ is the true label (1 or 0).
$ \hat{y}_i $ is the predicted probability for class 1.
The loss is high when the model predicts probabilities far from the true class.

Connecting the Concepts

In physics, Boltzmann entropy measures the disorder of a system based on probabilities of different states.
In machine learning, cross-entropy loss measures the "disorder" or "uncertainty" when the predicted probabilities do not align with the true labels.
Enhancing classification performance is positively correlated with reducing data disorder or uncertainty.

5. Multiclass Classification¶

Multiclass classification is an extension of binary classification where the goal is to assign each input instance to one of three or more distinct classes. Unlike binary classification, where the model predicts between two outcomes, multiclass classification requires handling multiple labels effectively.

The output $y$ belongs to a set of discrete values representing multiple classes:

$$y \in \{1,2, \cdots, K \}$$

Unlike binary classification, where the decision boundary separates two regions, multiclass classification often requires multiple boundaries to partition the input space into several regions.

Generalization to more than 2 classes is straightforward

(1) one vs. all (one vs. rest)

Example: For three classes ($C_1$, $C_2$, $C_3$)
- Classifier 1: Distinguishes $C_1$ from $C_2$ and $C_3$.
- Classifier 2: Distinguishes $C_2$ from $C_1$ and $C_3$.
- Classifier 3: Distinguishes $C_3$ from $C_1$ and $C_2$.

(2) one vs. one

Example: For three classes ($C_1$, $C_2$, $C_3$)
- Classifier 1: Distinguishes $C_1$ vs. $C_2$
- Classifier 2: Distinguishes $C_1$ vs. $C_3$
- Classifier 3: Distinguishes $C_2$ vs. $C_3$

Softmax (Multinomial Logistic Regression)

Directly models the probability distribution over multiple classes
Using the soft-max function instead of the logistic function (refer to UFLDL Tutorial)

$$P\left(y = k \mid x\,;\omega\right) = \frac{\exp{\left( \omega_k^T x \right) }}{\sum_k \exp{\left(\omega_k^T x \right)}} \in [0,1]$$

Each output represents the estimated probability of the corresponding class.

Sigmoid function is a special case of the softmax function

$$\sigma(\omega_1^Tx) = \frac{1}{1 + e^{-\omega_1^Tx}} = \frac{e^{\omega_1^Tx}}{e^{\omega_1^Tx} + 1} = \frac{e^{\omega_1^Tx}}{e^{\omega_1^Tx} + e^{\omega_0^Tx}} \qquad \text{where} \qquad \omega_0 = \vec 0$$

One-Hot Encoding

One-hot encoding is a technique used to represent categorical variables as numerical vectors. It is commonly applied in machine learning, particularly for tasks involving classification or categorical feature encoding.

Suppose you have three classes for a classification problem:

Class 1 $\rightarrow$ [1, 0, 0]
Class 2 $\rightarrow$ [0, 1, 0]
Class 3 $\rightarrow$ [0, 0, 1]

In this representation:

Each output node in the model corresponds to a specific class.
The predicted output vector can then be interpreted as class probabilities (e.g., using softmax in neural networks).

6. Nonlinear Classification¶

In real-world scenarios, data is often non-linearly separable due to complex patterns, interactions, or overlapping features. Nonlinear classification refers to the process of classifying data points that cannot be separated by a straight line (or hyperplane in higher dimensions). Unlike linear classifiers, which rely on linear decision boundaries, nonlinear classifiers are designed to capture complex patterns and relationships within the data.

One method to achieve nonlinear classification is to transform the original features into a higher-dimensional space where linear separation becomes possible.

Kernels: make linear model work in nonlinear settings

By mapping data to higher dimensions where it exhibits linear patterns
Apply the linear model in the new input feature space
Mapping $=$ changing the feature representation

1D Example

Consider the binary classification problem

Each example represented by a single feature $x$
No linear separator exists for this data

Now map each example as

$$x \quad \rightarrow \quad \begin{bmatrix}x\\x^2\end{bmatrix}$$

Data now becomes linearly separable in the new representation

Linear in the new representation $=$ nonlinear in the old representation

2D Example

Each example defined by a two features, $ x = \begin{bmatrix}x_1\\ x_2 \end{bmatrix} $
No linear separator exists for this data

Now map each example as

$$x = \begin{bmatrix}x_1\\ x_2 \end{bmatrix} \quad \rightarrow \quad z = \begin{bmatrix}x_1^2\\ \sqrt{2}x_1x_2 \\ x_2^2 \end{bmatrix}$$

Each example now has three features (derived from the old represenation)
Data now becomes linear separable in the new representation

Visual Illustration of Kernel Mapping

from IPython.display import YouTubeVideo
YouTubeVideo('3liCbRZPrZA', width = "560", height = "315")

Selecting the Appropriate Kernel

Determining the appropriate kernel function is a crucial aspect of effectively applying kernel-based methods. While we have demonstrated that applying the right kernel enables linear classification techniques to handle nonlinearly distributed data, we have not yet addressed the process of selecting this optimal kernel.

Throughout our previous discussions, we assumed that the kernel function was predefined. However, identifying the most suitable kernel for a given dataset remains a non-trivial challenge. Techniques such as the kernel trick offer a systematic approach to finding effective kernel functions, but we will not explore this topic further.

The primary reason for this decision is that in deep learning, the model architecture inherently learns effective feature transformations directly from the data. Unlike traditional kernel methods, modern deep learning frameworks are capable of automatically discovering complex and data-driven feature mappings, eliminating the need for manually selecting or designing an optimal kernel.

In essence, while understanding kernel selection is valuable in classical machine learning, deep learning models provide a more flexible and adaptive solution by learning these transformations directly during training.

6.1. Nonlinear Classification in SVM¶

X1 = np.array([[-1.1,0],[-0.3,0.1],[-0.9,1],[0.8,0.4],[0.4,0.9],[0.3,-0.6],
               [-0.5,0.3],[-0.8,0.6],[-0.5,-0.5]])

X0 = np.array([[-1,-1.3], [-1.6,2.2],[0.9,-0.7],[1.6,0.5],[1.8,-1.1],[1.6,1.6],
               [-1.6,-1.7],[-1.4,1.8],[1.6,-0.9],[0,-1.6],[0.3,1.7],[-1.6,0],[-2.1,0.2]])

X1 = np.asmatrix(X1)
X0 = np.asmatrix(X0)

plt.figure(figsize = (6, 4))
plt.plot(X1[:,0], X1[:,1], 'ro', label = 'C1')
plt.plot(X0[:,0], X0[:,1], 'bo', label = 'C0')
plt.title('SVM for Nonlinear Data')
plt.xlabel(r'$x_1$')
plt.ylabel(r'$x_2$')
plt.legend(loc = 1)
plt.show()

$$ x = \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} \quad \Longrightarrow \quad z = \phi(x) = \begin{bmatrix} 1\\ x_1^2 \\ \sqrt{2}x_1 x_2 \\x_2^2 \end{bmatrix}$$

N = X1.shape[0]
M = X0.shape[0]

X = np.vstack([X1, X0])
y = np.vstack([np.ones([N,1]), -np.ones([M,1])])

X = np.asmatrix(X)
y = np.asmatrix(y)

m = N + M
Z = np.hstack([np.ones([m,1]), np.square(X[:,0]), np.sqrt(2)*np.multiply(X[:,0],X[:,1]), np.square(X[:,1])])

g = 10

w = cvx.Variable([4, 1])
d = cvx.Variable([m, 1])

obj = cvx.Minimize(cvx.norm(w, 2) + g*np.ones([1,m])@d)
const = [cvx.multiply(y, Z@w) >= 1-d, d>=0]
prob = cvx.Problem(obj, const).solve()

w = w.value
print(w)

[[ 2.79259259]
 [-1.48148148]
 [-0.58925565]
 [-1.48148148]]

# to plot

[X1gr, X2gr] = np.meshgrid(np.arange(-3,3,0.1), np.arange(-3,3,0.1))

Xp = np.hstack([X1gr.reshape(-1,1), X2gr.reshape(-1,1)])
Xp = np.asmatrix(Xp)

m = Xp.shape[0]
Zp = np.hstack([np.ones([m,1]), np.square(Xp[:,0]), np.sqrt(2)*np.multiply(Xp[:,0], Xp[:,1]), np.square(Xp[:,1])])
q = Zp*w

B = []
for i in range(m):
    if q[i,0] > 0:
        B.append(Xp[i,:])

B = np.vstack(B)

plt.figure(figsize = (6, 4))
plt.plot(X1[:,0], X1[:,1], 'ro', label = 'C1')
plt.plot(X0[:,0], X0[:,1], 'bo', label = 'C0')
plt.plot(B[:,0], B[:,1], 'gs', markersize = 10, alpha = 0.1, label = 'SVM')
plt.title('SVM with Kernel')
plt.xlabel(r'$x_1$')
plt.ylabel(r'$x_2$')
plt.legend(loc = 1)
plt.show()

6.2. Non-linear Classification in Logistic Regression¶

X1 = np.array([[-1.1,0],[-0.3,0.1],[-0.9,1],[0.8,0.4],[0.4,0.9],[0.3,-0.6],
               [-0.5,0.3],[-0.8,0.6],[-0.5,-0.5]])

X0 = np.array([[-1,-1.3], [-1.6,2.2],[0.9,-0.7],[1.6,0.5],[1.8,-1.1],[1.6,1.6],
               [-1.6,-1.7],[-1.4,1.8],[1.6,-0.9],[0,-1.6],[0.3,1.7],[-1.6,0],[-2.1,0.2]])

X1 = np.asmatrix(X1)
X0 = np.asmatrix(X0)

plt.figure(figsize = (6, 4))
plt.plot(X1[:,0], X1[:,1], 'ro', label = 'C1')
plt.plot(X0[:,0], X0[:,1], 'bo', label = 'C0')
plt.title('Logistic Regression for Nonlinear Data')
plt.xlabel(r'$x_1$')
plt.ylabel(r'$x_2$')
plt.legend(loc = 4)
plt.show()

$$ x = \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} \quad \Longrightarrow \quad z = \phi(x) = \begin{bmatrix} 1\\ \sqrt{2}x_1\\ \sqrt{2}x_2 \\x_1^2 \\ \sqrt{2}x_1 x_2 \\x_2^2 \end{bmatrix}$$

N = X1.shape[0]
M = X0.shape[0]

X = np.vstack([X1, X0])
y = np.vstack([np.ones([N,1]), -np.ones([M,1])])

X = np.asmatrix(X)
y = np.asmatrix(y)

m = N + M
Z = np.hstack([np.ones([m,1]), np.sqrt(2)*X[:,0], np.sqrt(2)*X[:,1], np.square(X[:,0]),
               np.sqrt(2)*np.multiply(X[:,0], X[:,1]), np.square(X[:,1])])

w = cvx.Variable([6, 1])
obj = cvx.Minimize(cvx.sum(cvx.logistic(-cvx.multiply(y,Z @ w))))
prob = cvx.Problem(obj).solve()

w = w.value

# to plot
[X1gr, X2gr] = np.meshgrid(np.arange(-3,3,0.1), np.arange(-3,3,0.1))

Xp = np.hstack([X1gr.reshape(-1,1), X2gr.reshape(-1,1)])
Xp = np.asmatrix(Xp)

m = Xp.shape[0]
Zp = np.hstack([np.ones([m,1]), np.sqrt(2)*Xp[:,0], np.sqrt(2)*Xp[:,1], np.square(Xp[:,0]),
                np.sqrt(2)*np.multiply(Xp[:,0], Xp[:,1]), np.square(Xp[:,1])])
q = Zp*w

B = []
for i in range(m):
    if q[i,0] > 0:
        B.append(Xp[i,:])

B = np.vstack(B)

plt.figure(figsize = (6, 4))
plt.plot(X1[:,0], X1[:,1], 'ro', label = 'C1')
plt.plot(X0[:,0], X0[:,1], 'bo', label = 'C0')
plt.plot(B[:,0], B[:,1], 'gs', markersize = 10, alpha = 0.1, label = 'Logistic Regression')
plt.title('Logistic Regression with Kernel')
plt.xlabel(r'$x_1$')
plt.ylabel(r'$x_2$')
plt.legend(loc = 4)
plt.show()

%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')