Classification: Support Vector Machine


By Prof. Seungchul Lee
http://iai.postech.ac.kr/
Industrial AI Lab at POSTECH

Table of Contents

1. Classification (Linear)

  • Figure out, autonomously, which category (or class) an unknown item should be categorized into
  • Number of categories / classes

    • Binary: 2 different classes

    • Multiclass: more than 2 classes

  • Feature

    • The measurable parts that make up the unknown item (or the information you have available to categorize)

2. Distance from a Line



$$\omega = \begin{bmatrix}\omega_1 \\ \omega_2\end{bmatrix}, \, x = \begin{bmatrix}x_1\\x_2\end{bmatrix} \; \implies g(x) = \omega_0 + \omega^Tx = \omega_0 + \omega_1x_1 + \omega_2x_2 $$




  • If $\vec p$ and $\vec q$ are on the decision line


$$ \begin{align*} g\left(\vec p\right) = g\left(\vec q\right) = 0 \quad & \Rightarrow \quad \omega_0 + \omega^T \vec p = \omega_0 + \omega^T \vec q = 0 \\ & \Rightarrow \quad \omega^T \left( \vec p- \vec q \right) = 0 \end{align*} $$


$$ \begin{align*} & \therefore \, \omega : \text{normal to the line (orthogonal)} \\ & \implies \text{tells the direction of the line} \end{align*}$$


  • If $x$ is on the line and $x = d\frac{\omega}{\lVert \omega \rVert}$ (where $d$ is a normal distance from the origin to the line)



$$ \begin{align*} g(x)& = \omega_0 + \omega^Tx = 0 \; \\ & \Rightarrow \omega_0 + \omega^Td\frac{\omega}{\lVert \omega \rVert} = \omega_0 + d\frac{\omega^T\omega}{\lVert \omega \rVert} = \omega_0 + d\lVert \omega \rVert = 0 \\\\ & \therefore d \, = - \frac{\omega_0}{\lVert \omega \rVert} \end{align*}$$

  • For any vector of $x$
$$ x = x_{\perp} + r \frac{\omega}{\lVert \omega \rVert}$$


$$ \omega^Tx = \omega^T \left( x_{\perp} + r \frac{\omega}{\lVert \omega \rVert}\right) = r \frac{\omega^T\omega}{\lVert \omega \rVert} = r \lVert \omega \rVert$$



$$ \begin{align*} g(x) & = \omega_0 + \omega^Tx \\ & = \omega_0 + r \lVert \omega \rVert \qquad (r = d + h) \\ & = \omega_0 + (d +h) \lVert \omega \rVert\\ & = \omega_0 + \left(- \frac{\omega_0}{\lVert \omega \rVert} + h \right)\lVert \omega \rVert\\ & = h \lVert \omega \rVert \end{align*}$$



$$\therefore \; h = \frac{g(x)}{\lVert \omega \rVert} \implies\; \text{orthogonal signed distance from the line}$$




Another method to find a distance between $g(x) = -1$ and $g(x) = 1$

Suppose $g(x_1) = -1,\; g(x_2) = 1$


$$ \begin{array}{I} \omega_0 + \omega^Tx_1 = -1\\ \omega_0 + \omega^Tx_2 = 1\\ \end{array} \; \implies \; \omega^T(x_2 - x_1) = 2$$



$$s = \langle\frac{\omega}{\lVert \omega \rVert}, x_2 - x_1 \rangle = \frac{1}{\lVert \omega \rVert}\omega^T(x_2 - x_1) = \frac {2}{\lVert \omega \rVert}$$




3. Illustrative Example

  • Binary classification
    • $C_1$ and $C_0$
  • Features
    • The coordinate of the unknown animal $i$ in the zoo


$$x = \begin{bmatrix}x_1 \\ x_2 \end{bmatrix}$$

  • Is it possible to distinguish between $C_1$ and $C_0$ by its coordinates on a map of the zoo?
  • We need to find a separating hyperplane (or a line in 2D)


$$ \begin{align*} \omega_0 + \omega_1x_1 + \omega_2x_2 &= 0 \\\\ \omega_0 + \begin{bmatrix}\omega_1 & \omega_2 \end{bmatrix} \begin{bmatrix}x_1 \\ x_2 \end{bmatrix} &= 0\\\\ \omega_0 + \omega^Tx &= 0 \end{align*} $$

In [25]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

#training data gerneration
x1 = 8*np.random.rand(100, 1)
x2 = 7*np.random.rand(100, 1) - 4

g = 0.8*x1 + x2 - 3
g1 = g - 1
g0 = g + 1

C1 = np.where(g1 >= 0)[0]
C0 = np.where(g0 < 0)[0]
In [26]:
xp = np.linspace(-1,9,100).reshape(-1,1)
ypt = -0.8*xp + 3

plt.figure(figsize=(10, 8))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.plot(xp, ypt, 'k', linewidth = 3, label = 'True')
plt.title('Linearly and Strictly Separable Classes', fontsize = 15)
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 12)
plt.axis('equal')
plt.xlim([0, 8])
plt.ylim([-4, 3])
plt.show()
  • Given:
    • Hyperplane defined by $\omega$ and $\omega_0$
    • Animals coordinates (or features) $x$

  • Decision making:
$$ \omega_0 + \omega^Tx > 0 \implies x \; \text{belongs to} \; C_1$$$$ \omega_0 + \omega^Tx < 0 \implies x \; \text{belongs to} \; C_0$$
  • Find $\omega$ and $\omega_0$ such that $x$ given $\omega_0 + \omega^Tx = 0$

    or

  • Find $\omega$ and $\omega_0$ such that $x\in C_1$ given $\omega_0 + \omega^Tx > 1$ and $x\in C_0$ given $\omega_0 + \omega^Tx < -1 $


$$ \begin{align*} &\omega_0 + \omega^Tx > b, \quad (b>0)\\ \\ \Longleftrightarrow \;&\frac{\omega_0}{b} + \frac{\omega^T}{b}x > 1 \\\\ \Longleftrightarrow \;&\omega'_0 + \omega'^Tx > 1 \end{align*}$$

  • Same problem if strictly separable
In [27]:
#  see how data are generated
xp = np.linspace(-1,9,100).reshape(-1,1)
ypt = -0.8*xp + 3

plt.figure(figsize=(10, 8))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.plot(xp, ypt, 'k', linewidth = 3, label = 'True')
plt.plot(xp, ypt-1, '--k')
plt.plot(xp, ypt+1, '--k')
plt.title('Linearly and Strictly Separable Classes', fontsize = 15)
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 12)
plt.axis('equal')
plt.xlim([0, 8])
plt.ylim([-4, 3])
plt.show()

3.1. The First Attempt

  • $n \;(=2)$ features

  • $m = N + M$ data points in training set


$$ x^{(i)} = \begin{bmatrix} x^{(i)}_1 \\ x^{(i)}_2 \end{bmatrix} \;\text{with}\; \omega = \begin{bmatrix} \omega_1 \\ \omega_2 \end{bmatrix}\qquad \text{or} \qquad x^{(i)} = \begin{bmatrix} 1 \\ x^{(i)}_1 \\ x^{(i)}_2 \end{bmatrix} \;\; \text{with}\; \omega = \begin{bmatrix} \omega_0 \\ \omega_1 \\ \omega_2 \end{bmatrix}$$

  • $N$ belongs to $C_1$ in training set

  • $M$ belongs to $C_0$ in training set

  • $\omega$ and $\omega_0$ are the unknown variables
$$\begin{align*} \text{minimize} \quad & \text{something} \\ \\ \text{subject to} \quad & \begin{cases} \omega_0 + \omega^Tx^{(1)} \geq1\\ \omega_0 + \omega^Tx^{(2)} \geq1\\ \quad \quad \vdots \\ \omega_0 + \omega^Tx^{(N)}\geq1\\ \end{cases} \\ & \begin{cases} \omega_0 + \omega^Tx^{(N+1)} \leq{-1}\\ \omega_0 + \omega^Tx^{(N+2)} \leq{-1}\\ \quad \quad \vdots \\ \omega_0 + \omega^Tx^{(N+M)} \leq{-1}\\ \end{cases} \end{align*}$$ $$\begin{align*} \text{minimize} \quad & \text{something} \\ \\ \text{subject to} \quad & \begin{cases} \omega^Tx^{(1)} \geq1\\ \omega^Tx^{(2)} \geq1\\ \quad \quad \vdots \\ \omega^Tx^{(N)} \geq1\\ \end{cases} \\ & \begin{cases} \omega^Tx^{(N+1)} \leq{-1}\\ \omega^Tx^{(N+2)} \leq{-1}\\ \quad \quad \vdots \\ \omega^Tx^{(N+M)} \leq{-1}\\ \end{cases} \end{align*}$$

Code (CVXPY)

$$ \begin{align*} X_1 &= \begin{bmatrix} \left(x^{(1)}\right)^T \\ \left(x^{(2)}\right)^T \\ \vdots \\ \left(x^{(N)}\right)^T\end{bmatrix} = \begin{bmatrix} x_1^{(1)} & x_2^{(1)} \\ x_1^{(2)} & x_2^{(2)} \\ \vdots & \vdots \\ x_1^{(N)} & x_2^{(N)} \\ \end{bmatrix}\\ X_0 &= \begin{bmatrix} \left(x^{(N+1)}\right)^T \\ \left(x^{(N+2)}\right)^T \\ \vdots \\ \left(x^{(N+M)}\right)^T\end{bmatrix} = \begin{bmatrix} x_1^{(N+1)} & x_2^{(N+1)} \\ x_1^{(N+2)} & x_2^{(N+2)} \\ \vdots & \vdots \\ x_1^{(N+M)} & x_2^{(N+M)} \\ \end{bmatrix}\end{align*}$$ $$ \begin{align*} X_1 &= \begin{bmatrix} \left(x^{(1)}\right)^T \\ \left(x^{(2)}\right)^T \\ \vdots \\ \left(x^{(N)}\right)^T\end{bmatrix} = \begin{bmatrix} 1 & x_1^{(1)} & x_2^{(1)} \\ 1 & x_1^{(2)} & x_2^{(2)} \\ \vdots & \vdots & \vdots \\ 1 & x_1^{(N)} & x_2^{(N)} \\ \end{bmatrix}\\ X_0 &= \begin{bmatrix} \left(x^{(N+1)}\right)^T \\ \left(x^{(N+2)}\right)^T \\ \vdots \\ \left(x^{(N+M)}\right)^T\end{bmatrix} = \begin{bmatrix} 1 & x_1^{(N+1)} & x_2^{(N+1)} \\ 1 & x_1^{(N+2)} & x_2^{(N+2)} \\ \vdots & \vdots & \vdots \\ 1 & x_1^{(N+M)} & x_2^{(N+M)} \\ \end{bmatrix}\end{align*}$$


$$\begin{align*} \text{minimize} \quad & \text{something} \\ \text{subject to} \quad & \omega_0 + X_1\omega \geq 1 \\ & \omega_0 + X_0\omega \leq -1 \end{align*}$$ $$\begin{align*} \text{minimize} \quad & \text{something} \\ \text{subject to} \quad & X_1\omega \geq 1 \\ & X_0\omega \leq -1 \end{align*}$$

Form 1

$$\begin{align*} \text{minimize} \quad & \text{something} \\ \text{subject to} \quad & \omega_0 + X_1\omega \geq 1 \\ & \omega_0 + X_0\omega \leq -1 \end{align*}$$
In [28]:
# CVXPY using simple classification
import cvxpy as cvx

X1 = np.hstack([x1[C1], x2[C1]])
X0 = np.hstack([x1[C0], x2[C0]])

X1 = np.asmatrix(X1)
X0 = np.asmatrix(X0)

N = X1.shape[0]
M = X0.shape[0]
In [29]:
w0 = cvx.Variable([1,1])
w = cvx.Variable([2,1])

obj = cvx.Minimize(1)
const = [w0 + X1*w >= 1, w0 + X0*w <= -1]
prob = cvx.Problem(obj, const).solve()

w0 = w0.value
w = w.value
In [30]:
xp = np.linspace(-1,9,100).reshape(-1,1)
yp = - w[0,0]/w[1,0]*xp - w0/w[1,0]

plt.figure(figsize=(10, 8))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.plot(xp, ypt, 'k', alpha = 0.3, label = 'True')
plt.plot(xp, ypt-1, '--k', alpha = 0.3)
plt.plot(xp, ypt+1, '--k', alpha = 0.3)
plt.plot(xp, yp, 'g', linewidth = 3, label = 'Attempt 1')
plt.title('Linearly and Strictly Separable Classes', fontsize = 15)
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.legend(loc = 1)
plt.axis('equal')
plt.xlim([0, 8])
plt.ylim([-4, 3])
plt.show()

Form 2

$$\begin{align*} \text{minimize} \quad & \text{something} \\ \text{subject to} \quad & X_1\omega \geq 1 \\ & X_0\omega \leq -1 \end{align*}$$
In [31]:
import cvxpy as cvx

N = C1.shape[0]
M = C0.shape[0]

X1 = np.hstack([np.ones([N,1]), x1[C1], x2[C1]])
X0 = np.hstack([np.ones([M,1]), x1[C0], x2[C0]])

X1 = np.asmatrix(X1)
X0 = np.asmatrix(X0)
In [32]:
w = cvx.Variable([3,1])

obj = cvx.Minimize(1)
const = [X1*w >= 1, X0*w <= -1]
prob = cvx.Problem(obj, const).solve()

w = w.value
In [33]:
xp = np.linspace(-1,9,100).reshape(-1,1)
yp = - w[1,0]/w[2,0]*xp - w[0,0]/w[2,0]

plt.figure(figsize=(10, 8))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.plot(xp, ypt, 'k', alpha = 0.3, label = 'True')
plt.plot(xp, ypt-1, '--k', alpha = 0.3)
plt.plot(xp, ypt+1, '--k', alpha = 0.3)
plt.plot(xp, yp, 'g', linewidth = 3, label = 'Attempt 1')
plt.title('Linearly and Strictly Separable Classes', fontsize = 15)
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 12)
plt.axis('equal')
plt.xlim([0, 8])
plt.ylim([-4, 3])
plt.show()

3.2. Outlier

  • Note that in the real world, you may have noise, errors, or outliers that do not accurately represent the actual phenomena
  • Non-separable case
  • No solutions (hyperplane) exist
    • We will allow some training examples to be misclassified !
    • But we want their number to be minimized
In [34]:
X1 = np.hstack([np.ones([N,1]), x1[C1], x2[C1]])
X0 = np.hstack([np.ones([M,1]), x1[C0], x2[C0]])

outlier = np.array([1, 3, 2]).reshape(1,-1)
X0 = np.vstack([X0, outlier])

X1 = np.asmatrix(X1)
X0 = np.asmatrix(X0)

plt.figure(figsize=(10,8))
plt.plot(X1[:,1], X1[:,2], 'ro', alpha = 0.4, label = 'C1')
plt.plot(X0[:,1], X0[:,2], 'bo', alpha = 0.4, label = 'C0')
plt.title('When Outliers Exist', fontsize = 15)
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 12)
plt.axis('equal')
plt.xlim([0, 8])
plt.ylim([-4, 3])
plt.show()
In [35]:
w = cvx.Variable([3,1])

obj = cvx.Minimize(1)
const = [X1*w >= 1, X0*w <= -1]
prob = cvx.Problem(obj, const).solve()

print(w.value)
None

3.3. The Second Attempt

  • $n \;(=2)$ features

  • $m = N+M$ data points in a training set

$$x^i = \begin{bmatrix} x_1^{(i)} \\ x_2^{(i)} \end{bmatrix}$$
  • $N$ belongs to $C_1$ in training set

  • $M$ belongs to $C_0$ in training set

  • $\omega$ and $\omega_0$ are the variables (unknown)
  • For the non-separable case, we relex the above constraints

  • Need slack variables $u$ and $\upsilon$ where all are positive

The optimization problem for the non-separable case


$$\begin{align*} \text{minimize} \quad & \sum\limits_{i=1}^{N}u_i + \sum\limits_{i=1}^{M}\upsilon_i \\ \\ \text{subject to} \quad & \begin{cases} \omega^Tx^{(1)} \geq 1-u_1\\ \omega^Tx^{(2)} \geq 1-u_2\\ \quad \quad \vdots \\ \omega^Tx^{(N)} \geq 1-u_N\\ \end{cases} \\\\ & \begin{cases} \omega^Tx^{(N+1)} \leq {-(1-\upsilon_1)}\\ \omega^Tx^{(N+2)} \leq {-(1-\upsilon_2)}\\ \quad \quad \vdots \\ \omega^Tx^{(N+M)} \leq {-(1-\upsilon_M)}\\ \end{cases} \\\\ & \begin{cases} u \geq 0\\ v \geq 0\\ \end{cases} \end{align*}$$

  • Expressed in a matrix form
$$ \begin{align*} X_1 &= \begin{bmatrix} x^{(1)^T} \\ x^{(2)^T} \\ \vdots \\ x^{(N)^T}\end{bmatrix} = \begin{bmatrix} x_1^{(1)} & x_2^{(1)} \\ x_1^{(2)} & x_2^{(2)} \\ \vdots & \vdots \\ x_1^{(N)} & x_2^{(N)} \\ \end{bmatrix}\\\\ X_0 &= \begin{bmatrix} x^{(N+1)^T} \\ x^{(N+2)^T} \\ \vdots \\ x^{(N+M)^T}\end{bmatrix} = \begin{bmatrix} x_1^{(N+1)} & x_2^{(N+1)} \\ x_1^{(N+2)} & x_2^{(N+2)} \\ \vdots & \vdots \\ x_1^{(N+M)} & x_2^{(N+M)} \\ \end{bmatrix} \\\\ u &= \begin{bmatrix} u_1\\ \vdots\\ u_N\\ \end{bmatrix}\\ \upsilon &= \begin{bmatrix} \upsilon_{1}\\ \vdots\\ \upsilon_{M}\\ \end{bmatrix} \end{align*}$$ $$ \begin{align*} X_1 &= \begin{bmatrix} \left(x^{(1)}\right)^T \\ \left(x^{(2)}\right)^T \\ \vdots \\ \left(x^{(N)}\right)^T\end{bmatrix} = \begin{bmatrix} 1 & x_1^{(1)} & x_2^{(1)} \\ 1 & x_1^{(2)} & x_2^{(2)} \\ \vdots & \vdots & \vdots \\ 1 & x_1^{(N)} & x_2^{(N)} \\ \end{bmatrix}\\\\ X_0 &= \begin{bmatrix} \left(x^{(N+1)}\right)^T \\ \left(x^{(N+2)}\right)^T \\ \vdots \\ \left(x^{(N+M)}\right)^T\end{bmatrix} = \begin{bmatrix} 1 & x_1^{(N+1)} & x_2^{(N+1)} \\ 1 & x_1^{(N+2)} & x_2^{(N+2)} \\ \vdots & \vdots & \vdots \\ 1 & x_1^{(N+M)} & x_2^{(N+M)} \\ \end{bmatrix} \\\\ u &= \begin{bmatrix} u_1\\ \vdots\\ u_N\\ \end{bmatrix}\\ \upsilon &= \begin{bmatrix} \upsilon_{1}\\ \vdots\\ \upsilon_{M}\\ \end{bmatrix} \end{align*}$$



$$\begin{align*} \text{minimize} \quad & 1^Tu + 1^T\upsilon \\ \text{subject to} \quad & \omega_0 + X_1\omega \geq 1-u \\ & \omega_0 + X_0\omega \leq -(1-\upsilon) \\ & u \geq 0 \\ & \upsilon \geq 0 \end{align*}$$ $$\begin{align*} \text{minimize} \quad & 1^Tu + 1^T\upsilon \\ \text{subject to} \quad & X_1\omega \geq 1-u \\ & X_0\omega \leq -(1-\upsilon) \\ & u \geq 0 \\ & \upsilon \geq 0 \end{align*}$$
In [36]:
X1 = np.hstack([np.ones([N,1]), x1[C1], x2[C1]])
X0 = np.hstack([np.ones([M,1]), x1[C0], x2[C0]])

outlier = np.array([1, 2, 2]).reshape(1,-1)
X0 = np.vstack([X0, outlier])

X1 = np.asmatrix(X1)
X0 = np.asmatrix(X0)

N = X1.shape[0]
M = X0.shape[0]
In [37]:
w = cvx.Variable([3,1])
u = cvx.Variable([N,1])
v = cvx.Variable([M,1])

obj = cvx.Minimize(np.ones((1,N))*u + np.ones((1,M))*v)
const = [X1*w >= 1-u, X0*w <= -(1-v), u >= 0, v >= 0 ]
prob = cvx.Problem(obj, const).solve()

w = w.value
In [38]:
xp = np.linspace(-1,9,100).reshape(-1,1)
yp = - w[1,0]/w[2,0]*xp - w[0,0]/w[2,0]

plt.figure(figsize=(10, 8))
plt.plot(X1[:,1], X1[:,2], 'ro', alpha = 0.4, label = 'C1')
plt.plot(X0[:,1], X0[:,2], 'bo', alpha = 0.4, label = 'C0')
plt.plot(xp, ypt, 'k', alpha = 0.1, label = 'True')
plt.plot(xp, ypt-1, '--k', alpha = 0.1)
plt.plot(xp, ypt+1, '--k', alpha = 0.1)
plt.plot(xp, yp, 'g', linewidth = 3, label = 'Attempt 2')
plt.plot(xp, yp-1/w[2,0], '--g')
plt.plot(xp, yp+1/w[2,0], '--g')
plt.title('When Outliers Exist', fontsize = 15)
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 12)
plt.axis('equal')
plt.xlim([0, 8])
plt.ylim([-4, 3])
plt.show()

Further improvement

  • Notice that hyperplane is not as accurately represent the division due to the outlier
  • Can we do better when there are noise data or outliers?
  • Yes, but we need to look beyond LP
  • Idea: large margin leads to good generalization on the test data

3.4. Maximize Margin (Finally, it is Support Vector Machine)

  • Distance (= margin)
$$\text{margin} = \frac{2}{\lVert \omega \rVert _2}$$
  • Minimize $\lVert \omega \rVert_2$ to maximize the margin

  • Multiple objectives
  • Use gamma ($\gamma$) as a weighting betwwen the followings:
    • Bigger margin given robustness to outliers
    • Hyperplane that has few (or no) errors



$$\begin{align*} \text{minimize} \quad & \lVert \omega \rVert_2 + \gamma(1^Tu + 1^T\upsilon) \\ \text{subject to} \quad & X_1\omega \geq 1-u \\ & X_0\omega \leq -(1-\upsilon) \\ & u \geq 0 \\ & \upsilon \geq 0 \end{align*}$$

In [39]:
g = 2

w = cvx.Variable([3,1])
u = cvx.Variable([N,1])
v = cvx.Variable([M,1])

obj = cvx.Minimize(cvx.norm(w,2) + g*(np.ones((1,N))*u + np.ones((1,M))*v))
const = [X1*w >= 1-u, X0*w <= -(1-v), u >= 0, v >= 0 ]
prob = cvx.Problem(obj, const).solve()

w = w.value
In [40]:
xp = np.linspace(-1,9,100).reshape(-1,1)
yp = - w[1,0]/w[2,0]*xp - w[0,0]/w[2,0]

plt.figure(figsize=(10, 8))
plt.plot(X1[:,1], X1[:,2], 'ro', alpha = 0.4, label = 'C1')
plt.plot(X0[:,1], X0[:,2], 'bo', alpha = 0.4, label = 'C0')
plt.plot(xp, ypt, 'k', alpha = 0.1, label = 'True')
plt.plot(xp, ypt-1, '--k', alpha = 0.1)
plt.plot(xp, ypt+1, '--k', alpha = 0.1)
plt.plot(xp, yp, 'g', linewidth = 3, label = 'SVM')
plt.plot(xp, yp-1/w[2,0], '--g')
plt.plot(xp, yp+1/w[2,0], '--g')
plt.title('When Outliers Exist', fontsize = 15)
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 12)
plt.axis('equal')
plt.xlim([0, 8])
plt.ylim([-4, 3])
plt.show()

4. Support Vector Machine

  • Probably the most popular/influential classification algorithm
  • A hyperplane based classifier (like the Perceptron)
  • Additionally uses the maximum margin principle
    • Maximize distance (margin) of closest samples from the decision line

      $$ \text{maximize {minimum distance}} $$
    • Note: perceptron only utilizes a sign of distance
    • Finds the hyperplane with maximum separation margin on the training data


$$\begin{align*} \text{minimize} \quad & \lVert \omega \rVert_2 + \gamma(1^Tu + 1^T\upsilon) \\ \text{subject to} \quad & X_1\omega \geq 1-u \\ & X_0\omega \leq -(1-\upsilon) \\ & u \geq 0 \\ & \nu \geq 0 \end{align*}$$

  • In a more compact form


$$ \begin{align*} \omega^T x_n &\geq 1 \;\text{for }\; y_n = +1 \\ \omega^T x_n &\leq -1 \;\text{for }\; y_n = -1 \end{align*} \Longleftrightarrow y_n \cdot \left( \omega^T x_n \right) \geq 1 $$


$$\begin{align*} \text{minimize} \quad & \lVert \omega \rVert_2 + \gamma(1^T \xi) \\ \text{subject to} \quad & y_n \cdot \left( \omega^T x_n \right) \geq 1 - \xi_n \\ & \xi \geq 0 \\ \end{align*}$$

In [41]:
X = np.vstack([X1, X0])
y = np.vstack([np.ones([N,1]), -np.ones([M,1])])

m = N + M

g = 2

w = cvx.Variable([3,1])
d = cvx.Variable([m,1])

obj = cvx.Minimize(cvx.norm(w,2) + g*(np.ones([1,m])*d))
const = [cvx.multiply(y, X*w) >= 1-d, d >= 0]
prob = cvx.Problem(obj, const).solve()

w = w.value
In [42]:
xp = np.linspace(-1,9,100).reshape(-1,1)
yp = - w[1,0]/w[2,0]*xp - w[0,0]/w[2,0]

plt.figure(figsize=(10, 8))
plt.plot(X1[:,1], X1[:,2], 'ro', alpha = 0.4, label = 'C1')
plt.plot(X0[:,1], X0[:,2], 'bo', alpha = 0.4, label = 'C0')
plt.plot(xp, ypt, 'k', alpha = 0.1, label = 'True')
plt.plot(xp, ypt-1, '--k', alpha = 0.1)
plt.plot(xp, ypt+1, '--k', alpha = 0.1)
plt.plot(xp, yp, 'g', linewidth = 3, label = 'SVM')
plt.plot(xp, yp-1/w[2,0], '--g')
plt.plot(xp, yp+1/w[2,0], '--g')
plt.title('Support Vector Machine (SVM)', fontsize = 15)
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 12)
plt.axis('equal')
plt.xlim([0, 8])
plt.ylim([-4, 3])
plt.show()

5. Nonlinear Support Vector Machine

5.1. Kernel

  • Often we want to capture nonlinear patterns in the data
    • Nonlinear regression: input and output relationship may not be linear
    • Nonlinear classification: classes may note be separable by a linear boundary
  • Linear models (e.g. linear regression, linear SVM) are note just rich enough
  • Kernels: make linear model work in nonlinear settings
    • By mapping data to higher dimensions where it exhibits linear patterns
    • Apply the linear model in the new input feature space
    • Mapping $=$ changing the feature representation
  • Note: such mappings can be expensive to compute in general
    • Kernels give such mappings for (almost) free
    • In most cases, the mappings need not be even computed
    • Using the Kernel trick !

5.2. Classifying non-linear separable data

  • Consider the binary classification problem
    • Each example represented by a single feature $x$
    • No linear separator exists for this data