Supervised Learning

without Scikit Learn

 by Prof. Seungchul LeeIndustrial AI Labhttp://isystems.unist.ac.kr/POSTECH

# 1. Optimization¶

• an important tool in 1) engineering problem solving and 2) decision science
• peolple optimize
• nature optimizes

3 key components

1. objective
2. decision variable or unknown
3. constraints

Procedures

1. The process of identifying objective, variables, and constraints for a given problem is known as "modeling"
2. Once the model has been formulated, optimization algorithm can be used to find its solutions.

In mathematical expression

\begin{align*} \min_{x} \quad &f(x) \\ \text{subject to} \quad &g_i(x) \leq 0, \qquad i=1,\cdots,m \end{align*}

Remarks) equivalent

\begin{align*} \min_{x} f(x) \quad&\leftrightarrow \quad \max_{x} -f(x)\\ \quad g_i(x) \leq 0\quad&\leftrightarrow \quad -g_i(x) \geq 0\\ h(x) = 0 \quad&\leftrightarrow \quad \begin{cases} h(x) \leq 0 \quad \text{and} \\ h(x) \geq 0 \end{cases} \end{align*}

The good news: for many classes of optimization problems, people have already done all the "hardwork" of developing numerical algorithms

\begin{array}{Icr}\begin{align*} \max_{x} \quad & x_1 + x_2 \\ \text{subject to} \quad & 2x_1 + x_2 \leq 29 \\ & x_1 + 2x_2 \leq 25 \\ & x_1 \geq 2 \\ & x_2 \geq 5 \end{align*}\end{array} \quad\implies\quad \begin{array}{I} \quad \min_{x} \quad & - \begin{bmatrix} 1 & 1 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} \\ \text{subject to} \quad & \begin{bmatrix} 2 & 1 \\ 1 & 2 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} \leq \begin{bmatrix} 29 \\ 25 \end{bmatrix} \\ & \begin{bmatrix} 2 \\ 5 \end{bmatrix} \leq \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} \leq \begin{bmatrix} \\ \\ \end{bmatrix} \end{array}
In [1]:
import numpy as np
import matplotlib.pyplot as plt
import cvxpy as cvx

f = np.array([[1, 1]])
A = np.array([[2, 1], [1, 2]])
b = np.array([[29], [25]])
lb = np.array([[2], [5]])

x = cvx.Variable(2,1)
obj = cvx.Minimize(-f*x)
const = [A*x <= b, lb <= x]

prob = cvx.Problem(obj, const).solve()

print (x.value)

[[11.]
[ 7.]]


# 2. Linear Regression¶

Begin by considering linear regression (easy to extend to more comlex predictions later on)

$\text{Given} \; \begin{cases} x_{i} \; \text{: inputs} \\ y_{i} \; \text{: outputs} \end{cases}$ , $\;$ find $\theta_{1}$ and $\theta_{2}$

$$x= \begin{bmatrix} x_{1} \\ x_{2} \\ \vdots \\ x_{m} \end{bmatrix}, \qquad y= \begin{bmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{m} \end{bmatrix} \approx \hat{y}_{i} = \theta_{1}x_{i} + \theta_{2}$$

• $\hat{y}_{i}$ : predicted output

• $\theta = \begin{bmatrix} \theta_{1} \\ \theta_{2} \\ \end{bmatrix}$ : Model parameters
$$\hat{y}_{i} = f(x_{i}, \theta) \; \text{ in general}$$
• In many cases, a linear model to predict $y_{i}$ can be used

$$\hat{y}_{i} = \theta_{1}x_{i} + \theta_{2} \;\quad\text{ such that } \quad \min\limits_{\theta_{1}, \theta_{2}}\sum\limits_{i = 1}^{m} (\hat{y}_{i} - y_{i})^2$$

In [2]:
import numpy as np
import matplotlib.pyplot as plt

# data points in column vector [input, output]
x = np.array([0.1, 0.4, 0.7, 1.2, 1.3, 1.7, 2.2, 2.8, 3.0, 4.0, 4.3, 4.4, 4.9]).reshape(-1, 1)
y = np.array([0.5, 0.9, 1.1, 1.5, 1.5, 2.0, 2.2, 2.8, 2.7, 3.0, 3.5, 3.7, 3.9]).reshape(-1, 1)

# to plot
plt.figure(figsize=(10, 6))
plt.plot(x, y, 'ko', label="data")
plt.xlabel('X', fontsize=15)
plt.ylabel('Y', fontsize=15)
plt.axis('scaled')
plt.grid(alpha=0.3)
plt.xlim([0, 5])
plt.show()


Use CVXPY optimization (least squared)

For convenience, we define a function that maps inputs to feature vectors, $\phi$

\begin{array}{Icr}\begin{align*} \hat{y}_{i} & = \begin{bmatrix}x_{i} & 1\end{bmatrix}\begin{bmatrix}\theta_{1} \\ \theta_{2}\end{bmatrix} \\\\ & =\begin{bmatrix}x_{i} \\1\end{bmatrix}^{T}\begin{bmatrix}\theta_{1} \\ \theta_{2}\end{bmatrix} \\\\ & =\phi^{T}(x_{i})\theta \end{align*}\end{array}, \begin{array}{Icr} \quad \quad \text{feature vector} \; \phi(x_{i}) = \begin{bmatrix}x_{i} \\1\end{bmatrix} \end{array}

$$\Phi = \begin{bmatrix}x_{1} & 1 \\x_{2} & 1 \\ \vdots \\x_{m} & 1 \end{bmatrix}=\begin{bmatrix}\phi^T(x_{1}) \\\phi^T(x_{2}) \\\vdots \\\phi^T(x_{m}) \end{bmatrix} \quad \implies \quad \hat{y} = \begin{bmatrix}\hat{y}_{1} \\\hat{y}_{2} \\\vdots \\\hat{y}_{m}\end{bmatrix}=\Phi\theta$$

Model parameter estimation

$$\min_{\theta} ~ \lVert \hat y - y \rVert_2 = \min_{\theta} ~ \lVert \Phi\theta - y \rVert_2$$
In [3]:
import cvxpy as cvx

m = y.shape[0]
#A = np.hstack([x, np.ones([m, 1])])
A = np.hstack([x, x**0])
A = np.asmatrix(A)

theta2 = cvx.Variable(2, 1)
obj = cvx.Minimize(cvx.norm(A*theta2-y, 2))
cvx.Problem(obj,[]).solve()

print('theta:\n', theta2.value)

theta:
[[0.67129519]
[0.65306531]]

In [4]:
# to plot
plt.figure(figsize=(10, 6))
plt.title('$L_2$ Regression', fontsize=15)
plt.xlabel('X', fontsize=15)
plt.ylabel('Y', fontsize=15)
plt.plot(x, y, 'ko', label="data")

# to plot a straight line (fitted line)
xp = np.arange(0, 5, 0.01).reshape(-1, 1)
theta2 = theta2.value
yp = theta2[0,0]*xp + theta2[1,0]

plt.plot(xp, yp, 'r', linewidth=2, label="$L_2$")
plt.legend(fontsize=15)
plt.axis('scaled')
plt.grid(alpha=0.3)
plt.xlim([0, 5])
plt.show()


# 3. Classification (Linear)¶

• Figure out, autonomously, which category (or class) an unknown item should be categorized into
• Number of categories / classes

• Binary: 2 different classes

• Multiclass: more than 2 classes

• Feature

• The measurable parts that make up the unknown item (or the information you have available to categorize)
• Perceptron: make use of sign of data
• Support Vector Machine (SVM)
• Logistic regression is a classification algorithm
• don't be confused
• To find a classification boundary

Sign

• Sign with respect to a line
\begin{align*}&\omega = \begin{bmatrix} \omega_1 \\ \omega_2 \end{bmatrix}, \qquad x = \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} \, \implies \, g(x) = \omega_1 x_1 + \omega_2 x_2 + \omega_0 = \omega^T x + \omega_0 \\ \\ &\omega = \begin{bmatrix} \omega_0 \\ \omega_1 \\ \omega_2 \end{bmatrix}, \qquad x = \begin{bmatrix} 1 \\ x_1 \\ x_2 \end{bmatrix} \, \implies \, g(x) = \omega_0 + \omega_1 x_1 + \omega_2 x_2 = \omega^T x\end{align*}

Perceptron

• Hyperplane
• Separates a D-dimensional space into two half-spaces
• Defined by an outward pointing normal vector
• $\omega$ is orthogonal to any vector lying on the hyperplane

How to find $\omega$

• All data in class 1
• $g(\omega^T x)>0$
• All data in class 0
• $g(\omega^T x)<0$

Perceptron Algorithm

The perceptron implements

$$h(x) = \text{sign}\left( \omega^Tx \right)$$

Given the training set

$$(x_1, y_1), (x_2, y_2), \cdots, (x_N, y_N) \quad \text{where } y_i \in \{-1,1\}$$

1) pick a misclassified point

$$\text{sign}\left(\omega^Tx_n \right) \neq y_n$$

2) and update the weight vector

$$\omega \leftarrow \omega + y_nx_n$$

• Why perceptron updates work ?
• Let's look at a misclassified positive example ($y_n = +1$)
• perceptron (wrongly) thinks $\omega_{old}^T x_n < 0$
• updates would be \begin{align*}\omega_{new} &= \omega_{old} + y_n x_n = \omega_{old} + x_n \\ \\ \omega_{new}^T x_n &= (\omega_{old} + x_n)^T x_n = \omega_{old}^T x_n + x_n^T x_n \end{align*}
• Thus $\omega_{new}^T x_n$ is less negative than $\omega_{old}^T x_n$
In [5]:
import numpy as np
import matplotlib.pyplot as plt

% matplotlib inline

In [6]:
#training data gerneration
m = 100
x1 = 8*np.random.rand(m, 1)
x2 = 7*np.random.rand(m, 1) - 4

g0 = 0.8*x1 + x2 - 3
g1 = g0 - 1
g2 = g0 + 1

In [7]:
C1 = np.where(g1 >= 0)
C2 = np.where(g2 < 0)
print(C1)

(array([ 0,  2,  4,  5,  6,  7, 10, 17, 22, 24, 25, 27, 28, 30, 31, 38, 51,
52, 53, 56, 57, 61, 62, 64, 70, 78, 85, 86, 89, 98], dtype=int64), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0], dtype=int64))

In [8]:
C1 = np.where(g1 >= 0)[0]
C2 = np.where(g2 < 0)[0]
print(C1.shape)
print(C2.shape)

(30,)
(41,)

In [9]:
plt.figure(figsize=(10, 6))
plt.plot(x1[C1], x2[C1], 'ro', label='C1')
plt.plot(x1[C2], x2[C2], 'bo', label='C2')
plt.title('Linearly seperable classes', fontsize=15)
plt.legend(loc='upper left', fontsize=15)
plt.xlabel(r'$x_1$', fontsize=20)
plt.ylabel(r'$x_2$', fontsize=20)
plt.show()

\begin{align*} x &= \begin{bmatrix} \left(x^{(1)}\right)^T \\ \left(x^{(2)}\right)^T \\ \left(x^{(3)}\right)^T\\ \vdots \\ \left(x^{(m)}\right)^T \end{bmatrix} = \begin{bmatrix} 1 & x_1^{(1)} & x_2^{(1)} \\ 1 & x_1^{(2)} & x_2^{(2)} \\ 1 & x_1^{(3)} & x_2^{(3)}\\\vdots & \vdots & \vdots \\ 1 & x_1^{(m)} & x_2^{(m)}\end{bmatrix} \\ y &= \begin{bmatrix}y^{(1)} \\ y^{(2)} \\ y^{(3)}\\ \vdots \\ y^{(m)} \end{bmatrix} \end{align*}
In [10]:
X1 = np.hstack([np.ones([C1.shape[0],1]), x1[C1], x2[C1]])
X2 = np.hstack([np.ones([C2.shape[0],1]), x1[C2], x2[C2]])
X = np.vstack([X1, X2])

y = np.vstack([np.ones([C1.shape[0],1]), -np.ones([C2.shape[0],1])])

X = np.asmatrix(X)
y = np.asmatrix(y)

$$\omega = \begin{bmatrix} \omega_1 \\ \omega_2 \\ \omega_3\end{bmatrix}$$$$\omega \leftarrow \omega + yx$$

where $(x, y)$ is a misclassified training point

In [11]:
w = np.ones([3,1])
w = np.asmatrix(w)

n_iter = y.shape[0]
for k in range(n_iter):
for i in range(n_iter):
if y[i,0] != np.sign(X[i,:]*w)[0,0]:
w += y[i,0]*X[i,:].T

print(w)

[[-13.        ]
[  3.35733863]
[  8.90909811]]

\begin{align*} g(x) &= \omega^Tx + \omega_0 = \omega_1x_1 + \omega_2x_2 + \omega_0 = 0 \\ \implies x_2 &= -\frac{\omega_1}{\omega_2} x_1 - \frac{\omega_0}{\omega_2} \end{align*}

$\color{Red}{\text{Not a unique solution}}$

In [12]:
x1p = np.linspace(0,8,100).reshape(-1,1)
x2p = - w[1,0]/w[2,0]*x1p - w[0,0]/w[2,0]

plt.figure(figsize=(10, 6))
plt.scatter(x1[C1], x2[C1], c='r', s=50, label='C1')
plt.scatter(x1[C2], x2[C2], c='b', s=50, label='C2')
plt.plot(x1p, x2p, c='k', label='perceptron')
plt.xlim([0,8])
plt.xlabel('$x_1$', fontsize = 20)
plt.ylabel('$x_2$', fontsize = 20)
plt.legend(loc = 4, fontsize = 15)
plt.show()


Perceptron

## 3.1. Distance¶

$$\omega = \begin{bmatrix}\omega_1 \\ \omega_2\end{bmatrix}, \, x = \begin{bmatrix}x_1\\x_2\end{bmatrix} \; \implies g(x) = \omega^Tx + \omega_0 = \omega_1x_1 + \omega_2x_2 + \omega_0$$

• Find a distance between $g(x) = -1$ and $g(x) = 1$

$\text{suppose }\; g(x_1) = -1,\; g(x_2) = 1$

$$\begin{array}{I} \omega^Tx_1+\omega_0 = -1\\ \omega^Tx_2+\omega_0 = 1\\ \end{array} \; \implies \; \omega^T(x_2 - x_1) = 2$$

$$s = \left\langle\frac{\omega}{\lVert \omega \rVert}, x_2 - x_1 \right\rangle = \frac{1}{\lVert \omega \rVert}\omega^T(x_2 - x_1) = \frac {2}{\lVert \omega \rVert}$$

## 3.2. Illustrative Example¶

• Binary classification: $C_1$ and $C_2$

• Features: the coordinate of $i$th data

$$x = \begin{bmatrix}x_1 \\ x_2 \end{bmatrix}$$

• Is it possible to distinguish between $C_1$ and $C_2$ by its coordinates?
• We need to find a separating hyperplane (or a line in 2D)

$$\omega_1x_1 + \omega_2x_2 + \omega_0 = 0$$

$$\begin{bmatrix}\omega_1 & \omega_2 \end{bmatrix} \begin{bmatrix}x_1 \\ x_2 \end{bmatrix} + \omega_0 = 0$$$$\omega^Tx + \omega_0 = 0$$
In [13]:
import numpy as np
import matplotlib.pyplot as plt

#training data gerneration
x1 = 8*np.random.rand(100, 1)
x2 = 7*np.random.rand(100, 1) - 4

g0 = 0.8*x1 + x2 - 3
g1 = g0 - 1
g2 = g0 + 1

C1 = np.where(g1 >= 0)[0]
C2 = np.where(g2 < 0)[0]

In [14]:
xp = np.linspace(0,8,100).reshape(-1,1)
ypt = -0.8*xp + 3

plt.figure(figsize=(10, 6))
plt.plot(x1[C1], x2[C1], 'ro', label='C1')
plt.plot(x1[C2], x2[C2], 'bo', label='C2')
plt.plot(xp, ypt, '--k', label='True')
plt.title('linearly and strictly separable classes', fontweight = 'bold', fontsize = 15)
plt.xlabel('$x_1$', fontsize = 20)
plt.ylabel('$x_2$', fontsize = 20)
plt.legend(loc = 4)
plt.xlim([0, 8])
plt.ylim([-4, 3])
plt.show()

• Given:
• Hyperplane defined by $\omega$ and $\omega_0$
• Animals coordinates (or features) $x$

• Decision making:
$$\omega^Tx + \omega_0 >0 \implies x \; \text{belongs to} \; C_1$$$$\omega^Tx + \omega_0 <0 \implies x \; \text{belongs to} \; C_2$$
• Find $\omega$ and $\omega_0$ such that $x$ given $\omega^Tx + \omega_0 = 0$

or

• Find $\omega$ and $\omega_0$ such that $x\in C_1$ given $\omega^Tx + \omega_0 >1$ and $x\in C_2$ given $\omega^Tx + \omega_0 < -1$

\begin{align*} \;&\omega^Tx + \omega_0 > b \\ \Longleftrightarrow \;&\frac{\omega^T}{b}x + \frac{\omega_0}{b} > 1 \\ \Longleftrightarrow \;&\omega'^Tx + \omega'_0 > 1 \end{align*}

• Same problem if strictly separable
In [15]:
#  see how data are generated
xp = np.linspace(0,8,100).reshape(-1,1)
ypt = -0.8*xp + 3

plt.figure(figsize=(10, 6))
plt.plot(x1[C1], x2[C1], 'ro', label='C1')
plt.plot(x1[C2], x2[C2], 'bo', label='C2')
plt.plot(xp, ypt, '--k', label='true')
plt.plot(xp, ypt-1, '-k')
plt.plot(xp, ypt+1, '-k')
plt.title('linearly and strictly separable classes', fontweight = 'bold', fontsize = 15)
plt.xlabel('$x_1$', fontsize = 20)
plt.ylabel('$x_2$', fontsize = 20)
plt.legend(loc = 4)
plt.xlim([0, 8])
plt.ylim([-4, 3])
plt.show()


### 3.2.1. Optimization Formulation 1¶

• $n \;(=2)$ features

• $m = N + M$ data points in training set

$$x^{(i)} = \begin{bmatrix} x^{(i)}_1 \\ x^{(i)}_2 \end{bmatrix} \;\text{with}\; \omega = \begin{bmatrix} \omega_1 \\ \omega_2 \end{bmatrix}\qquad \text{or} \qquad x^{(i)} = \begin{bmatrix} 1 \\ x^{(i)}_1 \\ x^{(i)}_2 \end{bmatrix} \;\; \text{with}\; \omega = \begin{bmatrix} \omega_0 \\ \omega_1 \\ \omega_2 \end{bmatrix}$$

• $N$ belongs to $C_1$ in training set

• $M$ belongs to $C_2$ in training set

• $\omega$ and $\omega_0$ are the unknown variables
 \begin{align*} \text{minimize} \quad & \text{something} \\ \\ \text{subject to} \quad & \begin{cases} \omega^Tx^{(1)} + \omega_0 \geq1\\ \omega^Tx^{(2)} + \omega_0 \geq1\\ \quad \quad \vdots \\ \omega^Tx^{(N)} + \omega_0 \geq1\\ \end{cases} \\\\ & \begin{cases} \omega^Tx^{(N+1)} + \omega_0 \leq{-1}\\ \omega^Tx^{(N+2)} + \omega_0 \leq{-1}\\ \quad \quad \vdots \\ \omega^Tx^{(N+M)} + \omega_0 \leq{-1}\\ \end{cases} \end{align*} or \begin{align*} \text{minimize} \quad & \text{something} \\ \\ \text{subject to} \quad & \begin{cases} \omega^Tx^{(1)} \geq1\\ \omega^Tx^{(2)} \geq1\\ \quad \quad \vdots \\ \omega^Tx^{(N)} \geq1\\ \end{cases} \\\\ & \begin{cases} \omega^Tx^{(N+1)} \leq{-1}\\ \omega^Tx^{(N+2)} \leq{-1}\\ \quad \quad \vdots \\ \omega^Tx^{(N+M)} \leq{-1}\\ \end{cases} \end{align*}

Code (CVXPY)

\begin{align*} &X_1 = \begin{bmatrix} \left(x^{(1)}\right)^T \\ \left(x^{(2)}\right)^T \\ \vdots \\ \left(x^{(N)}\right)^T\end{bmatrix} = \begin{bmatrix} 1 & x_1^{(1)} & x_2^{(1)} \\ 1 & x_1^{(2)} & x_2^{(2)} \\ \vdots & \vdots & \vdots \\ 1 & x_1^{(N)} & x_2^{(N)} \\ \end{bmatrix}\\\\ &X_2 = \begin{bmatrix} \left(x^{(N+1)}\right)^T \\ \left(x^{(N+2)}\right)^T \\ \vdots \\ \left(x^{(N+M)}\right)^T\end{bmatrix} = \begin{bmatrix} 1 & x_1^{(N+1)} & x_2^{(N+1)} \\ 1 & x_1^{(N+2)} & x_2^{(N+2)} \\ \vdots & \vdots & \vdots \\ 1 & x_1^{(N+M)} & x_2^{(N+M)} \\ \end{bmatrix}\end{align*}

\begin{align*} \text{minimize} \quad & \text{something} \\ \text{subject to} \quad & X_1\omega \geq 1 \\ & X_2\omega \leq -1 \end{align*}
\begin{align*} \text{minimize} \quad & \text{something} \\ \text{subject to} \quad & X_1\omega \geq 1 \\ & X_2\omega \leq -1 \end{align*}
In [16]:
# CVXPY using simple classification
import cvxpy as cvx

N = C1.shape[0]
M = C2.shape[0]

X1 = np.hstack([np.ones([N,1]), x1[C1], x2[C1]])
X2 = np.hstack([np.ones([M,1]), x1[C2], x2[C2]])

X1 = np.asmatrix(X1)
X2 = np.asmatrix(X2)

In [17]:
w = cvx.Variable(3,1)
obj = cvx.Minimize(1)
const = [X1*w >= 1, X2*w <= -1]
prob = cvx.Problem(obj, const).solve()

w = w.value

In [18]:
xp = np.linspace(0,8,100).reshape(-1,1)
yp = - w[1,0]/w[2,0]*xp - w[0,0]/w[2,0]

plt.figure(figsize=(10, 6))
plt.plot(X1[:,1], X1[:,2], 'ro', label='C1')
plt.plot(X2[:,1], X2[:,2], 'bo', label='C2')
plt.plot(xp, yp, 'k', label='SVM')
plt.plot(xp, ypt, '--k', label='true')
plt.xlim([0,8])
plt.xlabel('$x_1$', fontsize = 20)
plt.ylabel('$x_2$', fontsize = 20)
plt.legend(loc = 4, fontsize = 15)
plt.show()


### 3.2.2. Outlier¶

• Note that in the real world, you may have noise, errors, or outliers that do not accurately represent the actual phenomena
• Non-separable case
• No solutions (hyperplane) exist
• We will allow some training examples to be misclassified !
• but we want their number to be minimized
In [19]:
X1 = np.hstack([np.ones([N,1]), x1[C1], x2[C1]])
X2 = np.hstack([np.ones([M,1]), x1[C2], x2[C2]])

outlier = np.array([1, 3, 2]).reshape(-1,1)
X2 = np.vstack([X2, outlier.T])

X1 = np.asmatrix(X1)
X2 = np.asmatrix(X2)

plt.figure(figsize=(10, 6))
plt.plot(X1[:,1], X1[:,2], 'ro', label='C1')
plt.plot(X2[:,1], X2[:,2], 'bo', label='C2')
plt.xlim([0,8])
plt.xlabel('$x_1$', fontsize = 20)
plt.ylabel('$x_2$', fontsize = 20)
plt.legend(loc = 4, fontsize = 15)
plt.show()

\begin{align*} \text{minimize} \quad & \text{something} \\ \text{subject to} \quad & X_1\omega \geq 1 \\ & X_2\omega \leq -1 \end{align*}
In [20]:
w = cvx.Variable(3,1)
obj = cvx.Minimize(1)
const = [X1*w >= 1, X2*w <= -1]
prob = cvx.Problem(obj, const).solve()

print(w.value)

None

• No solutions (hyperplane) exist
• We will allow some training examples to be misclassified !
• but we want their number to be minimized

### 3.2.3. Optimization Formulation 2¶

• $n \;(=2)$ features

• $m = N+M$ data points in a training set

x^i = \begin{bmatrix} 1 \\x_1^{(i)} \\ x_2^{(i)} \end{bmatrix} \quad \text{with} \, \, \omega = \begin{bmatrix} \omega_0 \\ \omega_1 \\ \omega_2 \end{bmatrix} \qquad \begin{align*} \text{minimize} \quad & \text{something} \\ \text{subject to} \quad & X_1\omega \geq 1 \\ & X_2\omega \leq -1 \end{align*}
• $N$ belongs to $C_1$ in training set

• $M$ belongs to $C_2$ in training set

• $\omega$ and $\omega_0$ are the variables (unknown)
• For the non-separable case, we relex the above constraints

• Need slack variables $u$ and $\upsilon$ where all are positive

The optimization problem for the non-separable case

\begin{align*} \text{minimize} \quad & \sum\limits_{i=1}^{N}u_i + \sum\limits_{i=1}^{M}\upsilon_i \\ \\ \text{subject to} \quad & \begin{cases} \omega^Tx^{(1)} \geq1-u_1\\ \omega^Tx^{(2)} \geq1-u_2\\ \quad \quad \vdots \\ \omega^Tx^{(N)} \geq1-u_N\\ \end{cases} \\\\ & \begin{cases} \omega^Tx^{(N+1)} \leq{-(1-\upsilon_1)}\\ \omega^Tx^{(N+2)} \leq{-(1-\upsilon_2)}\\ \quad \quad \vdots \\ \omega^Tx^{(N+M)} \leq{-(1-\upsilon_M)}\\ \end{cases} \\\\ & \begin{cases} u \geq 0\\ v \geq 0\\ \end{cases} \end{align*}

• Expressed in a matrix form

\begin{align*} X_1 &= \begin{bmatrix} \left(x^{(1)}\right)^T \\ \left(x^{(2)}\right)^T \\ \vdots \\ \left(x^{(N)}\right)^T\end{bmatrix} = \begin{bmatrix} 1 & x_1^{(1)} & x_2^{(1)} \\ 1 & x_1^{(2)} & x_2^{(2)} \\ \vdots & \vdots & \vdots \\ 1 & x_1^{(N)} & x_2^{(N)} \\ \end{bmatrix}\\\\ X_2 &= \begin{bmatrix} \left(x^{(N+1)}\right)^T \\ \left(x^{(N+2)}\right)^T \\ \vdots \\ \left(x^{(N+M)}\right)^T\end{bmatrix} = \begin{bmatrix} 1 & x_1^{(N+1)} & x_2^{(N+1)} \\ 1 & x_1^{(N+2)} & x_2^{(N+2)} \\ \vdots & \vdots & \vdots \\ 1 & x_1^{(N+M)} & x_2^{(N+M)} \\ \end{bmatrix} \\\\ u &= \begin{bmatrix} u_1\\ \vdots\\ u_N\\ \end{bmatrix}\\\\ \upsilon &= \begin{bmatrix} \upsilon_{1}\\ \vdots\\ \upsilon_{M}\\ \end{bmatrix} \end{align*}

\begin{align*} \text{minimize} \quad & 1^Tu + 1^T\upsilon \\ \text{subject to} \quad & X_1\omega \geq 1-u \\ & X_2\omega \leq -(1-\upsilon) \\ & u \geq 0 \\ & \upsilon \geq 0 \end{align*}
In [21]:
X1 = np.hstack([np.ones([C1.shape[0],1]), x1[C1], x2[C1]])
X2 = np.hstack([np.ones([C2.shape[0],1]), x1[C2], x2[C2]])

outlier = np.array([1, 2, 2]).reshape(-1,1)
X2 = np.vstack([X2, outlier.T])

X1 = np.asmatrix(X1)
X2 = np.asmatrix(X2)

N = X1.shape[0]
M = X2.shape[0]

w = cvx.Variable(3,1)
u = cvx.Variable(N,1)
v = cvx.Variable(M,1)
obj = cvx.Minimize(np.ones((1,N))*u + np.ones((1,M))*v)
const = [X1*w >= 1-u, X2*w <= -(1-v), u >= 0, v >= 0 ]
prob = cvx.Problem(obj, const).solve()

w = w.value

In [22]:
xp = np.linspace(0,8,100).reshape(-1,1)
yp = - w[1,0]/w[2,0]*xp - w[0,0]/w[2,0]

plt.figure(figsize=(10, 6))
plt.plot(X1[:,1], X1[:,2], 'ro', label='C1')
plt.plot(X2[:,1], X2[:,2], 'bo', label='C2')
plt.plot(xp, yp, '--k', label='SVM')
plt.plot(xp, yp-1/w[2,0], '-k')
plt.plot(xp, yp+1/w[2,0], '-k')
plt.xlim([0,8])
plt.xlabel('$x_1$', fontsize = 20)
plt.ylabel('$x_2$', fontsize = 20)
plt.legend(loc = 4, fontsize = 15)
plt.show()


Further improvement

• Notice that hyperplane is not as accurately represent the division due to the outlier
• Can we do better when there are noise data or outliers?
• Yes, but we need to look beyond LP
• Idea: large margin leads to good generalization on the test data

## 3.3. Maximize Margin (Finally, it is Support Vector Machine)¶

• Distance (= margin)
$$\text{margin} = \frac{2}{\lVert \omega \rVert _2}$$
• Minimize $\lVert \omega \rVert_2$ to maximize the margin (closest samples from the decision line)

$$\text{maximize {minimum distance}}$$

• Use gamma ($\gamma$) as a weighting betwwen the followings:
• Bigger margin given robustness to outliers
• Hyperplane that has few (or no) errors

\begin{align*} \text{minimize} \quad & \lVert \omega \rVert_2 + \gamma(1^Tu + 1^T\upsilon) \\ \text{subject to} \quad & X_1\omega + \omega_0 \geq 1-u \\ & X_2\omega + \omega_0 \leq -(1-\upsilon) \\ & u \geq 0 \\ & \upsilon \geq 0 \end{align*}

In [23]:
g = 1
w = cvx.Variable(3,1)
u = cvx.Variable(N,1)
v = cvx.Variable(M,1)
obj = cvx.Minimize(cvx.norm(w,2) + g*(np.ones((1,N))*u + np.ones((1,M))*v))
const = [X1*w >= 1-u, X2*w <= -(1-v), u >= 0, v >= 0 ]
prob = cvx.Problem(obj, const).solve()

w = w.value

In [24]:
xp = np.linspace(0,8,100).reshape(-1,1)
yp = - w[1,0]/w[2,0]*xp - w[0,0]/w[2,0]

plt.figure(figsize=(10, 6))
plt.plot(X1[:,1], X1[:,2], 'ro', label='C1')
plt.plot(X2[:,1], X2[:,2], 'bo', label='C2')
plt.plot(xp, yp, '--k', label='SVM')
plt.plot(xp, yp-1/w[2,0], '-k')
plt.plot(xp, yp+1/w[2,0], '-k')
plt.xlim([0,8])
plt.xlabel('$x_1$', fontsize = 20)
plt.ylabel('$x_2$', fontsize = 20)
plt.legend(loc = 4, fontsize = 15)
plt.show()


## 3.4. Logistic Regression¶

• Logistic regression is a classification algorithm
• don't be confused
• Perceptron: make use of sign of data
• SVM: make use of margin (minimum distance)
• Distance from a single data point
• We want to use distance information of ALL data points
• logistic regression

Using Distances

• basic idea: to find the decision boundary (hyperplane) of $g(x)=\omega^T x = 0$ such that maximizes $\prod_i \vert h_i \vert \rightarrow$ optimization

• Inequality of arithmetic and geometric means
\begin{align*} {x_1 + x_2 + \dots + x_m \over m} \ge \sqrt[m]{x_1 \cdot x_2 \dots x_m}\\ \text{equality holds if and only if } x_1=x_2=\dots=x_m \end{align*}

• Roughly speaking, this optimization of $\max \prod_i \vert h_i \vert$ tends to position a hyperplane in the middle of two classes
$$h = {g(x) \over \Vert \omega \Vert} = {\omega^T x \over \Vert \omega \Vert} \sim \omega^T x$$

Using all Distances with Outliers

• SVM vs. Logistic Regression

Sigmoid function

• We link or squeeze $(-\infty, +\infty)$ to $(0,1)$ for several reasons:
• If $\sigma(z)$ is the sigmoid function, or the logistic function $$\sigma(z) = \frac{1}{1+e^{-z}} \implies \sigma(\omega^T x) = \frac{1}{1+e^{-\omega^T x}}$$
• logistic function always generates a value between 0 and 1
• Crosses 0.5 at the origin, then flattens out

• Benefit of mapping via the logistic function

• monotonic: same or similar optimziation solution
• continuous and differentiable: good for gradient descent optimization
• probability or confidence: can be considered as probability

$$P\left(y = +1 \mid x,\omega\right) = \frac{1}{1+e^{-\omega^T x}} \;\; \in \; [0,1]$$

• Goal: we need to fit $\omega$ to our data
$$\max \prod_i \vert h_i \vert$$
• Classified based on probability
In [25]:
m = 200

X0 = np.random.multivariate_normal([0, 0], np.eye(2), m)
X1 = np.random.multivariate_normal([10, 10], np.eye(2), m)

X = np.vstack([X0, X1])
y = np.vstack([np.zeros([m,1]), np.ones([m,1])])

In [26]:
from sklearn import linear_model

clf = linear_model.LogisticRegression()
clf.fit(X, np.ravel(y))

Out[26]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False)
In [27]:
X_new = np.array([2, 0]).reshape(1, -1)
pred = clf.predict(X_new)

print(pred)

[0.]

In [28]:
pred = clf.predict_proba(X_new)

print(pred)

[[0.9407617 0.0592383]]

In [29]:
plt.figure(figsize=(10, 6))
plt.plot(X0[:,0], X0[:,1], '.b', label='Class 0')
plt.plot(X1[:,0], X1[:,1], '.k', label='Class 1')
plt.plot(X_new[0,0], X_new[0,1], 'o', label='New Data', ms=5, mew=5)

plt.title('Logistic Regression', fontsize=15)
plt.legend(loc='lower right', fontsize=15)
plt.xlabel('X1', fontsize=15)
plt.ylabel('X2', fontsize=15)
plt.grid(alpha=0.3)
plt.axis('equal')
plt.show()

In [30]:
%%javascript
\$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')