Classification

# 1. Classification¶

• where $y$ is a discrete value

• develop the classification algorithm to determine which class a new input should fall into

• Later look at multiclass classification problem, although this is just an extension of binary classification
• We could use linear regression

• Then, threshold the classifier output (i.e. anything over some value is yes, else no)
• linear regression with thresholding seems to work
• We will learn

• perceptron
• logistic regression

# 2. Perceptron¶

• For input $x = \begin{bmatrix}x_1\\ \vdots\\ x_d \end{bmatrix}\;$ 'attributes of a customer'

• weights $\omega = \begin{bmatrix}\omega_1\\ \vdots\\ \omega_d \end{bmatrix}$

\begin{align*} \text{Approve credit if} \; & \sum\limits_{i=1}^{d}\omega_ix_i > \text{threshold}, \\ \text{Deny credit if} \; & \sum\limits_{i=1}^{d}\omega_ix_i < \text{threshold}. \end{align*}

$$h(x) = \text{sign} \left(\left( \sum\limits_{i=1}^{d}\omega_ix_i \right)- \text{threshold} \right) = \text{sign}\left(\left( \sum\limits_{i=1}^{d}\omega_ix_i \right)+ \omega_0\right)$$

• Introduce an artificial coordinate $x_0 = 1$:

$$h(x) = \text{sign}\left( \sum\limits_{i=0}^{d}\omega_ix_i \right)$$

• In a vector form, the perceptron implements

$$h(x) = \text{sign}\left( \omega^T x \right)$$

• Sign function

$$\text{sgn}(x) = \begin{cases} 1, &\text{if }\; x < 0\\ 0, &\text{if }\; x = 0\\ -1, &\text{if }\; x > 0 \end{cases}$$

• Hyperplane
• Separates a D-dimensional space into two half-spaces
• Defined by an outward pointing normal vector $\omega$
• $\omega$ is orthogonal to any vector lying on the hyperplane
• Assume the hyperplane passes through origin, $\omega^T x = 0$ with $x_0 = 1$

• Sign with respect to a line

\begin{align*} \omega = \begin{bmatrix}\omega_1 \\ \omega_2 \end{bmatrix}, \quad x = \begin{bmatrix} x_1 \\ x_2\end{bmatrix} &\implies g(x) = \omega_0 + \omega_1 x_1 + \omega_2 x_2 = \omega_0 + \omega^T x\\\\ \omega = \begin{bmatrix}\omega_0 \\ \omega_1 \\ \omega_2 \end{bmatrix}, \quad x = \begin{bmatrix} 1 \\ x_1 \\ x_2\end{bmatrix} &\implies g(x) = \omega_0 \cdot 1 + \omega_1 x_1 + \omega_2 x_2 = \omega^T x \end{align*}

• Goal: to learn the hyperplane $g_{\omega}(x)=0$ using the training data

• How to find $\omega$

• All data in class 1 ($y = +1$) $$g(x) > 0$$
• All data in class 0 ($y = -1$) $$g(x) < 0$$

## 2.1. Perceptron Algorithm¶

The perceptron implements

$$h(x) = \text{sign}\left( \omega^Tx \right)$$

Given the training set

$$(x_1, y_1), (x_2, y_2), \cdots, (x_N, y_N) \quad \text{where } y_i \in \{-1,1\}$$

1. pick a misclassified point

$$\text{sign}\left(\omega^Tx_n \right) \neq y_n$$

1. and update the weight vector

$$\omega \leftarrow \omega + y_nx_n$$

• Let's look at a misclassified positive example ($y_n = +1$)
• perceptron (wrongly) thinks $\omega_{old}^T x_n < 0$

\begin{align*}\omega_{new} &= \omega_{old} + y_n x_n = \omega_{old} + x_n \\ \\ \omega_{new}^T x_n &= (\omega_{old} + x_n)^T x_n = \omega_{old}^T x_n + x_n^T x_n \end{align*}

• Thus $\omega_{new}^T x_n$ is less negative than $\omega_{old}^T x_n$

### 2.1.1. Iterations of Perceptron¶

1. Randomly assign $\omega$

2. One iteration of the PLA (perceptron learning algorithm) $$\omega \leftarrow \omega + yx$$ where $(x, y)$ is a misclassified training point

3. At iteration $i = 1, 2, 3, \cdots,$ pick a misclassified point from $$(x_1,y_1),(x_2,y_2),\cdots,(x_N, y_N)$$

4. and run a PLA iteration on it

5. That's it!

Summary

### 2.1.2. Perceptron loss function¶

$$\mathscr{L}(\omega) = \sum_{n =1}^{m} \max \left\{ 0, -y_n \cdot \left(\omega^T x_n \right)\right\}$$

• Loss $= 0$ on examples where perceptron is correct, i.e., $y_n \cdot \left(\omega^T x_n \right) > 0$

• Loss $> 0$ on examples where perceptron misclassified, i.e., $y_n \cdot \left(\omega^T x_n \right) < 0$

Note: $\text{sign}\left(\omega^T x_n \right) \neq y_n$ is equivalent to $y_n \cdot \left(\omega^T x_n \right) < 0$

## 2.2. Perceptron in Python¶

$$g(x) = \omega_0 + \omega^Tx = \omega_0 + \omega_1x_1 + \omega_2x_2 = 0$$

\begin{align*} \omega &= \begin{bmatrix} \omega_0 \\ \omega_1 \\ \omega_2\end{bmatrix}\\ \\ x &= \begin{bmatrix} \left(x^{(1)}\right)^T \\ \left(x^{(2)}\right)^T \\ \left(x^{(3)}\right)^T\\ \vdots \\ \left(x^{(m)}\right)^T \end{bmatrix} = \begin{bmatrix} 1 & x_1^{(1)} & x_2^{(1)} \\ 1 & x_1^{(2)} & x_2^{(2)} \\ 1 & x_1^{(3)} & x_2^{(3)}\\\vdots & \vdots & \vdots \\ 1 & x_1^{(m)} & x_2^{(m)}\end{bmatrix}, \qquad y = \begin{bmatrix}y^{(1)} \\ y^{(2)} \\ y^{(3)}\\ \vdots \\ y^{(m)} \end{bmatrix} \end{align*}

In [ ]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [ ]:
#training data gerneration
m = 100
x1 = 8*np.random.rand(m, 1)
x2 = 7*np.random.rand(m, 1) - 4

g = 0.8*x1 + x2 - 3

In [ ]:
C1 = np.where(g >= 1)
C0 = np.where(g < -1)
print(C1)

(array([ 1,  2,  4, 10, 12, 13, 14, 15, 19, 23, 24, 32, 33, 34, 40, 43, 46,
47, 51, 52, 53, 54, 57, 62, 64, 67, 68, 77, 82, 85, 88, 93, 94, 96,
98]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]))

In [ ]:
C1 = np.where(g >= 1)[0]
C0 = np.where(g < -1)[0]
print(C1.shape)
print(C0.shape)

(35,)
(40,)

In [ ]:
plt.figure(figsize = (6, 4))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.title('Linearly Separable Classes', fontsize = 15)
plt.legend(loc = 1, fontsize = 15)
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.show()


\begin{align*} x &= \begin{bmatrix} \left(x^{(1)}\right)^T \\ \left(x^{(2)}\right)^T \\ \left(x^{(3)}\right)^T\\ \vdots \\ \left(x^{(m)}\right)^T \end{bmatrix} = \begin{bmatrix} 1 & x_1^{(1)} & x_2^{(1)} \\ 1 & x_1^{(2)} & x_2^{(2)} \\ 1 & x_1^{(3)} & x_2^{(3)}\\\vdots & \vdots & \vdots \\ 1 & x_1^{(m)} & x_2^{(m)}\end{bmatrix} \qquad y = \begin{bmatrix}y^{(1)} \\ y^{(2)} \\ y^{(3)}\\ \vdots \\ y^{(m)} \end{bmatrix} \end{align*}

In [ ]:
X1 = np.hstack([np.ones([C1.shape[0],1]), x1[C1], x2[C1]])
X0 = np.hstack([np.ones([C0.shape[0],1]), x1[C0], x2[C0]])
X = np.vstack([X1, X0])

y = np.vstack([np.ones([C1.shape[0],1]), -np.ones([C0.shape[0],1])])

X = np.asmatrix(X)
y = np.asmatrix(y)


$$\omega = \begin{bmatrix} \omega_0 \\ \omega_1 \\ \omega_2\end{bmatrix}$$

$$\omega \leftarrow \omega + yx$$ where $(x, y)$ is a misclassified training point

In [ ]:
w = np.ones([3,1])
w = np.asmatrix(w)

n_iter = y.shape[0]
for k in range(n_iter):
for i in range(n_iter):
if y[i,0] != np.sign(X[i,:]*w)[0,0]:
w += y[i,0]*X[i,:].T

print(w)

[[-9.        ]
[ 2.61068705]
[ 4.15661782]]


\begin{align*} g(x) &= \omega_0 + \omega^Tx = \omega_0 + \omega_1x_1 + \omega_2x_2 = 0 \\\\ \implies x_2 &= -\frac{\omega_1}{\omega_2} x_1 - \frac{\omega_0}{\omega_2} \end{align*}

In [ ]:
x1p = np.linspace(0,8,100).reshape(-1,1)
x2p = - w[1,0]/w[2,0]*x1p - w[0,0]/w[2,0]

plt.figure(figsize = (6, 4))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.plot(x1p, x2p, c = 'k', linewidth = 3, label = 'perceptron')
plt.xlim([0, 8])
plt.xlabel('$x_1$', fontsize = 15)
plt.ylabel('$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 12)
plt.show()


Perceptron using Scikit-Learn

\begin{align*} x &= \begin{bmatrix} \left(x^{(1)}\right)^T \\ \left(x^{(2)}\right)^T \\ \left(x^{(3)}\right)^T\\ \vdots \\ \left(x^{(m)}\right)^T \end{bmatrix} = \begin{bmatrix} x_1^{(1)} & x_2^{(1)} \\ x_1^{(2)} & x_2^{(2)} \\ x_1^{(3)} & x_2^{(3)}\\ \vdots & \vdots \\ x_1^{(m)} & x_2^{(m)}\end{bmatrix} \qquad y = \begin{bmatrix}y^{(1)} \\ y^{(2)} \\ y^{(3)}\\ \vdots \\ y^{(m)} \end{bmatrix} \end{align*}

In [ ]:
X1 = np.hstack([x1[C1], x2[C1]])
X0 = np.hstack([x1[C0], x2[C0]])
X = np.vstack([X1, X0])

y = np.vstack([np.ones([C1.shape[0],1]), -np.ones([C0.shape[0],1])])

In [ ]:
from sklearn import linear_model

clf = linear_model.Perceptron(tol=1e-3)
clf.fit(X, np.ravel(y))

Out[ ]:
Perceptron()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
In [ ]:
clf.predict([[3, -2]])

Out[ ]:
array([-1.])
In [ ]:
clf.predict([[6, 2]])

Out[ ]:
array([1.])
In [ ]:
clf.coef_

Out[ ]:
array([[4.06397006, 5.50113727]])
In [ ]:
clf.intercept_

Out[ ]:
array([-10.])

\begin{align*} g(x) &= \omega_0 + \omega^Tx = \omega_0 + \omega_1x_1 + \omega_2x_2 = 0 \\\\ \implies x_2 &= -\frac{\omega_1}{\omega_2} x_1 - \frac{\omega_0}{\omega_2} \end{align*}

In [ ]:
w0 = clf.intercept_[0]
w1 = clf.coef_[0,0]
w2 = clf.coef_[0,1]

In [ ]:
x1p = np.linspace(0,8,100).reshape(-1,1)
x2p = - w1/w2*x1p - w0/w2

plt.figure(figsize = (6, 4))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.plot(x1p, x2p, c = 'k', linewidth = 4, label = 'perceptron')
plt.xlim([0, 8])
plt.xlabel('$x_1$', fontsize = 15)
plt.ylabel('$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 12)
plt.show()


## 2.3. The Best Hyperplane Separator?¶

• Perceptron finds one of the many possible hyperplanes separating the data if one exists

• Of the many possible choices, which one is the best?

• Utilize distance information

• Intuitively we want the hyperplane having the maximum margin

• Large margin leads to good generalization on the test data

• we will see this formally when we cover Support Vector Machine

# 3. SVM¶

## 3.1. Distance from a Line¶

$$\omega = \begin{bmatrix}\omega_1 \\ \omega_2\end{bmatrix}, \, x = \begin{bmatrix}x_1\\x_2\end{bmatrix} \; \implies g(x) = \omega_0 + \omega^Tx = \omega_0 + \omega_1x_1 + \omega_2x_2$$

• If $\vec p$ and $\vec q$ are on the decision line

\begin{align*} g\left(\vec p\right) = g\left(\vec q\right) = 0 \quad & \Rightarrow \quad \omega_0 + \omega^T \vec p = \omega_0 + \omega^T \vec q = 0 \\ & \Rightarrow \quad \omega^T \left( \vec p- \vec q \right) = 0 \end{align*}

\begin{align*} & \therefore \, \omega : \text{normal to the line (orthogonal)} \\ & \implies \text{tells the direction of the line} \end{align*}

• If $x$ is on the line and $x = d\frac{\omega}{\lVert \omega \rVert}$ (where $d$ is a normal distance from the origin to the line)

\begin{align*} g(x)& = \omega_0 + \omega^Tx = 0 \; \\ & \Rightarrow \omega_0 + \omega^Td\frac{\omega}{\lVert \omega \rVert} = \omega_0 + d\frac{\omega^T\omega}{\lVert \omega \rVert} = \omega_0 + d\lVert \omega \rVert = 0 \\\\ & \therefore d \, = - \frac{\omega_0}{\lVert \omega \rVert} \end{align*}

• For any vector of $x$

$$x = x_{\perp} + r \frac{\omega}{\lVert \omega \rVert}$$

$$\omega^Tx = \omega^T \left( x_{\perp} + r \frac{\omega}{\lVert \omega \rVert}\right) = r \frac{\omega^T\omega}{\lVert \omega \rVert} = r \lVert \omega \rVert$$

\begin{align*} g(x) & = \omega_0 + \omega^Tx \\ & = \omega_0 + r \lVert \omega \rVert \qquad (r = d + h) \\ & = \omega_0 + (d +h) \lVert \omega \rVert\\ & = \omega_0 + \left(- \frac{\omega_0}{\lVert \omega \rVert} + h \right)\lVert \omega \rVert\\ & = h \lVert \omega \rVert \end{align*}

$$\therefore \; h = \frac{g(x)}{\lVert \omega \rVert} \implies\; \text{orthogonal signed distance from the line}$$

Another method to find a distance between $g(x) = -1$ and $g(x) = 1$

Suppose $g(x_1) = -1,\; g(x_2) = 1$

$$\begin{array}{I} \omega_0 + \omega^Tx_1 = -1\\ \omega_0 + \omega^Tx_2 = 1\\ \end{array} \; \implies \; \omega^T(x_2 - x_1) = 2$$

$$s = \langle\frac{\omega}{\lVert \omega \rVert}, x_2 - x_1 \rangle = \frac{1}{\lVert \omega \rVert}\omega^T(x_2 - x_1) = \frac {2}{\lVert \omega \rVert}$$

## 3.2. Illustrative Example¶

• Binary classification

• $C_1$ and $C_0$
• Features

• The coordinate of the unknown animal $i$ in the zoo

$$x = \begin{bmatrix}x_1 \\ x_2 \end{bmatrix}$$

• Is it possible to distinguish between $C_1$ and $C_0$ by its coordinates on a map of the zoo?

• We need to find a separating hyperplane (or a line in 2D)

\begin{align*} \omega_0 + \omega_1x_1 + \omega_2x_2 &= 0 \\\\ \omega_0 + \begin{bmatrix}\omega_1 & \omega_2 \end{bmatrix} \begin{bmatrix}x_1 \\ x_2 \end{bmatrix} &= 0\\\\ \omega_0 + \omega^Tx &= 0 \end{align*}

In [ ]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

#training data gerneration
x1 = 8*np.random.rand(100, 1)
x2 = 7*np.random.rand(100, 1) - 4

g = 0.8*x1 + x2 - 3
g1 = g - 1
g0 = g + 1

C1 = np.where(g1 >= 0)[0]
C0 = np.where(g0 < 0)[0]

In [ ]:
xp = np.linspace(-1,9,100).reshape(-1,1)
ypt = -0.8*xp + 3

plt.figure(figsize = (6, 4))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.plot(xp, ypt, 'k', linewidth = 3, label = 'True')
plt.title('Linearly and Strictly Separable Classes', fontsize = 12)
plt.xlabel(r'$x_1$', fontsize = 12)
plt.ylabel(r'$x_2$', fontsize = 12)
plt.legend(loc = 1, fontsize = 12)
plt.axis('equal')
plt.xlim([0, 8])
plt.ylim([-4, 3])
plt.show()