Classification

By Prof. Seungchul Lee
http://iai.postech.ac.kr/
Industrial AI Lab at POSTECH

Table of Contents

0. Video Lectures¶

%%html 
<center><iframe src="https://www.youtube.com/embed/zQkJAjlXcAk?rel=0" 
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>

%%html 
<center><iframe src="https://www.youtube.com/embed/_i2ouy9w2AI?rel=0" 
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>

%%html 
<center><iframe src="https://www.youtube.com/embed/qFjWJHFgHYw?rel=0" 
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>

1. Classification¶

where $y$ is a discrete value
- develop the classification algorithm to determine which class a new input should fall into

start with binary class problems
- Later look at multiclass classification problem, although this is just an extension of binary classification

We could use linear regression
- Then, threshold the classifier output (i.e. anything over some value is yes, else no)
- linear regression with thresholding seems to work

We will learn
- perceptron
- logistic regression

2. Perceptron¶

For input $x = \begin{bmatrix}x_1\\ \vdots\\ x_d \end{bmatrix}\;$ 'attributes of a customer'

weights $\omega = \begin{bmatrix}\omega_1\\ \vdots\\ \omega_d \end{bmatrix}$

$$\begin{align*} \text{Approve credit if} \; & \sum\limits_{i=1}^{d}\omega_ix_i > \text{threshold}, \\ \text{Deny credit if} \; & \sum\limits_{i=1}^{d}\omega_ix_i < \text{threshold}. \end{align*}$$

$$h(x) = \text{sign} \left(\left( \sum\limits_{i=1}^{d}\omega_ix_i \right)- \text{threshold} \right) = \text{sign}\left(\left( \sum\limits_{i=1}^{d}\omega_ix_i \right)+ \omega_0\right)$$

Introduce an artificial coordinate $x_0 = 1$:

$$h(x) = \text{sign}\left( \sum\limits_{i=0}^{d}\omega_ix_i \right)$$

In a vector form, the perceptron implements

$$h(x) = \text{sign}\left( \omega^T x \right)$$

Sign function

$$ \text{sgn}(x) = \begin{cases} 1, &\text{if }\; x < 0\\ 0, &\text{if }\; x = 0\\ -1, &\text{if }\; x > 0 \end{cases} $$

Hyperplane
- Separates a D-dimensional space into two half-spaces
- Defined by an outward pointing normal vector $\omega$
- $\omega$ is orthogonal to any vector lying on the hyperplane
- Assume the hyperplane passes through origin, $\omega^T x = 0$ with $x_0 = 1$

Sign with respect to a line

$$ \begin{align*} \omega = \begin{bmatrix}\omega_1 \\ \omega_2 \end{bmatrix}, \quad x = \begin{bmatrix} x_1 \\ x_2\end{bmatrix} &\implies g(x) = \omega_0 + \omega_1 x_1 + \omega_2 x_2 = \omega_0 + \omega^T x\\\\ \omega = \begin{bmatrix}\omega_0 \\ \omega_1 \\ \omega_2 \end{bmatrix}, \quad x = \begin{bmatrix} 1 \\ x_1 \\ x_2\end{bmatrix} &\implies g(x) = \omega_0 \cdot 1 + \omega_1 x_1 + \omega_2 x_2 = \omega^T x \end{align*} $$

If $\vec p$ and $\vec q$ are on the decision line

$$ \begin{align*} g\left(\vec p\right) = g\left(\vec q\right) = 0 \quad & \Rightarrow \quad \omega_0 + \omega^T \vec p = \omega_0 + \omega^T \vec q = 0 \\ & \Rightarrow \quad \omega^T \left( \vec p- \vec q \right) = 0 \end{align*} $$

Goal: to learn the hyperplane $g_{\omega}(x)=0$ using the training data

How to find $\omega$
- All data in class 1 ($y = +1$) $$g(x) > 0$$
- All data in class 0 ($y = -1$) $$g(x) < 0$$

2.1. Perceptron Algorithm¶

The perceptron implements

$$h(x) = \text{sign}\left( \omega^Tx \right)$$

Given the training set

$$(x_1, y_1), (x_2, y_2), \cdots, (x_N, y_N) \quad \text{where } y_i \in \{-1,1\}$$

1) pick a misclassified point

$$ \text{sign}\left(\omega^Tx_n \right) \neq y_n$$

2) and update the weight vector

$$\omega \leftarrow \omega + y_nx_n$$

Why perceptron updates work ?

Let's look at a misclassified positive example ($y_n = +1$)
- perceptron (wrongly) thinks $\omega_{old}^T x_n < 0$

updates would be $$ \begin{align*}\omega_{new} &= \omega_{old} + y_n x_n = \omega_{old} + x_n \\ \\ \omega_{new}^T x_n &= (\omega_{old} + x_n)^T x_n = \omega_{old}^T x_n + x_n^T x_n \end{align*}$$

Thus $\omega_{new}^T x_n$ is less negative than $\omega_{old}^T x_n$

2.2. Iterations of Perceptron¶

Randomly assign $\omega$
One iteration of the PLA (perceptron learning algorithm) $$\omega \leftarrow \omega + yx$$ where $(x, y)$ is a misclassified training point
At iteration $i = 1, 2, 3, \cdots,$ pick a misclassified point from $$(x_1,y_1),(x_2,y_2),\cdots,(x_N, y_N)$$
and run a PLA iteration on it
That's it!

Summary

2.3. Perceptron in Python¶

$$g(x) = \omega_0 + \omega^Tx = \omega_0 + \omega_1x_1 + \omega_2x_2 = 0$$

$$ \begin{align*} \omega &= \begin{bmatrix} \omega_0 \\ \omega_1 \\ \omega_2\end{bmatrix}\\ \\ x &= \begin{bmatrix} \left(x^{(1)}\right)^T \\ \left(x^{(2)}\right)^T \\ \left(x^{(3)}\right)^T\\ \vdots \\ \left(x^{(m)}\right)^T \end{bmatrix} = \begin{bmatrix} 1 & x_1^{(1)} & x_2^{(1)} \\ 1 & x_1^{(2)} & x_2^{(2)} \\ 1 & x_1^{(3)} & x_2^{(3)}\\\vdots & \vdots & \vdots \\ 1 & x_1^{(m)} & x_2^{(m)}\end{bmatrix}, \qquad y = \begin{bmatrix}y^{(1)} \\ y^{(2)} \\ y^{(3)}\\ \vdots \\ y^{(m)} \end{bmatrix} \end{align*}$$

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

#training data gerneration
m = 100
x1 = 8*np.random.rand(m, 1)
x2 = 7*np.random.rand(m, 1) - 4

g = 0.8*x1 + x2 - 3

C1 = np.where(g >= 1)
C0 = np.where(g < -1)
print(C1)

(array([ 1,  2,  3,  6, 10, 12, 16, 17, 19, 24, 26, 27, 29, 34, 36, 40, 43,
       44, 52, 56, 57, 59, 66, 67, 71, 72, 73, 74, 75, 81, 83, 87, 91, 92],
      dtype=int64), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int64))

C1 = np.where(g >= 1)[0]
C0 = np.where(g < -1)[0]
print(C1.shape)
print(C0.shape)

(34,)
(39,)

plt.figure(figsize=(10, 8))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.title('Linearly Separable Classes', fontsize = 15)
plt.legend(loc = 1, fontsize = 15)
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.show()

$$ \begin{align*} x &= \begin{bmatrix} \left(x^{(1)}\right)^T \\ \left(x^{(2)}\right)^T \\ \left(x^{(3)}\right)^T\\ \vdots \\ \left(x^{(m)}\right)^T \end{bmatrix} = \begin{bmatrix} x_1^{(1)} & x_2^{(1)} \\ x_1^{(2)} & x_2^{(2)} \\ x_1^{(3)} & x_2^{(3)}\\ \vdots & \vdots \\ x_1^{(m)} & x_2^{(m)}\end{bmatrix} \qquad y = \begin{bmatrix}y^{(1)} \\ y^{(2)} \\ y^{(3)}\\ \vdots \\ y^{(m)} \end{bmatrix} \end{align*}$$

X1 = np.hstack([x1[C1], x2[C1]])
X0 = np.hstack([x1[C0], x2[C0]])
X = np.vstack([X1, X0])

y = np.vstack([np.ones([C1.shape[0],1]), -np.ones([C0.shape[0],1])])

from sklearn import linear_model

clf = linear_model.Perceptron(tol=1e-3)
clf.fit(X, np.ravel(y))

Perceptron(alpha=0.0001, class_weight=None, early_stopping=False, eta0=1.0,
           fit_intercept=True, max_iter=1000, n_iter_no_change=5, n_jobs=None,
           penalty=None, random_state=0, shuffle=True, tol=0.001,
           validation_fraction=0.1, verbose=0, warm_start=False)

clf.predict([[3, -2]])

array([-1.])

clf.predict([[6, 2]])

array([1.])

clf.coef_

array([[3.32453187, 7.36364365]])

clf.intercept_

array([-14.])

$$ \begin{align*} g(x) &= \omega_0 + \omega^Tx = \omega_0 + \omega_1x_1 + \omega_2x_2 = 0 \\\\ \implies x_2 &= -\frac{\omega_1}{\omega_2} x_1 - \frac{\omega_0}{\omega_2} \end{align*} $$

w0 = clf.intercept_[0]
w1 = clf.coef_[0,0]
w2 = clf.coef_[0,1]

x1p = np.linspace(0,8,100).reshape(-1,1)
x2p = - w1/w2*x1p - w0/w2

plt.figure(figsize=(10, 8))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.plot(x1p, x2p, c = 'k', linewidth = 4, label = 'perceptron')
plt.xlim([0, 8])
plt.xlabel('$x_1$', fontsize = 15)
plt.ylabel('$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 15)
plt.show()

2.4. The Best Hyperplane Separator?¶

Perceptron finds one of the many possible hyperplanes separating the data if one exists

Of the many possible choices, which one is the best?

Utilize distance information

Intuitively we want the hyperplane having the maximum margin

Large margin leads to good generalization on the test data
- we will see this formally when we cover Support Vector Machine

3. Logistic Regression¶

Logistic regression is a classification algorithm - don't be confused

3.1. Distance from a Line¶

$$\omega = \begin{bmatrix}\omega_1 \\ \omega_2\end{bmatrix}, \, x = \begin{bmatrix}x_1\\x_2\end{bmatrix} \; \implies g(x) = \omega_0 + \omega^Tx = \omega_0 + \omega_1x_1 + \omega_2x_2 $$

For any vector of $x$

$$ \; h = \frac{g(x)}{\lVert \omega \rVert} \implies\; \text{orthogonal signed distance from the line}$$

3.2. Using all Distances¶

Perceptron: make use of sign of data
We want to use distance information of all data points $\rightarrow$ logistic regression

Basic idea: to find the decision boundary (hyperplane) of $g(x)=\omega^T x =0$ such that maximizes $\prod_i \lvert h_i \rvert$
- Inequality of arithmetic and geometric means $$ \frac{h_1+h_2}{2} \geq \sqrt{h_1 h_2} $$ and that equality holds if and only if $h_1 = h_2$

Roughly speaking, this optimization of $\max \prod_i \lvert h_i \rvert$ tends to position a hyperplane in the middle of two classes

$$h = \frac{g(x)}{\lVert \omega \rVert} = \frac{\omega^T x}{\lVert \omega \rVert} \sim \omega^T x$$

We link or squeeze $(-\infty, +\infty)$ to $(0,1)$ for several reasons:

If $\sigma(z)$ is the sigmoid function, or the logistic function

$$ \sigma(z) = \frac{1}{1+e^{-z}} \implies \sigma \left(\omega^T x \right) = \frac{1}{1+e^{-\omega^T x}}$$
- Logistic function always generates a value between 0 and 1
- Crosses 0.5 at the origin, then flattens out

# plot a sigmoid function

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

z = np.linspace(-4,4,100)
s = 1/(1 + np.exp(-z))

plt.figure(figsize=(10,2))
plt.plot(z, s)
plt.xlim([-4, 4])
plt.axis('equal')
plt.grid(alpha = 0.3)
plt.show()

Benefit of mapping via the logistic function
- monotonic: same or similar optimziation solution
- continuous and differentiable: good for gradient descent optimization
- probability or confidence: can be considered as probability
$$P\left(y = +1 \mid x\,;\omega\right) = \frac{1}{1+e^{-\omega^T x}} \;\; \in \; [0,1]$$
- Often we do note care about predicting the label $y$
- Rather, we want to predict the label probabilities $P\left(y \mid x\,;\omega\right)$
  - the probability that the label is $+1$ $$P\left(y = +1 \mid x\,;\omega\right)$$
  - the probability that the label is $0$ $$P\left(y = 0 \mid x\,;\omega\right) = 1 - P\left(y = +1 \mid x\,;\omega\right)$$

Goal: we need to fit $\omega$ to our data

3.3. Logistic Regression using Scikit-Learn¶

$$ \begin{align*} \omega &= \begin{bmatrix} \omega_1 \\ \omega_2\end{bmatrix}, \qquad \omega_0, \qquad x = \begin{bmatrix} x_1 \\ x_2\end{bmatrix}\\ \\ X &= \begin{bmatrix} \left(x^{(1)}\right)^T \\ \left(x^{(2)}\right)^T \\ \left(x^{(3)}\right)^T \\ \vdots\end{bmatrix} = \begin{bmatrix} x_1^{(1)} & x_2^{(1)} \\ x_1^{(2)} & x_2^{(2)} \\ x_1^{(3)} & x_2^{(3)} \\ \vdots & \vdots \\\end{bmatrix}, \qquad y = \begin{bmatrix} y^{(1)}\\ y^{(2)} \\y^{(3)} \\ \vdots \end{bmatrix} \end{align*} $$

# data generation

m = 100
w = np.array([[-6], [2], [1]])
X = np.hstack([np.ones([m,1]), 4*np.random.rand(m,1), 4*np.random.rand(m,1)])

w = np.asmatrix(w)
X = np.asmatrix(X)

y = 1/(1 + np.exp(-X*w)) > 0.5 

C1 = np.where(y == True)[0]
C0 = np.where(y == False)[0]

y = np.empty([m,1])
y[C1] = 1
y[C0] = 0

plt.figure(figsize = (10,8))
plt.plot(X[C1,1], X[C1,2], 'ro', alpha = 0.3, label = 'C1')
plt.plot(X[C0,1], X[C0,2], 'bo', alpha = 0.3, label = 'C0')
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 12)
plt.axis('equal')
plt.xlim([0,4])
plt.ylim([0,4])
plt.show()

X = X[:,1:3]

X.shape

(100, 2)

from sklearn import linear_model

clf = linear_model.LogisticRegression(solver='lbfgs')
clf.fit(X,np.ravel(y))

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

clf.coef_

array([[3.31856236, 1.24254905]])

clf.intercept_

array([-9.07267516])

w0 = clf.intercept_[0]
w1 = clf.coef_[0,0]
w2 = clf.coef_[0,1]

xp = np.linspace(0,4,100).reshape(-1,1)
yp = - w1/w2*xp - w0/w2

plt.figure(figsize = (10,8))
plt.plot(X[C1,0], X[C1,1], 'ro', alpha = 0.3, label = 'C1')
plt.plot(X[C0,0], X[C0,1], 'bo', alpha = 0.3, label = 'C0')
plt.plot(xp, yp, 'g', linewidth = 4, label = 'Logistic Regression')
plt.title('Logistic Regression')
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 12)
plt.axis('equal')
plt.xlim([0,4])
plt.ylim([0,4])
plt.show()

4. Multiclass Classification¶

Generalization to more than 2 classes is straightforward
- one vs. all (one vs. rest)
- one vs. one

Using the soft-max function instead of the logistic function (refer to UFLDL Tutorial)
- see them as probability

$$P\left(y = k \mid x\,;\omega\right) = \frac{\exp{\left( \omega_k^T x \right) }}{\sum_k \exp{\left(\omega_k^T x \right)}} \in [0,1]$$

We maintain a separator weight vector $\omega_k$ for each class $k$

Note that the softmax function is often used in deep learning

5. Non-linear Classification¶

5.1. Kernel¶

Often we want to capture nonlinear patterns in the data
- Nonlinear regression: input and output relationship may not be linear
- Nonlinear classification: classes may note be separable by a linear boundary

Linear models (e.g. linear regression, linear SVM) are note just rich enough

Kernels: make linear model work in nonlinear settings
- By mapping data to higher dimensions where it exhibits linear patterns
- Apply the linear model in the new input feature space
- Mapping $=$ changing the feature representation

5.2. Classifying non-linear separable data¶

Consider the binary classification problem
- Each example represented by a single feature $x$
- No linear separator exists for this data

Now map each example as $x \rightarrow \{x,x^2\}$

Data now becomes linearly separable in the new representation

Linear in the new representation $=$ nonlinear in the old representation

Let's look at another example
- Each example defined by a two features $x=\{x_1, x_2\}$
- No linear separator exists for this data

Now map each example as $x=\{x_1, x_2\} \rightarrow z=\{x_1^2,\sqrt{2}x_1x_2,x_2^2\}$
- Each example now has three features (derived from the old represenation)

Data now becomes linear separable in the new representation

%%html
<center><iframe 
width="420" height="315" src="https://www.youtube.com/embed/3liCbRZPrZA?rel=0" frameborder="0" allowfullscreen>
</iframe></center>

%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')