Classification


By Prof. Seungchul Lee
http://iai.postech.ac.kr/
Industrial AI Lab at POSTECH

Table of Contents

0. Video Lectures

In [1]:
%%html 
<center><iframe src="https://www.youtube.com/embed/zQkJAjlXcAk?rel=0" 
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>
In [2]:
%%html 
<center><iframe src="https://www.youtube.com/embed/_i2ouy9w2AI?rel=0" 
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>
In [3]:
%%html 
<center><iframe src="https://www.youtube.com/embed/qFjWJHFgHYw?rel=0" 
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>

1. Classification

  • where $y$ is a discrete value
    • develop the classification algorithm to determine which class a new input should fall into
  • start with binary class problems
    • Later look at multiclass classification problem, although this is just an extension of binary classification
  • We could use linear regression
    • Then, threshold the classifier output (i.e. anything over some value is yes, else no)
    • linear regression with thresholding seems to work
  • We will learn
    • perceptron
    • logistic regression

2. Perceptron

  • For input $x = \begin{bmatrix}x_1\\ \vdots\\ x_d \end{bmatrix}\;$ 'attributes of a customer'
  • weights $\omega = \begin{bmatrix}\omega_1\\ \vdots\\ \omega_d \end{bmatrix}$
$$\begin{align*} \text{Approve credit if} \; & \sum\limits_{i=1}^{d}\omega_ix_i > \text{threshold}, \\ \text{Deny credit if} \; & \sum\limits_{i=1}^{d}\omega_ix_i < \text{threshold}. \end{align*}$$


$$h(x) = \text{sign} \left(\left( \sum\limits_{i=1}^{d}\omega_ix_i \right)- \text{threshold} \right) = \text{sign}\left(\left( \sum\limits_{i=1}^{d}\omega_ix_i \right)+ \omega_0\right)$$

  • Introduce an artificial coordinate $x_0 = 1$:
$$h(x) = \text{sign}\left( \sum\limits_{i=0}^{d}\omega_ix_i \right)$$
  • In a vector form, the perceptron implements
$$h(x) = \text{sign}\left( \omega^T x \right)$$
  • Sign function
$$ \text{sgn}(x) = \begin{cases} 1, &\text{if }\; x < 0\\ 0, &\text{if }\; x = 0\\ -1, &\text{if }\; x > 0 \end{cases} $$




  • Hyperplane

    • Separates a D-dimensional space into two half-spaces
    • Defined by an outward pointing normal vector $\omega$
    • $\omega$ is orthogonal to any vector lying on the hyperplane
    • Assume the hyperplane passes through origin, $\omega^T x = 0$ with $x_0 = 1$



  • Sign with respect to a line
$$ \begin{align*} \omega = \begin{bmatrix}\omega_1 \\ \omega_2 \end{bmatrix}, \quad x = \begin{bmatrix} x_1 \\ x_2\end{bmatrix} &\implies g(x) = \omega_0 + \omega_1 x_1 + \omega_2 x_2 = \omega_0 + \omega^T x\\\\ \omega = \begin{bmatrix}\omega_0 \\ \omega_1 \\ \omega_2 \end{bmatrix}, \quad x = \begin{bmatrix} 1 \\ x_1 \\ x_2\end{bmatrix} &\implies g(x) = \omega_0 \cdot 1 + \omega_1 x_1 + \omega_2 x_2 = \omega^T x \end{align*} $$




  • If $\vec p$ and $\vec q$ are on the decision line


$$ \begin{align*} g\left(\vec p\right) = g\left(\vec q\right) = 0 \quad & \Rightarrow \quad \omega_0 + \omega^T \vec p = \omega_0 + \omega^T \vec q = 0 \\ & \Rightarrow \quad \omega^T \left( \vec p- \vec q \right) = 0 \end{align*} $$

  • Goal: to learn the hyperplane $g_{\omega}(x)=0$ using the training data
  • How to find $\omega$

    • All data in class 1 ($y = +1$) $$g(x) > 0$$
    • All data in class 0 ($y = -1$) $$g(x) < 0$$

2.1. Perceptron Algorithm

The perceptron implements

$$h(x) = \text{sign}\left( \omega^Tx \right)$$

Given the training set

$$(x_1, y_1), (x_2, y_2), \cdots, (x_N, y_N) \quad \text{where } y_i \in \{-1,1\}$$

1) pick a misclassified point

$$ \text{sign}\left(\omega^Tx_n \right) \neq y_n$$

2) and update the weight vector

$$\omega \leftarrow \omega + y_nx_n$$





Why perceptron updates work ?

  • Let's look at a misclassified positive example ($y_n = +1$)

    • perceptron (wrongly) thinks $\omega_{old}^T x_n < 0$
  • updates would be $$ \begin{align*}\omega_{new} &= \omega_{old} + y_n x_n = \omega_{old} + x_n \\ \\ \omega_{new}^T x_n &= (\omega_{old} + x_n)^T x_n = \omega_{old}^T x_n + x_n^T x_n \end{align*}$$
  • Thus $\omega_{new}^T x_n$ is less negative than $\omega_{old}^T x_n$

2.2. Iterations of Perceptron

  1. Randomly assign $\omega$

  2. One iteration of the PLA (perceptron learning algorithm) $$\omega \leftarrow \omega + yx$$ where $(x, y)$ is a misclassified training point

  3. At iteration $i = 1, 2, 3, \cdots,$ pick a misclassified point from $$(x_1,y_1),(x_2,y_2),\cdots,(x_N, y_N)$$

  4. and run a PLA iteration on it

  5. That's it!




Summary





2.3. Perceptron in Python


$$g(x) = \omega_0 + \omega^Tx = \omega_0 + \omega_1x_1 + \omega_2x_2 = 0$$



$$ \begin{align*} \omega &= \begin{bmatrix} \omega_0 \\ \omega_1 \\ \omega_2\end{bmatrix}\\ \\ x &= \begin{bmatrix} \left(x^{(1)}\right)^T \\ \left(x^{(2)}\right)^T \\ \left(x^{(3)}\right)^T\\ \vdots \\ \left(x^{(m)}\right)^T \end{bmatrix} = \begin{bmatrix} 1 & x_1^{(1)} & x_2^{(1)} \\ 1 & x_1^{(2)} & x_2^{(2)} \\ 1 & x_1^{(3)} & x_2^{(3)}\\\vdots & \vdots & \vdots \\ 1 & x_1^{(m)} & x_2^{(m)}\end{bmatrix}, \qquad y = \begin{bmatrix}y^{(1)} \\ y^{(2)} \\ y^{(3)}\\ \vdots \\ y^{(m)} \end{bmatrix} \end{align*}$$

In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
In [2]:
#training data gerneration
m = 100
x1 = 8*np.random.rand(m, 1)
x2 = 7*np.random.rand(m, 1) - 4

g = 0.8*x1 + x2 - 3
In [3]:
C1 = np.where(g >= 1)
C0 = np.where(g < -1)
print(C1)
(array([ 1,  2,  3,  6, 10, 12, 16, 17, 19, 24, 26, 27, 29, 34, 36, 40, 43,
       44, 52, 56, 57, 59, 66, 67, 71, 72, 73, 74, 75, 81, 83, 87, 91, 92],
      dtype=int64), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int64))
In [4]:
C1 = np.where(g >= 1)[0]
C0 = np.where(g < -1)[0]
print(C1.shape)
print(C0.shape)
(34,)
(39,)
In [5]:
plt.figure(figsize=(10, 8))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.title('Linearly Separable Classes', fontsize = 15)
plt.legend(loc = 1, fontsize = 15)
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.show()



$$ \begin{align*} x &= \begin{bmatrix} \left(x^{(1)}\right)^T \\ \left(x^{(2)}\right)^T \\ \left(x^{(3)}\right)^T\\ \vdots \\ \left(x^{(m)}\right)^T \end{bmatrix} = \begin{bmatrix} x_1^{(1)} & x_2^{(1)} \\ x_1^{(2)} & x_2^{(2)} \\ x_1^{(3)} & x_2^{(3)}\\ \vdots & \vdots \\ x_1^{(m)} & x_2^{(m)}\end{bmatrix} \qquad y = \begin{bmatrix}y^{(1)} \\ y^{(2)} \\ y^{(3)}\\ \vdots \\ y^{(m)} \end{bmatrix} \end{align*}$$

In [6]:
X1 = np.hstack([x1[C1], x2[C1]])
X0 = np.hstack([x1[C0], x2[C0]])
X = np.vstack([X1, X0])

y = np.vstack([np.ones([C1.shape[0],1]), -np.ones([C0.shape[0],1])])
In [12]:
from sklearn import linear_model

clf = linear_model.Perceptron(tol=1e-3)
clf.fit(X, np.ravel(y))
Out[12]:
Perceptron(alpha=0.0001, class_weight=None, early_stopping=False, eta0=1.0,
           fit_intercept=True, max_iter=1000, n_iter_no_change=5, n_jobs=None,
           penalty=None, random_state=0, shuffle=True, tol=0.001,
           validation_fraction=0.1, verbose=0, warm_start=False)
In [8]:
clf.predict([[3, -2]])
Out[8]:
array([-1.])
In [9]:
clf.predict([[6, 2]])
Out[9]:
array([1.])
In [10]:
clf.coef_
Out[10]:
array([[3.32453187, 7.36364365]])
In [11]:
clf.intercept_
Out[11]:
array([-14.])
$$ \begin{align*} g(x) &= \omega_0 + \omega^Tx = \omega_0 + \omega_1x_1 + \omega_2x_2 = 0 \\\\ \implies x_2 &= -\frac{\omega_1}{\omega_2} x_1 - \frac{\omega_0}{\omega_2} \end{align*} $$
In [13]:
w0 = clf.intercept_[0]
w1 = clf.coef_[0,0]
w2 = clf.coef_[0,1]
In [14]:
x1p = np.linspace(0,8,100).reshape(-1,1)
x2p = - w1/w2*x1p - w0/w2

plt.figure(figsize=(10, 8))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.plot(x1p, x2p, c = 'k', linewidth = 4, label = 'perceptron')
plt.xlim([0, 8])
plt.xlabel('$x_1$', fontsize = 15)
plt.ylabel('$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 15)
plt.show()

2.4. The Best Hyperplane Separator?

  • Perceptron finds one of the many possible hyperplanes separating the data if one exists
  • Of the many possible choices, which one is the best?
  • Utilize distance information
  • Intuitively we want the hyperplane having the maximum margin
  • Large margin leads to good generalization on the test data
    • we will see this formally when we cover Support Vector Machine



3. Logistic Regression

  • Logistic regression is a classification algorithm - don't be confused

3.1. Distance from a Line



$$\omega = \begin{bmatrix}\omega_1 \\ \omega_2\end{bmatrix}, \, x = \begin{bmatrix}x_1\\x_2\end{bmatrix} \; \implies g(x) = \omega_0 + \omega^Tx = \omega_0 + \omega_1x_1 + \omega_2x_2 $$




  • For any vector of $x$


$$ \; h = \frac{g(x)}{\lVert \omega \rVert} \implies\; \text{orthogonal signed distance from the line}$$

3.2. Using all Distances

  • Perceptron: make use of sign of data

  • We want to use distance information of all data points $\rightarrow$ logistic regression





  • Basic idea: to find the decision boundary (hyperplane) of $g(x)=\omega^T x =0$ such that maximizes $\prod_i \lvert h_i \rvert$
    • Inequality of arithmetic and geometric means $$ \frac{h_1+h_2}{2} \geq \sqrt{h_1 h_2} $$ and that equality holds if and only if $h_1 = h_2$
  • Roughly speaking, this optimization of $\max \prod_i \lvert h_i \rvert$ tends to position a hyperplane in the middle of two classes
$$h = \frac{g(x)}{\lVert \omega \rVert} = \frac{\omega^T x}{\lVert \omega \rVert} \sim \omega^T x$$
  • We link or squeeze $(-\infty, +\infty)$ to $(0,1)$ for several reasons:





  • If $\sigma(z)$ is the sigmoid function, or the logistic function

    $$ \sigma(z) = \frac{1}{1+e^{-z}} \implies \sigma \left(\omega^T x \right) = \frac{1}{1+e^{-\omega^T x}}$$

    • Logistic function always generates a value between 0 and 1
    • Crosses 0.5 at the origin, then flattens out
In [14]:
# plot a sigmoid function

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

z = np.linspace(-4,4,100)
s = 1/(1 + np.exp(-z))

plt.figure(figsize=(10,2))
plt.plot(z, s)
plt.xlim([-4, 4])
plt.axis('equal')
plt.grid(alpha = 0.3)
plt.show()
  • Benefit of mapping via the logistic function

    • monotonic: same or similar optimziation solution
    • continuous and differentiable: good for gradient descent optimization
    • probability or confidence: can be considered as probability

    $$P\left(y = +1 \mid x\,;\omega\right) = \frac{1}{1+e^{-\omega^T x}} \;\; \in \; [0,1]$$

    • Often we do note care about predicting the label $y$

    • Rather, we want to predict the label probabilities $P\left(y \mid x\,;\omega\right)$

      • the probability that the label is $+1$ $$P\left(y = +1 \mid x\,;\omega\right)$$
      • the probability that the label is $0$ $$P\left(y = 0 \mid x\,;\omega\right) = 1 - P\left(y = +1 \mid x\,;\omega\right)$$
  • Goal: we need to fit $\omega$ to our data

3.3. Logistic Regression using Scikit-Learn



$$ \begin{align*} \omega &= \begin{bmatrix} \omega_1 \\ \omega_2\end{bmatrix}, \qquad \omega_0, \qquad x = \begin{bmatrix} x_1 \\ x_2\end{bmatrix}\\ \\ X &= \begin{bmatrix} \left(x^{(1)}\right)^T \\ \left(x^{(2)}\right)^T \\ \left(x^{(3)}\right)^T \\ \vdots\end{bmatrix} = \begin{bmatrix} x_1^{(1)} & x_2^{(1)} \\ x_1^{(2)} & x_2^{(2)} \\ x_1^{(3)} & x_2^{(3)} \\ \vdots & \vdots \\\end{bmatrix}, \qquad y = \begin{bmatrix} y^{(1)}\\ y^{(2)} \\y^{(3)} \\ \vdots \end{bmatrix} \end{align*} $$

In [15]:
# data generation

m = 100
w = np.array([[-6], [2], [1]])
X = np.hstack([np.ones([m,1]), 4*np.random.rand(m,1), 4*np.random.rand(m,1)])

w = np.asmatrix(w)
X = np.asmatrix(X)

y = 1/(1 + np.exp(-X*w)) > 0.5 

C1 = np.where(y == True)[0]
C0 = np.where(y == False)[0]

y = np.empty([m,1])
y[C1] = 1
y[C0] = 0

plt.figure(figsize = (10,8))
plt.plot(X[C1,1], X[C1,2], 'ro', alpha = 0.3, label = 'C1')
plt.plot(X[C0,1], X[C0,2], 'bo', alpha = 0.3, label = 'C0')
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 12)
plt.axis('equal')
plt.xlim([0,4])
plt.ylim([0,4])
plt.show()
In [16]:
X = X[:,1:3]

X.shape
Out[16]:
(100, 2)
In [17]:
from sklearn import linear_model

clf = linear_model.LogisticRegression(solver='lbfgs')
clf.fit(X,np.ravel(y))
Out[17]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)
In [18]:
clf.coef_
Out[18]:
array([[3.31856236, 1.24254905]])
In [19]:
clf.intercept_
Out[19]:
array([-9.07267516])
In [20]:
w0 = clf.intercept_[0]
w1 = clf.coef_[0,0]
w2 = clf.coef_[0,1]

xp = np.linspace(0,4,100).reshape(-1,1)
yp = - w1/w2*xp - w0/w2

plt.figure(figsize = (10,8))
plt.plot(X[C1,0], X[C1,1], 'ro', alpha = 0.3, label = 'C1')
plt.plot(X[C0,0], X[C0,1], 'bo', alpha = 0.3, label = 'C0')
plt.plot(xp, yp, 'g', linewidth = 4, label = 'Logistic Regression')
plt.title('Logistic Regression')
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 12)
plt.axis('equal')
plt.xlim([0,4])
plt.ylim([0,4])
plt.show()

4. Multiclass Classification

  • Generalization to more than 2 classes is straightforward
    • one vs. all (one vs. rest)
    • one vs. one
  • Using the soft-max function instead of the logistic function (refer to UFLDL Tutorial)
    • see them as probability
$$P\left(y = k \mid x\,;\omega\right) = \frac{\exp{\left( \omega_k^T x \right) }}{\sum_k \exp{\left(\omega_k^T x \right)}} \in [0,1]$$
  • We maintain a separator weight vector $\omega_k$ for each class $k$
  • Note that the softmax function is often used in deep learning

5. Non-linear Classification

5.1. Kernel

  • Often we want to capture nonlinear patterns in the data
    • Nonlinear regression: input and output relationship may not be linear
    • Nonlinear classification: classes may note be separable by a linear boundary
  • Linear models (e.g. linear regression, linear SVM) are note just rich enough
  • Kernels: make linear model work in nonlinear settings
    • By mapping data to higher dimensions where it exhibits linear patterns
    • Apply the linear model in the new input feature space
    • Mapping $=$ changing the feature representation

5.2. Classifying non-linear separable data

  • Consider the binary classification problem
    • Each example represented by a single feature $x$
    • No linear separator exists for this data





  • Now map each example as $x \rightarrow \{x,x^2\}$
  • Data now becomes linearly separable in the new representation




  • Linear in the new representation $=$ nonlinear in the old representation
  • Let's look at another example
    • Each example defined by a two features $x=\{x_1, x_2\}$
    • No linear separator exists for this data





  • Now map each example as $x=\{x_1, x_2\} \rightarrow z=\{x_1^2,\sqrt{2}x_1x_2,x_2^2\}$
    • Each example now has three features (derived from the old represenation)
  • Data now becomes linear separable in the new representation



In [21]:
%%html
<center><iframe 
width="420" height="315" src="https://www.youtube.com/embed/3liCbRZPrZA?rel=0" frameborder="0" allowfullscreen>
</iframe></center>
In [22]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')