Classification

Table of Contents

1. Classification¶

where $y$ is a discrete value
- develop the classification algorithm to determine which class a new input should fall into
start with binary class problems
- Later look at multiclass classification problem, although this is just an extension of binary classification
We could use linear regression
- Then, threshold the classifier output (i.e. anything over some value is yes, else no)
- linear regression with thresholding seems to work
We will learn
- perceptron
- logistic regression

2. Perceptron¶

For input $x = \begin{bmatrix}x_1\\ \vdots\\ x_d \end{bmatrix}\;$ 'attributes of a customer'

weights $\omega = \begin{bmatrix}\omega_1\\ \vdots\\ \omega_d \end{bmatrix}$

$$\begin{align*} \text{Approve credit if} \; & \sum\limits_{i=1}^{d}\omega_ix_i > \text{threshold}, \\ \text{Deny credit if} \; & \sum\limits_{i=1}^{d}\omega_ix_i < \text{threshold}. \end{align*}$$

$$h(x) = \text{sign} \left(\left( \sum\limits_{i=1}^{d}\omega_ix_i \right)- \text{threshold} \right) = \text{sign}\left(\left( \sum\limits_{i=1}^{d}\omega_ix_i \right)+ \omega_0\right)$$

Introduce an artificial coordinate $x_0 = 1$:

$$h(x) = \text{sign}\left( \sum\limits_{i=0}^{d}\omega_ix_i \right)$$

In a vector form, the perceptron implements

$$h(x) = \text{sign}\left( \omega^T x \right)$$

Sign function

$$ \text{sgn}(x) = \begin{cases} 1, &\text{if }\; x < 0\\ 0, &\text{if }\; x = 0\\ -1, &\text{if }\; x > 0 \end{cases} $$

Hyperplane
- Separates a D-dimensional space into two half-spaces
- Defined by an outward pointing normal vector $\omega$
- $\omega$ is orthogonal to any vector lying on the hyperplane
- Assume the hyperplane passes through origin, $\omega^T x = 0$ with $x_0 = 1$

Sign with respect to a line

$$ \begin{align*} \omega = \begin{bmatrix}\omega_1 \\ \omega_2 \end{bmatrix}, \quad x = \begin{bmatrix} x_1 \\ x_2\end{bmatrix} &\implies g(x) = \omega_0 + \omega_1 x_1 + \omega_2 x_2 = \omega_0 + \omega^T x\\\\ \omega = \begin{bmatrix}\omega_0 \\ \omega_1 \\ \omega_2 \end{bmatrix}, \quad x = \begin{bmatrix} 1 \\ x_1 \\ x_2\end{bmatrix} &\implies g(x) = \omega_0 \cdot 1 + \omega_1 x_1 + \omega_2 x_2 = \omega^T x \end{align*} $$

Goal: to learn the hyperplane $g_{\omega}(x)=0$ using the training data
How to find $\omega$
- All data in class 1 ($y = +1$) $$g(x) > 0$$
- All data in class 0 ($y = -1$) $$g(x) < 0$$

2.1. Perceptron Algorithm¶

The perceptron implements

$$h(x) = \text{sign}\left( \omega^Tx \right)$$

Given the training set

$$(x_1, y_1), (x_2, y_2), \cdots, (x_N, y_N) \quad \text{where } y_i \in \{-1,1\}$$

pick a misclassified point

$$ \text{sign}\left(\omega^Tx_n \right) \neq y_n$$

and update the weight vector

$$\omega \leftarrow \omega + y_nx_n$$

Why perceptron updates work ?

Let's look at a misclassified positive example ($y_n = +1$)
- perceptron (wrongly) thinks $\omega_{old}^T x_n < 0$

updates would be

$$ \begin{align*}\omega_{new} &= \omega_{old} + y_n x_n = \omega_{old} + x_n \\ \\ \omega_{new}^T x_n &= (\omega_{old} + x_n)^T x_n = \omega_{old}^T x_n + x_n^T x_n \end{align*}$$

Thus $\omega_{new}^T x_n$ is less negative than $\omega_{old}^T x_n$

2.1.1. Iterations of Perceptron¶

Randomly assign $\omega$
One iteration of the PLA (perceptron learning algorithm) $$\omega \leftarrow \omega + yx$$ where $(x, y)$ is a misclassified training point
At iteration $i = 1, 2, 3, \cdots,$ pick a misclassified point from $$(x_1,y_1),(x_2,y_2),\cdots,(x_N, y_N)$$
and run a PLA iteration on it
That's it!

Summary

2.1.2. Perceptron loss function¶

$$ \mathscr{L}(\omega) = \sum_{n =1}^{m} \max \left\{ 0, -y_n \cdot \left(\omega^T x_n \right)\right\} $$

Loss $ = 0$ on examples where perceptron is correct, i.e., $y_n \cdot \left(\omega^T x_n \right) > 0$
Loss $ > 0$ on examples where perceptron misclassified, i.e., $y_n \cdot \left(\omega^T x_n \right) < 0$

Note: $\text{sign}\left(\omega^T x_n \right) \neq y_n$ is equivalent to $ y_n \cdot \left(\omega^T x_n \right) < 0$

2.2. Perceptron in Python¶

$$g(x) = \omega_0 + \omega^Tx = \omega_0 + \omega_1x_1 + \omega_2x_2 = 0$$

$$ \begin{align*} \omega &= \begin{bmatrix} \omega_0 \\ \omega_1 \\ \omega_2\end{bmatrix}\\ \\ x &= \begin{bmatrix} \left(x^{(1)}\right)^T \\ \left(x^{(2)}\right)^T \\ \left(x^{(3)}\right)^T\\ \vdots \\ \left(x^{(m)}\right)^T \end{bmatrix} = \begin{bmatrix} 1 & x_1^{(1)} & x_2^{(1)} \\ 1 & x_1^{(2)} & x_2^{(2)} \\ 1 & x_1^{(3)} & x_2^{(3)}\\\vdots & \vdots & \vdots \\ 1 & x_1^{(m)} & x_2^{(m)}\end{bmatrix}, \qquad y = \begin{bmatrix}y^{(1)} \\ y^{(2)} \\ y^{(3)}\\ \vdots \\ y^{(m)} \end{bmatrix} \end{align*}$$

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

#training data gerneration
m = 100
x1 = 8*np.random.rand(m, 1)
x2 = 7*np.random.rand(m, 1) - 4

g = 0.8*x1 + x2 - 3

C1 = np.where(g >= 1)
C0 = np.where(g < -1)
print(C1)

(array([ 1,  2,  4, 10, 12, 13, 14, 15, 19, 23, 24, 32, 33, 34, 40, 43, 46,
       47, 51, 52, 53, 54, 57, 62, 64, 67, 68, 77, 82, 85, 88, 93, 94, 96,
       98]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]))

C1 = np.where(g >= 1)[0]
C0 = np.where(g < -1)[0]
print(C1.shape)
print(C0.shape)

(35,)
(40,)

plt.figure(figsize = (6, 4))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.title('Linearly Separable Classes', fontsize = 15)
plt.legend(loc = 1, fontsize = 15)
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.show()

$$ \begin{align*} x &= \begin{bmatrix} \left(x^{(1)}\right)^T \\ \left(x^{(2)}\right)^T \\ \left(x^{(3)}\right)^T\\ \vdots \\ \left(x^{(m)}\right)^T \end{bmatrix} = \begin{bmatrix} 1 & x_1^{(1)} & x_2^{(1)} \\ 1 & x_1^{(2)} & x_2^{(2)} \\ 1 & x_1^{(3)} & x_2^{(3)}\\\vdots & \vdots & \vdots \\ 1 & x_1^{(m)} & x_2^{(m)}\end{bmatrix} \qquad y = \begin{bmatrix}y^{(1)} \\ y^{(2)} \\ y^{(3)}\\ \vdots \\ y^{(m)} \end{bmatrix} \end{align*}$$

X1 = np.hstack([np.ones([C1.shape[0],1]), x1[C1], x2[C1]])
X0 = np.hstack([np.ones([C0.shape[0],1]), x1[C0], x2[C0]])
X = np.vstack([X1, X0])

y = np.vstack([np.ones([C1.shape[0],1]), -np.ones([C0.shape[0],1])])

X = np.asmatrix(X)
y = np.asmatrix(y)

$$\omega = \begin{bmatrix} \omega_0 \\ \omega_1 \\ \omega_2\end{bmatrix}$$

$$\omega \leftarrow \omega + yx$$ where $(x, y)$ is a misclassified training point

w = np.ones([3,1])
w = np.asmatrix(w)

n_iter = y.shape[0]
for k in range(n_iter):
    for i in range(n_iter):
        if y[i,0] != np.sign(X[i,:]*w)[0,0]:
            w += y[i,0]*X[i,:].T

print(w)

[[-9.        ]
 [ 2.61068705]
 [ 4.15661782]]

$$ \begin{align*} g(x) &= \omega_0 + \omega^Tx = \omega_0 + \omega_1x_1 + \omega_2x_2 = 0 \\\\ \implies x_2 &= -\frac{\omega_1}{\omega_2} x_1 - \frac{\omega_0}{\omega_2} \end{align*} $$

x1p = np.linspace(0,8,100).reshape(-1,1)
x2p = - w[1,0]/w[2,0]*x1p - w[0,0]/w[2,0]

plt.figure(figsize = (6, 4))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.plot(x1p, x2p, c = 'k', linewidth = 3, label = 'perceptron')
plt.xlim([0, 8])
plt.xlabel('$x_1$', fontsize = 15)
plt.ylabel('$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 12)
plt.show()

Perceptron using Scikit-Learn

$$ \begin{align*} x &= \begin{bmatrix} \left(x^{(1)}\right)^T \\ \left(x^{(2)}\right)^T \\ \left(x^{(3)}\right)^T\\ \vdots \\ \left(x^{(m)}\right)^T \end{bmatrix} = \begin{bmatrix} x_1^{(1)} & x_2^{(1)} \\ x_1^{(2)} & x_2^{(2)} \\ x_1^{(3)} & x_2^{(3)}\\ \vdots & \vdots \\ x_1^{(m)} & x_2^{(m)}\end{bmatrix} \qquad y = \begin{bmatrix}y^{(1)} \\ y^{(2)} \\ y^{(3)}\\ \vdots \\ y^{(m)} \end{bmatrix} \end{align*}$$

X1 = np.hstack([x1[C1], x2[C1]])
X0 = np.hstack([x1[C0], x2[C0]])
X = np.vstack([X1, X0])

y = np.vstack([np.ones([C1.shape[0],1]), -np.ones([C0.shape[0],1])])

from sklearn import linear_model

clf = linear_model.Perceptron(tol=1e-3)
clf.fit(X, np.ravel(y))

Perceptron()

Perceptron()

clf.predict([[3, -2]])

array([-1.])

clf.predict([[6, 2]])

array([1.])

clf.coef_

array([[4.06397006, 5.50113727]])

clf.intercept_

array([-10.])

$$ \begin{align*} g(x) &= \omega_0 + \omega^Tx = \omega_0 + \omega_1x_1 + \omega_2x_2 = 0 \\\\ \implies x_2 &= -\frac{\omega_1}{\omega_2} x_1 - \frac{\omega_0}{\omega_2} \end{align*} $$

w0 = clf.intercept_[0]
w1 = clf.coef_[0,0]
w2 = clf.coef_[0,1]

x1p = np.linspace(0,8,100).reshape(-1,1)
x2p = - w1/w2*x1p - w0/w2

plt.figure(figsize = (6, 4))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.plot(x1p, x2p, c = 'k', linewidth = 4, label = 'perceptron')
plt.xlim([0, 8])
plt.xlabel('$x_1$', fontsize = 15)
plt.ylabel('$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 12)
plt.show()

2.3. The Best Hyperplane Separator?¶

Perceptron finds one of the many possible hyperplanes separating the data if one exists
Of the many possible choices, which one is the best?
Utilize distance information
Intuitively we want the hyperplane having the maximum margin
Large margin leads to good generalization on the test data
- we will see this formally when we cover Support Vector Machine

3. SVM¶

3.1. Distance from a Line¶

$$\omega = \begin{bmatrix}\omega_1 \\ \omega_2\end{bmatrix}, \, x = \begin{bmatrix}x_1\\x_2\end{bmatrix} \; \implies g(x) = \omega_0 + \omega^Tx = \omega_0 + \omega_1x_1 + \omega_2x_2 $$

If $\vec p$ and $\vec q$ are on the decision line

$$ \begin{align*} g\left(\vec p\right) = g\left(\vec q\right) = 0 \quad & \Rightarrow \quad \omega_0 + \omega^T \vec p = \omega_0 + \omega^T \vec q = 0 \\ & \Rightarrow \quad \omega^T \left( \vec p- \vec q \right) = 0 \end{align*} $$

$$ \begin{align*} & \therefore \, \omega : \text{normal to the line (orthogonal)} \\ & \implies \text{tells the direction of the line} \end{align*}$$

If $x$ is on the line and $x = d\frac{\omega}{\lVert \omega \rVert}$ (where $d$ is a normal distance from the origin to the line)

$$ \begin{align*} g(x)& = \omega_0 + \omega^Tx = 0 \; \\ & \Rightarrow \omega_0 + \omega^Td\frac{\omega}{\lVert \omega \rVert} = \omega_0 + d\frac{\omega^T\omega}{\lVert \omega \rVert} = \omega_0 + d\lVert \omega \rVert = 0 \\\\ & \therefore d \, = - \frac{\omega_0}{\lVert \omega \rVert} \end{align*}$$

For any vector of $x$

$$ x = x_{\perp} + r \frac{\omega}{\lVert \omega \rVert}$$

$$ \omega^Tx = \omega^T \left( x_{\perp} + r \frac{\omega}{\lVert \omega \rVert}\right) = r \frac{\omega^T\omega}{\lVert \omega \rVert} = r \lVert \omega \rVert$$

$$ \begin{align*} g(x) & = \omega_0 + \omega^Tx \\ & = \omega_0 + r \lVert \omega \rVert \qquad (r = d + h) \\ & = \omega_0 + (d +h) \lVert \omega \rVert\\ & = \omega_0 + \left(- \frac{\omega_0}{\lVert \omega \rVert} + h \right)\lVert \omega \rVert\\ & = h \lVert \omega \rVert \end{align*}$$

$$\therefore \; h = \frac{g(x)}{\lVert \omega \rVert} \implies\; \text{orthogonal signed distance from the line}$$

Another method to find a distance between $g(x) = -1$ and $g(x) = 1$

Suppose $g(x_1) = -1,\; g(x_2) = 1$

$$ \begin{array}{I} \omega_0 + \omega^Tx_1 = -1\\ \omega_0 + \omega^Tx_2 = 1\\ \end{array} \; \implies \; \omega^T(x_2 - x_1) = 2$$

$$s = \langle\frac{\omega}{\lVert \omega \rVert}, x_2 - x_1 \rangle = \frac{1}{\lVert \omega \rVert}\omega^T(x_2 - x_1) = \frac {2}{\lVert \omega \rVert}$$

3.2. Illustrative Example¶

Binary classification
- $C_1$ and $C_0$
Features
- The coordinate of the unknown animal $i$ in the zoo

$$x = \begin{bmatrix}x_1 \\ x_2 \end{bmatrix}$$

Is it possible to distinguish between $C_1$ and $C_0$ by its coordinates on a map of the zoo?
We need to find a separating hyperplane (or a line in 2D)

$$ \begin{align*} \omega_0 + \omega_1x_1 + \omega_2x_2 &= 0 \\\\ \omega_0 + \begin{bmatrix}\omega_1 & \omega_2 \end{bmatrix} \begin{bmatrix}x_1 \\ x_2 \end{bmatrix} &= 0\\\\ \omega_0 + \omega^Tx &= 0 \end{align*} $$

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

#training data gerneration
x1 = 8*np.random.rand(100, 1)
x2 = 7*np.random.rand(100, 1) - 4

g = 0.8*x1 + x2 - 3
g1 = g - 1
g0 = g + 1

C1 = np.where(g1 >= 0)[0]
C0 = np.where(g0 < 0)[0]

xp = np.linspace(-1,9,100).reshape(-1,1)
ypt = -0.8*xp + 3

plt.figure(figsize = (6, 4))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.plot(xp, ypt, 'k', linewidth = 3, label = 'True')
plt.title('Linearly and Strictly Separable Classes', fontsize = 12)
plt.xlabel(r'$x_1$', fontsize = 12)
plt.ylabel(r'$x_2$', fontsize = 12)
plt.legend(loc = 1, fontsize = 12)
plt.axis('equal')
plt.xlim([0, 8])
plt.ylim([-4, 3])
plt.show()

Given:
- Hyperplane defined by $\omega$ and $\omega_0$
- Animals coordinates (or features) $x$

Decision making:

$$ \omega_0 + \omega^Tx > 0 \implies x \; \text{belongs to} \; C_1$$

$$ \omega_0 + \omega^Tx < 0 \implies x \; \text{belongs to} \; C_0$$

Find $\omega$ and $\omega_0$ such that $x$ given $\omega_0 + \omega^Tx = 0$

or
Find $\omega$ and $\omega_0$ such that $x\in C_1$ given $\omega_0 + \omega^Tx > 1$ and $x\in C_0$ given $\omega_0 + \omega^Tx < -1 $

$$ \begin{align*} &\omega_0 + \omega^Tx > b, \quad (b>0)\\ \\ \Longleftrightarrow \;&\frac{\omega_0}{b} + \frac{\omega^T}{b}x > 1 \\\\ \Longleftrightarrow \;&\omega'_0 + \omega'^Tx > 1 \end{align*}$$

Same problem if strictly separable

# see how data are generated

xp = np.linspace(-1,9,100).reshape(-1,1)
ypt = -0.8*xp + 3

plt.figure(figsize = (6, 4))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.plot(xp, ypt, 'k', linewidth = 3, label = 'True')
plt.plot(xp, ypt-1, '--k')
plt.plot(xp, ypt+1, '--k')
plt.title('Linearly and Strictly Separable Classes', fontsize = 12)
plt.xlabel(r'$x_1$', fontsize = 12)
plt.ylabel(r'$x_2$', fontsize = 12)
plt.legend(loc = 1, fontsize = 12)
plt.axis('equal')
plt.xlim([0, 8])
plt.ylim([-4, 3])
plt.show()

3.2.1. The First Attempt¶

$n \;(=2)$ features
$m = N + M$ data points in training set

$$ x^{(i)} = \begin{bmatrix} x^{(i)}_1 \\ x^{(i)}_2 \end{bmatrix} \;\text{with}\; \omega = \begin{bmatrix} \omega_1 \\ \omega_2 \end{bmatrix}\qquad \text{or} \qquad x^{(i)} = \begin{bmatrix} 1 \\ x^{(i)}_1 \\ x^{(i)}_2 \end{bmatrix} \;\; \text{with}\; \omega = \begin{bmatrix} \omega_0 \\ \omega_1 \\ \omega_2 \end{bmatrix}$$

$N$ belongs to $C_1$ in training set
$M$ belongs to $C_0$ in training set
$\omega$ and $\omega_0$ are the unknown variables

$$\begin{align*} \text{minimize} \quad & \text{something} \\ \\ \text{subject to} \quad & \begin{cases} \omega_0 + \omega^Tx^{(1)} \geq1\\ \omega_0 + \omega^Tx^{(2)} \geq1\\ \quad \quad \vdots \\ \omega_0 + \omega^Tx^{(N)}\geq1\\ \end{cases} \\ & \begin{cases} \omega_0 + \omega^Tx^{(N+1)} \leq{-1}\\ \omega_0 + \omega^Tx^{(N+2)} \leq{-1}\\ \quad \quad \vdots \\ \omega_0 + \omega^Tx^{(N+M)} \leq{-1}\\ \end{cases} \end{align*}$$

$$\begin{align*} \text{minimize} \quad & \text{something} \\ \\ \text{subject to} \quad & \begin{cases} \omega^Tx^{(1)} \geq1\\ \omega^Tx^{(2)} \geq1\\ \quad \quad \vdots \\ \omega^Tx^{(N)} \geq1\\ \end{cases} \\ & \begin{cases} \omega^Tx^{(N+1)} \leq{-1}\\ \omega^Tx^{(N+2)} \leq{-1}\\ \quad \quad \vdots \\ \omega^Tx^{(N+M)} \leq{-1}\\ \end{cases} \end{align*}$$

Code (CVXPY)

$$ \begin{align*} X_1 &= \begin{bmatrix} \left(x^{(1)}\right)^T \\ \left(x^{(2)}\right)^T \\ \vdots \\ \left(x^{(N)}\right)^T\end{bmatrix} = \begin{bmatrix} x_1^{(1)} & x_2^{(1)} \\ x_1^{(2)} & x_2^{(2)} \\ \vdots & \vdots \\ x_1^{(N)} & x_2^{(N)} \\ \end{bmatrix}\\ X_0 &= \begin{bmatrix} \left(x^{(N+1)}\right)^T \\ \left(x^{(N+2)}\right)^T \\ \vdots \\ \left(x^{(N+M)}\right)^T\end{bmatrix} = \begin{bmatrix} x_1^{(N+1)} & x_2^{(N+1)} \\ x_1^{(N+2)} & x_2^{(N+2)} \\ \vdots & \vdots \\ x_1^{(N+M)} & x_2^{(N+M)} \\ \end{bmatrix}\end{align*}$$

$$ \begin{align*} X_1 &= \begin{bmatrix} \left(x^{(1)}\right)^T \\ \left(x^{(2)}\right)^T \\ \vdots \\ \left(x^{(N)}\right)^T\end{bmatrix} = \begin{bmatrix} 1 & x_1^{(1)} & x_2^{(1)} \\ 1 & x_1^{(2)} & x_2^{(2)} \\ \vdots & \vdots & \vdots \\ 1 & x_1^{(N)} & x_2^{(N)} \\ \end{bmatrix}\\ X_0 &= \begin{bmatrix} \left(x^{(N+1)}\right)^T \\ \left(x^{(N+2)}\right)^T \\ \vdots \\ \left(x^{(N+M)}\right)^T\end{bmatrix} = \begin{bmatrix} 1 & x_1^{(N+1)} & x_2^{(N+1)} \\ 1 & x_1^{(N+2)} & x_2^{(N+2)} \\ \vdots & \vdots & \vdots \\ 1 & x_1^{(N+M)} & x_2^{(N+M)} \\ \end{bmatrix}\end{align*}$$

$$\begin{align*} \text{minimize} \quad & \text{something} \\ \text{subject to} \quad & \omega_0 + X_1\omega \geq 1 \\ & \omega_0 + X_0\omega \leq -1 \end{align*}$$

$$\begin{align*} \text{minimize} \quad & \text{something} \\ \text{subject to} \quad & X_1\omega \geq 1 \\ & X_0\omega \leq -1 \end{align*}$$

Form 1

$$\begin{align*} \text{minimize} \quad & \text{something} \\ \text{subject to} \quad & \omega_0 + X_1\omega \geq 1 \\ & \omega_0 + X_0\omega \leq -1 \end{align*}$$

# CVXPY

import cvxpy as cvx

X1 = np.hstack([x1[C1], x2[C1]])
X0 = np.hstack([x1[C0], x2[C0]])

X1 = np.asmatrix(X1)
X0 = np.asmatrix(X0)

N = X1.shape[0]
M = X0.shape[0]

w0 = cvx.Variable([1,1])
w = cvx.Variable([2,1])

obj = cvx.Minimize(1)
const = [w0 + X1@w >= 1, w0 + X0@w <= -1]
prob = cvx.Problem(obj, const).solve()

w0 = w0.value
w = w.value

xp = np.linspace(-1,9,100).reshape(-1,1)
yp = - w[0,0]/w[1,0]*xp - w0/w[1,0]

plt.figure(figsize = (6, 4))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.plot(xp, ypt, 'k', alpha = 0.3, label = 'True')
plt.plot(xp, ypt-1, '--k', alpha = 0.3)
plt.plot(xp, ypt+1, '--k', alpha = 0.3)
plt.plot(xp, yp, 'g', linewidth = 3, label = 'Attempt 1')
plt.title('Linearly and Strictly Separable Classes', fontsize = 12)
plt.xlabel(r'$x_1$', fontsize = 12)
plt.ylabel(r'$x_2$', fontsize = 12)
plt.legend(loc = 1)
plt.axis('equal')
plt.xlim([0, 8])
plt.ylim([-4, 3])
plt.show()

Form 2

$$\begin{align*} \text{minimize} \quad & \text{something} \\ \text{subject to} \quad & X_1\omega \geq 1 \\ & X_0\omega \leq -1 \end{align*}$$

import cvxpy as cvx

N = C1.shape[0]
M = C0.shape[0]

X1 = np.hstack([np.ones([N,1]), x1[C1], x2[C1]])
X0 = np.hstack([np.ones([M,1]), x1[C0], x2[C0]])

X1 = np.asmatrix(X1)
X0 = np.asmatrix(X0)

w = cvx.Variable([3,1])

obj = cvx.Minimize(1)
const = [X1@w >= 1, X0@w <= -1]
prob = cvx.Problem(obj, const).solve()

w = w.value

xp = np.linspace(-1,9,100).reshape(-1,1)
yp = - w[1,0]/w[2,0]*xp - w[0,0]/w[2,0]

plt.figure(figsize = (6, 4))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.plot(xp, ypt, 'k', alpha = 0.3, label = 'True')
plt.plot(xp, ypt-1, '--k', alpha = 0.3)
plt.plot(xp, ypt+1, '--k', alpha = 0.3)
plt.plot(xp, yp, 'g', linewidth = 3, label = 'Attempt 1')
plt.title('Linearly and Strictly Separable Classes', fontsize = 12)
plt.xlabel(r'$x_1$', fontsize = 12)
plt.ylabel(r'$x_2$', fontsize = 12)
plt.legend(loc = 1, fontsize = 12)
plt.axis('equal')
plt.xlim([0, 8])
plt.ylim([-4, 3])
plt.show()

3.2.2. Outlier¶

Note that in the real world, you may have noise, errors, or outliers that do not accurately represent the actual phenomena
Non-separable case
No solutions (hyperplane) exist
- We will allow some training examples to be misclassified !
- But we want their number to be minimized

X1 = np.hstack([np.ones([N,1]), x1[C1], x2[C1]])
X0 = np.hstack([np.ones([M,1]), x1[C0], x2[C0]])

outlier = np.array([1, 3, 2]).reshape(1,-1)
X0 = np.vstack([X0, outlier])

X1 = np.asmatrix(X1)
X0 = np.asmatrix(X0)

plt.figure(figsize = (6, 4))
plt.plot(X1[:,1], X1[:,2], 'ro', alpha = 0.4, label = 'C1')
plt.plot(X0[:,1], X0[:,2], 'bo', alpha = 0.4, label = 'C0')
plt.title('When Outliers Exist', fontsize = 12)
plt.xlabel(r'$x_1$', fontsize = 12)
plt.ylabel(r'$x_2$', fontsize = 12)
plt.legend(loc = 1, fontsize = 12)
plt.axis('equal')
plt.xlim([0, 8])
plt.ylim([-4, 3])
plt.show()

w = cvx.Variable([3,1])

obj = cvx.Minimize(1)
const = [X1@w >= 1, X0@w <= -1]
prob = cvx.Problem(obj, const).solve()

print(w.value)

None

3.2.3. The Second Attempt¶

$n \;(=2)$ features
$m = N+M$ data points in a training set

$$x^i = \begin{bmatrix} x_1^{(i)} \\ x_2^{(i)} \end{bmatrix}$$

$N$ belongs to $C_1$ in training set
$M$ belongs to $C_0$ in training set
$\omega$ and $\omega_0$ are the variables (unknown)
For the non-separable case, we relex the above constraints
Need slack variables $u$ and $\upsilon$ where all are positive

The optimization problem for the non-separable case

$$\begin{align*} \text{minimize} \quad & \sum\limits_{i=1}^{N}u_i + \sum\limits_{i=1}^{M}\upsilon_i \\ \\ \text{subject to} \quad & \begin{cases} \omega^Tx^{(1)} \geq 1-u_1\\ \omega^Tx^{(2)} \geq 1-u_2\\ \quad \quad \vdots \\ \omega^Tx^{(N)} \geq 1-u_N\\ \end{cases} \\\\ & \begin{cases} \omega^Tx^{(N+1)} \leq {-(1-\upsilon_1)}\\ \omega^Tx^{(N+2)} \leq {-(1-\upsilon_2)}\\ \quad \quad \vdots \\ \omega^Tx^{(N+M)} \leq {-(1-\upsilon_M)}\\ \end{cases} \\\\ & \begin{cases} u \geq 0\\ v \geq 0\\ \end{cases} \end{align*}$$

Expressed in a matrix form

$$ \begin{align*} X_1 &= \begin{bmatrix} x^{(1)^T} \\ x^{(2)^T} \\ \vdots \\ x^{(N)^T}\end{bmatrix} = \begin{bmatrix} x_1^{(1)} & x_2^{(1)} \\ x_1^{(2)} & x_2^{(2)} \\ \vdots & \vdots \\ x_1^{(N)} & x_2^{(N)} \\ \end{bmatrix}\\\\ X_0 &= \begin{bmatrix} x^{(N+1)^T} \\ x^{(N+2)^T} \\ \vdots \\ x^{(N+M)^T}\end{bmatrix} = \begin{bmatrix} x_1^{(N+1)} & x_2^{(N+1)} \\ x_1^{(N+2)} & x_2^{(N+2)} \\ \vdots & \vdots \\ x_1^{(N+M)} & x_2^{(N+M)} \\ \end{bmatrix} \\\\ u &= \begin{bmatrix} u_1\\ \vdots\\ u_N\\ \end{bmatrix}\\ \upsilon &= \begin{bmatrix} \upsilon_{1}\\ \vdots\\ \upsilon_{M}\\ \end{bmatrix} \end{align*}$$

$$ \begin{align*} X_1 &= \begin{bmatrix} \left(x^{(1)}\right)^T \\ \left(x^{(2)}\right)^T \\ \vdots \\ \left(x^{(N)}\right)^T\end{bmatrix} = \begin{bmatrix} 1 & x_1^{(1)} & x_2^{(1)} \\ 1 & x_1^{(2)} & x_2^{(2)} \\ \vdots & \vdots & \vdots \\ 1 & x_1^{(N)} & x_2^{(N)} \\ \end{bmatrix}\\\\ X_0 &= \begin{bmatrix} \left(x^{(N+1)}\right)^T \\ \left(x^{(N+2)}\right)^T \\ \vdots \\ \left(x^{(N+M)}\right)^T\end{bmatrix} = \begin{bmatrix} 1 & x_1^{(N+1)} & x_2^{(N+1)} \\ 1 & x_1^{(N+2)} & x_2^{(N+2)} \\ \vdots & \vdots & \vdots \\ 1 & x_1^{(N+M)} & x_2^{(N+M)} \\ \end{bmatrix} \\\\ u &= \begin{bmatrix} u_1\\ \vdots\\ u_N\\ \end{bmatrix}\\ \upsilon &= \begin{bmatrix} \upsilon_{1}\\ \vdots\\ \upsilon_{M}\\ \end{bmatrix} \end{align*}$$

$$\begin{align*} \text{minimize} \quad & 1^Tu + 1^T\upsilon \\ \text{subject to} \quad & \omega_0 + X_1\omega \geq 1-u \\ & \omega_0 + X_0\omega \leq -(1-\upsilon) \\ & u \geq 0 \\ & \upsilon \geq 0 \end{align*}$$

$$\begin{align*} \text{minimize} \quad & 1^Tu + 1^T\upsilon \\ \text{subject to} \quad & X_1\omega \geq 1-u \\ & X_0\omega \leq -(1-\upsilon) \\ & u \geq 0 \\ & \upsilon \geq 0 \end{align*}$$

X1 = np.hstack([np.ones([N,1]), x1[C1], x2[C1]])
X0 = np.hstack([np.ones([M,1]), x1[C0], x2[C0]])

outlier = np.array([1, 2, 2]).reshape(1,-1)
X0 = np.vstack([X0, outlier])

X1 = np.asmatrix(X1)
X0 = np.asmatrix(X0)

N = X1.shape[0]
M = X0.shape[0]

w = cvx.Variable([3,1])
u = cvx.Variable([N,1])
v = cvx.Variable([M,1])

obj = cvx.Minimize(np.ones((1,N))@u + np.ones((1,M))@v)
const = [X1@w >= 1-u, X0@w <= -(1-v), u >= 0, v >= 0 ]
prob = cvx.Problem(obj, const).solve()

w = w.value

xp = np.linspace(-1,9,100).reshape(-1,1)
yp = - w[1,0]/w[2,0]*xp - w[0,0]/w[2,0]

plt.figure(figsize = (6, 4))
plt.plot(X1[:,1], X1[:,2], 'ro', alpha = 0.4, label = 'C1')
plt.plot(X0[:,1], X0[:,2], 'bo', alpha = 0.4, label = 'C0')
plt.plot(xp, ypt, 'k', alpha = 0.1, label = 'True')
plt.plot(xp, ypt-1, '--k', alpha = 0.1)
plt.plot(xp, ypt+1, '--k', alpha = 0.1)
plt.plot(xp, yp, 'g', linewidth = 3, label = 'Attempt 2')
plt.plot(xp, yp-1/w[2,0], '--g')
plt.plot(xp, yp+1/w[2,0], '--g')
plt.title('When Outliers Exist', fontsize = 12)
plt.xlabel(r'$x_1$', fontsize = 12)
plt.ylabel(r'$x_2$', fontsize = 12)
plt.legend(loc = 1, fontsize = 12)
plt.axis('equal')
plt.xlim([0, 8])
plt.ylim([-4, 3])
plt.show()

Further improvement

Notice that hyperplane is not as accurately represent the division due to the outlier
Can we do better when there are noise data or outliers?
Yes, but we need to look beyond LP
Idea: large margin leads to good generalization on the test data

3.2.4. Maximize Margin (Finally, it is Support Vector Machine)¶

Distance (= margin)

$$\text{margin} = \frac{2}{\lVert \omega \rVert _2}$$

Minimize $\lVert \omega \rVert_2$ to maximize the margin
Multiple objectives
Use gamma ($\gamma$) as a weighting betwwen the followings:
- Bigger margin given robustness to outliers
- Hyperplane that has few (or no) errors

$$\begin{align*} \text{minimize} \quad & \lVert \omega \rVert_2 + \gamma(1^Tu + 1^T\upsilon) \\ \text{subject to} \quad & X_1\omega \geq 1-u \\ & X_0\omega \leq -(1-\upsilon) \\ & u \geq 0 \\ & \upsilon \geq 0 \end{align*}$$

g = 2

w = cvx.Variable([3,1])
u = cvx.Variable([N,1])
v = cvx.Variable([M,1])

obj = cvx.Minimize(cvx.norm(w,2) + g*(np.ones((1,N))@u + np.ones((1,M))@v))
const = [X1@w >= 1-u, X0@w <= -(1-v), u >= 0, v >= 0 ]
prob = cvx.Problem(obj, const).solve()

w = w.value

xp = np.linspace(-1,9,100).reshape(-1,1)
yp = - w[1,0]/w[2,0]*xp - w[0,0]/w[2,0]

plt.figure(figsize = (6, 4))
plt.plot(X1[:,1], X1[:,2], 'ro', alpha = 0.4, label = 'C1')
plt.plot(X0[:,1], X0[:,2], 'bo', alpha = 0.4, label = 'C0')
plt.plot(xp, ypt, 'k', alpha = 0.1, label = 'True')
plt.plot(xp, ypt-1, '--k', alpha = 0.1)
plt.plot(xp, ypt+1, '--k', alpha = 0.1)
plt.plot(xp, yp, 'g', linewidth = 3, label = 'SVM')
plt.plot(xp, yp-1/w[2,0], '--g')
plt.plot(xp, yp+1/w[2,0], '--g')
plt.title('When Outliers Exist', fontsize = 12)
plt.xlabel(r'$x_1$', fontsize = 12)
plt.ylabel(r'$x_2$', fontsize = 12)
plt.legend(loc = 1, fontsize = 12)
plt.axis('equal')
plt.xlim([0, 8])
plt.ylim([-4, 3])
plt.show()

3.3. Support Vector Machine¶

Probably the most popular/influential classification algorithm
A hyperplane based classifier (like the Perceptron)
Additionally uses the maximum margin principle
- Maximize distance (margin) of closest samples from the decision line
  
  $$ \text{maximize {minimum distance}} $$
- Note: perceptron only utilizes a sign of distance
- Finds the hyperplane with maximum separation margin on the training data

$$\begin{align*} \text{minimize} \quad & \lVert \omega \rVert_2 + \gamma(1^Tu + 1^T\upsilon) \\ \text{subject to} \quad & X_1\omega \geq 1-u \\ & X_0\omega \leq -(1-\upsilon) \\ & u \geq 0 \\ & \nu \geq 0 \end{align*}$$

In a more compact form

$$ \begin{align*} \omega^T x_n &\geq 1 \;\text{for }\; y_n = +1 \\ \omega^T x_n &\leq -1 \;\text{for }\; y_n = -1 \end{align*} \Longleftrightarrow y_n \cdot \left( \omega^T x_n \right) \geq 1 $$

$$\begin{align*} \text{minimize} \quad & \lVert \omega \rVert_2 + \gamma(1^T \xi) \\ \text{subject to} \quad & y_n \cdot \left( \omega^T x_n \right) \geq 1 - \xi_n \\ & \xi \geq 0 \\ \end{align*}$$

X = np.vstack([X1, X0])
y = np.vstack([np.ones([N,1]), -np.ones([M,1])])

m = N + M

g = 2

w = cvx.Variable([3,1])
d = cvx.Variable([m,1])

obj = cvx.Minimize(cvx.norm(w,2) + g*(np.ones([1,m])@d))
const = [cvx.multiply(y, X@w) >= 1-d, d >= 0]
prob = cvx.Problem(obj, const).solve()

w = w.value

xp = np.linspace(-1,9,100).reshape(-1,1)
yp = - w[1,0]/w[2,0]*xp - w[0,0]/w[2,0]

plt.figure(figsize = (6, 4))
plt.plot(X1[:,1], X1[:,2], 'ro', alpha = 0.4, label = 'C1')
plt.plot(X0[:,1], X0[:,2], 'bo', alpha = 0.4, label = 'C0')
plt.plot(xp, ypt, 'k', alpha = 0.1, label = 'True')
plt.plot(xp, ypt-1, '--k', alpha = 0.1)
plt.plot(xp, ypt+1, '--k', alpha = 0.1)
plt.plot(xp, yp, 'g', linewidth = 3, label = 'SVM')
plt.plot(xp, yp-1/w[2,0], '--g')
plt.plot(xp, yp+1/w[2,0], '--g')
plt.title('Support Vector Machine (SVM)', fontsize = 12)
plt.xlabel(r'$x_1$', fontsize = 12)
plt.ylabel(r'$x_2$', fontsize = 12)
plt.legend(loc = 1, fontsize = 12)
plt.axis('equal')
plt.xlim([0, 8])
plt.ylim([-4, 3])
plt.show()

3.4. Nonlinear Support Vector Machine¶

3.4.1. Kernel¶

Often we want to capture nonlinear patterns in the data
- Nonlinear regression: input and output relationship may not be linear
- Nonlinear classification: classes may note be separable by a linear boundary
Linear models (e.g. linear regression, linear SVM) are note just rich enough
Kernels: make linear model work in nonlinear settings
- By mapping data to higher dimensions where it exhibits linear patterns
- Apply the linear model in the new input feature space
- Mapping $=$ changing the feature representation
Note: such mappings can be expensive to compute in general
- Kernels give such mappings for (almost) free
- In most cases, the mappings need not be even computed
- Using the Kernel trick !

3.4.2. Classifying non-linear separable data¶

Consider the binary classification problem
- Each example represented by a single feature $x$
- No linear separator exists for this data

Now map each example as $x \rightarrow \{x,x^2\}$
Data now becomes linearly separable in the new representation

Linear in the new representation $=$ nonlinear in the old representation
Let's look at another example
- Each example defined by a two features $x=\{x_1, x_2\}$
- No linear separator exists for this data

Now map each example as $x=\{x_1, x_2\} \rightarrow z=\{x_1^2,\sqrt{2}x_1x_2,x_2^2\}$
- Each example now has three features (derived from the old represenation)
Data now becomes linear separable in the new representation

%%html
<center><iframe
width="420" height="315" src="https://www.youtube.com/embed/3liCbRZPrZA?rel=0" frameborder="0" allowfullscreen>
</iframe></center>

X1 = np.array([[-1.1,0],[-0.3,0.1],[-0.9,1],[0.8,0.4],[0.4,0.9],[0.3,-0.6],
               [-0.5,0.3],[-0.8,0.6],[-0.5,-0.5]])

X0 = np.array([[-1,-1.3], [-1.6,2.2],[0.9,-0.7],[1.6,0.5],[1.8,-1.1],[1.6,1.6],
               [-1.6,-1.7],[-1.4,1.8],[1.6,-0.9],[0,-1.6],[0.3,1.7],[-1.6,0],[-2.1,0.2]])

X1 = np.asmatrix(X1)
X0 = np.asmatrix(X0)

plt.figure(figsize = (6, 4))
plt.plot(X1[:,0], X1[:,1], 'ro', label = 'C1')
plt.plot(X0[:,0], X0[:,1], 'bo', label = 'C0')
plt.title('SVM for Nonlinear Data', fontsize = 12)
plt.xlabel(r'$x_1$', fontsize = 12)
plt.ylabel(r'$x_2$', fontsize = 12)
plt.legend(loc = 1, fontsize = 12)
plt.axis('equal')
plt.show()

$$ x = \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} \quad \Longrightarrow \quad z = \phi(x) = \begin{bmatrix} 1\\ x_1^2 \\ \sqrt{2}x_1 x_2 \\x_2^2 \end{bmatrix}$$

N = X1.shape[0]
M = X0.shape[0]

X = np.vstack([X1, X0])
y = np.vstack([np.ones([N,1]), -np.ones([M,1])])

X = np.asmatrix(X)
y = np.asmatrix(y)

m = N + M
Z = np.hstack([np.ones([m,1]), np.square(X[:,0]), np.sqrt(2)*np.multiply(X[:,0],X[:,1]), np.square(X[:,1])])

g = 10

w = cvx.Variable([4, 1])
d = cvx.Variable([m, 1])

obj = cvx.Minimize(cvx.norm(w, 2) + g*np.ones([1,m])@d)
const = [cvx.multiply(y, Z@w) >= 1-d, d>=0]
prob = cvx.Problem(obj, const).solve()

w = w.value
print(w)

[[ 2.79259259]
 [-1.48148148]
 [-0.58925565]
 [-1.48148148]]

# to plot

[X1gr, X2gr] = np.meshgrid(np.arange(-3,3,0.1), np.arange(-3,3,0.1))

Xp = np.hstack([X1gr.reshape(-1,1), X2gr.reshape(-1,1)])
Xp = np.asmatrix(Xp)

m = Xp.shape[0]
Zp = np.hstack([np.ones([m,1]), np.square(Xp[:,0]), np.sqrt(2)*np.multiply(Xp[:,0], Xp[:,1]), np.square(Xp[:,1])])
q = Zp*w

B = []
for i in range(m):
    if q[i,0] > 0:
        B.append(Xp[i,:])

B = np.vstack(B)

plt.figure(figsize = (6, 4))
plt.plot(X1[:,0], X1[:,1], 'ro', label = 'C1')
plt.plot(X0[:,0], X0[:,1], 'bo', label = 'C0')
plt.plot(B[:,0], B[:,1], 'gs', markersize = 10, alpha = 0.1, label = 'SVM')
plt.title('SVM with Kernel', fontsize = 12)
plt.xlabel(r'$x_1$', fontsize = 12)
plt.ylabel(r'$x_2$', fontsize = 12)
plt.legend(loc = 1, fontsize = 12)
plt.axis('equal')
plt.show()

4. Logistic Regression¶

Logistic regression is a classification algorithm - don't be confused

4.1. Using all Distances¶

Perceptron: make use of sign of data
We want to use distance information of all data points $\rightarrow$ logistic regression

Basic idea: to find the decision boundary (hyperplane) of $g(x)=\omega^T x =0$ such that maximizes $\prod_i \lvert h_i \rvert$
- Inequality of arithmetic and geometric means
  
  $$ \frac{h_1+h_2}{2} \geq \sqrt{h_1 h_2} $$
  and that equality holds if and only if $h_1 = h_2$

Roughly speaking, this optimization of $\max \prod_i \lvert h_i \rvert$ tends to position a hyperplane in the middle of two classes

$$h = \frac{g(x)}{\lVert \omega \rVert} = \frac{\omega^T x}{\lVert \omega \rVert} \sim \omega^T x$$

We link or squeeze $(-\infty, +\infty)$ to $(0,1)$ for several reasons:

If $\sigma(z)$ is the sigmoid function, or the logistic function

$$ \sigma(z) = \frac{1}{1+e^{-z}} \implies \sigma \left(\omega^T x \right) = \frac{1}{1+e^{-\omega^T x}}$$
Logistic function always generates a value between 0 and 1
Crosses 0.5 at the origin, then flattens out
The derivative of the sigmoid function satisfies

$$\sigma'(z) = \sigma(z)\left( 1 - \sigma(z)\right)$$

# plot a sigmoid function

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

z = np.linspace(-4,4,100)
s = 1/(1 + np.exp(-z))

plt.figure(figsize = (6,2))
plt.plot(z, s)
plt.xlim([-4, 4])
plt.axis('equal')
plt.grid(alpha = 0.3)
plt.show()

Benefit of mapping via the logistic function
monotonic: same or similar optimziation solution
continuous and differentiable: good for gradient descent optimization
probability or confidence: can be considered as probability

$$P\left(y = +1 \mid x\,;\omega\right) = \frac{1}{1+e^{-\omega^T x}} \;\; \in \; [0,1]$$

Often we do not care about predicting the label $y$
Rather, we want to predict the label probabilities $P\left(y \mid x\,;\omega\right)$
the probability that the label is $+1$

$$P\left(y = +1 \mid x\,;\omega\right)$$
the probability that the label is $0$

$$P\left(y = 0 \mid x\,;\omega\right) = 1 - P\left(y = +1 \mid x\,;\omega\right)$$

Goal: we need to fit $\omega$ to our data

For a single data point $(x,y)$ with parameters $\omega$

$$ \begin{align*} P\left(y = +1 \mid x\,;\omega\right) &= h_{\omega}(x) = \sigma \left(\omega^T x \right)\\ P\left(y = 0 \mid x\,;\omega\right) &= 1 - h_{\omega}(x) = 1- \sigma \left(\omega^T x \right) \end{align*} $$

It can be written as

$$P\left(y \mid x\,;\omega\right) = \left(h_{\omega}(x) \right)^y \left(1 - h_{\omega}(x)\right)^{1-y}$$

For $m$ training data points, the likelihood function of the parameters:

$$ \begin{align*} \mathscr{L}(\omega) &= P\left(y^{(1)}, \cdots, y^{(m)} \mid x^{(1)}, \cdots, x^{(m)}\,;\omega\right)\\ &= \prod\limits_{i=1}^{m}P\left(y^{(i)} \mid x^{(i)}\,;\omega\right)\\ &= \prod\limits_{i=1}^{m}\left(h_{\omega}\left(x^{(i)}\right) \right)^{y^{(i)}} \left(1 - h_{\omega}\left(x^{(i)}\right)\right)^{1-y^{(i)}} \qquad \left(\sim \prod_i \lvert h_i \rvert \right) \end{align*} $$

It would be easier to work on the log likelihood.

$$\ell(\omega) = \log \mathscr{L}(\omega) = \sum_{i=1}^{m} y^{(i)} \log h_{\omega} \left(x^{(i)} \right) + \left(1-y^{(i)} \right) \log \left(1-h_{\omega} \left(x^{(i)} \right) \right)$$

The logistic regression problem can be solved as a (convex) optimization problem as

$$\hat{\omega} = \arg\max_{\omega} \ell(\omega)$$

To use the gradient descent method, we need to find the derivative of it

$$\nabla \ell(\omega) = \begin{bmatrix} \frac{\partial \ell(\omega)}{\partial \omega_1} \\ \vdots \\ \frac{\partial \ell(\omega)}{\partial \omega_n} \end{bmatrix}$$

Think about a single data point with a single parameter $\omega$ for the simplicity.

$$ \begin{align*} \frac{\partial}{\partial \omega} \left[ y \log (\sigma) + (1-y) \log (1-\sigma)\right] & = y\frac{\sigma'}{\sigma} + (1-y)\frac{-\sigma}{1-\sigma}\\ & = \left(\frac{y}{\sigma}-\frac{1-y}{1-\sigma} \right)\sigma'\\ & = \frac{y-\sigma}{\sigma (1-\sigma)}\sigma'\\ & = \frac{y-\sigma}{\sigma (1-\sigma)}\sigma (1-\sigma)x\\ & = (y-\sigma)x \end{align*} $$

For $m$ training data points with the parameters $\omega$

$$\frac{\partial \ell(\omega)}{\partial \omega_j} = \sum_{i=1}^{m} \left(y^{(i)}-h_{\omega} \left(x^{(i)} \right) \right) x_{j}^{(i)} \quad \stackrel{\text{vectorization}}{=} \quad \left(y-h_{\omega}(x)\right)^T x_{j} = x_{j}^T \left(y-h_{\omega}(x)\right) $$

4.1.1. Logistic Regression using Gradient Descent¶

$$ \begin{align*} \omega &= \begin{bmatrix} \omega_0 \\ \omega_1 \\ \omega_2\end{bmatrix}, \qquad x = \begin{bmatrix} 1 \\ x_1 \\ x_2\end{bmatrix}\\ \\ X &= \begin{bmatrix} \left(x^{(1)}\right)^T \\ \left(x^{(2)}\right)^T \\ \left(x^{(3)}\right)^T \\ \vdots\end{bmatrix} = \begin{bmatrix} 1 & x_1^{(1)} & x_2^{(1)} \\ 1 & x_1^{(2)} & x_2^{(2)} \\ 1 & x_1^{(3)} & x_2^{(3)} \\ \vdots & \vdots & \vdots \\\end{bmatrix}, \qquad y = \begin{bmatrix} y^{(1)}\\ y^{(2)} \\y^{(3)} \\ \vdots \end{bmatrix} \end{align*} $$

$$\nabla \ell(\omega) = \begin{bmatrix} \frac{\partial \ell(\omega)}{\partial \omega_0} \\ \frac{\partial \ell(\omega)}{\partial \omega_1} \\ \frac{\partial \ell(\omega)}{\partial \omega_2} \end{bmatrix} = X^T \left(y-h_{\omega}(x)\right) = X^T \left(y-\sigma(X \omega)\right) $$

# datat generation

m = 100
w = np.array([[-6], [2], [1]])
X = np.hstack([np.ones([m,1]), 4*np.random.rand(m,1), 4*np.random.rand(m,1)])

w = np.asmatrix(w)
X = np.asmatrix(X)

y = 1/(1 + np.exp(-X*w)) > 0.5

C1 = np.where(y == True)[0]
C0 = np.where(y == False)[0]

y = np.empty([m,1])
y[C1] = 1
y[C0] = 0

plt.figure(figsize = (6, 4))
plt.plot(X[C1,1], X[C1,2], 'ro', alpha = 0.3, label = 'C1')
plt.plot(X[C0,1], X[C0,2], 'bo', alpha = 0.3, label = 'C0')
plt.xlabel(r'$x_1$', fontsize = 12)
plt.ylabel(r'$x_2$', fontsize = 12)
plt.legend(loc = 1, fontsize = 12)
plt.axis('equal')
plt.xlim([0,4])
plt.ylim([0,4])
plt.show()

$$h_{\omega}(x) = h(x\,;\omega) = \sigma \left(\omega^T x\right) = \frac{1}{1+e^{-\omega^T x}}$$

# be careful with matrix shape

def h(x,w):
    return 1/(1 + np.exp(-x*w))

Gradient descent

Maximization problem

$$\nabla \ell(\omega) = \begin{bmatrix} \frac{\partial \ell(\omega)}{\partial \omega_0} \\ \frac{\partial \ell(\omega)}{\partial \omega_1} \\ \frac{\partial \ell(\omega)}{\partial \omega_2} \end{bmatrix} = X^T \left(y-h_{\omega}(x)\right) = X^T \left(y-\sigma(X \omega)\right) $$

$$\omega \leftarrow \omega - \eta \left( - \nabla \ell(\omega)\right)$$

w = np.zeros([3,1])

alpha = 0.01

for i in range(10000):
    df = -X.T*(y - h(X,w))
    w = w - alpha*df

print(w)

[[-31.87603306]
 [ 10.66310721]
 [  5.18047559]]

xp = np.linspace(0,4,100).reshape(-1,1)
yp = - w[1,0]/w[2,0]*xp - w[0,0]/w[2,0]

plt.figure(figsize = (6, 4))
plt.plot(X[C1,1], X[C1,2], 'ro', alpha = 0.3, label = 'C1')
plt.plot(X[C0,1], X[C0,2], 'bo', alpha = 0.3, label = 'C0')
plt.plot(xp, yp, 'g', linewidth = 4, label = 'Logistic Regression')
plt.title('Logistic Regression', fontsize = 12)
plt.xlabel(r'$x_1$', fontsize = 12)
plt.ylabel(r'$x_2$', fontsize = 12)
plt.legend(loc = 1)
plt.axis('equal')
plt.xlim([0,4])
plt.ylim([0,4])
plt.show()

4.1.2. Logistic Regression using CVXPY¶

$$ \begin{align*} p &= \frac{1}{1+e^{-\omega^T x}} = \frac{e^{\omega^T x}}{e^{\omega^T x} + 1}\\ 1-p &= \frac{1}{e^{\omega^T x} + 1} \end{align*} $$

We can re-order the training data so

for $x_1, \cdots, x_q$, the outcome is $y = +1$, and
for $x_{q+1}, \cdots, x_m$, the outcome is $y=0$

The likelihood function

$$\mathscr{L} = \prod\limits_{i=1}^{q}{p_i}\prod\limits_{i=q+1}^{m}{(1-p_i)} \qquad \left(\sim \prod_i \lvert h_i \rvert \right)$$

The log likelihood function

$$ \begin{align*} \ell(\omega) &= \log \mathscr{L} = \sum\limits_{i=1}^{q}{\log p_i} + \sum\limits_{i=q+1}^{m}{\log(1 - p_i)} \\ & = \sum\limits_{i=1}^{q}{\log \frac{\exp\left(\omega^T x_i\right)}{1 + \exp \left(\omega^T x_i \right)}} + \sum\limits_{i=q+1}^{m}{\log \frac{1}{1+\exp \left(\omega^T x_i \right)}} \\ &= \sum\limits_{i=1}^{q}{\left(\omega^T x_i\right)} - \sum\limits_{i=1}^{m}{\log \left(1+\exp \left(\omega^T x_i \right) \right)} \end{align*} $$

Since $\ell$ is a concave function of $\omega$, the logistic regression problem can be solved as a convex optimization problem

$$\hat{\omega} = \arg\max_{\omega} \ell(\omega)$$

$$ \begin{align*} \omega &= \begin{bmatrix} \omega_0 \\ \omega_1 \\ \omega_2\end{bmatrix}, \qquad x = \begin{bmatrix} 1 \\ x_1 \\ x_2\end{bmatrix}\\ \\ X &= \begin{bmatrix} \left(x^{(1)}\right)^T \\ \left(x^{(2)}\right)^T \\ \left(x^{(3)}\right)^T \\ \vdots\end{bmatrix} = \begin{bmatrix} 1 & x_1^{(1)} & x_2^{(1)} \\ 1 & x_1^{(2)} & x_2^{(2)} \\ 1 & x_1^{(3)} & x_2^{(3)} \\ \vdots & \vdots & \vdots \\\end{bmatrix}, \qquad y = \begin{bmatrix} y^{(1)}\\ y^{(2)} \\y^{(3)} \\ \vdots \end{bmatrix} \end{align*} $$

$$ \begin{align*} \ell(\omega) &= \log \mathscr{L} = \sum\limits_{i=1}^{q}{\log p_i} + \sum\limits_{i=q+1}^{m}{\log(1 - p_i)} \\ & = \sum\limits_{i=1}^{q}{\log \frac{\exp\left(\omega^T x_i\right)}{1 + \exp \left(\omega^T x_i \right)}} + \sum\limits_{i=q+1}^{m}{\log \frac{1}{1+\exp \left(\omega^T x_i \right)}} \\ &= \sum\limits_{i=1}^{q}{\left(\omega^T x_i\right)} - \sum\limits_{i=1}^{m}{\log \left(1+\exp \left(\omega^T x_i \right) \right)} \end{align*} $$

Refer to cvxpy functions

scalar function: cvx.sum(x) = $\sum_{ij} x_{ij}$
elementwise function: cvx.logistic(x) = $\log \left(1+e^{x} \right)$

import cvxpy as cvx

w = cvx.Variable([3, 1])

obj = cvx.Maximize(y.T@X@w - cvx.sum(cvx.logistic(X@w)))
prob = cvx.Problem(obj).solve()

w = w.value

xp = np.linspace(0,4,100).reshape(-1,1)
yp = - w[1,0]/w[2,0]*xp - w[0,0]/w[2,0]

plt.figure(figsize = (6, 4))
plt.plot(X[C1,1], X[C1,2], 'ro', alpha = 0.3, label = 'C1')
plt.plot(X[C0,1], X[C0,2], 'bo', alpha = 0.3, label = 'C0')
plt.plot(xp, yp, 'g', linewidth = 4, label = 'Logistic Regression')
plt.title('Logistic Regression', fontsize = 12)
plt.xlabel(r'$x_1$', fontsize = 12)
plt.ylabel(r'$x_2$', fontsize = 12)
plt.legend(loc = 1)
plt.axis('equal')
plt.xlim([0,4])
plt.ylim([0,4])
plt.show()

/usr/local/lib/python3.10/dist-packages/cvxpy/problems/problem.py:1387: UserWarning: Solution may be inaccurate. Try another solver, adjusting the solver settings, or solve with verbose=True for more information.
  warnings.warn(

In a more compact form

Change $y \in \{0,+1\} \rightarrow y \in \{-1,+1\}$ for compuational convenience

Consider the following function

$$P(y=+1) = p = \sigma(\omega^T x), \quad P(y=-1) = 1-p = 1- \sigma(\omega^T x) = \sigma(-\omega^T x)$$

$$P\left(y \mid x\,;\omega\right) = \sigma\left(y\omega^Tx\right) = \frac{1}{1+\exp\left(-y\omega^T x\right)} \in [0,1]$$

Log-likelihood

$$ \begin{align*} \ell(\omega) = \log \mathscr{L} = \log P\left(y \mid x\,;\omega\right)& = \log \prod_{n=1}^{m} P\left(y_n \mid x_n\,;\omega\right)\\ &= \sum_{n=1}^{m} \log P\left(y_n \mid x_n\,;\omega\right)\\ &= \sum_{n=1}^{m} \log \frac{1}{1+\exp\left(-y_n\omega^T x_n\right)}\\ &= \sum\limits_{n=1}^{m}{-\log \left(1+\exp \left(-y_n\omega^T x_n \right) \right)} \end{align*} $$

MLE solution

$$ \begin{align*} \hat{\omega} &= \arg \max_{\omega} \sum\limits_{n=1}^{m}{-\log \left(1+\exp \left(-y_n\omega^T x_n \right) \right)}\\ &= \arg \min_{\omega} \sum\limits_{n=1}^{m}{\log \left(1+\exp \left(-y_n\omega^T x_n \right) \right)} \end{align*} $$

y = np.empty([m, 1])
y[C1] = 1
y[C0] = -1
y = np.asmatrix(y)

w = cvx.Variable([3, 1])

obj = cvx.Minimize(cvx.sum(cvx.logistic(-cvx.multiply(y, X@w))))
prob = cvx.Problem(obj).solve()

w = w.value

xp = np.linspace(0,4,100).reshape(-1,1)
yp = - w[1,0]/w[2,0]*xp - w[0,0]/w[2,0]

plt.figure(figsize = (6, 4))
plt.plot(X[C1,1], X[C1,2], 'ro', alpha = 0.3, label = 'C1')
plt.plot(X[C0,1], X[C0,2], 'bo', alpha = 0.3, label = 'C0')
plt.plot(xp, yp, 'g', linewidth = 4, label = 'Logistic Regression')
plt.title('Logistic Regression', fontsize = 12)
plt.xlabel(r'$x_1$', fontsize = 12)
plt.ylabel(r'$x_2$', fontsize = 12)
plt.legend(loc = 1)
plt.axis('equal')
plt.xlim([0,4])
plt.ylim([0,4])
plt.show()

4.1.3. Logistic Regression using Scikit-Learn¶

$$ \begin{align*} \omega &= \begin{bmatrix} \omega_1 \\ \omega_2\end{bmatrix}, \qquad \omega_0, \qquad x = \begin{bmatrix} x_1 \\ x_2\end{bmatrix}\\ \\ X &= \begin{bmatrix} \left(x^{(1)}\right)^T \\ \left(x^{(2)}\right)^T \\ \left(x^{(3)}\right)^T \\ \vdots\end{bmatrix} = \begin{bmatrix} x_1^{(1)} & x_2^{(1)} \\ x_1^{(2)} & x_2^{(2)} \\ x_1^{(3)} & x_2^{(3)} \\ \vdots & \vdots \\\end{bmatrix}, \qquad y = \begin{bmatrix} y^{(1)}\\ y^{(2)} \\y^{(3)} \\ \vdots \end{bmatrix} \end{align*} $$

X = X[:, 1:3]

X.shape

(100, 2)

from sklearn import linear_model

clf = linear_model.LogisticRegression(solver = 'lbfgs')
clf.fit(np.asarray(X), np.ravel(y))

LogisticRegression()

LogisticRegression()

clf.coef_

array([[3.19125989, 1.35341969]])

clf.intercept_

array([-9.08772965])

w0 = clf.intercept_[0]
w1 = clf.coef_[0,0]
w2 = clf.coef_[0,1]

xp = np.linspace(0,4,100).reshape(-1,1)
yp = - w1/w2*xp - w0/w2

plt.figure(figsize = (6, 4))
plt.plot(X[C1,0], X[C1,1], 'ro', alpha = 0.3, label = 'C1')
plt.plot(X[C0,0], X[C0,1], 'bo', alpha = 0.3, label = 'C0')
plt.plot(xp, yp, 'g', linewidth = 4, label = 'Logistic Regression')
plt.title('Logistic Regression')
plt.xlabel(r'$x_1$', fontsize = 12)
plt.ylabel(r'$x_2$', fontsize = 12)
plt.legend(loc = 1)
plt.axis('equal')
plt.xlim([0,4])
plt.ylim([0,4])
plt.show()

4.2. Multiclass Classification¶

Generalization to more than 2 classes is straightforward
- one vs. all (one vs. rest)
- one vs. one
Using the soft-max function instead of the logistic function (refer to UFLDL Tutorial)
- see them as probability

$$P\left(y = k \mid x\,;\omega\right) = \frac{\exp{\left( \omega_k^T x \right) }}{\sum_k \exp{\left(\omega_k^T x \right)}} \in [0,1]$$

We maintain a separator weight vector $\omega_k$ for each class $k$

Note that the softmax function is often used in deep learning

4.3. Non-linear Classification¶

Same idea as for linear regression: non-linear features, either explicit or implicit Kernels

X1 = np.array([[-1.1,0],[-0.3,0.1],[-0.9,1],[0.8,0.4],[0.4,0.9],[0.3,-0.6],
               [-0.5,0.3],[-0.8,0.6],[-0.5,-0.5]])

X0 = np.array([[-1,-1.3], [-1.6,2.2],[0.9,-0.7],[1.6,0.5],[1.8,-1.1],[1.6,1.6],
               [-1.6,-1.7],[-1.4,1.8],[1.6,-0.9],[0,-1.6],[0.3,1.7],[-1.6,0],[-2.1,0.2]])

X1 = np.asmatrix(X1)
X0 = np.asmatrix(X0)

plt.figure(figsize = (6, 4))
plt.plot(X1[:,0], X1[:,1], 'ro', label = 'C1')
plt.plot(X0[:,0], X0[:,1], 'bo', label = 'C0')
plt.title('Logistic Regression for Nonlinear Data', fontsize = 12)
plt.xlabel(r'$x_1$', fontsize = 12)
plt.ylabel(r'$x_2$', fontsize = 12)
plt.axis('equal')
plt.legend(loc = 4, fontsize = 12)
plt.show()

$$ x = \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} \quad \Longrightarrow \quad z = \phi(x) = \begin{bmatrix} 1\\ \sqrt{2}x_1\\ \sqrt{2}x_2 \\x_1^2 \\ \sqrt{2}x_1 x_2 \\x_2^2 \end{bmatrix}$$

N = X1.shape[0]
M = X0.shape[0]

X = np.vstack([X1, X0])
y = np.vstack([np.ones([N,1]), -np.ones([M,1])])

X = np.asmatrix(X)
y = np.asmatrix(y)

m = N + M
Z = np.hstack([np.ones([m,1]), np.sqrt(2)*X[:,0], np.sqrt(2)*X[:,1], np.square(X[:,0]),
               np.sqrt(2)*np.multiply(X[:,0], X[:,1]), np.square(X[:,1])])

w = cvx.Variable([6, 1])
obj = cvx.Minimize(cvx.sum(cvx.logistic(-cvx.multiply(y,Z @ w))))
prob = cvx.Problem(obj).solve()

w = w.value

# to plot
[X1gr, X2gr] = np.meshgrid(np.arange(-3,3,0.1), np.arange(-3,3,0.1))

Xp = np.hstack([X1gr.reshape(-1,1), X2gr.reshape(-1,1)])
Xp = np.asmatrix(Xp)

m = Xp.shape[0]
Zp = np.hstack([np.ones([m,1]), np.sqrt(2)*Xp[:,0], np.sqrt(2)*Xp[:,1], np.square(Xp[:,0]),
                np.sqrt(2)*np.multiply(Xp[:,0], Xp[:,1]), np.square(Xp[:,1])])
q = Zp*w

B = []
for i in range(m):
    if q[i,0] > 0:
        B.append(Xp[i,:])

B = np.vstack(B)

plt.figure(figsize = (6, 4))
plt.plot(X1[:,0], X1[:,1], 'ro', label = 'C1')
plt.plot(X0[:,0], X0[:,1], 'bo', label = 'C0')
plt.plot(B[:,0], B[:,1], 'gs', markersize = 10, alpha = 0.1, label = 'Logistic Regression')
plt.title('Logistic Regression with Kernel', fontsize = 12)
plt.xlabel(r'$x_1$', fontsize = 12)
plt.ylabel(r'$x_2$', fontsize = 12)
plt.axis('equal')
plt.legend()
plt.show()

%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')