Classification: Logistic Regression


By Prof. Seungchul Lee
http://iai.postech.ac.kr/
Industrial AI Lab at POSTECH

Table of Contents

1. Logistic Regression

  • Logistic regression is a classification algorithm - don't be confused

Using all Distances

  • Perceptron: make use of sign of data

  • We want to use distance information of all data points $\rightarrow$ logistic regression





  • Basic idea: to find the decision boundary (hyperplane) of $g(x)=\omega^T x =0$ such that maximizes $\prod_i \lvert h_i \rvert$
    • Inequality of arithmetic and geometric means $$ \frac{h_1+h_2}{2} \geq \sqrt{h_1 h_2} $$ and that equality holds if and only if $h_1 = h_2$
  • Roughly speaking, this optimization of $\max \prod_i \lvert h_i \rvert$ tends to position a hyperplane in the middle of two classes
$$h = \frac{g(x)}{\lVert \omega \rVert} = \frac{\omega^T x}{\lVert \omega \rVert} \sim \omega^T x$$
  • We link or squeeze $(-\infty, +\infty)$ to $(0,1)$ for several reasons:





  • If $\sigma(z)$ is the sigmoid function, or the logistic function

    $$ \sigma(z) = \frac{1}{1+e^{-z}} \implies \sigma \left(\omega^T x \right) = \frac{1}{1+e^{-\omega^T x}}$$

    • Logistic function always generates a value between 0 and 1
    • Crosses 0.5 at the origin, then flattens out
    • The derivative of the sigmoid function satisfies

      $$\sigma'(z) = \sigma(z)\left( 1 - \sigma(z)\right)$$

In [1]:
# plot a sigmoid function

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

z = np.linspace(-4,4,100)
s = 1/(1 + np.exp(-z))

plt.figure(figsize=(10,2))
plt.plot(z, s)
plt.xlim([-4, 4])
plt.axis('equal')
plt.grid(alpha = 0.3)
plt.show()
  • Benefit of mapping via the logistic function

    • monotonic: same or similar optimziation solution
    • continuous and differentiable: good for gradient descent optimization
    • probability or confidence: can be considered as probability

    $$P\left(y = +1 \mid x\,;\omega\right) = \frac{1}{1+e^{-\omega^T x}} \;\; \in \; [0,1]$$

    • Often we do note care about predicting the label $y$

    • Rather, we want to predict the label probabilities $P\left(y \mid x\,;\omega\right)$

      • the probability that the label is $+1$ $$P\left(y = +1 \mid x\,;\omega\right)$$
      • the probability that the label is $0$ $$P\left(y = 0 \mid x\,;\omega\right) = 1 - P\left(y = +1 \mid x\,;\omega\right)$$
  • Goal: we need to fit $\omega$ to our data

For a single data point $(x,y)$ with parameters $\omega$

$$ \begin{align*} P\left(y = +1 \mid x\,;\omega\right) &= h_{\omega}(x) = \sigma \left(\omega^T x \right)\\ P\left(y = 0 \mid x\,;\omega\right) &= 1 - h_{\omega}(x) = 1- \sigma \left(\omega^T x \right) \end{align*} $$

It can be written as

$$P\left(y \mid x\,;\omega\right) = \left(h_{\omega}(x) \right)^y \left(1 - h_{\omega}(x)\right)^{1-y}$$


For $m$ training data points, the likelihood function of the parameters:

$$ \begin{align*} \mathscr{L}(\omega) &= P\left(y^{(1)}, \cdots, y^{(m)} \mid x^{(1)}, \cdots, x^{(m)}\,;\omega\right)\\ &= \prod\limits_{i=1}^{m}P\left(y^{(i)} \mid x^{(i)}\,;\omega\right)\\ &= \prod\limits_{i=1}^{m}\left(h_{\omega}\left(x^{(i)}\right) \right)^{y^{(i)}} \left(1 - h_{\omega}\left(x^{(i)}\right)\right)^{1-y^{(i)}} \qquad \left(\sim \prod_i \lvert h_i \rvert \right) \end{align*} $$

It would be easier to work on the log likelihood.


$$\ell(\omega) = \log \mathscr{L}(\omega) = \sum_{i=1}^{m} y^{(i)} \log h_{\omega} \left(x^{(i)} \right) + \left(1-y^{(i)} \right) \log \left(1-h_{\omega} \left(x^{(i)} \right) \right)$$

The logistic regression problem can be solved as a (convex) optimization problem as


$$\hat{\omega} = \arg\max_{\omega} \ell(\omega)$$

To use the gradient descent method, we need to find the derivative of it


$$\nabla \ell(\omega) = \begin{bmatrix} \frac{\partial \ell(\omega)}{\partial \omega_1} \\ \vdots \\ \frac{\partial \ell(\omega)}{\partial \omega_n} \end{bmatrix}$$

Think about a single data point with a single parameter $\omega$ for the simplicity.


$$ \begin{align*} \frac{\partial}{\partial \omega} \left[ y \log (\sigma) + (1-y) \log (1-\sigma)\right] & = y\frac{\sigma'}{\sigma} + (1-y)\frac{-\sigma}{1-\sigma}\\ & = \left(\frac{y}{\sigma}-\frac{1-y}{1-\sigma} \right)\sigma'\\ & = \frac{y-\sigma}{\sigma (1-\sigma)}\sigma'\\ & = \frac{y-\sigma}{\sigma (1-\sigma)}\sigma (1-\sigma)x\\ & = (y-\sigma)x \end{align*} $$

For $m$ training data points with the parameters $\omega$


$$\frac{\partial \ell(\omega)}{\partial \omega_j} = \sum_{i=1}^{m} \left(y^{(i)}-h_{\omega} \left(x^{(i)} \right) \right) x_{j}^{(i)} \quad \stackrel{\text{vectorization}}{=} \quad \left(y-h_{\omega}(x)\right)^T x_{j} = x_{j}^T \left(y-h_{\omega}(x)\right) $$

1.1. Logistic Regression using Gradient Descent


$$ \begin{align*} \omega &= \begin{bmatrix} \omega_0 \\ \omega_1 \\ \omega_2\end{bmatrix}, \qquad x = \begin{bmatrix} 1 \\ x_1 \\ x_2\end{bmatrix}\\ \\ X &= \begin{bmatrix} \left(x^{(1)}\right)^T \\ \left(x^{(2)}\right)^T \\ \left(x^{(3)}\right)^T \\ \vdots\end{bmatrix} = \begin{bmatrix} 1 & x_1^{(1)} & x_2^{(1)} \\ 1 & x_1^{(2)} & x_2^{(2)} \\ 1 & x_1^{(3)} & x_2^{(3)} \\ \vdots & \vdots & \vdots \\\end{bmatrix}, \qquad y = \begin{bmatrix} y^{(1)}\\ y^{(2)} \\y^{(3)} \\ \vdots \end{bmatrix} \end{align*} $$

$$\nabla \ell(\omega) = \begin{bmatrix} \frac{\partial \ell(\omega)}{\partial \omega_0} \\ \frac{\partial \ell(\omega)}{\partial \omega_1} \\ \frac{\partial \ell(\omega)}{\partial \omega_2} \end{bmatrix} = X^T \left(y-h_{\omega}(x)\right) = X^T \left(y-\sigma(X \omega)\right) $$
In [2]:
# datat generation

m = 100
w = np.array([[-6], [2], [1]])
X = np.hstack([np.ones([m,1]), 4*np.random.rand(m,1), 4*np.random.rand(m,1)])

w = np.asmatrix(w)
X = np.asmatrix(X)

y = 1/(1 + np.exp(-X*w)) > 0.5 

C1 = np.where(y == True)[0]
C0 = np.where(y == False)[0]

y = np.empty([m,1])
y[C1] = 1
y[C0] = 0

plt.figure(figsize = (10,8))
plt.plot(X[C1,1], X[C1,2], 'ro', alpha = 0.3, label = 'C1')
plt.plot(X[C0,1], X[C0,2], 'bo', alpha = 0.3, label = 'C0')
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 12)
plt.axis('equal')
plt.xlim([0,4])
plt.ylim([0,4])
plt.show()
$$h_{\omega}(x) = h(x\,;\omega) = \sigma \left(\omega^T x\right) = \frac{1}{1+e^{-\omega^T x}}$$
In [3]:
# be careful with matrix shape

def h(x,w):
    return 1/(1 + np.exp(-x*w))

Gradient descent

  • Maximization problem
$$\nabla \ell(\omega) = \begin{bmatrix} \frac{\partial \ell(\omega)}{\partial \omega_0} \\ \frac{\partial \ell(\omega)}{\partial \omega_1} \\ \frac{\partial \ell(\omega)}{\partial \omega_2} \end{bmatrix} = X^T \left(y-h_{\omega}(x)\right) = X^T \left(y-\sigma(X \omega)\right) $$


$$\omega \leftarrow \omega - \eta \left( - \nabla \ell(\omega)\right)$$

In [4]:
w = np.zeros([3,1])

alpha = 0.01

for i in range(10000):   
    df = -X.T*(y - h(X,w))
    w = w - alpha*df

print(w)
[[-33.17078193]
 [ 11.30683027]
 [  5.57192522]]
In [5]:
xp = np.linspace(0,4,100).reshape(-1,1)
yp = - w[1,0]/w[2,0]*xp - w[0,0]/w[2,0]

plt.figure(figsize = (10,8))
plt.plot(X[C1,1], X[C1,2], 'ro', alpha = 0.3, label = 'C1')
plt.plot(X[C0,1], X[C0,2], 'bo', alpha = 0.3, label = 'C0')
plt.plot(xp, yp, 'g', linewidth = 4, label = 'Logistic Regression')
plt.title('Logistic Regression', fontsize = 15)
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 12)
plt.axis('equal')
plt.xlim([0,4])
plt.ylim([0,4])
plt.show()

1.2. Logistic Regression using CVXPY


$$ \begin{align*} p &= \frac{1}{1+e^{-\omega^T x}} = \frac{e^{\omega^T x}}{e^{\omega^T x} + 1}\\ 1-p &= \frac{1}{e^{\omega^T x} + 1} \end{align*} $$

We can re-order the training data so

  • for $x_1, \cdots, x_q$, the outcome is $y = +1$, and
  • for $x_{q+1}, \cdots, x_m$, the outcome is $y=0$

The likelihood function

$$\mathscr{L} = \prod\limits_{i=1}^{q}{p_i}\prod\limits_{i=q+1}^{m}{(1-p_i)} \qquad \left(\sim \prod_i \lvert h_i \rvert \right)$$

the log likelihood function

$$ \begin{align*} \ell(\omega) &= \log \mathscr{L} = \sum\limits_{i=1}^{q}{\log p_i} + \sum\limits_{i=q+1}^{m}{\log(1 - p_i)} \\ & = \sum\limits_{i=1}^{q}{\log \frac{\exp\left(\omega^T x_i\right)}{1 + \exp \left(\omega^T x_i \right)}} + \sum\limits_{i=q+1}^{m}{\log \frac{1}{1+\exp \left(\omega^T x_i \right)}} \\ &= \sum\limits_{i=1}^{q}{\left(\omega^T x_i\right)} - \sum\limits_{i=1}^{m}{\log \left(1+\exp \left(\omega^T x_i \right) \right)} \end{align*} $$

Since $\ell$ is a concave function of $\omega$, the logistic regression problem can be solved as a convex optimization problem


$$\hat{\omega} = \arg\max_{\omega} \ell(\omega)$$

$$ \begin{align*} \omega &= \begin{bmatrix} \omega_0 \\ \omega_1 \\ \omega_2\end{bmatrix}, \qquad x = \begin{bmatrix} 1 \\ x_1 \\ x_2\end{bmatrix}\\ \\ X &= \begin{bmatrix} \left(x^{(1)}\right)^T \\ \left(x^{(2)}\right)^T \\ \left(x^{(3)}\right)^T \\ \vdots\end{bmatrix} = \begin{bmatrix} 1 & x_1^{(1)} & x_2^{(1)} \\ 1 & x_1^{(2)} & x_2^{(2)} \\ 1 & x_1^{(3)} & x_2^{(3)} \\ \vdots & \vdots & \vdots \\\end{bmatrix}, \qquad y = \begin{bmatrix} y^{(1)}\\ y^{(2)} \\y^{(3)} \\ \vdots \end{bmatrix} \end{align*} $$
$$ \begin{align*} \ell(\omega) &= \log \mathscr{L} = \sum\limits_{i=1}^{q}{\log p_i} + \sum\limits_{i=q+1}^{m}{\log(1 - p_i)} \\ & = \sum\limits_{i=1}^{q}{\log \frac{\exp\left(\omega^T x_i\right)}{1 + \exp \left(\omega^T x_i \right)}} + \sum\limits_{i=q+1}^{m}{\log \frac{1}{1+\exp \left(\omega^T x_i \right)}} \\ &= \sum\limits_{i=1}^{q}{\left(\omega^T x_i\right)} - \sum\limits_{i=1}^{m}{\log \left(1+\exp \left(\omega^T x_i \right) \right)} \end{align*} $$

Refer to cvxpy functions

  • scalar function: cvx.sum(x) = $\sum_{ij} x_{ij}$

  • elementwise function: cvx.logistic(x) = $\log \left(1+e^{x} \right)$

In [6]:
import cvxpy as cvx

w = cvx.Variable([3, 1])

obj = cvx.Maximize(y.T*X*w - cvx.sum(cvx.logistic(X*w)))
prob = cvx.Problem(obj).solve()

w = w.value

xp = np.linspace(0,4,100).reshape(-1,1)
yp = - w[1,0]/w[2,0]*xp - w[0,0]/w[2,0]

plt.figure(figsize = (10,8))
plt.plot(X[C1,1], X[C1,2], 'ro', alpha = 0.3, label = 'C1')
plt.plot(X[C0,1], X[C0,2], 'bo', alpha = 0.3, label = 'C0')
plt.plot(xp, yp, 'g', linewidth = 4, label = 'Logistic Regression')
plt.title('Logistic Regression', fontsize = 15)
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 12)
plt.axis('equal')
plt.xlim([0,4])
plt.ylim([0,4])
plt.show()

In a more compact form

Change $y \in \{0,+1\} \rightarrow y \in \{-1,+1\}$ for compuational convenience

  • Consider the following function


$$P(y=+1) = p = \sigma(\omega^T x), \quad P(y=-1) = 1-p = 1- \sigma(\omega^T x) = \sigma(-\omega^T x)$$

$$P\left(y \mid x\,;\omega\right) = \sigma\left(y\omega^Tx\right) = \frac{1}{1+\exp\left(-y\omega^T x\right)} \in [0,1]$$
  • Log-likelihood
$$ \begin{align*} \ell(\omega) = \log \mathscr{L} = \log P\left(y \mid x\,;\omega\right)& = \log \prod_{n=1}^{m} P\left(y_n \mid x_n\,;\omega\right)\\ &= \sum_{n=1}^{m} \log P\left(y_n \mid x_n\,;\omega\right)\\ &= \sum_{n=1}^{m} \log \frac{1}{1+\exp\left(-y_n\omega^T x_n\right)}\\ &= \sum\limits_{n=1}^{m}{-\log \left(1+\exp \left(-y_n\omega^T x_n \right) \right)} \end{align*} $$
  • MLE solution
$$ \begin{align*} \hat{\omega} &= \arg \max_{\omega} \sum\limits_{n=1}^{m}{-\log \left(1+\exp \left(-y_n\omega^T x_n \right) \right)}\\ &= \arg \min_{\omega} \sum\limits_{n=1}^{m}{\log \left(1+\exp \left(-y_n\omega^T x_n \right) \right)} \end{align*} $$
In [7]:
y = np.empty([m,1])
y[C1] = 1
y[C0] = -1
y = np.asmatrix(y)

w = cvx.Variable([3, 1])

obj = cvx.Minimize(cvx.sum(cvx.logistic(-cvx.multiply(y,X*w))))
prob = cvx.Problem(obj).solve()

w = w.value

xp = np.linspace(0,4,100).reshape(-1,1)
yp = - w[1,0]/w[2,0]*xp - w[0,0]/w[2,0]

plt.figure(figsize = (10,8))
plt.plot(X[C1,1], X[C1,2], 'ro', alpha = 0.3, label = 'C1')
plt.plot(X[C0,1], X[C0,2], 'bo', alpha = 0.3, label = 'C0')
plt.plot(xp, yp, 'g', linewidth = 4, label = 'Logistic Regression')
plt.title('Logistic Regression', fontsize = 15)
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 12)
plt.axis('equal')
plt.xlim([0,4])
plt.ylim([0,4])
plt.show()

1.3. Logistic Regression using Scikit-Learn



$$ \begin{align*} \omega &= \begin{bmatrix} \omega_1 \\ \omega_2\end{bmatrix}, \qquad \omega_0, \qquad x = \begin{bmatrix} x_1 \\ x_2\end{bmatrix}\\ \\ X &= \begin{bmatrix} \left(x^{(1)}\right)^T \\ \left(x^{(2)}\right)^T \\ \left(x^{(3)}\right)^T \\ \vdots\end{bmatrix} = \begin{bmatrix} x_1^{(1)} & x_2^{(1)} \\ x_1^{(2)} & x_2^{(2)} \\ x_1^{(3)} & x_2^{(3)} \\ \vdots & \vdots \\\end{bmatrix}, \qquad y = \begin{bmatrix} y^{(1)}\\ y^{(2)} \\y^{(3)} \\ \vdots \end{bmatrix} \end{align*} $$

In [8]:
X = X[:,1:3]

X.shape
Out[8]:
(100, 2)
In [9]:
from sklearn import linear_model

clf = linear_model.LogisticRegression(solver='lbfgs')
clf.fit(X,np.ravel(y))
Out[9]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False)
In [10]:
clf.coef_
Out[10]:
array([[3.24350356, 1.62850268]])
In [11]:
clf.intercept_
Out[11]:
array([-9.56892254])
In [12]:
w0 = clf.intercept_[0]
w1 = clf.coef_[0,0]
w2 = clf.coef_[0,1]

xp = np.linspace(0,4,100).reshape(-1,1)
yp = - w1/w2*xp - w0/w2

plt.figure(figsize = (10,8))
plt.plot(X[C1,0], X[C1,1], 'ro', alpha = 0.3, label = 'C1')
plt.plot(X[C0,0], X[C0,1], 'bo', alpha = 0.3, label = 'C0')
plt.plot(xp, yp, 'g', linewidth = 4, label = 'Logistic Regression')
plt.title('Logistic Regression')
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 12)
plt.axis('equal')
plt.xlim([0,4])
plt.ylim([0,4])
plt.show()

2. Multiclass Classification

  • Generalization to more than 2 classes is straightforward
    • one vs. all (one vs. rest)
    • one vs. one
  • Using the soft-max function instead of the logistic function (refer to UFLDL Tutorial)
    • see them as probability
$$P\left(y = k \mid x\,;\omega\right) = \frac{\exp{\left( \omega_k^T x \right) }}{\sum_k \exp{\left(\omega_k^T x \right)}} \in [0,1]$$
  • We maintain a separator weight vector $\omega_k$ for each class $k$
  • Note that the softmax function is often used in deep learning

3. Non-linear Classification

  • Same idea as for linear regression: non-linear features, either explicit or implicit Kernels
In [13]:
X1 = np.array([[-1.1,0],[-0.3,0.1],[-0.9,1],[0.8,0.4],[0.4,0.9],[0.3,-0.6],
               [-0.5,0.3],[-0.8,0.6],[-0.5,-0.5]])
     
X0 = np.array([[-1,-1.3], [-1.6,2.2],[0.9,-0.7],[1.6,0.5],[1.8,-1.1],[1.6,1.6],
               [-1.6,-1.7],[-1.4,1.8],[1.6,-0.9],[0,-1.6],[0.3,1.7],[-1.6,0],[-2.1,0.2]])

X1 = np.asmatrix(X1)
X0 = np.asmatrix(X0)

plt.figure(figsize=(10, 8))
plt.plot(X1[:,0], X1[:,1], 'ro', label = 'C1')
plt.plot(X0[:,0], X0[:,1], 'bo', label = 'C0')
plt.title('Logistic Regression for Nonlinear Data', fontsize = 15)
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.axis('equal')
plt.legend(loc = 4, fontsize = 15)
plt.show()



$$ x = \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} \quad \Longrightarrow \quad z = \phi(x) = \begin{bmatrix} 1\\ \sqrt{2}x_1\\ \sqrt{2}x_2 \\x_1^2 \\ \sqrt{2}x_1 x_2 \\x_2^2 \end{bmatrix}$$

In [14]:
N = X1.shape[0]
M = X0.shape[0]

X = np.vstack([X1, X0])
y = np.vstack([np.ones([N,1]), -np.ones([M,1])])

X = np.asmatrix(X)
y = np.asmatrix(y)

m = N + M
Z = np.hstack([np.ones([m,1]), np.sqrt(2)*X[:,0], np.sqrt(2)*X[:,1], np.square(X[:,0]),
               np.sqrt(2)*np.multiply(X[:,0], X[:,1]), np.square(X[:,1])])

w = cvx.Variable([6, 1])
obj = cvx.Minimize(cvx.sum(cvx.logistic(-cvx.multiply(y,Z*w))))
prob = cvx.Problem(obj).solve()

w = w.value
In [15]:
# to plot
[X1gr, X2gr] = np.meshgrid(np.arange(-3,3,0.1), np.arange(-3,3,0.1))

Xp = np.hstack([X1gr.reshape(-1,1), X2gr.reshape(-1,1)])
Xp = np.asmatrix(Xp)

m = Xp.shape[0]
Zp = np.hstack([np.ones([m,1]), np.sqrt(2)*Xp[:,0], np.sqrt(2)*Xp[:,1], np.square(Xp[:,0]), 
                np.sqrt(2)*np.multiply(Xp[:,0], Xp[:,1]), np.square(Xp[:,1])])
q = Zp*w

B = []
for i in range(m):
    if q[i,0] > 0:
        B.append(Xp[i,:])       

B = np.vstack(B)

plt.figure(figsize=(10, 8))
plt.plot(X1[:,0], X1[:,1], 'ro', label = 'C1')
plt.plot(X0[:,0], X0[:,1], 'bo', label = 'C0')
plt.plot(B[:,0], B[:,1], 'gs', markersize = 10, alpha = 0.1, label = 'Logistic Regression')
plt.title('Logistic Regression with Kernel', fontsize = 15)
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.axis('equal')
plt.legend()
plt.show()
In [16]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')