Fisher Discriminant Analysis (FDA) or

Linear Discriminant Analysis (LDA)


By Prof. Seungchul Lee
http://iai.postech.ac.kr/
Industrial AI Lab at POSTECH

Table of Contents

1. Dimensionality Reduction with Label

  • Dimensionality reduction with label information (when the ultimate goal is classification/regression)
  • PCA ignores label information even if it is available
    • Only chooses directions of maximum variance
  • Fisher Discriminant Analysis (FDA) takes into account the label information
    • It is also called Linear Discriminant Analysis (LDA)
  • FDA/LDA projects data while preserving class separation
    • Examples from same class are put closely together by the projection
    • Examples from different classes are placed far apart by the projection

2. Projection onto Line $\omega$

  • Linear regression projects each data point
    • assume zero mean, otherwise $x \leftarrow x - \bar{x}$
    • $\omega_0 = 0$
$$ \hat{y} = \langle \omega,x\rangle = \omega^Tx = \omega_1x_1 + \omega_2x_2$$
  • Dimension reduction
$$x = \begin{bmatrix}x_1 \\ x_2 \end{bmatrix} \rightarrow \hat y \; (\text{scalar})$$
  • Each data point $x = \begin{bmatrix}x_1 \\ x_2 \end{bmatrix}$ is projected onto $\omega$ (projected length on $\omega$ direction)
  • For a given $\omega$, distribution of the projected points $\{ \hat y^{(1)}, \cdots, \hat y^{(m)} \}$ is specified.


    Question: Which $\omega$ is better for classification?



3. Fisher Discriminant Analysis

  • Find $\omega$ so that when projected onto $\omega$,

    • the classes are maximally separated (maximize distance between classes)

    • Each class is tight (minimize variance of each class)




$$\max_\omega \frac{(\text{seperation of projected means})^2}{\text{sum of within class variances}}$$


$$ \implies \max_\omega \frac {\big(\mu_0^T\omega - \mu_1^T\omega \big)^2}{n_0\omega^TS_0\omega + n_1\omega^TS_1\omega}$$




$$\omega = \text{arg}\max_\omega \bigg\{ \frac{\big((\mu_0^T - \mu_1^T)\omega\big)^2}{n_0\omega^TS_0\omega + n_1\omega^TS_1\omega} \bigg \}$$


$$J(\omega) = \frac {\big((\mu_0^T - \mu_1^T)\omega\big)^2}{\omega^T(n_0S_0 + n_1S_1)\omega} = \frac {(m^T\omega)^2}{\omega^T\Sigma \omega}$$



$$ \begin{align*} m &\equiv \mu_0 - \mu_1 &\\ \Sigma &\equiv n_0S_0 + n_1S_1 = R^TR \qquad &\text{We can always write } \Sigma \text{ like this, where } R \text{ is a "square root" matrix} \\ u &\equiv R\omega \rightarrow \omega = R^{-1}u \qquad &\text{Using } R, \text{change the coordinate systems from } \omega \text{ to } u \end{align*} $$



$$J(u) = \frac {\big( m^TR^{-1}u\big)^2}{\omega^TR^TR\omega} = \frac {\bigg( \big(R^{-T}m \big)^Tu\bigg)^2}{u^Tu}= \bigg( \left(R^{-T}m \right)^T \frac{u}{\parallel u \parallel} \bigg)^2$$


$$J(u) = \bigg( \left(R^{(-T)}m \right)^T \frac{u}{\lVert u \rVert} \bigg)^2 \; \text{is maximum when} \; u = a\,R^{-T}m$$

  • Why?
    • Dot product of a unit vector and another vector is maximum when the two have the same direction.


$$\begin{align*} u &= aR^{-T}m = aR^{-T}(\mu_0 - \mu_1)\\ \omega &= R^{-1}u = aR^{-1}R^{-T}(\mu_0 - \mu_1) = a\big(R^TR\big)^{-1}(\mu_0 - \mu_1) = a\Sigma^{-1}(\mu_0 - \mu_1) \\ \\ \therefore \; \omega &= a(n_0S_0 + n_1S_1)^{-1}(\mu_0 - \mu_1) \end{align*}$$

4. Python Code

In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
In [2]:
# generating data set 

n0 = 200
n1 = 200

mu = [0, 0]
sigma = [[0.9, -0.4],
         [-0.4, 0.3]]

x0 = np.random.multivariate_normal([2.5,2.5], sigma, n0).T        # data in class 0
x1 = np.random.multivariate_normal([1,1], sigma, n1).T            # data in class 0

x0 = np.asmatrix(x0)
x1 = np.asmatrix(x1)
In [3]:
mu0 = np.mean(x0, axis = 1)
mu1 = np.mean(x1, axis = 1)

S0 = 1/(n0 - 1)*(x0 - mu0)*(x0 - mu0).T
S1 = 1/(n1 - 1)*(x1 - mu1)*(x1 - mu1).T

w = (n0*S0 + n1*S1).I*(mu0 - mu1)
print(w)
[[0.02689314]
 [0.04831161]]

Projection line and histogram

In [4]:
plt.figure(figsize = (10, 8))
plt.plot(x0[0,:], x0[1,:], 'r.')
plt.plot(x1[0,:], x1[1,:], 'b.')

xp = np.arange(-4, 6, 0.1)
yp = w[1,0]/w[0,0]*xp
plt.plot(xp, yp, 'k')
plt.axis('equal')
plt.ylim([-2, 6])
plt.show()
In [5]:
y1 = x0.T*w
y2 = x1.T*w

plt.figure(figsize = (10, 8))
plt.hist(y1, 21, color = 'r', rwidth = 0.5)
plt.hist(y2, 21, color = 'b', rwidth = 0.5)
plt.show()

Different $\omega$ (x axis)

In [6]:
w = np.array([[1], [0]])
w = np.asmatrix(w)
In [7]:
plt.figure(figsize = (10, 8))
plt.plot(x0[0,:], x0[1,:], 'r.')
plt.plot(x1[0,:], x1[1,:], 'b.')
plt.axhline(0, color = 'k')
plt.axis('equal')
plt.ylim([-2, 6])
plt.show()
In [8]:
y1 = x0.T*w
y2 = x1.T*w

plt.figure(figsize = (10, 8))
plt.hist(y1, 21, color = 'r', rwidth = 0.5)
plt.hist(y2, 21, color = 'b', rwidth = 0.5)
plt.show()

Different $\omega$ (y axis)

In [9]:
w = np.array([[0], [1]])
w = np.asmatrix(w)

plt.figure(figsize = (10, 8))
plt.plot(x0[0,:], x0[1,:], 'r.')
plt.plot(x1[0,:], x1[1,:], 'b.')
plt.axvline(0, color = 'k')
plt.axis('equal')
plt.ylim([-2, 6])
plt.show()

y1 = x0.T*w
y2 = x1.T*w

plt.figure(figsize = (10, 6))
plt.hist(y1, 21, color = 'r', rwidth = 0.5)
plt.hist(y2, 21, color = 'b', rwidth = 0.5)
plt.show()

5. Scikit-learn

In [10]:
# reshape data

X = np.vstack([x0.T, x1.T])
y = np.vstack([np.ones([n0, 1]), np.zeros([n1, 1])])
In [11]:
from sklearn import discriminant_analysis

clf = discriminant_analysis.LinearDiscriminantAnalysis()
clf.fit(X, np.ravel(y))
Out[11]:
LinearDiscriminantAnalysis(n_components=None, priors=None, shrinkage=None,
              solver='svd', store_covariance=False, tol=0.0001)
In [12]:
# projection

X_LDA = clf.transform(X)

plt.figure(figsize = (10, 8))
plt.hist(X_LDA[0:200], 21, color = 'r', rwidth = 0.5)
plt.hist(X_LDA[200:400], 21, color = 'b', rwidth = 0.5)
plt.show()
In [13]:
# classifier boundary, not a projection line

clf_w = clf.coef_[0]
clf_w0 = clf.intercept_[0]
print(clf_w)
print(clf_w0)
[10.75725689 19.32464447]
-52.41110376609386
In [14]:
# projection line

mu0 = np.mean(x0, axis = 1)
mu1 = np.mean(x1, axis = 1)

S0 = 1/(n0 - 1)*(x0 - mu0)*(x0 - mu0).T
S1 = 1/(n1 - 1)*(x1 - mu1)*(x1 - mu1).T

prj_w = (n0*S0 + n1*S1).I*(mu0 - mu1)
In [15]:
xp = np.arange(-4, 6, 0.1)
prj_yp = prj_w[1,0]/prj_w[0,0]*xp
clf_yp = - clf_w[0]/clf_w[1]*xp - clf_w0/clf_w[1]

plt.figure(figsize = (10, 8))
plt.plot(x0[0,:], x0[1,:], 'r.')
plt.plot(x1[0,:], x1[1,:], 'b.')
plt.plot(xp, prj_yp, 'k', label = 'LDA projection line')
plt.plot(xp, clf_yp, 'g', label = 'LDA classification boundary')
plt.legend(fontsize = 12)
plt.axis('equal')
plt.ylim([-2, 6])
plt.show()
In [16]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')