Bayesian Machine Learning


By Prof. Seungchul Lee
http://iai.postech.ac.kr/
Industrial AI Lab at POSTECH

Sources

Table of Contents

1. Bayesian Decision Theory 1

Suppose the data $x \in \mathbb{R}$ in 1D. Assume we have two classes ($\mathcal{C}_1$ and $\mathcal{C}_2$) with the probability density functions (pdf) and their cumulative distribution functions (cdf).


$$ \begin{align*} f_1(x) &= \frac{\partial{F_1(x)}}{\partial x}\\\\ f_2(x) &= \frac{\partial{F_2(x)}}{\partial x}\\\\ \end{align*} $$

We further assume two classes are Gaussian distributed and $\mu_1 < \mu_2$. Then an instance $x \in \mathbb{R}$ belongs to one of the these two classes:


$$ x \sim \begin{cases} \mathcal{N}(\mu_1, \sigma_1^2), \quad \text{if } x \in \mathcal{C}_1\\ \mathcal{N}(\mu_2, \sigma_2^2), \quad \text{if } x \in \mathcal{C}_2 \end{cases} $$

Since this is a binary classification problem in 1 dimensional space, we have to determine the threshold $\omega$ where $\mu_1 < \omega < \mu_2$. Then


$$ \begin{cases} \text{if } x < \omega,\quad x \in \mathcal{C}_1 \\ \text{if } x > \omega, \quad x \in \mathcal{C}_2 \end{cases} $$

1.1. Optimal Boundary for Classes

  • We want to minimize a misclassification rate (or error)


$$ \begin{align*} P(\text{error}) &= P(x> \omega, x \in \mathcal{C}_1) + P(x< \omega, x \in \mathcal{C}_2) \\\\ &= P(x> \omega \mid x \in \mathcal{C}_1)P(x \in \mathcal{C}_1) + P(x< \omega \mid x \in \mathcal{C}_2)P(x \in \mathcal{C}_2)\\\\ &= \left(1-F_1(\omega)\right)\,\pi_1 + F_2(\omega)\, \pi_2 \end{align*} $$

  • where priors


$$ \begin{align*} P(x \in \mathcal{C}_1) = \pi_1\\ P(x \in \mathcal{C}_2) = \pi_2 \end{align*} $$

  • minimize


$$\min_{\omega}P(\text{error}) = \min_{\omega} \left\{ \left(1-F_1(\omega)\right)\,\pi_1 + F_2(\omega)\, \pi_2 \right\}$$

  • We take derivatives


$$ \begin{align*} \frac{\partial P(\text{error})}{\partial \omega} =& -f_1(\omega) \, \pi_1 + f_2(\omega) \, \pi_2 = 0\\\\ &\implies f_1(\omega) \, \pi_1 = f_2(\omega) \, \pi_2 \end{align*} $$

1.2. Posterior Probabilities

Another way is equating the posterior probabilities to have the equation of the classification boundary. For $x$ on the boundary


$$ \begin{align*} P(x \in \mathcal{C}_1 \mid X=x) &= P(x \in \mathcal{C}_2 \mid X=x)\\\\ \frac{P(X=x \mid x \in \mathcal{C}_1)P(x \in \mathcal{C}_1)}{P(X=x)} &= \frac{P(X=x \mid x \in \mathcal{C}_2)P(x \in \mathcal{C}_2)}{P(X=x)}\\\\ f_1(x)\, \pi_1 &= f_2(x)\, \pi_2 \end{align*} $$

1.3. Boundaries for Gaussian

Now let us think of data as multivariate Gaussian distributions, $x \sim \mathcal{N}(\mu, \Sigma)$.


$$f(x) = \frac{1}{\sqrt{(2\pi)^d\lvert \Sigma\rvert}}\exp{\left(-\frac{1}{2} (x-\mu)^T\Sigma^{-1}(x-\mu)\right)}$$

Then the equation of boundary


$$\frac{1}{\sqrt{(2\pi)^d\lvert \Sigma_1\rvert}}\exp{\left(-\frac{1}{2} (x-\mu_1)^T\Sigma_1^{-1}(x-\mu_1)\right)} \, \pi_1= \frac{1}{\sqrt{(2\pi)^d\lvert \Sigma_2\rvert}}\exp{\left(-\frac{1}{2} (x-\mu_2)^T\Sigma_2^{-1}(x-\mu_2)\right)} \, \pi_2$$

1.3.1. Equal Covariance

  • $\Sigma_1 = \Sigma_2 = \Sigma$


$$ \begin{align*} \frac{1}{\sqrt{(2\pi)^d\lvert \Sigma\rvert}}\exp{\left(-\frac{1}{2} (x-\mu_1)^T\Sigma^{-1}(x-\mu_1)\right)} \, \pi_1 &= \frac{1}{\sqrt{(2\pi)^d\lvert \Sigma\rvert}}\exp{\left(-\frac{1}{2} (x-\mu_2)^T\Sigma^{-1}(x-\mu_2)\right)} \, \pi_2 \\\\ \exp{\left(-\frac{1}{2} (x-\mu_1)^T\Sigma^{-1}(x-\mu_1)\right)} \, \pi_1 &= \exp{\left(-\frac{1}{2} (x-\mu_2)^T\Sigma^{-1}(x-\mu_2)\right)} \, \pi_2 \\\\ -(x-\mu_1)^T\Sigma^{-1}(x-\mu_1) + 2 \ln \pi_1 &= -(x-\mu_2)^T\Sigma^{-1}(x-\mu_2) + 2 \ln \pi_2 \\\\ -x^T\Sigma^{-1} x + x^T\Sigma^{-1} \mu_1 + \mu_1 \Sigma^{-1} x -\mu_1^T\Sigma^{-1} \mu_1 + 2 \ln \pi_1 &= -x^T\Sigma^{-1} x + x^T\Sigma^{-1} \mu_2 + \mu_2 \Sigma^{-1} x -\mu_2^T\Sigma^{-1} \mu_2 + 2 \ln \pi_2 \end{align*} $$


$$2 \left( \Sigma^{-1}(\mu_2 - \mu_1)\right)^T x + \left( \mu_1^T\Sigma^{-1}\mu_1 - \mu_2^T\Sigma^{-1}\mu_2 \right) + 2 \ln \frac{\pi_2}{\pi_1}= \color{red}{a^Tx + b =0}$$


  • If the covariance matrices are equal, the decision boundary of classification is a line.

1.3.2. Not Equal Covariance

  • $\Sigma_1 \neq \Sigma_2$


$$ \begin{align*} \frac{1}{\sqrt{(2\pi)^d\lvert \Sigma_1\rvert}}\exp{\left(-\frac{1}{2} (x-\mu_1)^T\Sigma_1^{-1}(x-\mu_1)\right)} \, \pi_1 &= \frac{1}{\sqrt{(2\pi)^d\lvert \Sigma_2\rvert}}\exp{\left(-\frac{1}{2} (x-\mu_2)^T\Sigma_2^{-1}(x-\mu_2)\right)} \, \pi_2 \\\\ \frac{1}{\sqrt{(\lvert \Sigma_1\rvert}}\exp{\left(-\frac{1}{2} (x-\mu_1)^T\Sigma_1^{-1}(x-\mu_1)\right)} \, \pi_1 &= \frac{1}{\sqrt{(\lvert \Sigma_2\rvert}}\exp{\left(-\frac{1}{2} (x-\mu_2)^T\Sigma_2^{-1}(x-\mu_2)\right)} \, \pi_2 \\\\ -\ln \left(\lvert \Sigma_1\rvert \right) -(x-\mu_1)^T\Sigma_1^{-1}(x-\mu_1) + 2 \ln \pi_1 &= -\ln \left(\lvert \Sigma_2\rvert \right) -(x-\mu_2)^T\Sigma_2^{-1}(x-\mu_2) + 2 \ln \pi_2 \\\\ -\ln \left(\lvert \Sigma_1\rvert \right) -x^T\Sigma_1^{-1} x + x^T\Sigma_1^{-1} \mu_1 + \mu_1 \Sigma_1^{-1} x -\mu_1^T\Sigma_1^{-1} \mu_1 + 2 \ln \pi_1 &= -\ln \left(\lvert \Sigma_2\rvert \right) -x^T\Sigma_2^{-1} x + x^T\Sigma_2^{-1} \mu_2 + \mu_2 \Sigma_2^{-1} x -\mu_2^T\Sigma_2^{-1} \mu_2 + 2 \ln \pi_2 \end{align*} $$


$$x^T(\Sigma_1 - \Sigma_2)^{-1}x + 2\left( \Sigma_2^{-1}\mu_2 - \Sigma_1^{-1} \mu_1\right)^T x + \left( \mu_1^T \Sigma_1^{-1} \mu_1 - \mu_2^T \Sigma_2^{-1} \mu_2 \right) - \ln \frac{\lvert \Sigma_2\rvert }{\lvert \Sigma_1\rvert} + 2 \ln \frac{\pi_2}{\pi_1} = \color{red}{x^TAx + b^Tx + b =0}$$


  • If the covariance matrices are not equal, the decision boundary of classification is a quadratic.
  • When we assume a linear model for any given data set, we should be careful.

1.3.3 Examples of Gaussian Decision Regions

  • When the covariances are all equal, the separating surfaces are hyperplanes.



- The separating surfaces are piecewise level sets of quadratic functions.

2. Bayesian Decision Theory 2

Given the height $x$ of a person, decide whether the person is male ($y=1$) or female ($y=0$).

  • Binary Classes: $y\in \{0,1\}$


$$ \begin{align*} P(y=1 \mid x) &=\frac{P(x \mid y=1)P(y=1)}{P(x)} =\frac{ \underbrace{P(x \mid y=1)}_{\text{likelihood}} \underbrace{P(y=1)}_{\text{prior}}}{\underbrace{P(x)}_{\text{marginal}}} \\ P(y=0 \mid x) &=\frac{P(x \mid y=0)P(y=0)}{P(x)} \end{align*} $$

  • Decision


$$ \begin{align*} \text{If} \; P(y=1 \mid x) > P(y=0 \mid x),\; \text{then} \; \hat{y} = 1 \\ \text{If} \; P(y=1 \mid x) < P(y=0 \mid x),\; \text{then} \; \hat{y} = 0 \end{align*} $$


$$\therefore \; \frac{P(x \mid y=0)P(y=0)}{P(x \mid y=1)P(y=1)} \quad\begin{cases} >1 \quad \implies \; \hat{y}=0 \\ =1 \quad \implies \; \text{decision boundary}\\ <1 \quad \implies \; \hat{y}=1 \end{cases}$$

2.1. Equal Variance and Equal Prior


$$\sigma_0 = \sigma_1 \qquad \text{and} \qquad P(y=0)=P(y=1)=\frac{1}{2}$$

$$P(x) = P(x \mid y=0) P(y=0) + P(x \mid y=1)P(y=1) = \frac{1}{2}\left\{ P(x \mid y=0) + P(x \mid y=1)\right\}$$
  • Decision Boundary
$$P(y=0 \mid x)=P(y=1 \mid x)$$
  • If equal variance
  • If equal prior
$$P(y=0) = P(y = 1) = \frac{1}{2}$$
  • Decision boundary
$$P(y=0 \mid x)=P(y=1 \mid x)$$



In [1]:
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from scipy.stats import norm
from matplotlib import animation
%matplotlib inline
In [2]:
x = np.arange(130,200)

mu0 = 160
mu1 = 173
sigma0 = 6.5
sigma1 = 6.5

L1 = norm.pdf(x,mu0,sigma0)
L2 = norm.pdf(x,mu1,sigma1)

prior0 = 1/2
prior1 = 1/2

Px = L1*prior0 + L2*prior1
posterior0 = (L1*prior0)/Px
posterior1 = (L2*prior1)/Px

var1 = posterior1 - posterior1**2;

plt.figure(figsize=(10,12))
plt.subplot(4,1,1)
plt.plot(x,L1,'r',x,L2,'b')
plt.axis([135, 195, 0, 0.1])
plt.text(mu0-10,0.08,'P(x|y=0)', color='r', fontsize=15)
plt.text(mu1,0.08,'P(x|y=1)', color='b', fontsize=15)


plt.subplot(4,1,2), 
plt.plot(x,L1*prior0,'r',x,L2*prior1,'b',x,Px,'k')
plt.axis([135, 195, 0, 0.1])
plt.text(mu0-14,0.05,'P(x|y=0)P(y=0)', color='r', fontsize=15)
plt.text(mu1,0.05,'P(x|y=1)P(y=1)', color='b', fontsize=15)
plt.text(mu0+5,0.06,'P(x)', fontsize=15)

plt.subplot(4,1,3),  
plt.plot(x,posterior0,'r',x,posterior1,'b')
plt.axis([135, 195, 0, 1.1])
plt.text(142,0.8,'P(y=0|x)', color='r', fontsize=15)
plt.text(180,0.8,'P(y=1|x)', color='b', fontsize=15)


plt.subplot(4,1,4),  
plt.plot(x,var1,'k')
plt.axis([135, 195, 0, 1.1])
plt.text(170,0.4,'var(y|x)', fontsize=15)
plt.xlabel('x', fontsize = 15)

plt.show()

2.2. Equal Variance and Not Equal Prior



$$\sigma_0 = \sigma_1 \qquad \text{and} \qquad P(y=1) > P(y=0)$$



In [3]:
x = np.arange(130,200)

mu0 = 160
mu1 = 173
sigma0 = 6.5
sigma1 = 6.5

L1 = norm.pdf(x,mu0,sigma0)
L2 = norm.pdf(x,mu1,sigma1)

prior0 = 1/4
prior1 = 3/4

Px = L1*prior0 + L2*prior1
posterior0 = (L1*prior0)/Px
posterior1 = (L2*prior1)/Px

var1 = posterior1 - posterior1**2;

plt.figure(figsize=(10,12))
plt.subplot(4,1,1)
plt.plot(x,L1,'r',x,L2,'b')
plt.axis([135, 195, 0, 0.1])
plt.text(mu0-10,0.08,'P(x|y=0)', color='r', fontsize=15)
plt.text(mu1,0.08,'P(x|y=1)', color='b', fontsize=15)

plt.subplot(4,1,2), 
plt.plot(x,L1*prior0,'r',x,L2*prior1,'b',x,Px,'k')
plt.axis([135, 195, 0, 0.1])
plt.text(mu0-10,0.05,'P(x|y=0)P(y=0)', color='r', fontsize=15)
plt.text(mu1,0.05,'P(x|y=1)P(y=1)', color='b', fontsize=15)
plt.text(185,0.03,'P(x)', fontsize=15)

plt.subplot(4,1,3),  
plt.plot(x,posterior0,'r',x,posterior1,'b')
plt.axis([135, 195, 0, 1.1])
plt.text(140,0.8,'P(y=0|x)', color='r', fontsize=15)
plt.text(180,0.8,'P(y=1|x)', color='b', fontsize=15)

plt.subplot(4,1,4),  
plt.plot(x,var1,'k')
plt.axis([135, 195, 0, 1.1])
plt.text(170,0.4,'var(y|x)', fontsize=15)
plt.xlabel('x', fontsize = 15)

plt.show()

2.3. Not Equal Variance and Equal Prior



$$\sigma_0 \ne \sigma_1 \qquad \text{and} \qquad P(y=0)=P(y=1)=\frac{1}{2}$$