Probabilistic Machine Learning

By Prof. Seungchul Lee
http://iai.postech.ac.kr/
Industrial AI Lab at POSTECH

Table of Contents

1. Probabilistic Linear Regression¶

Probabilistic Machine Learning
- I personally believe this is a more fundamental way of looking at the machine learning

Inference Idea

$$P(x \mid \theta) = \text{Probability [data } \mid \text{ pattern]}$$

data = underlying pattern + independent noise

Change your viewpoint of data
- Generative model

1.1. Generative Model¶

Given observed data $ D=\{(x_1,y_1),(x_2,y_2),\cdots,(x_m,y_m)\}$

Generative model structure
- Each response generated by a linear model plus some Gaussian noise

$$ y = \hat y + \varepsilon = \omega^T x + \varepsilon, \quad \varepsilon\sim \mathcal{N} \left(0,\sigma^2\right)$$

$$P(y \mid x\,; \omega)= \mathcal{N} \left(\omega^T x,\sigma^2\right)$$

1.2. Frequentist View of Linear Regression¶

Given observed data

$$ D=\{(x_1,y_1),(x_2,y_2),\cdots,(x_m,y_m)\}$$

We want to estimate the weight vector $\omega$

Each response generated by a linear model plus some Gaussian noise

$$ y = \omega^T x + \varepsilon, \quad \varepsilon\sim \mathcal{N} \left(0,\sigma^2\right)$$

Each response $y$ then becomes a draw from the following Gaussian:

$$ y \mid x \sim \left(\omega^T x,\sigma^2\right) $$

Probability of each response variable

$$P(y \mid x\,; \omega)= \mathcal{N} \left(\omega^T x,\sigma^2\right)=\frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{1}{2\sigma^2}\left(y-\omega^T x \right)^2\right)$$

1.2.2. Maximum Likelihood Estimation (MLE)¶

Estimate pramters $\theta=\left(\omega,\sigma^2\right)$ such that maximize the likelihood given a generative model
- Likelihood:

$$\mathcal{L} = P(D \mid \theta) = P(D \,;\, \theta)$$

$$\hat \theta_{MLE} = \underset{\theta}{\mathrm{argmax}}\;\; P(D \,;\, \theta)$$

Log-likelihood:

$$\begin{align*} \ell (\omega,\sigma) = \log \mathcal{L}(\omega,\sigma) &= \log P(D \,; \omega, \sigma^2) \\&= \log P(Y \mid X\,; \omega, \sigma^2) \\&= \log \prod\limits^m_{i=1}P\left(y_i \mid x_i\,; \omega,\sigma^2 \right)\\ &= \sum\limits^m_{i=1}\log P \left(y_i \mid x_i\,; \omega, \sigma^2 \right)\\ &= \sum\limits^m_{i=1}\log \frac{1}{\sqrt{2\pi\sigma^2}}\exp{\left(-\frac{\left(y_i - \omega^Tx_i \right)^2}{2\sigma^2}\right)}\\ &= \sum\limits^m_{i=1}\left\{ -\frac{1}{2}\log \left(2\pi\sigma^2 \right) - \frac{\left(y_i - \omega^Tx_i \right)^2}{2\sigma^2}\right\} \end{align*}$$

Maximum likelihood solution:

$$\begin{align*} \hat{\omega}_{MLE} &= \arg\max_{\omega}\log P(D \,; \omega, \sigma^2)\\ &= \arg\max_{\omega} \;- \frac{1}{2\sigma^2}\sum\limits^m_{i=1} \left(y_i-\omega^Tx_i \right)^2\\ &= \arg\min_{\omega} \frac{1}{2\sigma^2}\sum\limits^m_{i=1} \left(y_i-\omega^Tx_i \right)^2\\ &= \arg\min_{\omega} \sum\limits^m_{i=1} \left(y_i-\omega^Tx_i \right)^2 \end{align*}$$

It is equivalent to the least-squares objective function for linear regression (amazing !)

1.2.3. Compute MLE for Linear Regression¶

$$\begin{align*} \mathcal{L}(\omega,\sigma) & = P\left(y_1,y_2,\cdots,y_m \mid x_1,x_2,\cdots,x_m; \; \underbrace{\omega, \sigma}_{\theta}\right)\\ & = \prod\limits_{i=1}^{m} P\left(y_i \mid x_i; \; \omega,\sigma\right)\\ & = \frac{1}{(2\pi\sigma^2)^\frac{m}{2}}\exp\left(-\frac{1}{2\sigma^2}\sum\limits_{i=1}^m(y_i-\omega^T x_i)^2\right) \end{align*}$$

$$\begin{align*} \mathcal{L}(\omega,\sigma) & = \frac{1}{(2\pi\sigma^2)^\frac{m}{2}}\exp\left(-\frac{1}{2\sigma^2}\sum\limits_{i=1}^m(y_i-\omega^T x_i)^2\right)\\\\ \ell &= -\frac{m}{2}\log{2\pi}-m\log{\sigma}-\frac{1}{2\sigma^2}\sum\limits_{i=1}^m\left(y_i-\omega^T x_i\right)^2\\\\ \frac{d\ell}{d\omega} &=-2X^TY+2X^TX\omega=0\; \quad \implies \; \omega_{MLE}=\left(X^TX\right)^{-1}X^TY \\\\ \frac{d\ell}{d\sigma} & =-\frac{m}{\sigma}+\frac{1}{\sigma^3}\sum\limits_{i=1}^m\left(y_i-\omega^T x_i\right)^2=0 \quad \implies \; \sigma^2_{MLE}=\frac{1}{m}\sum\limits_{i=1}^m\left(y_i-\omega^T x_i\right)^2 \end{align*}$$

Big lesson
- Same as the least squared optimization
- In least squares, we implicitly assume that noise is Gaussian distributed

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

%matplotlib inline

m = 200            

a = 1              
x = 3 + 2*np.random.uniform(0,1,[m,1])
noise = 0.1*np.random.randn(m,1)

y = a*x + noise;
y = np.asmatrix(y)

plt.figure(figsize=(6, 6))
plt.plot(x, y, 'b.')
plt.axis('equal')
plt.grid(alpha=0.3)
plt.show()

# compute theta(1) and theta(2) which are coefficients of y = theta(1)*x + theta(2)
A = np.hstack([np.ones([m, 1]), x])
A = np.asmatrix(A)

theta = (A.T*A).I*A.T*y

# to plot the fitted line
xp = np.linspace(np.min(x), np.max(x))
yp = theta[1,0]*xp + theta[0,0]

plt.figure(figsize=(6, 6))
plt.plot(x, y, 'b.')
plt.plot(xp, yp, 'r', linewidth = 3)
plt.axis('equal')
plt.grid(alpha=0.3)
plt.show()

yhat0 = theta[1,0]*x + theta[0,0]
err0 = yhat0 - y

yhat1 = 1.2*x - 1
err1 = yhat1 - y

yhat2 = 1.3*x - 1
err2 = yhat2 - y

plt.figure(figsize=(10, 6))
plt.subplot(2,3,1), plt.plot(x,y,'b.',x,yhat0,'r'), 
plt.axis([2.9, 5.1, 2.9, 5.1]), plt.grid(alpha=0.3)
plt.subplot(2,3,2), plt.plot(x,y,'b.',x,yhat1,'r'), 
plt.axis([2.9, 5.1, 2.9, 5.1]), plt.grid(alpha=0.3)
plt.subplot(2,3,3), plt.plot(x,y,'b.',x,yhat2,'r'), 
plt.axis([2.9, 5.1, 2.9, 5.1]), plt.grid(alpha=0.3)
plt.subplot(2,3,4), plt.hist(err0,31), plt.axvline(0, color='k'), 
plt.xlabel(r'$\epsilon$', fontsize=15), 
plt.yticks([]), plt.axis([-1, 1, 0, 15]), plt.grid(alpha=0.3)
plt.subplot(2,3,5), plt.hist(err1,31), plt.axvline(0, color='k'), 
plt.xlabel(r'$\epsilon$', fontsize=15),
plt.yticks([]), plt.axis([-1, 1, 0, 15]), plt.grid(alpha=0.3)
plt.subplot(2,3,6), plt.hist(err2,31), plt.axvline(0, color='k'), 
plt.xlabel(r'$\epsilon$', fontsize=15), 
plt.yticks([]), plt.axis([-1, 1, 0, 15]), plt.grid(alpha=0.3)
plt.show()

a0x = err0[1:]
a0y = err0[0:-1]

a1x = err1[1:]
a1y = err1[0:-1]

a2x = err2[1:]
a2y = err2[0:-1]

plt.figure(figsize=(10, 6))
plt.subplot(2,3,1), plt.plot(x,y,'b.',x,yhat0,'r'), 
plt.axis([2.9, 5.1, 2.9, 5.1]), plt.grid(alpha=0.3)
plt.subplot(2,3,2), plt.plot(x,y,'b.',x,yhat1,'r'), 
plt.axis([2.9, 5.1, 2.9, 5.1]), plt.grid(alpha=0.3)
plt.subplot(2,3,3), plt.plot(x,y,'b.',x,yhat2,'r'), 
plt.axis([2.9, 5.1, 2.9, 5.1]), plt.grid(alpha=0.3)

plt.subplot(2,3,4), plt.plot(a0x, a0y, '.'), 
plt.axis('equal'), plt.axis([-0.7, 0.7, -0.7, 0.7]), plt.grid(alpha=0.3)
plt.xlabel(r'$\epsilon_i$', fontsize=15), plt.ylabel(r'$\epsilon_{i-1}$', fontsize=15)

plt.subplot(2,3,5), plt.plot(a1x, a1y, '.'), 
plt.axis('equal'), plt.axis([-0.7, 0.7, -0.7, 0.7]), plt.grid(alpha=0.3)
plt.xlabel(r'$\epsilon_i$', fontsize=15), plt.ylabel(r'$\epsilon_{i-1}$', fontsize=15)

plt.subplot(2,3,6), plt.plot(a2x, a2y, '.'), 
plt.axis('equal'), plt.axis([-0.7, 0.7, -0.7, 0.7]), plt.grid(alpha=0.3)
plt.xlabel(r'$\epsilon_i$', fontsize=15), plt.ylabel(r'$\epsilon_{i-1}$', fontsize=15)
plt.show()

1.3. Bayesian View of Linear Regression¶

No prior information or uniform distribution on $\omega$ leads to MLE

1.3.1. A Prior on $\omega$¶

Suppose to assume a Gaussian prior distribution over the weight vector $\omega$
- Make sure you understand what it means

1.3.2. Posteriori¶

Posterior probability
- Bayes rule

$$P(\omega \mid D) = \frac{P(D \mid \omega) P(\omega)}{P(D)}$$

Log posterior probability:

$$\log P(\omega \mid D) = \log\frac{P(D \mid \omega)P(\omega)}{P(D)} = \log P(D \mid \omega) + \log P(\omega) - \underbrace{\log P(D)}_{\text{constant}}$$

1.3.3. Maximum-a-Posteriori (MAP)¶

Assume $E[\omega] = 0$ for simplicity

$$P(\omega) \sim \mathcal{N}\left( 0, \Sigma\right) = \mathcal{N}\left( 0, \lambda^{-1}I\right) = \frac{1}{(2\pi)^{D/2}}\exp\left( -\frac{\lambda}{2} \omega^T\omega\right)$$

Maximize log posterior probability

$$\begin{align*} \hat{\omega}_{MAP} &= \arg\max_{\omega} \log P(\omega \mid D)\\\\ &= \arg\max_{\omega}\left\{\log P(D \mid \omega) + \log P(\omega)\right\}\\\\ &= \arg\max_{\omega}\left\{ \sum\limits^m_{i=1}\left\{ -\frac{1}{2}\log\left(2\pi\sigma^2\right) - \frac{\left(y_i-\omega^Tx_i\right)^2}{2\sigma^2} \right\} -\frac{D}{2}\log (2\pi) - \frac{\lambda}{2}\omega^T\omega \right\}\\\\ &= \arg\min_{\omega}\frac{1}{2\sigma^2}\sum\limits^m_{i=1}\left(y_i - \omega^Tx_i\right)^2 + \frac{\lambda}{2}\omega^T\omega \quad \\&\text{ (ignoring constants and changing max to min)} \end{align*}$$

For $\sigma = 1$ (or some constant) for each input, it’s equivalent to the regularized least-squares objective

$$\hat{\omega}_{MAP} = \arg\min_{\omega} \left\{\sum\limits^m_{i=1}\left(y_i - \omega^Tx_i\right)^2 + \lambda\omega^T\omega \right\}$$

Big lesson: MAP $= l_2$ norm regularization

1.4. Summary: MLE vs MAP¶

MLE solution:

$$\hat{\omega}_{MLE} = \arg\min_{\omega}\frac{1}{2\sigma^2}\sum\limits^m_{i=1} \left(y_i - \omega^Tx_i \right)^2$$

MAP solution:

$$\hat{\omega}_{MAP} = \arg\min_{\omega}\frac{1}{2\sigma^2}\sum\limits^m_{i=1} \left(y_i - \omega^Tx_i \right)^2 + \frac{\lambda}{2}\omega^T\omega$$

Take-Home messages:
- MLE estimation of a parameter leads to unregularized solutions
- MAP estimation of a parameter leads to regularized solutions
- The prior distribution acts as a regularizer in MAP estimation

Note : for MAP, different prior distributions lead to different regularizers
- Gaussian prior on $\omega$ regularizes the $l_2$ norm of $\omega$
- Laplace prior $\exp \left(-C\lVert\omega\rVert_1 \right)$ on $\omega$ regularizes the $l_1$ norm of $\omega$

2. Probabilistic Linear Classification¶

Often we do not just care about predicting the label $y$ for an example
Rather, we want to predict the label probabilities $P(y \mid x, \omega)$
- E.g., $P(y = +1 \mid x, \omega)$: the probability that the label is $+1$
- In a sense, it is our confidence in the predicted label

Consider the following function in a compact expression $(y = -1/+1)$:

$$P(y \mid x, \omega) = \sigma \left(y\omega^Tx \right) = \frac{1}{1 + \exp \left(-y\omega^Tx \right)}$$

$\sigma$ is the logistic function which maps all real number into $(0,1)$

2.1. Logistic Regression¶

What does the decision boundary look like for logistic regression?

At the decision boundary labels $-1/+1$ becomes equiprobable

$$\begin{align*} P(y= + 1 \mid x, \omega) &= P(y = -1 \mid x, \omega)\\\\ \frac{1}{1+\exp \left(-\omega^Tx \right)} &= \frac{1}{1+\exp \left(\omega^Tx \right)}\\\\ \exp \left(-\omega^Tx \right) &= \exp \left(\omega^Tx \right)\\\\ \omega^T x&= 0 \end{align*}$$

The decision boundary is therefore linear $\implies$ logistic regression is a linear classifier

Note: it is possible to kernelize and make it nonlinear

2.2. Maximum Likelihood Solution¶

Goal: want to estimate $\omega$ from the data $D = \{ (x_1, y_1), \cdots, (x_m, y_m)\}$

Log-likelihood:

$$\begin{align*} \ell (\omega) = \log \mathcal{L}(\omega) &= \log P(D \mid \omega) \\&= \log P(Y \mid X, \omega) \\&= \log\prod\limits^m_{i=1}P(y_i \mid x_i, \omega)\\ &= \sum^m_{i=1}\log P(y_i \mid x_i, \omega)\\ &= \sum^m_{i=1}\log \frac{1}{1+\exp \left(-y_i\omega^Tx_i \right)}\\ &= \sum^m_{i=1}-\log\left[1+\exp \left(-y_i\omega^Tx_i \right)\right] \end{align*}$$

Maximum Likelihood Solution:

$$\hat{\omega}_{MLE} = \arg\max_{\omega}\log \mathcal{L}(\omega) = \arg\min_{\omega}\sum\limits^m_{i=1} \log\left[1 + \exp\left(-y_i\omega^Tx_i\right)\right]$$

No closed-form solution exists but we can do gradient descent on $\omega$

$$\begin{align*} \nabla_{\omega} \log \mathcal{L}(\omega) &= \sum^m_{i=1} -\frac{1}{1 + \exp\left(-y_i\omega^Tx_i\right)}\exp\left(-y_i\omega^Tx_i\right)(-y_i x_i)\\ &= \sum^m_{i=1} \frac{1}{1 + \exp\left(y_i\omega^Tx_i\right)}y_i x_i \end{align*}$$

2.3. Maximum-a-Posteriori Solution¶

Let's assume a Gaussian prior distribution over the weight vector $\omega$

$$P(\omega) = \mathcal{N}\left( 0, \lambda^{-1}I\right) = \frac{1}{(2\pi)^{D/2}}\exp\left( -\frac{\lambda}{2} \omega^T\omega \right)$$

Maximum-a-Posteriori Solution:

$$\begin{align*} \hat{\omega}_{MAP} &= \arg\max_{\omega} \log P(\omega \mid D) \\ &= \arg\max_{\omega}\{\log P(D \mid \omega) + \log P(\omega) - \underbrace{\log P(D)}_{\text{constant}} \}&\\ &= \arg\max_{\omega}\{\log P(D \mid \omega) + \log P(\omega) \}&\\ &= \arg\max_{\omega}\bigg\{ \sum\limits^m_{i=1} - \log\left[1 + \exp\left(-y_i\omega^Tx_i\right)\right] -\frac{D}{2}\log (2\pi) - \frac{\lambda}{2}\omega^T\omega \bigg\}&\\ &= \arg\min_{\omega}\sum\limits^m_{i=1} \log\left[1 + \exp\left(-y_i\omega^Tx_i\right)\right] + \frac{\lambda}{2}\omega^T\omega \quad \\&\text{ (ignoring constants and changing max to min)}& \end{align*}$$

Big lesson: MAP $= l_2$ norm regularization

No closed-form solution exists but we can do gradient descent on $\omega$

See “A comparison of numerical optimizers for logistic regression” by Tom Minka on optimization techniques (gradient descent and others) for logistic regression (both MLE and MAP)

2.4. Summary: MLE vs MAP¶

MLE solution:

$$\hat{\omega}_{MLE} = \arg\min_{\omega}\sum\limits^m_{i=1}\log\left[1 + \exp\left(-y_i\omega^Tx_i\right)\right]$$

MAP solution:

$$\hat{\omega}_{MAP} = \arg\min_{\omega}\sum\limits^m_{i=1}\log\left[1 + \exp\left(-y_i\omega^Tx_i\right)\right] + \frac{\lambda}{2}\omega^T\omega$$

Take-home messages (we already saw these before)
- MLE estimation of a parameter leads to unregularized solutions
- MAP estimation of a parameter leads to regularized solutions
- The prior distribution acts as a regularizer in MAP estimation

Note: For MAP, different prior distributions lead to different regularizers
- Gaussian prior on $\omega$ regularizes the $l_2$ norm of $\omega$
- Laplace prior $\exp(-C\lVert\omega\rVert_1)$ on $\omega$ regularizes the $l_1$ norm of $\omega$

3. Probabilistic Clustering¶

will not cover in this course

4. Probabilistic Dimension Reduction¶

will not cover in this course

%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')