Probabilistic Machine Learning


By Prof. Seungchul Lee
http://iai.postech.ac.kr/
Industrial AI Lab at POSTECH

Table of Contents

1. Probabilistic Linear Regression¶

  • Probabilistic Machine Learning
    • I personally believe this is a more fundamental way of looking at the machine learning
  • Inference Idea



$$P(x \mid \theta) = \text{Probability [data } \mid \text{ pattern]}$$


data = underlying pattern + independent noise



  • Change your viewpoint of data
    • Generative model

1.1. Generative Model¶

  • Given observed data $ D=\{(x_1,y_1),(x_2,y_2),\cdots,(x_m,y_m)\}$
  • Generative model structure
    • Each response generated by a linear model plus some Gaussian noise


$$ y = \hat y + \varepsilon = \omega^T x + \varepsilon, \quad \varepsilon\sim \mathcal{N} \left(0,\sigma^2\right)$$


$$P(y \mid x\,; \omega)= \mathcal{N} \left(\omega^T x,\sigma^2\right)$$





1.2. Frequentist View of Linear Regression¶

  • Given observed data
$$ D=\{(x_1,y_1),(x_2,y_2),\cdots,(x_m,y_m)\}$$
  • We want to estimate the weight vector $\omega$
  • Each response generated by a linear model plus some Gaussian noise
$$ y = \omega^T x + \varepsilon, \quad \varepsilon\sim \mathcal{N} \left(0,\sigma^2\right)$$
  • Each response $y$ then becomes a draw from the following Gaussian:
$$ y \mid x \sim \left(\omega^T x,\sigma^2\right) $$
  • Probability of each response variable

$$P(y \mid x\,; \omega)= \mathcal{N} \left(\omega^T x,\sigma^2\right)=\frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{1}{2\sigma^2}\left(y-\omega^T x \right)^2\right)$$

1.2.2. Maximum Likelihood Estimation (MLE)¶

  • Estimate pramters $\theta=\left(\omega,\sigma^2\right)$ such that maximize the likelihood given a generative model
    • Likelihood:
$$\mathcal{L} = P(D \mid \theta) = P(D \,;\, \theta)$$


$$\hat \theta_{MLE} = \underset{\theta}{\mathrm{argmax}}\;\; P(D \,;\, \theta)$$

  • Log-likelihood:

$$\begin{align*} \ell (\omega,\sigma) = \log \mathcal{L}(\omega,\sigma) &= \log P(D \,; \omega, \sigma^2) \\&= \log P(Y \mid X\,; \omega, \sigma^2) \\&= \log \prod\limits^m_{i=1}P\left(y_i \mid x_i\,; \omega,\sigma^2 \right)\\ &= \sum\limits^m_{i=1}\log P \left(y_i \mid x_i\,; \omega, \sigma^2 \right)\\ &= \sum\limits^m_{i=1}\log \frac{1}{\sqrt{2\pi\sigma^2}}\exp{\left(-\frac{\left(y_i - \omega^Tx_i \right)^2}{2\sigma^2}\right)}\\ &= \sum\limits^m_{i=1}\left\{ -\frac{1}{2}\log \left(2\pi\sigma^2 \right) - \frac{\left(y_i - \omega^Tx_i \right)^2}{2\sigma^2}\right\} \end{align*}$$
  • Maximum likelihood solution:

$$\begin{align*} \hat{\omega}_{MLE} &= \arg\max_{\omega}\log P(D \,; \omega, \sigma^2)\\ &= \arg\max_{\omega} \;- \frac{1}{2\sigma^2}\sum\limits^m_{i=1} \left(y_i-\omega^Tx_i \right)^2\\ &= \arg\min_{\omega} \frac{1}{2\sigma^2}\sum\limits^m_{i=1} \left(y_i-\omega^Tx_i \right)^2\\ &= \arg\min_{\omega} \sum\limits^m_{i=1} \left(y_i-\omega^Tx_i \right)^2 \end{align*}$$


  • It is equivalent to the least-squares objective function for linear regression (amazing !)

1.2.3. Compute MLE for Linear Regression¶


$$\begin{align*} \mathcal{L}(\omega,\sigma) & = P\left(y_1,y_2,\cdots,y_m \mid x_1,x_2,\cdots,x_m; \; \underbrace{\omega, \sigma}_{\theta}\right)\\ & = \prod\limits_{i=1}^{m} P\left(y_i \mid x_i; \; \omega,\sigma\right)\\ & = \frac{1}{(2\pi\sigma^2)^\frac{m}{2}}\exp\left(-\frac{1}{2\sigma^2}\sum\limits_{i=1}^m(y_i-\omega^T x_i)^2\right) \end{align*}$$


$$\begin{align*} \mathcal{L}(\omega,\sigma) & = \frac{1}{(2\pi\sigma^2)^\frac{m}{2}}\exp\left(-\frac{1}{2\sigma^2}\sum\limits_{i=1}^m(y_i-\omega^T x_i)^2\right)\\\\ \ell &= -\frac{m}{2}\log{2\pi}-m\log{\sigma}-\frac{1}{2\sigma^2}\sum\limits_{i=1}^m\left(y_i-\omega^T x_i\right)^2\\\\ \frac{d\ell}{d\omega} &=-2X^TY+2X^TX\omega=0\; \quad \implies \; \omega_{MLE}=\left(X^TX\right)^{-1}X^TY \\\\ \frac{d\ell}{d\sigma} & =-\frac{m}{\sigma}+\frac{1}{\sigma^3}\sum\limits_{i=1}^m\left(y_i-\omega^T x_i\right)^2=0 \quad \implies \; \sigma^2_{MLE}=\frac{1}{m}\sum\limits_{i=1}^m\left(y_i-\omega^T x_i\right)^2 \end{align*}$$


  • Big lesson
    • Same as the least squared optimization
    • In least squares, we implicitly assume that noise is Gaussian distributed
In [1]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

%matplotlib inline
In [2]:
m = 200            

a = 1              
x = 3 + 2*np.random.uniform(0,1,[m,1])
noise = 0.1*np.random.randn(m,1)

y = a*x + noise;
y = np.asmatrix(y)

plt.figure(figsize=(6, 6))
plt.plot(x, y, 'b.')
plt.axis('equal')
plt.grid(alpha=0.3)
plt.show()
In [3]:
# compute theta(1) and theta(2) which are coefficients of y = theta(1)*x + theta(2)
A = np.hstack([np.ones([m, 1]), x])
A = np.asmatrix(A)

theta = (A.T*A).I*A.T*y

# to plot the fitted line
xp = np.linspace(np.min(x), np.max(x))
yp = theta[1,0]*xp + theta[0,0]

plt.figure(figsize=(6, 6))
plt.plot(x, y, 'b.')
plt.plot(xp, yp, 'r', linewidth = 3)
plt.axis('equal')
plt.grid(alpha=0.3)
plt.show()
In [4]:
yhat0 = theta[1,0]*x + theta[0,0]
err0 = yhat0 - y

yhat1 = 1.2*x - 1
err1 = yhat1 - y

yhat2 = 1.3*x - 1
err2 = yhat2 - y

plt.figure(figsize=(10, 6))
plt.subplot(2,3,1), plt.plot(x,y,'b.',x,yhat0,'r'), 
plt.axis([2.9, 5.1, 2.9, 5.1]), plt.grid(alpha=0.3)
plt.subplot(2,3,2), plt.plot(x,y,'b.',x,yhat1,'r'), 
plt.axis([2.9, 5.1, 2.9, 5.1]), plt.grid(alpha=0.3)
plt.subplot(2,3,3), plt.plot(x,y,'b.',x,yhat2,'r'), 
plt.axis([2.9, 5.1, 2.9, 5.1]), plt.grid(alpha=0.3)
plt.subplot(2,3,4), plt.hist(err0,31), plt.axvline(0, color='k'), 
plt.xlabel(r'$\epsilon$', fontsize=15), 
plt.yticks([]), plt.axis([-1, 1, 0, 15]), plt.grid(alpha=0.3)
plt.subplot(2,3,5), plt.hist(err1,31), plt.axvline(0, color='k'), 
plt.xlabel(r'$\epsilon$', fontsize=15),
plt.yticks([]), plt.axis([-1, 1, 0, 15]), plt.grid(alpha=0.3)
plt.subplot(2,3,6), plt.hist(err2,31), plt.axvline(0, color='k'), 
plt.xlabel(r'$\epsilon$', fontsize=15), 
plt.yticks([]), plt.axis([-1, 1, 0, 15]), plt.grid(alpha=0.3)
plt.show()
In [5]:
a0x = err0[1:]
a0y = err0[0:-1]

a1x = err1[1:]
a1y = err1[0:-1]

a2x = err2[1:]
a2y = err2[0:-1]

plt.figure(figsize=(10, 6))
plt.subplot(2,3,1), plt.plot(x,y,'b.',x,yhat0,'r'), 
plt.axis([2.9, 5.1, 2.9, 5.1]), plt.grid(alpha=0.3)
plt.subplot(2,3,2), plt.plot(x,y,'b.',x,yhat1,'r'), 
plt.axis([2.9, 5.1, 2.9, 5.1]), plt.grid(alpha=0.3)
plt.subplot(2,3,3), plt.plot(x,y,'b.',x,yhat2,'r'), 
plt.axis([2.9, 5.1, 2.9, 5.1]), plt.grid(alpha=0.3)

plt.subplot(2,3,4), plt.plot(a0x, a0y, '.'), 
plt.axis('equal'), plt.axis([-0.7, 0.7, -0.7, 0.7]), plt.grid(alpha=0.3)
plt.xlabel(r'$\epsilon_i$', fontsize=15), plt.ylabel(r'$\epsilon_{i-1}$', fontsize=15)

plt.subplot(2,3,5), plt.plot(a1x, a1y, '.'), 
plt.axis('equal'), plt.axis([-0.7, 0.7, -0.7, 0.7]), plt.grid(alpha=0.3)
plt.xlabel(r'$\epsilon_i$', fontsize=15), plt.ylabel(r'$\epsilon_{i-1}$', fontsize=15)

plt.subplot(2,3,6), plt.plot(a2x, a2y, '.'), 
plt.axis('equal'), plt.axis([-0.7, 0.7, -0.7, 0.7]), plt.grid(alpha=0.3)
plt.xlabel(r'$\epsilon_i$', fontsize=15), plt.ylabel(r'$\epsilon_{i-1}$', fontsize=15)
plt.show()

1.3. Bayesian View of Linear Regression¶

  • No prior information or uniform distribution on $\omega$ leads to MLE

1.3.1. A Prior on $\omega$¶





  • Suppose to assume a Gaussian prior distribution over the weight vector $\omega$
    • Make sure you understand what it means