Probabilistic Machine Learning
Table of Contents
$$P(x \mid \theta) = \text{Probability [data } \mid \text{ pattern]}$$
$$ y = \hat y + \varepsilon = \omega^T x + \varepsilon, \quad \varepsilon\sim \mathcal{N} \left(0,\sigma^2\right)$$
$$\hat \theta_{MLE} = \underset{\theta}{\mathrm{argmax}}\;\; P(D \,;\, \theta)$$
$$\begin{align*} \mathcal{L}(\omega,\sigma)
& = P\left(y_1,y_2,\cdots,y_m \mid x_1,x_2,\cdots,x_m; \; \underbrace{\omega, \sigma}_{\theta}\right)\\
& = \prod\limits_{i=1}^{m} P\left(y_i \mid x_i; \; \omega,\sigma\right)\\
& = \frac{1}{(2\pi\sigma^2)^\frac{m}{2}}\exp\left(-\frac{1}{2\sigma^2}\sum\limits_{i=1}^m(y_i-\omega^T x_i)^2\right)
\end{align*}$$
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
%matplotlib inline
m = 200
a = 1
x = 3 + 2*np.random.uniform(0,1,[m,1])
noise = 0.1*np.random.randn(m,1)
y = a*x + noise;
y = np.asmatrix(y)
plt.figure(figsize=(6, 6))
plt.plot(x, y, 'b.')
plt.axis('equal')
plt.grid(alpha=0.3)
plt.show()
# compute theta(1) and theta(2) which are coefficients of y = theta(1)*x + theta(2)
A = np.hstack([np.ones([m, 1]), x])
A = np.asmatrix(A)
theta = (A.T*A).I*A.T*y
# to plot the fitted line
xp = np.linspace(np.min(x), np.max(x))
yp = theta[1,0]*xp + theta[0,0]
plt.figure(figsize=(6, 6))
plt.plot(x, y, 'b.')
plt.plot(xp, yp, 'r', linewidth = 3)
plt.axis('equal')
plt.grid(alpha=0.3)
plt.show()
yhat0 = theta[1,0]*x + theta[0,0]
err0 = yhat0 - y
yhat1 = 1.2*x - 1
err1 = yhat1 - y
yhat2 = 1.3*x - 1
err2 = yhat2 - y
plt.figure(figsize=(10, 6))
plt.subplot(2,3,1), plt.plot(x,y,'b.',x,yhat0,'r'),
plt.axis([2.9, 5.1, 2.9, 5.1]), plt.grid(alpha=0.3)
plt.subplot(2,3,2), plt.plot(x,y,'b.',x,yhat1,'r'),
plt.axis([2.9, 5.1, 2.9, 5.1]), plt.grid(alpha=0.3)
plt.subplot(2,3,3), plt.plot(x,y,'b.',x,yhat2,'r'),
plt.axis([2.9, 5.1, 2.9, 5.1]), plt.grid(alpha=0.3)
plt.subplot(2,3,4), plt.hist(err0,31), plt.axvline(0, color='k'),
plt.xlabel(r'$\epsilon$', fontsize=15),
plt.yticks([]), plt.axis([-1, 1, 0, 15]), plt.grid(alpha=0.3)
plt.subplot(2,3,5), plt.hist(err1,31), plt.axvline(0, color='k'),
plt.xlabel(r'$\epsilon$', fontsize=15),
plt.yticks([]), plt.axis([-1, 1, 0, 15]), plt.grid(alpha=0.3)
plt.subplot(2,3,6), plt.hist(err2,31), plt.axvline(0, color='k'),
plt.xlabel(r'$\epsilon$', fontsize=15),
plt.yticks([]), plt.axis([-1, 1, 0, 15]), plt.grid(alpha=0.3)
plt.show()
a0x = err0[1:]
a0y = err0[0:-1]
a1x = err1[1:]
a1y = err1[0:-1]
a2x = err2[1:]
a2y = err2[0:-1]
plt.figure(figsize=(10, 6))
plt.subplot(2,3,1), plt.plot(x,y,'b.',x,yhat0,'r'),
plt.axis([2.9, 5.1, 2.9, 5.1]), plt.grid(alpha=0.3)
plt.subplot(2,3,2), plt.plot(x,y,'b.',x,yhat1,'r'),
plt.axis([2.9, 5.1, 2.9, 5.1]), plt.grid(alpha=0.3)
plt.subplot(2,3,3), plt.plot(x,y,'b.',x,yhat2,'r'),
plt.axis([2.9, 5.1, 2.9, 5.1]), plt.grid(alpha=0.3)
plt.subplot(2,3,4), plt.plot(a0x, a0y, '.'),
plt.axis('equal'), plt.axis([-0.7, 0.7, -0.7, 0.7]), plt.grid(alpha=0.3)
plt.xlabel(r'$\epsilon_i$', fontsize=15), plt.ylabel(r'$\epsilon_{i-1}$', fontsize=15)
plt.subplot(2,3,5), plt.plot(a1x, a1y, '.'),
plt.axis('equal'), plt.axis([-0.7, 0.7, -0.7, 0.7]), plt.grid(alpha=0.3)
plt.xlabel(r'$\epsilon_i$', fontsize=15), plt.ylabel(r'$\epsilon_{i-1}$', fontsize=15)
plt.subplot(2,3,6), plt.plot(a2x, a2y, '.'),
plt.axis('equal'), plt.axis([-0.7, 0.7, -0.7, 0.7]), plt.grid(alpha=0.3)
plt.xlabel(r'$\epsilon_i$', fontsize=15), plt.ylabel(r'$\epsilon_{i-1}$', fontsize=15)
plt.show()
$$P(\omega \mid D) = \frac{P(D \mid \omega) P(\omega)}{P(D)}$$
$$P(\omega) \sim \mathcal{N}\left( 0, \Sigma\right) = \mathcal{N}\left( 0, \lambda^{-1}I\right) = \frac{1}{(2\pi)^{D/2}}\exp\left( -\frac{\lambda}{2} \omega^T\omega\right)$$
$$\begin{align*}
\hat{\omega}_{MAP} &= \arg\max_{\omega} \log P(\omega \mid D)\\\\
&= \arg\max_{\omega}\left\{\log P(D \mid \omega) + \log P(\omega)\right\}\\\\
&= \arg\max_{\omega}\left\{ \sum\limits^m_{i=1}\left\{ -\frac{1}{2}\log\left(2\pi\sigma^2\right) - \frac{\left(y_i-\omega^Tx_i\right)^2}{2\sigma^2} \right\} -\frac{D}{2}\log (2\pi) - \frac{\lambda}{2}\omega^T\omega \right\}\\\\
&= \arg\min_{\omega}\frac{1}{2\sigma^2}\sum\limits^m_{i=1}\left(y_i - \omega^Tx_i\right)^2 + \frac{\lambda}{2}\omega^T\omega \quad \\&\text{ (ignoring constants and changing max to min)}
\end{align*}$$
$$\hat{\omega}_{MAP} = \arg\min_{\omega} \left\{\sum\limits^m_{i=1}\left(y_i - \omega^Tx_i\right)^2 + \lambda\omega^T\omega \right\}$$
Take-Home messages:
MLE estimation of a parameter leads to unregularized solutions
MAP estimation of a parameter leads to regularized solutions
The prior distribution acts as a regularizer in MAP estimation
Note : for MAP, different prior distributions lead to different regularizers
Gaussian prior on $\omega$ regularizes the $l_2$ norm of $\omega$
Laplace prior $\exp \left(-C\lVert\omega\rVert_1 \right)$ on $\omega$ regularizes the $l_1$ norm of $\omega$
Often we do not just care about predicting the label $y$ for an example
Rather, we want to predict the label probabilities $P(y \mid x, \omega)$
E.g., $P(y = +1 \mid x, \omega)$: the probability that the label is $+1$
In a sense, it is our confidence in the predicted label
$$P(y \mid x, \omega) = \sigma \left(y\omega^Tx \right) = \frac{1}{1 + \exp \left(-y\omega^Tx \right)}$$
$$\hat{\omega}_{MLE} = \arg\max_{\omega}\log \mathcal{L}(\omega) = \arg\min_{\omega}\sum\limits^m_{i=1} \log\left[1 + \exp\left(-y_i\omega^Tx_i\right)\right]$$
$$\begin{align*}
\nabla_{\omega} \log \mathcal{L}(\omega) &= \sum^m_{i=1} -\frac{1}{1 + \exp\left(-y_i\omega^Tx_i\right)}\exp\left(-y_i\omega^Tx_i\right)(-y_i x_i)\\
&= \sum^m_{i=1} \frac{1}{1 + \exp\left(y_i\omega^Tx_i\right)}y_i x_i
\end{align*}$$
$$P(\omega) = \mathcal{N}\left( 0, \lambda^{-1}I\right) = \frac{1}{(2\pi)^{D/2}}\exp\left( -\frac{\lambda}{2} \omega^T\omega \right)$$
$$\begin{align*}
\hat{\omega}_{MAP} &= \arg\max_{\omega} \log P(\omega \mid D) \\
&= \arg\max_{\omega}\{\log P(D \mid \omega) + \log P(\omega) - \underbrace{\log P(D)}_{\text{constant}} \}&\\
&= \arg\max_{\omega}\{\log P(D \mid \omega) + \log P(\omega) \}&\\
&= \arg\max_{\omega}\bigg\{ \sum\limits^m_{i=1} - \log\left[1 + \exp\left(-y_i\omega^Tx_i\right)\right] -\frac{D}{2}\log (2\pi) - \frac{\lambda}{2}\omega^T\omega \bigg\}&\\
&= \arg\min_{\omega}\sum\limits^m_{i=1} \log\left[1 + \exp\left(-y_i\omega^Tx_i\right)\right] + \frac{\lambda}{2}\omega^T\omega \quad \\&\text{ (ignoring constants and changing max to min)}&
\end{align*}$$
Take-home messages (we already saw these before)
MLE estimation of a parameter leads to unregularized solutions
MAP estimation of a parameter leads to regularized solutions
The prior distribution acts as a regularizer in MAP estimation
Note: For MAP, different prior distributions lead to different regularizers
Gaussian prior on $\omega$ regularizes the $l_2$ norm of $\omega$
Laplace prior $\exp(-C\lVert\omega\rVert_1)$ on $\omega$ regularizes the $l_1$ norm of $\omega$
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')