Probabilistic Machine Learning

 by Prof. Seungchul LeeiSystems Design Labhttp://isystems.unist.ac.kr/UNIST

# 1. Probabilistic Linear Regression¶

$$P(X \mid \theta) = \text{Probability [data } \mid \text{ pattern]}$$

• Inference idea
data = underlying pattern + independent noise
• each response generated by a linear model plus some Gaussian noise

$$y = \omega^T x + \varepsilon, \quad \varepsilon\sim \mathcal{N} \left(0,\sigma^2\right)$$
• each response $y$ then becomes a draw from the following Gaussian:

$$y \sim \left(\omega^T x,\sigma^2\right)$$
• Probability of each response variable

$$P(y \mid x, \omega)= \mathcal{N} \left(\omega^T x,\sigma^2\right)=\frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{1}{2\sigma^2}\left(y-\omega^T x \right)^2\right)$$
• Given observed data $D=\{(x_1,y_1),(x_2,y_2),\cdots,(x_m,y_m)\}$, we want to estimate the weight vector $\omega$

## 1.1. Maximum Likelihood Solution¶

• Log-likelihood:

\begin{align*} \ell (\omega) = \log L(\omega) &= \log P(D \mid \omega) \\&= \log P(Y \mid X, \omega) \\&= \log \prod\limits^m_{n=1}P\left(y_n \mid x_n, \omega \right)\\ &= \sum\limits^m_{n=1}\log P \left(y_n \mid x_n, \omega \right)\\ &= \sum\limits^m_{n=1}\log \frac{1}{\sqrt{2\pi\sigma^2}}\exp{\left(-\frac{\left(y_n - \omega^Tx_n \right)^2}{2\sigma^2}\right)}\\ &= \sum\limits^m_{n=1}\left\{ -\frac{1}{2}\log \left(2\pi\sigma^2 \right) - \frac{\left(y_n - \omega^Tx_n \right)^2}{2\sigma^2}\right\} \end{align*}
• Maximum Likelihood Solution:

\begin{align*} \hat{\omega}_{MLE} &= \arg\max_{\omega}\log P(D \mid \omega)\\ &= \arg\max_{\omega} \;- \frac{1}{2\sigma^2}\sum\limits^m_{n=1} \left(y_n-\omega^Tx_n \right)^2\\ &= \arg\min_{\omega} \frac{1}{2\sigma^2}\sum\limits^m_{n=1} \left(y_n-\omega^Tx_n \right)^2\\ &= \arg\min_{\omega} \sum\limits^m_{n=1} \left(y_n-\omega^Tx_n \right)^2 \end{align*}

• It is equivalent to the least-squares objective for linear regression

## 1.2. Maximum-a-Posteriori Solution¶

• Let's assume a Gaussian prior distribution over the weight vector $\omega$

$$P(\omega) \sim \mathcal{N}\left(\omega \mid 0, \lambda^{-1}I\right) = \frac{1}{(2\pi)^{D/2}}\exp\left( -\frac{\lambda}{2} \omega^T\omega\right)$$
• Log posterior probability:

$$\log P(\omega \mid D) = \log\frac{P(\omega)P(D \mid \omega)}{P(D)} = \log P(\omega) + \log P(D \mid \omega) - \underbrace{\log P(D)}_{\text{constant}}$$
• Maximum-a-Posteriori Solution:

\begin{align*} & \hat{\omega}_{MAP} \\\\&= \arg\max_{\omega} \log P(\omega \mid D)\\ &= \arg\max_{\omega}\left\{\log P(\omega) + \log P(D \mid \omega)\right\}\\ &= \arg\max_{\omega}\left\{ -\frac{D}{2}\log (2\pi) - \frac{\lambda}{2}\omega^T\omega + \sum\limits^m_{n=1}\left\{ -\frac{1}{2}\log\left(2\pi\sigma^2\right) - \frac{\left(y_n-\omega^Tx_n\right)^2}{2\sigma^2} \right\} \right\}\\ &= \arg\min_{\omega}\frac{1}{2\sigma^2}\sum\limits^m_{n=1}\left(y_n - \omega^Tx_n\right)^2 + \frac{\lambda}{2}\omega^T\omega \quad \\&\text{ (ignoring constants and changing max to min)} \end{align*}

• For $\sigma = 1$ (or some constant) for each input, it’s equivalent to the regularized least-squares objective
• BIG Lesson: MAP $= l_2$ norm regularization

## 1.3. Summary: MLE vs MAP¶

• MLE solution:
$$\hat{\omega}_{MLE} = \arg\min_{\omega}\frac{1}{2\sigma^2}\sum\limits^m_{n=1} \left(y_n - \omega^Tx_n \right)^2$$
• MAP solution:
$$\hat{\omega}_{MLE} = \arg\min_{\omega}\frac{1}{2\sigma^2}\sum\limits^m_{n=1} \left(y_n - \omega^Tx_n \right)^2 + \frac{\lambda}{2}\omega^T\omega$$
• Take-Home messages:

• MLE estimation of a parameter leads to unregularized solutions

• MAP estimation of a parameter leads to regularized solutions

• The prior distribution acts as a regularizer in MAP estimation

• Note : for MAP, different prior distributions lead to different regularizers

• Gaussian prior on $\omega$ regularizes the $l_2$ norm of $\omega$

• Laplace prior $\exp \left(-C\lVert\omega\rVert_1 \right)$ on $\omega$ regularizes the $l_1$ norm of $\omega$

# 2. Probabilistic Linear Classification¶

• Often we do not just care about predicting the label $y$ for an example

• Rather, we want to predict the label probabilities $P(y \mid x, \omega)$

• E.g., $P(y = +1 \mid x, \omega)$: the probability that the label is $+1$

• In a sense, it is our confidence in the predicted label

• Probabilistic classification models allow us do that
• Consider the following function in a compact expression $(y = -1/+1)$:

$$P(y \mid x, \omega) = \sigma \left(y\omega^Tx \right) = \frac{1}{1 + \exp \left(-y\omega^Tx \right)}$$
• $\sigma$ is the logistic function which maps all real number into $(0,1)$

## 2.1. Logistic Regression¶

• What does the decision boundary look like for logistic regression?
• At the decision boundary labels $-1/+1$ becomes equiprobable

\begin{align*} P(y= + 1 \mid x, \omega) &= P(y = -1 \mid x, \omega)\\ \frac{1}{1+\exp \left(-\omega^Tx \right)} &= \frac{1}{1+\exp \left(\omega^Tx \right)}\\ \exp \left(-\omega^Tx \right) &= \exp \left(\omega^Tx \right)\\ \omega^T x&= 0 \end{align*}
• The decision boundary is therefore linear $\implies$ logistic regression is a linear classifier
• note: it is possible to kernelize and make it nonlinear

## 2.2. Maximum Likelihood Solution¶

• Goal: want to estimate $\omega$ from the data $D = \{ (x_1, y_1), \cdots, (x_m, y_m)\}$
• Log-likelihood:

\begin{align*} \ell (\omega) = \log L(\omega) &= \log P(D \mid \omega) \\&= \log P(Y \mid X, \omega) \\&= \log\prod\limits^m_{n=1}P(y_n \mid x_n, \omega)\\ &= \sum^m_{n=1}\log P(y_n \mid x_n, \omega)\\ &= \sum^m_{n=1}\log \frac{1}{1+\exp \left(-y_n\omega^Tx_n \right)}\\ &= \sum^m_{n=1}-\log\left[1+\exp \left(-y_n\omega^Tx_n \right)\right] \end{align*}
• Maximum Likelihood Solution:

$$\hat{\omega}_{MLE} = \arg\max_{\omega}\log L(\omega) = \arg\min_{\omega}\sum\limits^m_{n=1} \log\left[1 + \exp\left(-y_n\omega^Tx_n\right)\right]$$
• No closed-form solution exists but we can do gradient descent on $\omega$

\begin{align*} \nabla_{\omega} \log L(\omega) &= \sum^m_{n=1} -\frac{1}{1 + \exp\left(-y_n\omega^Tx_n\right)}\exp\left(-y_n\omega^Tx_n\right)(-y_nx_n)\\ &= \sum^m_{n=1} \frac{1}{1 + \exp\left(y_n\omega^Tx_n\right)}y_nx_n \end{align*}

## 2.3. Maximum-a-Posteriori Solution¶

• Let's assume a Gaussian prior distribution over the weight vector $\omega$

$$P(\omega) = \mathcal{N}\left(\omega \mid 0, \lambda^{-1}I\right) = \frac{1}{(2\pi)^{D/2}}\exp\left( -\frac{\lambda}{2} \omega^T\omega \right)$$
• Maximum-a-Posteriori Solution:

\begin{align*} &\hat{\omega}_{MAP} \\\\&= \arg\max_{\omega} \log P(\omega \mid D) \\ &= \arg\max_{\omega}\{\log P(\omega) + \log P(D \mid \omega) - \underbrace{\log P(D)}_{\text{constant}} \}&\\ &= \arg\max_{\omega}\{\log P(\omega) + \log P(D \mid \omega)\}&\\ &= \arg\max_{\omega}\bigg\{ -\frac{D}{2}\log (2\pi) - \frac{\lambda}{2}\omega^T\omega + \sum\limits^m_{n=1} - \log\left[1 + \exp\left(-y_n\omega^Tx_n\right)\right] \bigg\}&\\ &= \arg\min_{\omega}\sum\limits^m_{n=1} \log\left[1 + \exp\left(-y_n\omega^Tx_n\right)\right] + \frac{\lambda}{2}\omega^T\omega \quad \\&\text{ (ignoring constants and changing max to min)}& \end{align*}
• BIG Lesson: MAP $= l_2$ norm regularization
• No closed-form solution exists but we can do gradient descent on $\omega$

## 2.4. Summary: MLE vs MAP¶

• MLE solution:
$$\hat{\omega}_{MLE} = \arg\min_{\omega}\sum\limits^m_{n=1}\log\left[1 + \exp\left(-y\omega^Tx_n\right)\right]$$
• MAP solution:
$$\hat{\omega}_{MAP} = \arg\min_{\omega}\sum\limits^m_{n=1}\log\left[1 + \exp\left(-y\omega^Tx_n\right)\right] + \frac{\lambda}{2}\omega^T\omega$$
• Take-home messages (we already saw these before)

• MLE estimation of a parameter leads to unregularized solutions

• MAP estimation of a parameter leads to regularized solutions

• The prior distribution acts as a regularizer in MAP estimation

• Note: For MAP, different prior distributions lead to different regularizers

• Gaussian prior on $\omega$ regularizes the $l_2$ norm of $\omega$

• Laplace prior $\exp(-C\lVert\omega\rVert_1)$ on $\omega$ regularizes the $l_1$ norm of $\omega$

# 3. Probabilistic Clustering¶

• will not cover in this course

# 4. Probabilistic Dimension Reduction¶

• will not cover in this course