Probabilistic Machine Learning

Table of Contents





1. Probabilistic Machine Learning

In the previous chapter, we introduced the core concepts of Maximum Likelihood Estimation (MLE) and Maximum-a-Posteriori Estimation (MAP). Building on these foundations, we now revisit fundamental machine learning problems such as regression and classification through a probabilistic lens.


In classical machine learning, models are often treated as deterministic mappings: given input $x$, the model predicts output $y$. Probabilistic Machine Learning, by contrast, frames learning as a process of inference: given observed data, what is the most probable underlying structure or pattern that generated it?


This probabilistic approach offers several key advantages:

  • It provides a principled way to account for uncertainty in predictions

  • It enables reasoning about confidence, credibility intervals, and Bayesian updates

  • It naturally extends to more flexible models, such as Bayesian regression, Gaussian processes, and latent variable models



Rethinking the Role of Data

In traditional supervised learning, data is assumed to be fixed, and the model is optimized to fit it as closely as possible. In contrast, the probabilistic viewpoint assumes that the generative model comes first. This model, though unknown, gives rise to the observed data via a stochastic process.

From this perspective, the dataset is seen as a realization of a random process governed by an underlying distribution. Our goal is to infer this distribution - or aspects of it, such as its parameters - using the observed data. This leads to the broader framework of generative modeling, where learning becomes a process of probabilistic inference.




2. Probabilistic Linear Regression

2.1. Generative Model

We begin with a generative perspective: assume there exists a true, underlying linear relationship between $x$ and $y$,
but our observations are corrupted by noise.

Given observed data:


$$ D = \{(x_1, y_1), (x_2, y_2), \cdots, (x_m, y_m)\} $$


We assume each output $y_i$ is generated by a linear function of the input $x_i$ with additive Gaussian noise:


$$ y = \omega^T x + \varepsilon, \quad \varepsilon \sim \mathcal{N}(0, \sigma^2) $$


This leads to the conditional likelihood:


$$ P(y \mid x \,;\, \omega) = \mathcal{N}(\omega^T x, \sigma^2) $$


That is, for a given input $x$, the output $y$ is normally distributed around the mean $\omega^T x$, with variance $\sigma^2$ reflecting observation noise.



2.2. Frequentist View of Linear Regression

In the frequentist approach, the parameters $\omega$ and $\sigma^2$ are treated as fixed but unknown. Our goal is to estimate them using Maximum Likelihood Estimation (MLE).


Given the dataset:


$$ D = \{(x_1, y_1), (x_2, y_2), \cdots, (x_m, y_m)\} $$


and the generative model:


$$ y_i = \omega^T x_i + \varepsilon_i, \quad \varepsilon_i \sim \mathcal{N}(0, \sigma^2) $$


We assume that each $y_i$ is independently sampled:


$$ P(y_i \mid x_i; \omega, \sigma^2) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left( -\frac{1}{2\sigma^2} (y_i - \omega^T x_i)^2 \right) $$


Likelihood and Log-Likelihood

The likelihood of the parameters given the dataset is:


$$ \begin{align*} \mathcal{L}(\omega, \sigma^2) = P(D \mid \theta) = P(D \,;\, \theta) &= P\left(y_1,y_2,\cdots,y_m \mid x_1,x_2,\cdots,x_m; \; \underbrace{\omega, \sigma}_{\theta}\right) \\ &= \prod_{i=1}^m P(y_i \mid x_i; \omega, \sigma^2) \end{align*} $$


Taking the logarithm (log-likelihood):


$$\begin{align*} \ell (\omega,\sigma^2) = \log \mathcal{L}(\omega,\sigma^2) &= \log \prod\limits^m_{i=1}P\left(y_i \mid x_i\,; \omega,\sigma^2 \right)\\ &= \sum\limits^m_{i=1}\log P \left(y_i \mid x_i\,; \omega, \sigma^2 \right)\\ &= \sum\limits^m_{i=1}\log \frac{1}{\sqrt{2\pi\sigma^2}}\exp{\left(-\frac{\left(y_i - \omega^Tx_i \right)^2}{2\sigma^2}\right)}\\ &= \sum\limits^m_{i=1}\left\{ -\frac{1}{2}\log \left(2\pi\sigma^2 \right) - \frac{\left(y_i - \omega^Tx_i \right)^2}{2\sigma^2}\right\} \end{align*}$$



Maximum Likelihood Estimation (MLE)

To estimate parameters, we maximize the log-likelihood:


$$ \hat{\theta}_{MLE} = {\arg\max_{\omega, \sigma^2}} \;\; \ell(\omega, \sigma^2) $$

(1) Focusing first on maximizing with respect to $\omega$:


$$\begin{align*} \hat{\omega}_{MLE} &= \arg\max_{\omega} \;- \frac{1}{2\sigma^2}\sum\limits^m_{i=1} \left(y_i-\omega^Tx_i \right)^2\\ &= \arg\min_{\omega} \frac{1}{2\sigma^2}\sum\limits^m_{i=1} \left(y_i-\omega^Tx_i \right)^2\\ &= \arg\min_{\omega} \sum\limits^m_{i=1} \left(y_i-\omega^Tx_i \right)^2 \end{align*}$$


This is exactly the least-squares objective from classical linear regression. (Amazing !)

Maximum likelihood estimation under a Gaussian noise model leads to the least-squares solution.


In a matrix form


$$\begin{align*} \mathcal{L}(\omega,\sigma^2) & = \prod\limits_{i=1}^{m} P\left(y_i \mid x_i; \; \omega,\sigma\right)\\ & = \frac{1}{(2\pi\sigma^2)^{m/2}}\exp\left(-\frac{1}{2\sigma^2}\sum\limits_{i=1}^m(y_i-\omega^T x_i)^2\right)\\ & = \frac{1}{(2\pi \sigma^2)^{m/2}} \exp\left( -\frac{1}{2\sigma^2} \|Y - X\omega\|^2 \right) \end{align*}$$


Taking gradients of the log-likelihood and setting them to zero:


$$\begin{align*} \ell (\omega,\sigma^2) &= -\frac{m}{2}\log{2\pi}-m\log{\sigma}-\frac{1}{2\sigma^2}\|Y - X\omega\|^2\\\\ \frac{\partial \ell}{\partial \omega} &=-2X^TY+2X^TX\omega=0 \\\\ &\implies \; \hat \omega_{MLE}=\left(X^TX\right)^{-1}X^TY \end{align*}$$



(2) Focusing then on $\sigma$:


$$ \begin{align*} \ell (\omega,\sigma^2) &= -\frac{m}{2}\log{2\pi}-m\log{\sigma}-\frac{1}{2\sigma^2}\sum\limits_{i=1}^m\left(y_i-\omega^T x_i\right)^2 \\\\ \frac{\partial \ell}{\partial \sigma} & =-\frac{m}{\sigma}+\frac{1}{\sigma^3}\sum\limits_{i=1}^m\left(y_i-\omega^T x_i\right)^2=0 \quad \\\\ & \implies \; \hat \sigma^2_{MLE}=\frac{1}{m}\sum\limits_{i=1}^m\left(y_i-\omega^T x_i\right)^2 \end{align*} $$


Big Lessons

  • MLE under Gaussian noise leads directly to least-squares regression
  • Probabilistic modeling justifies the use of least-squares from first principles

Lab:

The following example demonstrates that if a linear model is correctly specified and well estimated, then the residual errors (i.e., the differences between predicted and observed values) should approximately follow a Gaussian distribution:


$$ \varepsilon = y - \omega^T x \sim \mathcal{N}(0, \sigma^2) $$


In [ ]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

%matplotlib inline
In [ ]:
m = 200

a = 1
x = 3 + 2*np.random.uniform(0,1,[m,1])
noise = 0.1*np.random.randn(m,1)

y = a*x + noise;
y = np.asmatrix(y)

plt.figure(figsize = (6, 6))
plt.plot(x, y, 'b.')
plt.axis('equal')
plt.grid(alpha = 0.3)
plt.show()
No description has been provided for this image
In [ ]:
# compute theta(1) and theta(2) which are coefficients of y = theta(1)*x + theta(2)
A = np.hstack([np.ones([m, 1]), x])
A = np.asmatrix(A)

theta = (A.T*A).I*A.T*y

# to plot the fitted line
xp = np.linspace(np.min(x), np.max(x))
yp = theta[1,0]*xp + theta[0,0]

plt.figure(figsize = (6, 6))
plt.plot(x, y, 'b.')
plt.plot(xp, yp, 'r', linewidth = 3)
plt.axis('equal')
plt.grid(alpha = 0.3)
plt.show()
No description has been provided for this image
In [ ]:
yhat0 = theta[1,0]*x + theta[0,0]
err0 = yhat0 - y

yhat1 = 1.2*x - 1
err1 = yhat1 - y

yhat2 = 1.3*x - 1
err2 = yhat2 - y

plt.figure(figsize = (10, 6))
plt.subplot(2,3,1), plt.plot(x,y,'b.',x,yhat0,'r'),
plt.axis([2.9, 5.1, 2.9, 5.1]), plt.grid(alpha = 0.3)
plt.subplot(2,3,2), plt.plot(x,y,'b.',x,yhat1,'r'),
plt.axis([2.9, 5.1, 2.9, 5.1]), plt.grid(alpha = 0.3)
plt.subplot(2,3,3), plt.plot(x,y,'b.',x,yhat2,'r'),
plt.axis([2.9, 5.1, 2.9, 5.1]), plt.grid(alpha = 0.3)
plt.subplot(2,3,4), plt.hist(err0,31), plt.axvline(0, color = 'k'),
plt.xlabel(r'$\epsilon$', fontsize=15),
plt.yticks([]), plt.axis([-1, 1, 0, 15]), plt.grid(alpha = 0.3)
plt.subplot(2,3,5), plt.hist(err1,31), plt.axvline(0, color = 'k'),
plt.xlabel(r'$\epsilon$', fontsize=15),
plt.yticks([]), plt.axis([-1, 1, 0, 15]), plt.grid(alpha = 0.3)
plt.subplot(2,3,6), plt.hist(err2,31), plt.axvline(0, color = 'k'),
plt.xlabel(r'$\epsilon$', fontsize=15),
plt.yticks([]), plt.axis([-1, 1, 0, 15]), plt.grid(alpha = 0.3)
plt.show()
No description has been provided for this image
In [ ]:
a0x = err0[1:]
a0y = err0[0:-1]

a1x = err1[1:]
a1y = err1[0:-1]

a2x = err2[1:]
a2y = err2[0:-1]

plt.figure(figsize = (10, 6))
plt.subplot(2,3,1), plt.plot(x,y,'b.',x,yhat0,'r'),
plt.axis([2.9, 5.1, 2.9, 5.1]), plt.grid(alpha=0.3)
plt.subplot(2,3,2), plt.plot(x,y,'b.',x,yhat1,'r'),
plt.axis([2.9, 5.1, 2.9, 5.1]), plt.grid(alpha=0.3)
plt.subplot(2,3,3), plt.plot(x,y,'b.',x,yhat2,'r'),
plt.axis([2.9, 5.1, 2.9, 5.1]), plt.grid(alpha=0.3)

plt.subplot(2,3,4), plt.plot(a0x, a0y, '.'),
plt.axis('equal'), plt.axis([-0.7, 0.7, -0.7, 0.7]), plt.grid(alpha=0.3)
plt.xlabel(r'$\epsilon_i$', fontsize=15), plt.ylabel(r'$\epsilon_{i-1}$', fontsize=15)

plt.subplot(2,3,5), plt.plot(a1x, a1y, '.'),
plt.axis('equal'), plt.axis([-0.7, 0.7, -0.7, 0.7]), plt.grid(alpha=0.3)
plt.xlabel(r'$\epsilon_i$', fontsize=15), plt.ylabel(r'$\epsilon_{i-1}$', fontsize=15)

plt.subplot(2,3,6), plt.plot(a2x, a2y, '.'),
plt.axis('equal'), plt.axis([-0.7, 0.7, -0.7, 0.7]), plt.grid(alpha=0.3)
plt.xlabel(r'$\epsilon_i$', fontsize=15), plt.ylabel(r'$\epsilon_{i-1}$', fontsize=15)
plt.show()
No description has been provided for this image

2.3. Bayesian View of Linear Regression

In the frequentist approach, we estimate the parameters $\omega$ by maximizing the likelihood (MLE), which implicitly assumes no prior information or a uniform prior over $\omega$. In contrast, the Bayesian approach treats $\omega$ as a random variable and incorporates prior beliefs about its distribution.

Bayesian linear regression not only provides a point estimate of $\omega$, but also quantifies uncertainty through its posterior distribution. This enables a principled approach to regularization and confidence assessment.


2.3.1. A Prior on $\omega$


Suppose we assume a Gaussian prior over the weight vector $\omega = \begin{bmatrix} \omega_0 \\ \omega_1 \end{bmatrix}$:


$$ P(\omega) = \mathcal{N}(0, \Sigma) = \mathcal{N}\left(0, \lambda^{-1} I\right) $$


This expresses our belief that $\omega$ is centered at zero, with uncertainty controlled by $\lambda$. A large $\lambda$ implies strong confidence that weights should be small (i.e., a stronger regularization effect).


The prior probability density is:


$$ P(\omega) = \frac{1}{(2\pi)^{D/2}} \exp\left( -\frac{\lambda}{2} \omega^T \omega \right) $$


This means that before seeing any data, we believe that the weights are centered around zero, with uncertainty (variance) controlled by the parameter $\lambda$.

Understanding this assumption is important: it formalizes the idea that we prefer simpler models (with smaller weights), unless the data provides strong evidence otherwise. This is the Bayesian interpretation of regularization.



2.3.2. Posterior Distribution

By Bayes' Rule, the posterior distribution over $\omega$ given data $D$ is:


$$P(\omega \mid D) = \frac{P(D \mid \omega) P(\omega)}{P(D)}$$


Taking logarithms, we get the log-posterior:


$$\log P(\omega \mid D) = \log\frac{P(D \mid \omega)P(\omega)}{P(D)} = \log P(D \mid \omega) + \log P(\omega) - \underbrace{\log P(D)}_{\text{constant}}$$


Since $P(D)$ does not depend on $\omega$, we can treat it as a constant in optimization.


2.3.3. Maximum-a-Posteriori (MAP)

To find the most likely value of $\omega$ given the data (i.e., the mode of the posterior), we use MAP estimation:


$$\begin{align*} \hat{\omega}_{MAP} &= \arg\max_{\omega} \; \log P(\omega \mid D)\\\\ &= \arg\max_{\omega} \; \left\{\log P(D \mid \omega) + \log P(\omega)\right\}\\\\ &= \arg\max_{\omega} \; \left\{ \sum\limits^m_{i=1}\left\{ -\frac{1}{2}\log\left(2\pi\sigma^2\right) - \frac{\left(y_i-\omega^Tx_i\right)^2}{2\sigma^2} \right\} -\frac{D}{2}\log (2\pi) - \frac{\lambda}{2}\omega^T\omega \right\}\\\\ &= \arg\min_{\omega} \; \frac{1}{2\sigma^2}\sum\limits^m_{i=1}\left(y_i - \omega^Tx_i\right)^2 + \frac{\lambda}{2}\omega^T\omega \quad \\&\text{ (ignoring constants and changing max to min)} \end{align*}$$


Regularized Least Squares

If we assume $\sigma^2 = 1$, the MAP objective becomes:


$$\hat{\omega}_{MAP} = \arg\min_{\omega} \left\{\sum\limits^m_{i=1}\left(y_i - \omega^Tx_i\right)^2 + \lambda\omega^T\omega \right\}$$


This is equivalent to ridge regression, also known as $L_2$-regularized linear regression. (Amazing !).


Big Lesson

  • MLE assumes no prior and yields the least-squares solution
  • MAP incorporates a Gaussian prior and yields $L_2$ regularization
  • Bayesian inference provides a natural framework to balance data fit and model complexity

MAP estimation illustrates how Bayesian reasoning not only extends classical approaches but also provides a principled foundation for regularization and improved generalization.



2.4. Summary: MLE vs MAP

Let us compare Maximum Likelihood Estimation (MLE) and Maximum a Posteriori (MAP) estimation in the context of linear regression with Gaussian noise.


MLE Solution

In MLE, we assume no prior over the parameters. We estimate $\omega$ by maximizing the likelihood of the observed data:


$$ \hat{\omega}_{MLE} = \arg\min_{\omega} \frac{1}{2\sigma^2} \sum_{i=1}^m \left(y_i - \omega^T x_i \right)^2 $$


This corresponds to the classical least-squares regression.


MAP Solution

In MAP estimation, we incorporate a prior belief over the parameters. Assume a Gaussian prior:


$$\omega \sim \mathcal{N}(0, \lambda^{-1} I)$$


Then the MAP estimator becomes:


$$ \hat{\omega}_{MAP} = \arg\min_{\omega} \left\{ \frac{1}{2\sigma^2} \sum_{i=1}^m \left(y_i - \omega^T x_i \right)^2 + \frac{\lambda}{2} \omega^T \omega \right\} $$


This is equivalent to $L_2$-regularized least-squares, also known as ridge regression.


Take-Home Messages

  • MLE fits the data without any regularization. It assumes no prior information and seeks to minimize prediction error alone.

  • MAP balances data fit with prior belief. The prior acts as a regularizer, discouraging large weights and controlling model complexity.

  • This probabilistic framework reveals that regularization is not an ad hoc technique, but a natural consequence of incorporating prior knowledge into the estimation process. It provides deeper insight into the trade-off between model complexity and data fidelity.


Note

Different prior distributions correspond to different regularization penalties:


  • Gaussian prior:

    $$P(\omega) \propto \exp(-\lambda \|\omega\|_2^2)$$

    ⟶ $L_2$ regularization (ridge regression)


  • Laplace prior:

    $$P(\omega) \propto \exp(-\lambda \|\omega\|_1)$$

    ⟶ $L_1$ regularization (Lasso)





3. Probabilistic Linear Classification

We have studied probabilistic linear regression. Now we turn to classification, another fundamental task in machine learning.

In classification problems, we are not only interested in assigning a label (e.g., $y = +1$ or $y = -1$), but also in quantifying how confident we are in that prediction.


This naturally motivates a probabilistic approach: instead of deterministically outputting a class label, we model the probability distribution over possible labels:


$$ P(y \mid x, \omega) = \text{Confidence in the label prediction} $$


3.1. Logistic Regression

We consider a probabilistic classifier for binary labels $y \in \{-1, +1\}$, defined by:


$$ P(y \mid x, \omega) = \sigma(y \omega^T x) = \frac{1}{1 + \exp(-y \omega^T x)} $$


Here, $\sigma(\cdot)$ is the logistic function, which maps any real number into the range $(0, 1)$.


Decision Boundary

The decision boundary is defined where both class probabilities are equal:


$$\begin{align*} P(y= + 1 \mid x, \omega) &= P(y = -1 \mid x, \omega)\\\\ \frac{1}{1+\exp \left(-\omega^Tx \right)} &= \frac{1}{1+\exp \left(\omega^Tx \right)}\\\\ \exp \left(-\omega^Tx \right) &= \exp \left(\omega^Tx \right)\\\\ \omega^T x&= 0 \end{align*}$$

The decision boundary is thus linear, confirming that logistic regression is a linear classifier. However, the output is probabilistic, not deterministic.



Note: it is possible to kernelize and make it nonlinear


3.2. Maximum Likelihood Solution

Given training data $D = \{(x_1, y_1), \cdots, (x_m, y_m)\}$, we estimate $\omega$ by maximizing the log-likelihood:


Log-likelihood function:


$$\begin{align*} \ell (\omega) = \log \mathcal{L}(\omega) &= \log P(D \mid \omega) \\&= \log P(Y \mid X, \omega) \\&= \log\prod\limits^m_{i=1}P(y_i \mid x_i, \omega)\\ &= \sum^m_{i=1}\log P(y_i \mid x_i, \omega)\\ &= \sum^m_{i=1}\log \frac{1}{1+\exp \left(-y_i\omega^Tx_i \right)}\\ &= \sum^m_{i=1}-\log\left[1+\exp \left(-y_i\omega^Tx_i \right)\right] \end{align*}$$

MLE solution:


$$\hat{\omega}_{MLE} = \arg\max_{\omega}\;\log \mathcal{L}(\omega) = \arg\min_{\omega}\;\sum\limits^m_{i=1} \log\left[1 + \exp\left(-y_i\omega^Tx_i\right)\right]$$


This is a convex optimization problem but has no closed-form solution. Gradient descent is commonly used.


The gradient of the log-likelihood is:


$$\begin{align*} \nabla_{\omega} \log \mathcal{L}(\omega) &= \sum^m_{i=1} -\frac{1}{1 + \exp\left(-y_i\omega^Tx_i\right)}\exp\left(-y_i\omega^Tx_i\right)(-y_i x_i)\\ &= \sum^m_{i=1} \frac{1}{1 + \exp\left(y_i\omega^Tx_i\right)}y_i x_i \end{align*}$$


3.3. Maximum-a-Posteriori Estimation (MAP)

To incorporate prior knowledge, we place a Gaussian prior on $\omega$:


$$P(\omega) = \mathcal{N}\left( 0, \lambda^{-1}I\right) = \frac{1}{(2\pi)^{D/2}}\exp\left( -\frac{\lambda}{2} \omega^T\omega \right)$$


Then, using Bayes’ rule, the posterior becomes:


$$\begin{align*} \hat{\omega}_{MAP} &= \arg\max_{\omega}\; \log P(\omega \mid D) \\ &= \arg\max_{\omega} \; \{\log P(D \mid \omega) + \log P(\omega) - \underbrace{\log P(D)}_{\text{constant}} \}&\\ &= \arg\max_{\omega}\; \{\log P(D \mid \omega) + \log P(\omega) \}&\\ &= \arg\max_{\omega}\; \bigg\{ \sum\limits^m_{i=1} - \log\left[1 + \exp\left(-y_i\omega^Tx_i\right)\right] -\frac{D}{2}\log (2\pi) - \frac{\lambda}{2}\omega^T\omega \bigg\}&\\ &= \arg\min_{\omega}\; \sum\limits^m_{i=1} \log\left[1 + \exp\left(-y_i\omega^Tx_i\right)\right] + \frac{\lambda}{2}\omega^T\omega \quad \\&\text{ (ignoring constants and changing max to min)}& \end{align*}$$


This is equivalent to $L_2$-regularized logistic regression.


Notes:

  • Although MAP has no closed-form solution, the objective is convex and can be efficiently solved using gradient descent (as we saw in the previous chapter).

  • See "A comparison of numerical optimizers for logistic regression" by Tom Minka for a detailed discussion of optimization methods (gradient descent and others) used for both MLE and MAP in logistic regression.


3.4. Summary: MLE vs MAP

MLE solution:

$$\hat{\omega}_{MLE} = \arg\min_{\omega}\sum\limits^m_{i=1}\log\left[1 + \exp\left(-y_i\omega^Tx_i\right)\right]$$


MAP solution:

$$\hat{\omega}_{MAP} = \arg\min_{\omega}\sum\limits^m_{i=1}\log\left[1 + \exp\left(-y_i\omega^Tx_i\right)\right] + \frac{\lambda}{2}\omega^T\omega$$


Take-Home Messages

  • MLE yields unregularized models

  • MAP yields regularized models via the prior

  • The prior distribution acts as a regularizer in MAP estimation


Note: For MAP, different prior distributions lead to different regularizers

  • Gaussian prior on $\omega$ regularizes the $l_2$ norm of $\omega$

  • Laplace prior $\exp(-C\lVert\omega\rVert_1)$ on $\omega$ regularizes the $l_1$ norm of $\omega$




4. Probabilistic Clustering

  • will not cover in this course




5. Probabilistic Dimension Reduction

  • will not cover in this course

In [ ]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')