Bayesian Machine Learning

Table of Contents

Sources

Probabilistic Programming and Bayesian Methods for Hackers by Cameron Davidson-Pilon
Student Dave's Tutorials
S. Lall at Stanford
Reza Shadmehr at Johns Hopkins University

1. Bayesian Decision Theory¶

When we first studied classification, we began by assuming that the decision boundary is linear. This assumption served as a starting point, allowing us to construct simple and interpretable models such as SVM and logistic regression.

However, at that stage, the linearity of the boundary was merely a modeling choice, not something rigorously derived from the underlying probabilistic structure of the data.

In this section, we will mathematically investigate when and why this assumption is valid.

We begin with the foundational framework for making optimal classification decisions under uncertainty. Bayesian decision theory combines class-conditional probability distributions with prior beliefs to derive optimal rules for decision-making.

Suppose the data $x \in \mathbb{R}$ is drawn from one of two classes, $\mathcal{C}_1$ or $\mathcal{C}_2$, each described by its probability density function (pdf) and cumulative distribution function (cdf):

$$ \begin{align*} f_1(x) &= \frac{\partial{F_1(x)}}{\partial x}\\\\ f_2(x) &= \frac{\partial{F_2(x)}}{\partial x}\\\\ \end{align*} $$

Assume the class-conditional distributions are Gaussian, with $\mu_1 < \mu_2$:

$$ x \sim \begin{cases} \mathcal{N}(\mu_1, \sigma_1^2), \quad \text{if } x \in \mathcal{C}_1\\\\ \mathcal{N}(\mu_2, \sigma_2^2), \quad \text{if } x \in \mathcal{C}_2 \end{cases} $$

Since this is a binary classification problem in one dimension, we aim to find an optimal decision threshold $\omega$ between $\mu_1$ and $\mu_2$. We decide class membership as:

$$ \begin{cases} \text{if } x < \omega \; \implies \; x \in \mathcal{C}_1 \\ \text{if } x > \omega \; \implies \; x \in \mathcal{C}_2 \end{cases} $$

1.1. Optimal Boundary for Classes¶

(1) Minimizing Probability of Misclassification

To minimize the probability of misclassification, we define the total error as:

$$ \begin{align*} P(\text{error}) &= P(x> \omega, x \in \mathcal{C}_1) + P(x< \omega, x \in \mathcal{C}_2) \\\\ &= P(x> \omega \mid x \in \mathcal{C}_1)P(x \in \mathcal{C}_1) + P(x< \omega \mid x \in \mathcal{C}_2)P(x \in \mathcal{C}_2)\\\\ &= \left(1-F_1(\omega)\right)\,\pi_1 + F_2(\omega)\, \pi_2 \end{align*} $$

Here, the prior class probabilities are:

$$ \begin{align*} P(x \in \mathcal{C}_1) = \pi_1\\ P(x \in \mathcal{C}_2) = \pi_2 \end{align*} $$

To find the optimal threshold $\omega$, we minimize this error probability:

$$\min_{\omega}P(\text{error}) = \min_{\omega} \left\{ \left(1-F_1(\omega)\right)\,\pi_1 + F_2(\omega)\, \pi_2 \right\}$$

Taking the derivative:

$$ \begin{align*} \frac{d }{d \omega}P(\text{error}) &= -f_1(\omega) \, \pi_1 + f_2(\omega) \, \pi_2 = 0\\\\ & \implies f_1(\omega) \, \pi_1 = f_2(\omega) \, \pi_2 \end{align*} $$

This condition defines the optimal decision boundary.

(2) Equal Posterior Probabilities

An equivalent way to derive the decision rule is to compare the posterior probabilities. The optimal decision boundary is where both posteriors are equal:

$$ \begin{align*} P(x \in \mathcal{C}_1 \mid X=x) &= P(x \in \mathcal{C}_2 \mid X=x)\\\\ \frac{P(X=x \mid x \in \mathcal{C}_1)P(x \in \mathcal{C}_1)}{P(X=x)} &= \frac{P(X=x \mid x \in \mathcal{C}_2)P(x \in \mathcal{C}_2)}{P(X=x)}\\\\ \frac{f_1(x) \pi_1}{P(x)} &= \frac{f_2(x) \pi_2}{P(x)}\\\\ f_1(x)\, \pi_1 &= f_2(x)\, \pi_2 \end{align*} $$

This yields the same optimality condition as before. This confirms that minimizing classification error and equating posterior probabilities lead to the same decision boundary.

1.2. Boundaries for Gaussian Distributions¶

To deepen our understanding of Bayesian classification, we now consider the multivariate case where each class follows a multivariate Gaussian distribution:

$$f(x) = \frac{1}{\sqrt{(2\pi)^d\lvert \Sigma\rvert}}\exp{\left(-\frac{1}{2} (x-\mu)^T\Sigma^{-1}(x-\mu)\right)}$$

The decision boundary is defined by the point at which the posterior probabilities are equal:

$$f_1(x)\, \pi_1 = f_2(x)\, \pi_2$$

Substituting the Gaussian densities gives:

$$\frac{1}{\sqrt{(2\pi)^d\lvert \Sigma_1\rvert}}\exp{\left(-\frac{1}{2} (x-\mu_1)^T\Sigma_1^{-1}(x-\mu_1)\right)} \, \pi_1= \frac{1}{\sqrt{(2\pi)^d\lvert \Sigma_2\rvert}}\exp{\left(-\frac{1}{2} (x-\mu_2)^T\Sigma_2^{-1}(x-\mu_2)\right)} \, \pi_2$$

Solving this equation yields the decision boundary. Let’s consider two important cases.

(1) Equal Covariance Case

Assume $\Sigma_1 = \Sigma_2 = \Sigma$. The expression simplifies:

$$ \begin{align*} \frac{1}{\sqrt{(2\pi)^d\lvert \Sigma\rvert}}\exp{\left(-\frac{1}{2} (x-\mu_1)^T\Sigma^{-1}(x-\mu_1)\right)} \, \pi_1 &= \frac{1}{\sqrt{(2\pi)^d\lvert \Sigma\rvert}}\exp{\left(-\frac{1}{2} (x-\mu_2)^T\Sigma^{-1}(x-\mu_2)\right)} \, \pi_2 \\\\ \exp{\left(-\frac{1}{2} (x-\mu_1)^T\Sigma^{-1}(x-\mu_1)\right)} \, \pi_1 &= \exp{\left(-\frac{1}{2} (x-\mu_2)^T\Sigma^{-1}(x-\mu_2)\right)} \, \pi_2 \end{align*} $$

Taking logarithms:

$$ \begin{align*} -(x-\mu_1)^T\Sigma^{-1}(x-\mu_1) + 2 \ln \pi_1 &= -(x-\mu_2)^T\Sigma^{-1}(x-\mu_2) + 2 \ln \pi_2 \\\\ -x^T\Sigma^{-1} x + x^T\Sigma^{-1} \mu_1 + \mu_1 \Sigma^{-1} x -\mu_1^T\Sigma^{-1} \mu_1 + 2 \ln \pi_1 &= -x^T\Sigma^{-1} x + x^T\Sigma^{-1} \mu_2 + \mu_2 \Sigma^{-1} x -\mu_2^T\Sigma^{-1} \mu_2 + 2 \ln \pi_2 \end{align*} $$

$$2 \left( \Sigma^{-1}(\mu_2 - \mu_1)\right)^T x + \left( \mu_1^T\Sigma^{-1}\mu_1 - \mu_2^T\Sigma^{-1}\mu_2 \right) + 2 \ln \frac{\pi_2}{\pi_1}= \color{red}{a^Tx + b =0}$$

This is a linear equation in $x$, which defines a hyperplane.

$$ a^T x + b = 0 $$

Thus, when covariances are equal, the decision boundary is linear (Amazing !). In other words, linear decision boundaries arise naturally when the class-conditional distributions are Gaussian and share the same covariance matrix.

(2) Unequal Covariance Case

Now consider $\Sigma_1 \ne \Sigma_2$. Begin with:

Starting from:

$$ \begin{align*} \frac{1}{\sqrt{(2\pi)^d\lvert \Sigma_1\rvert}}\exp{\left(-\frac{1}{2} (x-\mu_1)^T\Sigma_1^{-1}(x-\mu_1)\right)} \, \pi_1 &= \frac{1}{\sqrt{(2\pi)^d\lvert \Sigma_2\rvert}}\exp{\left(-\frac{1}{2} (x-\mu_2)^T\Sigma_2^{-1}(x-\mu_2)\right)} \, \pi_2 \\\\ \frac{1}{\sqrt{(\lvert \Sigma_1\rvert}}\exp{\left(-\frac{1}{2} (x-\mu_1)^T\Sigma_1^{-1}(x-\mu_1)\right)} \, \pi_1 &= \frac{1}{\sqrt{(\lvert \Sigma_2\rvert}}\exp{\left(-\frac{1}{2} (x-\mu_2)^T\Sigma_2^{-1}(x-\mu_2)\right)} \, \pi_2 \\\\ \end{align*} $$

Taking logarithms and expanding:

$$ \begin{align*} -\ln \left(\lvert \Sigma_1\rvert \right) -(x-\mu_1)^T\Sigma_1^{-1}(x-\mu_1) + 2 \ln \pi_1 &= -\ln \left(\lvert \Sigma_2\rvert \right) -(x-\mu_2)^T\Sigma_2^{-1}(x-\mu_2) + 2 \ln \pi_2 \\\\ -\ln \left(\lvert \Sigma_1\rvert \right) -x^T\Sigma_1^{-1} x + x^T\Sigma_1^{-1} \mu_1 + \mu_1 \Sigma_1^{-1} x -\mu_1^T\Sigma_1^{-1} \mu_1 + 2 \ln \pi_1 &= -\ln \left(\lvert \Sigma_2\rvert \right) -x^T\Sigma_2^{-1} x + x^T\Sigma_2^{-1} \mu_2 + \mu_2 \Sigma_2^{-1} x -\mu_2^T\Sigma_2^{-1} \mu_2 + 2 \ln \pi_2 \end{align*} $$

$$x^T(\Sigma_1 - \Sigma_2)^{-1}x + 2\left( \Sigma_2^{-1}\mu_2 - \Sigma_1^{-1} \mu_1\right)^T x + \left( \mu_1^T \Sigma_1^{-1} \mu_1 - \mu_2^T \Sigma_2^{-1} \mu_2 \right) - \ln \frac{\lvert \Sigma_2\rvert }{\lvert \Sigma_1\rvert} + 2 \ln \frac{\pi_2}{\pi_1} = \color{red}{x^TAx + b^Tx + b =0}$$

This is a quadratic form in $x$:

$$x^TAx + b^Tx + b =0$$

When covariances are unequal, the decision boundary is quadratic.

Summary: Decision Boundaries

If class covariances are equal: decision boundaries are linear (hyperplanes)
If class covariances are unequal: decision boundaries are quadratic

This highlights the importance of understanding distributional assumptions before choosing a model for classification.

1.3. Python Implementation in 1D¶

To build intuition, let's now explore how class distributions and priors affect the location of the decision boundary in a one-dimensional binary classification task.

Case 1: Equal Variance and Equal Prior

Assume the following:

$$\sigma_0 = \sigma_1 \qquad \text{and} \qquad P(y=0)=P(y=1)=\frac{1}{2}$$

Under these assumptions, the marginal likelihood becomes:

$$P(x) = P(x \mid y=0) P(y=0) + P(x \mid y=1)P(y=1) = \frac{1}{2}\left\{ P(x \mid y=0) + P(x \mid y=1)\right\}$$

To determine the decision boundary, we compare posterior probabilities:

$$ P(y = 0 \mid x) = P(y = 1 \mid x) $$

Applying Bayes' rule:

$$ \frac{P(x \mid y = 0) P(y = 0)}{P(x)} = \frac{P(x \mid y = 1) P(y = 1)}{P(x)} \quad \Rightarrow \quad P(x \mid y = 0) = P(x \mid y = 1) $$

For Gaussian class-conditionals with equal variances, this equality holds exactly when $x$ is the midpoint between the two means:

$$ \omega = \frac{\mu_0 + \mu_1}{2} $$

It means that if both classes have the same prior distribution, it reflects no prior preference for one class over the other. If the variances are also equal, it implies that the class-conditional distributions differ only in their means. In this scenario, the optimal decision boundary lies at the midpoint between the two means.

Summary via Graphical Illustration

Lecture 23 in Learning Theory (Reza Shadmehr, Johns Hopkins University)

from IPython.display import YouTubeVideo
YouTubeVideo('3r5SlvjJptM', width = "560", height = "315")

Python Coding

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from scipy.stats import norm
from matplotlib import animation
%matplotlib inline

x = np.arange(130, 200)

mu0 = 160
mu1 = 173
sigma0 = 6.5
sigma1 = 6.5

L1 = norm.pdf(x, mu0, sigma0)
L2 = norm.pdf(x, mu1, sigma1)

prior0 = 1/2
prior1 = 1/2

Px = L1*prior0 + L2*prior1
posterior0 = (L1*prior0)/Px
posterior1 = (L2*prior1)/Px

var1 = posterior1 - posterior1**2;

plt.figure(figsize = (7, 8))
plt.subplot(4, 1, 1)
plt.plot(x, L1, 'r', x, L2, 'b')
plt.axis([135, 195, 0, 0.1])
plt.text(mu0-10,0.08, 'P(x|y=0)', color = 'r', fontsize = 12)
plt.text(mu1,0.08, 'P(x|y=1)', color = 'b', fontsize = 12)

plt.subplot(4, 1, 2),
plt.plot(x, L1*prior0, 'r', x, L2*prior1, 'b', x, Px, 'k')
plt.axis([135, 195, 0, 0.1])
plt.text(mu0-14, 0.05, 'P(x|y=0)P(y=0)', color = 'r', fontsize = 12)
plt.text(mu1, 0.05, 'P(x|y=1)P(y=1)', color = 'b', fontsize = 12)
plt.text(mu0+5, 0.06, 'P(x)', fontsize = 12)

plt.subplot(4, 1, 3),
plt.plot(x, posterior0, 'r', x, posterior1, 'b')
plt.axis([135, 195, 0, 1.1])
plt.text(142, 0.8, 'P(y=0|x)', color = 'r', fontsize = 12)
plt.text(180, 0.8, 'P(y=1|x)', color = 'b', fontsize = 12)

plt.subplot(4, 1, 4),
plt.plot(x, var1, 'k')
plt.axis([135, 195, 0, 1.1])
plt.text(170, 0.4, 'var(y|x)', fontsize = 12)
plt.xlabel('x', fontsize = 12)

plt.show()

Case 2: Equal Variance but Unequal Prior

Suppose both class distributions have equal variance:

$$ \sigma_0 = \sigma_1 $$

but the prior probabilities differ:

$$ \quad P(y=0) < P(y=1) $$

In this case, the decision boundary no longer lies at the midpoint between the class means. Instead, it shifts toward the less probable class (class 0), because we require more evidence to classify a sample as belonging to the less likely class.

Summary via Graphical Illustration

Python Coding

x = np.arange(130, 200)

mu0 = 160
mu1 = 173
sigma0 = 6.5
sigma1 = 6.5

L1 = norm.pdf(x, mu0, sigma0)
L2 = norm.pdf(x, mu1, sigma1)

prior0 = 1/4
prior1 = 3/4

Px = L1*prior0 + L2*prior1
posterior0 = (L1*prior0)/Px
posterior1 = (L2*prior1)/Px

var1 = posterior1 - posterior1**2;

plt.figure(figsize = (7,8))
plt.subplot(4, 1, 1)
plt.plot(x, L1, 'r', x, L2, 'b')
plt.axis([135, 195, 0, 0.1])
plt.text(mu0-10, 0.08, 'P(x|y=0)', color = 'r', fontsize = 12)
plt.text(mu1, 0.08, 'P(x|y=1)', color = 'b', fontsize = 12)

plt.subplot(4, 1, 2),
plt.plot(x, L1*prior0, 'r', x, L2*prior1, 'b', x, Px, 'k')
plt.axis([135, 195, 0, 0.1])
plt.text(mu0-10, 0.05, 'P(x|y=0)P(y=0)', color = 'r', fontsize = 12)
plt.text(mu1, 0.05, 'P(x|y=1)P(y=1)', color = 'b', fontsize = 12)
plt.text(185, 0.03, 'P(x)', fontsize = 12)

plt.subplot(4, 1, 3),
plt.plot(x, posterior0, 'r', x, posterior1, 'b')
plt.axis([135, 195, 0, 1.1])
plt.text(140, 0.8, 'P(y=0|x)', color = 'r', fontsize = 12)
plt.text(180, 0.8, 'P(y=1|x)', color = 'b', fontsize = 12)

plt.subplot(4, 1, 4),
plt.plot(x,var1,'k')
plt.axis([135, 195, 0, 1.1])
plt.text(170,0.4,'var(y|x)', fontsize=12)
plt.xlabel('x', fontsize = 12)

plt.show()

Case 3: Unequal Variance and Equal Prior

Now suppose the class priors are equal, but the class-conditional variances differ:

$$ \sigma_0 \ne \sigma_1, \quad P(y=0) = P(y=1) = \frac{1}{2} $$

The decision boundary is again determined by the point where the posterior probabilities are equal:

$$ P(x \mid y=0) = P(x \mid y=1) $$

However, since the variances differ, the class-conditional distributions have different spreads. As a result, the point of intersection (decision boundary) is not at the midpoint between the means.

The resulting decision rule is nonlinear, and the discriminant function becomes a quadratic function of $x$.

Summary via Graphical Illustration

We have learned that the classification boundary becomes quadratic in this case. In one dimension, this implies that the decision rule involves two thresholds.

$$ \begin{cases} \text{if } x < \omega_1 &\; \implies \; y = 1 \\ \text{if } \omega_1 < x < \omega_2 &\; \implies \; y = 0 \\ \text{if } x > \omega_2 &\; \implies \; y = 1 \end{cases} $$

Python Coding

x = np.arange(130, 200)

mu0 = 160
mu1 = 173
sigma0 = 6.5
sigma1 = 15

L1 = norm.pdf(x, mu0, sigma0)
L2 = norm.pdf(x, mu1, sigma1)

prior0 = 1/2
prior1 = 1/2

Px = L1*prior0 + L2*prior1
posterior0 = (L1*prior0)/Px
posterior1 = (L2*prior1)/Px

var1 = posterior1 - posterior1**2;

plt.figure(figsize = (7, 8))
plt.subplot(4, 1, 1)
plt.plot(x, L1, 'r', x, L2, 'b')
plt.axis([135, 195, 0, 0.1])
plt.text(mu0-10, 0.08, 'P(x|y=0)', color = 'r', fontsize = 12)
plt.text(mu1, 0.08, 'P(x|y=1)', color = 'b', fontsize = 12)

plt.subplot(4, 1, 2),
plt.plot(x, L1*prior0, 'r', x, L2*prior1, 'b', x, Px, 'k')
plt.axis([135, 195, 0, 0.1])
plt.text(mu0-10, 0.05, 'P(x|y=0)P(y=0)', color = 'r', fontsize = 12)
plt.text(mu1, 0.05, 'P(x|y=1)P(y=1)', color = 'b', fontsize = 12)
plt.text(185, 0.03, 'P(x)', fontsize = 12)

plt.subplot(4, 1, 3),
plt.plot(x, posterior0, 'r', x, posterior1, 'b')
plt.axis([135, 195, 0, 1.1])
plt.text(140, 0.8, 'P(y=0|x)', color = 'r', fontsize = 12)
plt.text(180, 0.8, 'P(y=1|x)', color = 'b', fontsize = 12)

plt.subplot(4, 1, 4),
plt.plot(x, var1, 'k')
plt.axis([135, 195, 0, 1.1])
plt.text(170, 0.4, 'var(y|x)', fontsize = 12)
plt.xlabel('x', fontsize = 12)

plt.show()

Lecture 23 in Learning Theory (Reza Shadmehr, Johns Hopkins University)

from IPython.display import YouTubeVideo
YouTubeVideo('X1WB6IJqMjM', width = "560", height = "315")

Summary

Bayesian decision theory selects the most probable class based on observed data and prior knowledge.
Posterior probabilities result from combining likelihood (how well the data fits the class) and prior (how likely each class is before seeing the data).
The shape and position of the decision boundary depend on:
- The class means and variances
- The prior probabilities of each class
When the class-conditional distributions are Gaussian:
- Equal variances → linear decision boundary
- Unequal priors → bias in the decision threshold toward the more probable class
- Unequal variances → quadratic decision boundary

2. Probability Density Estimation¶

Probability Density Estimation refers to the task of estimating an unknown probability distribution from observed data. Instead of producing a single estimate (e.g., a mean), the goal is to recover the entire probability density function (pdf), allowing for a more complete representation of uncertainty.

There are two major approaches:

Kernel Density Estimation
Bayesian Density Estimation

2.1. Kernel Density Estimation¶

A non-parametric method that estimates the probability density function (pdf) by placing a smooth kernel (e.g., Gaussian) at each data point and summing their contributions.

Given samples $x_1, \cdots, x_m$, the estimated pdf at point $x$ is:

$$ \hat{p}(x) = \frac{1}{m h} \sum_{i=1}^{m} K\left( \frac{x - x_i}{h} \right) $$

Where:

$K(\cdot)$: Kernel function (commonly a Gaussian kernel)
$h$: Bandwidth (controls the smoothness of the estimate)

Properties

Makes minimal assumptions about the form of the underlying distribution
Sensitive to the choice of kernel and bandwidth
Does not generalize well in high dimensions (curse of dimensionality)

Note:

Imagine that each sampled data point is a realization from an unknown underlying probability density function (pdf). It is therefore reasonable to treat each sample as representative of a local mean of that density. If we assume the density is Gaussian around each sample, the spread of each contribution is determined by a user-defined bandwidth parameter. This bandwidth controls how smooth or spiky the estimated density will be.

In this setting, kernel density estimation (KDE) provides a natural method for combining multiple samples to approximate the unknown pdf.

from IPython.display import YouTubeVideo
YouTubeVideo('a357NoXy4Nk', width = "560", height = "315")

Python Coding

m = 10
mu = 0
sigma = 5

x = np.random.normal(mu, sigma, [m, 1])
xp = np.linspace(-20, 20, 100)
y0 = np.zeros([m, 1])

X = []

for i in range(m):
    X.append(norm.pdf(xp, x[i, 0], sigma))

X = np.array(X).T
Xnorm = np.sum(X, 1)/m

plt.figure(figsize = (6, 4))
plt.plot(x, y0, 'kx', markersize = 8)
plt.plot(xp, X, 'b--', alpha = 0.4)
plt.plot(xp, Xnorm, 'r', linewidth = 3)
plt.show()

2.2. Bayesian Density Estimation¶

In Bayesian density estimation, the goal is to infer the full probability distribution of a hidden state or variable, rather than estimating a fixed set of parameters as in traditional approaches.

For example, in parametric models such as the Gaussian, specifying parameters (like mean and variance) fully determines the density function. However, the Bayesian perspective generalizes this by treating those parameters as random variables, and systematically updating our beliefs about them as data is observed.

Prior and Posterior Distributions

The Bayesian framework begins with a prior distribution over the hypothesis $H$, which encodes our belief about $H$ before seeing any data. This belief may be informed by prior knowledge or be intentionally uninformative when such information is unavailable.

As we observe data $D$, we revise this belief by computing the posterior distribution:

$$ P(H \mid D) = \frac{P(D \mid H)\,P(H)}{P(D)} $$

Where:

$P(H)$ is the prior
$P(D \mid H)$ is the likelihood
$P(H \mid D)$ is the posterior
$P(D)$ is the marginal likelihood (normalization constant)

This rule reflects how evidence $D$ updates our belief about the hidden hypothesis $H$.

from IPython.display import YouTubeVideo
YouTubeVideo('w1u6-_2jQJo', width = "560", height = "315")

2.2.1. Updating Beliefs with Multiple Observations¶

Suppose we observe multiple data points $d_1, d_2, \cdots, d_m$. If we assume these observations are conditionally independent given a hidden state $H$, then the joint likelihood factorizes:

$H$: latent hypothesis or hidden variable
$D = {d_1, d_2, \cdots, d_m}$: observed data or evidence

Using Bayes' rule:

$$ P(H \mid D) = \frac{P(D \mid H)\,P(H)}{P(D)} $$

Because the observations are conditionally independent:

$$ \begin{align*} P(H \mid \underbrace{d_1,d_2,\cdots,d_m}_{\text{multiple evidences}}) &= \frac{P(d_1,d_2,\cdots,d_m \mid H) \; P(H)}{P(d_1,d_2,\cdots,d_m)} \\ \\ &=\frac{P(d_1 \mid H)P(d_2 \mid H)\cdots P(d_m \mid H) \;P(H)}{P(d_1,d_2,\cdots,d_m)} \\ \\ &= \eta \cdot \left( \prod_{i=1}^{m} P(d_i \mid H) \right) P(H) \end{align*}$$

Here, $\eta$ is a normalizing constant that ensures the posterior integrates to one.

This formulation allows the posterior to be computed from the prior and the sequence of observations, enabling systematic belief revision.

2.2.2. Recursive Bayesian Estimation¶

Rather than recomputing the posterior from scratch every time new data arrives, we can update our beliefs recursively. This is crucial in online learning and real-time systems.

Key Identities from Probability Theory:

$$ \begin{align*}P(a,b) &= P(a \mid b) P(b) \\ P(a,b \mid c) & = P(a \mid b,c) P(b \mid c) \end{align*}$$

Let us define the sequence of posterior distributions:

$$ \begin{align*} P_0(H) &= P(H) \quad \text{(initial prior)} \\\\ P_1(H) &= P(H \mid d_1) \quad \text{(posterior after first observation)} \\\\ P_2(H) &= P(H \mid d_1, d_2) \quad \text{(posterior after second observation)} \\\\ &\;\; \vdots \\\\ P_m(H) &= P(H \mid d_1, d_2, \cdots, d_m) \end{align*} $$

We now recursively update the posterior as follows:

$$ \begin{align*} P(H\mid d_1) &= \frac{P(d_1 \mid H) P(H)}{P(d_1)} = \eta_1 \, P(d_1 \mid H) \underbrace{P(H)}_{\text{prior}} \\ P(H\mid d_1 d_2) &= \frac{P(d_1 d_2\mid H) P(H)}{P(d_1 d_2)} = \frac{P(d_1 \mid d_2, H)P(d_2 \mid H) P(H)}{P(d_1 d_2)} = \frac{P(d_1 \mid H)P(d_2 \mid H) P(H)}{P(d_1 d_2)} \\& = \eta_2 \, P(d_2 \mid H) \underbrace{P(H\mid d_1)}_{\text{acting as a prior}} \\ & \quad \vdots \\ \\ P(H \mid d_1,d_2,\cdots,d_m) &= \eta_m \,P(d_m \mid H) \underbrace{P(H \mid d_1,d_2,\cdots,d_{m-1})}_{\text{acting as a prior}} \end{align*} $$

Thus, each posterior can be recursively computed as:

$$ \begin{align*} P(H \mid d_1, \cdots, d_m) &= \eta_m P(d_m \mid H) \, P(H \mid d_1, \cdots, d_{m-1})\\\\ \text{or equivalently} \quad P_m(H) &= \eta_m \cdot P(d_m \mid H)\,P_{m-1}(H) \end{align*} $$

Thus, we obtain the recursive chain:

$$ P_0(H) \xrightarrow{\;d_1\;} P_1(H) \xrightarrow{\;d_2\;} P_2(H) \xrightarrow{\;d_3\;} \quad \cdots \quad \xrightarrow{\;d_m\;} P_m(H) $$

This reflects the core idea of recursive Bayesian estimation: "the current posterior becomes the prior for the next update."

This recursive update mechanism is at the heart of Bayesian inference and forms the foundation for many sequential inference techniques such as the Kalman filter, particle filters, and various Bayesian learning frameworks.

In summary:

The prior represents our initial belief.
The likelihood incorporates newly observed data.
The posterior updates the prior using the likelihood.
As more data arrives, we iteratively update our belief.

2.2.3. Lab 1: Bernoulli Model¶

The Bernoulli distribution

$$d = \{0,1\},\qquad \theta \in [0,1]$$

$$p(d \mid \theta) = P[D=d \mid \theta] = \theta^d (1-\theta)^{1-d} = \begin{cases} 1-\theta, & d = 0\\ \theta, & d = 1 \end{cases} $$

import numpy as np

from scipy import stats
import matplotlib.pyplot as plt

def normalize(y, x):
    return y / np.trapezoid(y, x)

N = 101

theta = np.linspace(0, 1, N)
prior = normalize(np.repeat(1, N), theta)

d = np.random.choice([0,1])

likelihood = theta**d * (1 - theta)**(1 - d)

posterior = likelihood * prior
posterior = normalize(posterior, theta)

plt.figure(figsize = (6, 4))
plt.plot(d, 0, 'kx', markersize = 8)
plt.plot(theta, prior, linestyle = '--', label = 'prior')
plt.plot(theta, posterior, label = 'posterior')
plt.title('observed: %d' % d)
plt.xlabel(r'$\theta$')
plt.legend()
plt.show()

def bernoulli_model(d, theta, prior):
    likelihood = theta**d * (1 - theta)**(1 - d)
    posterior = likelihood * prior
    return normalize(posterior, theta)

fig, ax = plt.subplots(ncols = 3, nrows = 3, figsize = (7, 7))
ax = np.ravel(ax)

prior = normalize(np.repeat(1, N), theta)

for n in range(9):
    observed = np.random.choice([0, 1])
    posterior = bernoulli_model(observed, theta, prior)

    ax[n].plot(theta, prior, linestyle = '--')
    ax[n].plot(theta, posterior)
    ax[n].set_title('observed: %d' % observed)
    ax[n].set_xlim([0, 1])

    prior = posterior

plt.tight_layout()
plt.show()

prior = normalize(np.repeat(1, N), theta)

observation = []
for _ in range(200):
    observed = np.random.choice([0,1])
    observation.append(observed)
    posterior = bernoulli_model(observed, theta, prior)

    prior = posterior

print(np.mean(observation), '\n')

plt.figure(figsize = (6, 4))
plt.plot(theta, posterior)
plt.axvline(0.5, color = 'red', linestyle = '--')
plt.xlabel(r'$\theta$')
plt.show()

0.475

2.2.4. Lab 2: Gaussian Model¶

$$p(d \mid h) \sim \mathcal{N}(h, \sigma^2)$$

h = 4
sigma = 5

N = 201
D = np.linspace(-30, 30, N)

likelihood = stats.norm.pdf(D, h, sigma)

plt.figure(figsize = (6, 4))
plt.plot(D, likelihood)
plt.plot(h, 0, 'kx')
plt.xlabel('D')
plt.ylabel('P(D|H)')
plt.title('Given h = {}'.format(h))
plt.show()

H = np.linspace(-30, 30, N)
prior = normalize(np.repeat(1, N), H)

d = np.random.normal(0, sigma)

likelihood = []
for h in H:
    likelihood.append(stats.norm.pdf(d, h, sigma))

posterior = likelihood * prior
posterior = normalize(posterior, H)
plt.figure(figsize = (6, 4))
plt.plot(H, prior, '--', label = 'prior')
plt.plot(H, posterior, label = 'posterior')
plt.plot(d, 0, 'kx')
plt.axvline(h, color = 'k', linestyle = '--')
# plt.axvline(d, color = 'k', linestyle = '--')
plt.xlim([-15, 15])
plt.xlabel('H')
plt.ylabel('P(D|H)')
plt.title('Given d = {0:1.2f}'.format(d))
plt.legend()
plt.show()

def Gaussian_model(d, H, prior):
    likelihood = []
    for h in H:
        likelihood.append(stats.norm.pdf(d, h, sigma))

    posterior = likelihood * prior
    return normalize(posterior, H)

fig, ax = plt.subplots(ncols = 3, nrows = 3, figsize = (7, 7))
ax = np.ravel(ax)

for n in range(9):
    observed = np.random.normal(0, sigma)
    posterior = Gaussian_model(observed, H, prior)

    ax[n].plot(H, prior, '--')
    ax[n].plot(H, posterior)
    ax[n].plot(observed, 0, 'kx')
    ax[n].axvline(h, color = 'k', linestyle = '--')
    ax[n].set_title('observed: %1.2f' % observed)
    ax[n].set_ylim([-0.01, 0.3])
    ax[n].set_xlim([-15, 15])

    prior = posterior

plt.tight_layout()
plt.show()

prior = normalize(np.repeat(1, N), H)

observation = []

for _ in range(20):
    d = np.random.normal(0, sigma)
    observation.append(d)
    posterior = Gaussian_model(d, H, prior)

    prior = posterior

print(np.mean(observation),'\n')

plt.figure(figsize = (6, 4))
plt.plot(H, posterior, color = u'#ff7f0e')
plt.plot(observation, np.zeros(np.size(observation)), 'kx')
plt.xlim([-15, 15])
plt.xlabel('H')
plt.ylabel('P(H|D)')
plt.show()

-0.7987971935315983

Object Tracking in Computer Vision

Lecture: Introduction to Computer Vision by Prof. Aaron Bobick at Georgia Tech

from IPython.display import YouTubeVideo
YouTubeVideo('rf3DKqWajWY?rel=0', width = "560", height = "315")

Recursive Bayesian Estimation in 2D Gaussian Distribution

from IPython.display import YouTubeVideo
YouTubeVideo('qsLF3KgavJk', width = "560", height = "315")

Implementation (but in MATLAB)

from IPython.display import YouTubeVideo
YouTubeVideo('dhsK-mRzxFw', width = "560", height = "315")

Lastly, watch the following video clip (https://youtu.be/AEqkpb_5fe0), which is taken from the TV show NUMB3RS. Reflect on it in the context of recursive Bayesian estimation.

%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')