Parameter Estimation in Probabilistic Model

Table of Contents

1. Probability Density Estimation¶

In many practical situations, we do not know the exact probability distribution of the data. Instead, we are given a collection of samples:

$$y_1, y_2, \cdots, y_m$$

Our goal is to estimate the underlying probability density function (PDF) that most likely generated this dataset. This process is called probability density estimation, and it forms the foundation of statistical modeling, inference, and machine learning.

When we observe a finite set of data points, we assume that they are drawn from some unknown probability distribution. One common approach is to model this unknown distribution using a parametric form, such as a Gaussian.

Colored curves (1), (2), and (3) represent a different Gaussian probability density function with varying parameters. Our goal is to find the most suitable Gaussian distribution that likely generated the observed samples.

To estimate the parameters $\mu$ and $\sigma^2$, we define a likelihood function that quantifies how likely it is to observe the data, given a specific choice of parameters.

This naturally leads to a parameter estimation problem. Specifically, we want to estimate the values of $\mu$ and $\sigma^2$ that best explain the observed data.

1.1. Given $m$ Data Points, Drawn from a Gaussian Distribution¶

Assume each data point $y_i$ is drawn from a normal distribution (but unknown) with parameters $\mu$ and $\sigma^2$:

$$ P(y) = \mathcal{N}(y;\, \mu, \sigma^2) $$

We are given $m$ i.i.d. samples drawn from this distribution

$$ y_1, y_2, \cdots, y_m \sim P(y) $$

Step 1: Likelihood Function

The probability of a single data point is:

$$ P(y = y_i \,;\, \mu, \sigma^2) = \frac{1}{\sqrt{2\pi}\sigma} \exp\left( -\frac{1}{2\sigma^2} (y_i - \mu)^2 \right) \quad \text{(generative model)} $$

Thus, the likelihood of observing all $m$ samples is:

$$ \begin{align*} \mathcal{L}(\mu, \sigma^2) &= \prod_{i=1}^{m} \frac{1}{\sqrt{2\pi} \sigma} \exp\left( -\frac{1}{2\sigma^2} (y_i - \mu)^2 \right) \\\\ &= \frac{1}{(2\pi)^{m/2} \sigma^m} \exp\left( -\frac{1}{2\sigma^2} \sum_{i=1}^{m} (y_i - \mu)^2 \right) \end{align*} $$

Step 2: Log-Likelihood

To simplify optimization, we take the natural logarithm of the likelihood:

$$ \ell(\mu, \sigma^2) = \log \mathcal{L} = -\frac{m}{2} \log 2\pi - m \log \sigma - \frac{1}{2\sigma^2} \sum_{i=1}^{m} (y_i - \mu)^2 $$

Step 3: Maximum Likelihood Estimation (MLE)

To find the best estimates of $\mu$ and $\sigma^2$, we maximize $\ell$:

$$ \frac{\partial \ell}{\partial \mu} = 0, \qquad \frac{\partial \ell}{\partial \sigma} = 0 $$

Compute each derivative:

$$ \begin{align*} \frac{\partial \ell}{\partial \mu} &= \frac{1}{\sigma^2} \sum_{i=1}^{m} (y_i - \mu) = 0 \quad \Rightarrow \quad \mu_{\text{MLE}} = \frac{1}{m} \sum_{i=1}^{m} y_i \quad \text{(sample mean)} \\\\ \frac{\partial \ell}{\partial \sigma} &= -\frac{m}{\sigma} + \frac{1}{\sigma^3} \sum_{i=1}^{m} (y_i - \mu)^2 = 0 \quad \Rightarrow \quad \sigma^2_{\text{MLE}} = \frac{1}{m} \sum_{i=1}^{m} (y_i - \mu)^2 \quad \text{(sample variance)} \end{align*} $$

Key Observations

The most likely estimate of $\mu$ is the sample mean
The most likely estimate of $\sigma^2$ is the sample variance
These results justify the intuitive approach of using mean and variance as descriptors of data

A surprising and powerful result:
Even if the data isn't Gaussian, the sample mean becomes approximately Gaussian by the central limit theorem, especially when $m$ is large.

1.2. Maximum Likelihood Estimation (MLE)¶

Maximum Likelihood Estimation is a general method used in statistics to estimate parameters of a model by maximizing the likelihood function.

Definition

The likelihood function is the probability (or density) of observing the data $D$ given a model with parameters $\theta$:

$$ \mathcal{L}(\theta) = P(D \mid \theta) = P(D \,;\, \theta) $$

The MLE is defined as:

$$ \theta_{\text{MLE}} = \arg \max_{\theta} \;\; P(D \,;\, \theta) $$

Interpretation

You assume the data is generated by a probabilistic model with unknown parameters
You find the parameters $\theta$ that make the observed data most probable
This approach is model-driven, consistent with generative modeling

Example (Gaussian)

If you assume the data is Gaussian:

$$ \mu_{\text{MLE}} = \frac{1}{m} \sum_{i=1}^{m} x_i $$

This is the result we derived earlier and serves as a canonical example of how MLE works in practice.

import numpy as np
from scipy.stats import norm
import matplotlib.pyplot as plt
%matplotlib inline

# MLE of Gaussian distribution
# mu

m = 20
mu = 0
sigma = 5

x = np.random.normal(mu,sigma,[m,1])
xp = np.linspace(-20, 20, 100)
y0 = np.zeros([m, 1])

muhat = [-5, 0, 5, np.mean(x)]

plt.figure(figsize = (6, 6))

for i in range(4):
    yp = norm.pdf(xp, muhat[i], sigma)
    y = norm.pdf(x, muhat[i], sigma)
    logL = np.sum(np.log(y))

    plt.subplot(4, 1, i+1)
    plt.plot(xp, yp, 'r')
    plt.plot(x, y, 'bo')
    plt.plot(np.hstack([x, x]).T, np.hstack([y, y0]).T, 'k--')

    plt.title(r'$\hat\mu$ = {0:.2f}'.format(muhat[i]))
    plt.text(-15,0.06,np.round(logL,4))
    plt.axis([-20, 20, 0, 0.11])

plt.tight_layout()
plt.show()

Compare to a result from formula

$$\mu_{MLE}=\frac{1}{m}\sum_{i = 1}^{m}x_i$$

# mean is unknown in this example
# variance is known in this example

m = 10
mu = 0
sigma = 5

x = np.random.normal(mu,sigma,[m,1])

mus = np.arange(-10, 10.5, 0.5)
LOGL = []

for i in range(np.size(mus)):
    y = norm.pdf(x, mus[i], sigma)
    logL = np.sum(np.log(y))
    LOGL.append(logL)

muhat = np.mean(x)
print(muhat,'\n')

plt.figure(figsize = (6, 4))
plt.plot(mus, LOGL, '.')
plt.title('$log (\prod \mathcal{N}(x \mid \mu , \sigma^2))$')
plt.xlabel(r'$\hat \mu$')
plt.grid(alpha = 0.3)
plt.show()

1.018080164522729

$$\sigma_{MLE}^2=\frac{1}{m}\sum\limits_{i=1}^m(y_i-\mu)^2$$

# mean is known in this example
# variance is unknown in this example

m = 100
mu = 0
sigma = 3

x = np.random.normal(mu,sigma,[m,1])  # samples

sigmas = np.arange(1, 10, 0.1)
LOGL = []

for i in range(sigmas.shape[0]):
    y = norm.pdf(x, mu, sigmas[i])     # likelihood
    logL = np.sum(np.log(y))
    LOGL.append(logL)

sigmahat = np.sqrt(np.var(x))
print(sigmahat)

plt.figure(figsize = (6, 4))
plt.title(r'$\log (\prod \mathcal{N} (x|\mu,\sigma^2))$')
plt.plot(sigmas, LOGL, '.')
plt.xlabel(r'$\hat \sigma$')
plt.axis([0, np.max(sigmas), np.min(LOGL), -200])
plt.grid(alpha = 0.3)
plt.show()

3.0438236510057792

2. Data Fusion with Uncertainties¶

From Learning Theory (Reza Shadmehr, Johns Hopkins University)

Lecture 6

In many real-world systems, measurements are collected from multiple sensors, each of which may be noisy or imprecise. An important question is: How can we combine noisy measurements from multiple sensors to produce a more accurate estimate?

This question leads us to the concept of data fusion - combining observations while accounting for their associated uncertainties.

2.1. Two Noisy Sensors¶

Assume we are trying to estimate a true quantity $x$ using two sensors, each affected by additive Gaussian noise:

$$\begin{align*} y_a &= x + \varepsilon_a, \quad \varepsilon_a \sim \mathcal{N}\left(0,\sigma^2_a\right) \\ y_b &= x + \varepsilon_b, \quad \varepsilon_b \sim \mathcal{N}\left(0,\sigma^2_b\right) \end{align*}$$

Each sensor provides a noisy measurement of the same true value $x$, but with possibly different noise levels.

Matrix Formulation

We rewrite the two measurements in vector form:

$$ y = \begin{bmatrix} y_a \\ y_b \end{bmatrix} = Cx + \varepsilon = \begin{bmatrix} 1 \\ 1 \end{bmatrix} x + \begin{bmatrix} \varepsilon_a \\ \varepsilon_b \end{bmatrix}, \quad \varepsilon \sim \mathcal{N}(0, R) $$

The noise covariance matrix is:

$$ R = \begin{bmatrix} \sigma_a^2 & 0 \\ 0 & \sigma_b^2 \end{bmatrix} $$

Thus, the likelihood model for $y$ given $x$ is:

$$ P(y \mid x) \sim \mathcal{N}(Cx, R) $$

Or in full expression:

$$ P(y \mid x) = \frac{1}{\sqrt{(2\pi)^2 |R|}} \exp\left( -\frac{1}{2} (y - Cx)^T R^{-1} (y - Cx) \right) $$

Goal: Estimate $\hat{x}$ from $y$

We aim to find the maximum likelihood estimate (MLE) of $x$, given the observation $y$.

Step 1: Log-Likelihood

The negative log-likelihood is:

$$ \ell(x) = \text{const} + \frac{1}{2} (y - Cx)^T R^{-1} (y - Cx) $$

We minimize this with respect to $x$.

Step 2: Expand the Quadratic Term

$$ \begin{align*} (y - Cx)^T R^{-1} (y - Cx) &= y^T R^{-1} y - 2 x^T C^T R^{-1} y + x^T C^T R^{-1} C x \end{align*} $$

Step 3: Set Gradient to Zero

Differentiating with respect to $x$ and setting the derivative to zero:

$$ \frac{d\ell}{dx} = -2 C^T R^{-1} y + 2 C^T R^{-1} C x = 0 $$

Solving for $\hat{x}$:

$$ \hat{x} = \left(C^T R^{-1} C\right)^{-1} C^T R^{-1} y $$

This is the weighted least squares estimator, where weights are determined by the inverse noise variances.

Analytical Evaluation

We compute the matrix terms explicitly:

$$ \begin{align*} C &= \begin{bmatrix} 1 \\ 1 \end{bmatrix}, \quad R^{-1} = \begin{bmatrix} \frac{1}{\sigma_a^2} & 0 \\ 0 & \frac{1}{\sigma_b^2} \end{bmatrix} \\\\ C^T R^{-1} C &= \frac{1}{\sigma_a^2} + \frac{1}{\sigma_b^2} \\\\ C^T R^{-1} &= \begin{bmatrix} \frac{1}{\sigma_a^2} & \frac{1}{\sigma_b^2} \end{bmatrix} \end{align*} $$

Plug into the estimator:

$$ \hat{x} = \left( \frac{1}{\sigma_a^2} + \frac{1}{\sigma_b^2} \right)^{-1} \left( \frac{1}{\sigma_a^2} y_a + \frac{1}{\sigma_b^2} y_b \right) $$

This gives the weighted average of the two measurements, where the weights are inversely proportional to their variances.

Variance of the Estimate

We now calculate the variance of the fused estimate $\hat{x}$:

$$ \text{var}(\hat{x}) = \left(C^T R^{-1} C\right)^{-1} = \frac{1}{\frac{1}{\sigma_a^2} + \frac{1}{\sigma_b^2}} $$

Why?

$$\begin{align*} \text{var}\left(\hat{x}\right) &= \left(\left(C^TR^{-1}C \right)^{-1}C^TR^{-1}\right) \cdot \text{var}(y) \cdot \left( \left(C^TR^{-1}C)^{-1}C^TR^{-1} \right) \right)^T\\ & = \left( \left(C^TR^{-1}C \right)^{-1}C^TR^{-1}\right) \cdot R \cdot \left( \left(C^TR^{-1}C)^{-1}C^TR^{-1} \right) \right)^T\\ &= \left(C^TR^{-1}C \right)^{-1}C^T \cdot \left(R^{-1} \right)^TC \left(\left(C^TR^{-1}C \right)^{-1} \right)^T \\ &=\underbrace{\left(C^TR^{-1}C\right)^{-1}} \underbrace{C^TR^{-1}C}\left(\left(C^TR^{-1}C\right)^{-1}\right)^T = \left(C^TR^{-1}C\right)^{-1}\\\\ &=\frac{1}{\frac{1}{\sigma^2_a}+\frac{1}{\sigma^2_b}} \end{align*}$$

Note:

$$ \text{var}(\hat{x}) \leq \min(\sigma_a^2, \sigma_b^2) $$

This inequality reflects the benefit of combining information: the estimate is more precise than either of the individual sensors.

Summary

$$ \begin{align*} \hat{x} &= \frac{ \frac{1}{\sigma_a^2} y_a + \frac{1}{\sigma_b^2} y_b } { \frac{1}{\sigma_a^2} + \frac{1}{\sigma_b^2} } = \alpha_a y_a + \alpha_b y_b \quad \text{(weighted average)} \\\\ \text{var}(\hat{x}) &= \frac{1}{ \frac{1}{\sigma_a^2} + \frac{1}{\sigma_b^2} } \end{align*} $$

Where:

$$ \alpha_a = \frac{ \frac{1}{\sigma_a^2} }{ \frac{1}{\sigma_a^2} + \frac{1}{\sigma_b^2} }, \quad \alpha_b = \frac{ \frac{1}{\sigma_b^2} }{ \frac{1}{\sigma_a^2} + \frac{1}{\sigma_b^2} } $$

Big Lesson

Two sensors are better than one
Even if both sensors are noisy, fusing them reduces overall uncertainty
Sensor accuracy matters - better sensors get more weight
This principle is at the core of probabilistic robotics, Kalman filtering, sensor networks, and machine learning models with uncertainty

2.2. Example: Two Rulers (1D Sensor Fusion in the Brain)¶

Imagine you are estimating the length of an object using two different sources of sensory information:

Visual measurement: looking at the object
Haptic measurement: feeling the object with eyes closed

Each of these measurements is affected by Gaussian noise. Let the true length be $x$, and the measurements be:

$$ \begin{align*} y_{\text{vision}} = y_v &= x + \varepsilon_v, \quad \varepsilon_v \sim \mathcal{N}(0, \sigma_v^2) \\ y_{\text{haptic}} = y_h &= x + \varepsilon_h, \quad \varepsilon_h \sim \mathcal{N}(0, \sigma_h^2) \end{align*} $$

Where:

$x$: the true length of the object
$y_v$: measurement from vision (visual ruler)
$y_h$: measurement from touch (haptic ruler)
$\sigma_v^2$, $\sigma_h^2$: variances associated with visual and haptic uncertainty, respectively

Each sensor independently observes the same true value $x$ with its own uncertainty. Our goal is to estimate $x$ based on both observations.

Optimal Fusion

The optimal estimate $\hat{x}$ that fuses both noisy measurements is:

$$ \hat{x} = \frac{ \frac{1}{\sigma_v^2} y_v + \frac{1}{\sigma_h^2} y_h } { \frac{1}{\sigma_v^2} + \frac{1}{\sigma_h^2} } $$

More reliable measurements (i.e., smaller variance) receive greater weight.

If $\sigma_v^2 \ll \sigma_h^2$, visual information is more reliable, and $\hat{x} \approx y_{\text{vision}}$
If both variances are equal, then $\hat{x}$ becomes the average: $\hat{x} = \frac{y_{\text{vision}} + y_{\text{haptic}}}{2}$

Variance of the Fused Estimate

The uncertainty (variance) of the fused estimate is:

$$ \text{var}(\hat{x}) = \frac{1}{ \frac{1}{\sigma_v^2} + \frac{1}{\sigma_h^2} } < \min(\sigma_v^2, \sigma_h^2) $$

This is always less than either individual variance, showing that fusing two noisy measurements leads to a more accurate (less uncertain) estimate.

Neuroscience Perspective

This model reflects how the human brain integrates sensory input:

It doesn't rely solely on one modality
It combines information in a statistically optimal way (approximately)
Uncertainty matters: The brain gives more weight to more reliable sources

This example beautifully illustrates how probabilistic reasoning and Bayesian estimation naturally arise in both artificial systems (sensor fusion) and biological ones (perception).

1D Fusion Example

x = 5       # true state (length in this example)
a = 1       # sigma of a
b = 2       # sigma of b

YA = []
YB = []
XML = []

for i in range(2000):
    ya = x + np.random.normal(0,a)
    yb = x + np.random.normal(0,b)
    xml = (1/a**2*ya + 1/b**2*yb)/(1/a**2+1/b**2)
    YA.append(ya)
    YB.append(yb)
    XML.append(xml)

plt.figure(figsize = (6, 5))
plt.subplot(3, 1, 1), plt.hist(YA, 41, density=True), plt.axis([0, 10, 0, 0.5]),
plt.title(r'$y_a$'), plt.grid(alpha = 0.3)
plt.subplot(3, 1, 2), plt.hist(YB, 41, density=True), plt.axis([0, 10, 0, 0.5]),
plt.title(r'$y_b$'), plt.grid(alpha = 0.3)
plt.subplot(3, 1, 3), plt.hist(XML, 41, density=True), plt.axis([0, 10, 0, 0.5]),
plt.title(r'$x_{ML}$'), plt.grid(alpha = 0.3)
plt.tight_layout()
plt.show()

2D Fusion Example

x = np.array([5, 10]).reshape(-1, 1) # true position

mu = np.array([0, 0])
Ra = np.matrix([[9, 1],
               [1, 1]])
Rb = np.matrix([[1, 1],
                [1, 9]])

YA = []
YB = []
XML = []

for i in range(1000):
    ya = x + np.random.multivariate_normal(mu, Ra).reshape(-1, 1)
    yb = x + np.random.multivariate_normal(mu, Rb).reshape(-1, 1)
    xml = (Ra.I+Rb.I).I*(Ra.I*ya+Rb.I*yb)
    YA.append(ya.T)
    YB.append(yb.T)
    XML.append(xml.T)

YA = np.vstack(YA)
YB = np.vstack(YB)
XML = np.vstack(XML)

plt.figure(figsize = (6, 4))
plt.title('Data Fusion')
plt.plot(YA[:,0], YA[:,1], 'b.', label='Observation 1')
plt.plot(YB[:,0], YB[:,1], 'r.', label='Observation 2')
plt.plot(XML[:,0], XML[:,1], 'k.', label='MLE')
plt.axis('equal')
plt.grid(alpha = 0.3)
plt.legend()
plt.show()

3. Maximum-a-Posteriori Estimation (MAP)¶

3.1. Think Differently¶

In many traditional estimation settings, we are given a dataset $D = \{d_1, d_2, \cdots, d_m\}$:

We aim to estimate the underlying probability distribution $P(D)$ that generated the data. Once $P(D)$ is estimated, a common prediction for the next observation $d_{m+1}$ is simply the expected value:

$$ d_{m+1} = E[D] $$

But What If the Data Comes from a Hidden Parameter?

Suppose the data is generated not directly, but through a hidden variable $\theta$. In this case:

$\theta$ is unknown
We assume $\theta$ itself is a random variable

The assumption is now that the data points are conditionally independent given $\theta$:

$$ d_i \sim P(d_i \mid \theta) $$

Our goal shifts from estimating $P(D)$ to inferring the hidden parameter $\theta$ that explains the data.

Key Perspective

We no longer treat $\theta$ as a fixed constant to be estimated (as in MLE), but rather as a random variable with its own prior distribution:

$$ \theta \sim P(\theta) $$

Upon observing data $D$, we apply Bayes' rule to update our belief about $\theta$:

$$ P(\theta \mid D) = \frac{P(D \mid \theta) \cdot P(\theta)}{P(D)} $$

This framework is the foundation of Bayesian inference, where we do not merely look for the best-fit model, but update our belief about parameters based on evidence.

In Maximum Likelihood Estimation (MLE), we choose the parameter $\theta$ that maximizes the likelihood of the observed data. In contrast, Maximum-a-Posteriori (MAP) estimation incorporates both the observed data and our prior belief about $\theta$.

MAP as Posterior Maximization

We aim to choose $\theta$ that maximizes the posterior probability given the data $D$:

$$ \theta_{\text{MAP}} = \mathrm{arg} \max_{\theta} \; P(\theta \mid D) $$

By Bayes' rule, the posterior is expressed as:

$$ P(\theta \mid D) = \frac{P(D \mid \theta) \cdot P(\theta)}{P(D)} $$

$P(\theta)$: Prior distribution over parameters (before observing data)
$P(D \mid \theta)$: Likelihood of the observed data under parameter $\theta$
$P(D)$: Marginal likelihood (a normalizing constant)

Since $P(D)$ does not depend on $\theta$, we can ignore it in the optimization:

$$ \theta_{\text{MAP}} = \mathrm{arg} \max_{\theta} \; P(D \mid \theta) \cdot P(\theta) $$

Logarithmic Form

Maximizing the log-posterior simplifies the computation:

$$ \theta_{\text{MAP}} = \mathrm{arg} \max_{\theta} \; \log P(D \mid \theta) + \log P(\theta) $$

This is just like MLE with an additional regularization term $\log P(\theta)$ that encodes prior belief.

For Independent Observations

If the data consists of $m$ i.i.d. observations $D = \{d_1, d_2, \dots, d_m\}$, the MAP estimator becomes:

$$ \theta_{\text{MAP}} = \mathrm{arg} \max_{\theta} \; \left\{ \sum_{i=1}^m \log P(d_i \mid \theta) + \log P(\theta) \right\} $$

Compare this to MLE:

$$ \theta_{\text{MLE}} = \mathrm{arg} \max_{\theta} \; \sum_{i=1}^m \log P(d_i \mid \theta) $$

Key Insight

MAP is similar to MLE but incorporates a prior belief about the parameter $\theta$
The prior acts like an additional observation or constraint
MAP estimation leads to regularized solutions

$$ \begin{align*} \text{MAP:} \quad & \theta_{\text{MAP}} = \mathrm{arg} \max_{\theta} \; \log P(D \mid \theta) + \log P(\theta) \\\\ \text{MLE:} \quad & \theta_{\text{MLE}} = \mathrm{arg} \max_{\theta} \; \log P(D \mid \theta) \end{align*} $$

Takeaway: MAP provides a principled way to incorporate prior knowledge into parameter estimation.

3.2. MAP for a Univariate Gaussian¶

Suppose we observe data points:

$$D = {x_1, x_2, \cdots, x_m}$$

and assume that each sample is drawn independently from a Gaussian distribution with unknown mean $\theta$ and known variance $\sigma^2$:

$$ x_i \sim \mathcal{N}(\theta,\sigma^2) $$

Now suppose that we also have a prior belief about the value of $\theta$:

$$ \theta \sim \mathcal{N}(\mu, 1^2) $$

That is, our prior belief is that $\theta$ is normally distributed with mean $\mu$ and variance $1$.

Step 1: Joint Probability

The joint probability of the data given $\theta$ (i.e., the likelihood) is:

$$ P(x_1,x_2,\cdots,x_m \mid \theta) = \prod\limits_{i=1}^m P(x_i \mid \theta) $$

Step 2: MAP Estimation

The MAP estimate maximizes the posterior probability of $\theta$:

$$ \theta_{MAP} = \mathrm{arg} \max_{\theta} \; P(\theta \mid D) = \mathrm{arg} \max_{\theta} \; P(D \mid \theta)P(\theta) $$

Taking the logarithm:

$$ \theta_{MAP} = \mathrm{arg} \max_{\theta} \; \left\{\log P(D \mid \theta) + \log P(\theta)\right\} $$

Step 3: Compute Gradients

We proceed by taking the derivative with respect to $\theta$:

For Likelihood term

$$ \frac{\partial}{\partial \theta} \left(\log P(D \mid \theta)\right) = \frac{1}{\sigma^2} \left( \sum_{i=1}^m x_i - m\theta \right) $$

For the prior term:

$$ \frac{\partial}{\partial \theta} \left(\log P(\theta)\right) = \mu - \theta $$

Step 4: Set Derivative to Zero

We combine both terms and set the derivative to zero:

$$ \frac{1}{\sigma^2} \sum_{i=1}^m x_i + \mu - \left(\frac{m}{\sigma^2} + 1\right)\theta^* = 0 $$

Solving for $\theta^*$:

$$ \theta^* = \frac{\frac{1}{\sigma^2} \sum_{i=1}^m x_i + \mu}{\frac{m}{\sigma^2} + 1} $$

Thus, the MAP estimator for $\theta$ is:

$$ \theta_{MAP} = \frac{\frac{m}{\sigma^2}}{\frac{m}{\sigma^2}+1}\bar{x} + \frac{1}{\frac{m}{\sigma^2}+1}\mu $$

This is a weighted average of the sample mean $\bar{x}$ and the prior mean $\mu$.

Interpretation

The MAP estimate blends data and prior belief based on their respective uncertainties:

$\bar{x} = \frac{1}{m} \sum x_i$: sample mean from observed data
$\mu$: prior guess of $\theta$
The weight on $\bar{x}$ increases as the number of samples $m$ increases or as noise variance $\sigma^2$ decreases
The prior acts like a virtual data point with unit variance

Analogy

$\mu$ = 1st observation (prior), drawn from $\mathcal{N}(\mu, 1)$
$\bar{x}$ = 2nd observation (data), drawn from $\mathcal{N}(\theta, \frac{\sigma^2}{m})$

MAP estimation is like combining two observations with known variances.

Big Lesson

A prior behaves like additional data. It stabilizes the estimate, especially when observed data is limited or noisy. As more data is collected, the influence of the prior diminishes.

Note:

Prior knowledge can come from various sources such as:

Education: e.g., learning that volume is not always proportional to weight
Experience: accumulated through interactions over time
Aging: as people grow older, they refine priors based on repeated sensory outcomes

Example: Perceiving Object Weight via Multiple Senses

You're presented with two objects (as in the image) and asked: "Which one is heavier?"

We consider four different sensory conditions:

Without Touching and with Eyes Closed
Only Visual Inspection
Only Haptic (Touch) Inspection
Both Visual and Haptic Inspection

(1) Without Touching and with Eyes Closed

No sensory input (neither visual nor haptic)
You have to guess purely based on prior knowledge or random chance

(2) Only Visual Inspection

You see the objects but do not touch them
You likely assume the larger object is heavier
This is guided by a prior based on daily experience (volume $\propto$ mass)
Outcome: You likely say "the left one is heavier"

(3) Only Haptic (Touch) Inspection

Eyes are closed, but you feel the objects
No visual bias is introduced
This engages the haptic sensory system, which may have higher precision for weight
Outcome: You may correctly identify which is heavier if your sense of touch is reliable

(4) Both Visual and Haptic Inspection

You integrate visual and haptic signals
The brain uses a Bayesian strategy to weight each signal inversely to its uncertainty

3.3. MAP in Python¶

Suppose $\theta$ is a random variable with prior distribution:

$$\theta \sim \mathcal{N}(\mu, 1^2)$$

We observe data drawn from a Gaussian distribution with unknown mean $\theta$ and known variance $\sigma^2$:

$$ x_i \sim \mathcal{N}(\theta, \sigma^2) $$

This setup corresponds to estimating the mean of a univariate Gaussian using MAP, where we incorporate prior knowledge about $\theta$ centered at $\mu$.

# known
mu = 5
sigma = 2

# unknown theta
theta = np.random.normal(mu,1)
x = np.random.normal(theta, sigma)

print('theta = {:.4f}\n'.format(theta))
print('x = {:.4f}'.format(x))

theta = 5.4504

x = 4.7356

$$\theta_{MAP} = \frac{\frac{m}{\sigma^2}}{\frac{m}{\sigma^2}+1}\bar{x}+\frac{1}{\frac{m}{\sigma^2}+1}\mu $$

# MAP

m = 4
X = np.random.normal(theta,sigma,[m,1])

xbar = np.mean(X)
theta_MAP = m/(m+sigma**2)*xbar + sigma**2/(m+sigma**2)*mu

print('mu = 5\n')
print('xbar = {:.4f}\n'.format(xbar))
print('theta_MAP = {:.4f}'.format(theta_MAP))

mu = 5

xbar = 4.3017

theta_MAP = 4.6508

# theta
mu = 5
theta = np.random.normal(mu,1)

sigma = 2
m = 2000

X = np.random.normal(theta,sigma,[m,1])
X = np.asmatrix(X)

print('theta = {:.4f}\n'.format(theta))
plt.figure(figsize = (6, 4))
plt.hist(X, 31)
plt.xlim([-5, 15])
plt.show()

theta = 3.6052

m = 50
XMLE = []
XMAP = []

for k in range(m):
    xmle = np.mean(X[0:k+1,0])
    xmap = k/(k+sigma**2)*xmle + sigma**2/(k + sigma**2)*mu
    XMLE.append(xmle)
    XMAP.append(xmap)

print('theta = {:.4f}\n'.format(theta))
plt.figure(figsize = (6,4))
plt.plot(XMLE)
plt.plot(XMAP)
plt.legend(['MLE','MAP'])
plt.show()

theta = 3.6052

4. Linear Measurement with Noise¶

In many estimation problems, we deal with measurements contaminated by noise. A common setup is:

$$y = Ax + \omega$$

Where:

$x$ is the true quantity we wish to estimate.
$y$ is the measurement.
$A$ is a matrix representing the measurement model or the sensor's characteristics.
$\omega$ represents sensor noise

We assume that both $x$ and $\omega$ are normally distributed:

$$x \sim \mathcal{N}(\mu_x, \Sigma_x) \quad \rightarrow \quad \text{for simplicity } x \sim \mathcal{N}(0, \Sigma_x)$$

$$\omega \sim \mathcal{N}(\mu_{\omega}, \Sigma_{\omega}) \quad \rightarrow \quad \text{for simplicity } \omega \sim \mathcal{N}(0, \Sigma_{\omega})$$

Further, we assume $x$ and $\omega$ are independent. The joint distribution of $x$ and $\omega$ is:

$$\begin{bmatrix} x \\ \omega \end{bmatrix} \sim \mathcal{N}\left( \begin{bmatrix} \mu_x \\ \mu_{\omega}\end{bmatrix}, \begin{bmatrix} \Sigma_{x}& 0\\ 0& \Sigma_{\omega} \end{bmatrix} \right)$$

Since $y = A x + \omega$, the joint distribution of $x$ and $y$ becomes:

$$\begin{bmatrix} x \\ y \end{bmatrix} = \begin{bmatrix} I & 0 \\A & I \end{bmatrix} \begin{bmatrix} x \\ \omega \end{bmatrix}$$

Therefore,

$$ \begin{align*} \begin{bmatrix} x \\ y \end{bmatrix} &\sim \mathcal{N}\left( \begin{bmatrix} I & 0 \\A & I \end{bmatrix} \begin{bmatrix} \mu_x \\ \mu_{\omega}\end{bmatrix}, \begin{bmatrix} I & 0 \\A & I \end{bmatrix} \begin{bmatrix} \Sigma_{x}& 0\\ 0& \Sigma_{\omega} \end{bmatrix} \begin{bmatrix} I & 0 \\A & I \end{bmatrix}^T \right) \\\\ &\sim \mathcal{N}\left( \begin{bmatrix} \mu_x \\ A \mu_x + \mu_{\omega}\end{bmatrix}, \begin{bmatrix} \Sigma_{x}& \Sigma_x A^T\\ A\Sigma_x& A\Sigma_xA^T + \Sigma_{\omega} \end{bmatrix} \right) = \mathcal{N} \left( \begin{bmatrix} \mu_x \\ \mu_y\end{bmatrix}, \begin{bmatrix} \Sigma_{x}& \Sigma_{xy}\\ \Sigma_{yx}& \Sigma_y \end{bmatrix} \right) \end{align*} $$

$$\Sigma_y = \underbrace{ A\Sigma_xA^T}_{\text{signal covariance}} + \underbrace{\Sigma_{\omega}}_{\text{noise covariance}}$$

Posterior Distribution: Estimate $x$ Given $y$

Given a measurement $y$, the posterior distribution of $x$ given $y$ is also Gaussian:

$$x \mid y \sim \mathcal{N} \left(\mu_x + \Sigma_{xy}\Sigma_{y}^{-1}(y-\mu_y), \; \Sigma_{x} - \Sigma_{xy}\Sigma_{y}^{-1}\Sigma_{yx} \right)$$

Where:

$\Sigma_{xy} = A \Sigma_x$
$\Sigma_y = A \Sigma_x A^T + \Sigma_{\omega}$

The optimal estimator for $x$ is:

$$ \begin{align*} \hat{x} = \phi(y) &= E[x \mid y]\\ & = \mu_x + \Sigma_{xy}\Sigma_{y}^{-1}(y-\mu_y)\\ & = \mu_x + \Sigma_x A^T(A\Sigma_xA^T + \Sigma_{\omega})^{-1}(y-\mu_y)\\ & = \mu_x + K(y-\mu_y)\\\\ \text{cov}(x \mid y) &= \Sigma_{x} - \Sigma_{xy}\Sigma_{y}^{-1}\Sigma_{yx}\\\\ &= \Sigma_{x} - \Sigma_x A^T (A\Sigma_xA^T + \Sigma_{\omega})^{-1} A \Sigma_x\leq \Sigma_x \end{align*} $$

Where the Kalman gain $K$ is:

$$ K = \Sigma_x A^T \left( A \Sigma_x A^T + \Sigma_\omega \right)^{-1} $$

Interpretation

$\mu_x$ is our best prior guess for $x$.
The term $(y - \mu_y)$ represents the discrepancy between the observed value and the expected value.
The estimator adjusts the prior estimate by $K$ times the discrepancy between the measured and expected values.
The covariance of the estimation error is always smaller than the prior covariance of $x$, reflecting the fact that measurements have reduced uncertainty about $x$.

5. Recursive Bayesian Estimation from Two Sensors¶

Bayesian reasoning provides a powerful framework for estimation and inference under uncertainty. It allows us to systematically update our beliefs as new data becomes available, making it a cornerstone of modern probabilistic modeling and filtering.

5.1. Re-visit Two Sensors Problem¶

In the two-sensor fusion problem, each sensor provides a noisy observation of the same true value $x$. We assume that the measurements are independent and corrupted by Gaussian noise.

Let:

$$ \text{var}(y_a) = \sigma_a^2, \qquad \text{var}(y_b) = \sigma_b^2 $$

Then, the optimal estimator of $x$ based on both measurements is a precision-weighted average:

$$ \begin{align*} E[\hat x] &= \frac{ \frac{1}{\sigma_a^2} y_a + \frac{1}{\sigma_b^2} y_b }{ \frac{1}{\sigma_a^2} + \frac{1}{\sigma_b^2} } = \frac{\frac{1}{\sigma^2_a}}{\frac{1}{\sigma^2_a} + \frac{1}{\sigma^2_a}}y_a + \frac{\frac{1}{\sigma^2_b}}{\frac{1}{\sigma^2_a} + \frac{1}{\sigma^2_b}}y_b \\\\ \text{var}[\hat x] &= \frac{1}{\frac{1}{\sigma_a^2} + \frac{1}{\sigma_b^2}} \quad < \min(\sigma_a^2, \sigma_b^2) \end{align*} $$

This result shows that fusing two noisy estimates reduces uncertainty: the variance of the combined estimate is strictly smaller than that of either individual measurement.

5.2. Recursive Bayesian Update¶

Now let's view this fusion process from a recursive Bayesian perspective: we start with one measurement and then update our belief after observing the second.

Step 1: Prior from Sensor A

Suppose we first receive a measurement from sensor A:

$$ \hat{x}_1 = y_a, \qquad \hat{\sigma}_1^2 = \sigma_a^2 $$

This forms our initial belief about $x$, with a corresponding uncertainty $\hat{\sigma}_1^2$.

Step 2: Update with Sensor B

We then incorporate sensor B's measurement $y_b$, which has its own uncertainty $\sigma_b^2$.

The updated estimate becomes:

$$ \hat{x}_2 = (1 - K) \hat{x}_1 + K y_b, \quad K = \frac{ \frac{1}{\sigma_b^2} }{ \frac{1}{\sigma_a^2} + \frac{1}{\sigma_b^2} } $$

This can be interpreted as:

$$ \hat{x}_2 = \hat{x}_1 + K (y_b - \hat{x}_1) $$

Interpretation

This recursive form illustrates the essence of Bayesian updating:

$\hat{x}_1$ is the prior estimate
$y_b - \hat{x}_1$ is the innovation or prediction error (difference between new observation and prior)
$K$ determines how much we trust the new observation

This equation has the form:

$$ \hat{x}_{\text{new}} = \text{prior} + \text{gain} \times \text{prediction error} $$

The more confident we are in sensor B (i.e., smaller $\sigma_b^2$), the larger the gain $K$ and the more we adjust the prior.

This recursive update structure forms the basis of the Kalman filter, a powerful tool for sequential state estimation in dynamic systems.

5.3. Bayesian Kalman Filter¶

The Kalman filter is a recursive Bayesian estimator for linear dynamical systems with Gaussian noise. It formalizes the intuition we saw in sensor fusion: we start with a belief, predict based on system dynamics, and update that belief when a new observation arrives.

State Dynamics

$$ x_{t+1} = A x_t + B u_t + \omega_t, \quad \omega_t \sim \mathcal{N}(0, Q) $$

Measurement Model

$$ y_t = C x_t + v_t, \quad v_t \sim \mathcal{N}(0, R) $$

All variables are assumed Gaussian, and noise terms are independent:

$x_t$: system state
$u_t$: control input
$y_t$: observed measurement
$w_t$: process noise
$v_t$: observation noise

Kalman Filtering as Recursive Bayesian Inference

At each time step, the Kalman filter performs two operations:

(1) Prediction Step (Prior Update)

We use the system model to project the prior belief forward in time:

Predicted mean:

$$ \hat{x}_{t \mid t-1} = A \hat{x}_{t-1} + B u_{t-1} $$

Predicted covariance:

$$ \Sigma_{t \mid t-1} = A \Sigma_{t-1} A^T + Q $$

This gives us the prior distribution before seeing the new measurement:

$$ p(x_t \mid y_{1:t-1}) = \mathcal{N}(\hat{x}_{t \mid t-1}, \Sigma_{t \mid t-1}) $$

(2) Correction Step (Posterior Update)

When a new measurement $y_t$ arrives, we update the prediction:

Kalman gain:

$$ K_t = \Sigma_{t \mid t-1} C^T (C \Sigma_{t \mid t-1} C^T + R)^{-1} $$

Updated mean:

$$ \hat{x}_t = \hat{x}_{t \mid t-1} + K_t \left( y_t - C \hat{x}_{t \mid t-1} \right) $$

Updated covariance:

$$ \Sigma_t = \Sigma_{t \mid t-1} - K_t C \Sigma_{t \mid t-1} $$

This gives the posterior distribution:

$$ p(x_t \mid y_{1:t}) = \mathcal{N}(\hat{x}_t, \Sigma_t) $$

Summary of Bayesian Kalman Filter

Maintains a Gaussian belief over the system state
Alternates between prediction (using dynamics) and correction (using measurements)
Each step is computationally efficient (closed-form update)
The Kalman gain $K_t$ balances trust between the model prediction and the new data
Covariance decreases over time as more information is accumulated

5.4. Object Tracking in Computer Vision¶

Lecture: Introduction to Computer Vision by Prof. Aaron Bobick at Georgia Tech

from IPython.display import YouTubeVideo
YouTubeVideo('5yUjYCkm2jI?rel=0', width = "560", height = "315")

%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')