Parameter Estimation in Probabilistic Model

Table of Contents





1. Probability Density Estimation

In many practical situations, we do not know the exact probability distribution of the data. Instead, we are given a collection of samples:


$$y_1, y_2, \cdots, y_m$$


Our goal is to estimate the underlying probability density function (PDF) that most likely generated this dataset. This process is called probability density estimation, and it forms the foundation of statistical modeling, inference, and machine learning.


When we observe a finite set of data points, we assume that they are drawn from some unknown probability distribution. One common approach is to model this unknown distribution using a parametric form, such as a Gaussian.


Colored curves (1), (2), and (3) represent a different Gaussian probability density function with varying parameters. Our goal is to find the most suitable Gaussian distribution that likely generated the observed samples.



To estimate the parameters $\mu$ and $\sigma^2$, we define a likelihood function that quantifies how likely it is to observe the data, given a specific choice of parameters.

This naturally leads to a parameter estimation problem. Specifically, we want to estimate the values of $\mu$ and $\sigma^2$ that best explain the observed data.



1.1. Given $m$ Data Points, Drawn from a Gaussian Distribution



Assume each data point $y_i$ is drawn from a normal distribution (but unknown) with parameters $\mu$ and $\sigma^2$:


$$ P(y) = \mathcal{N}(y;\, \mu, \sigma^2) $$


We are given $m$ i.i.d. samples drawn from this distribution


$$ y_1, y_2, \cdots, y_m \sim P(y) $$


Step 1: Likelihood Function

The probability of a single data point is:


$$ P(y = y_i \,;\, \mu, \sigma^2) = \frac{1}{\sqrt{2\pi}\sigma} \exp\left( -\frac{1}{2\sigma^2} (y_i - \mu)^2 \right) \quad \text{(generative model)} $$


Thus, the likelihood of observing all $m$ samples is:


$$ \begin{align*} \mathcal{L}(\mu, \sigma^2) &= \prod_{i=1}^{m} \frac{1}{\sqrt{2\pi} \sigma} \exp\left( -\frac{1}{2\sigma^2} (y_i - \mu)^2 \right) \\\\ &= \frac{1}{(2\pi)^{m/2} \sigma^m} \exp\left( -\frac{1}{2\sigma^2} \sum_{i=1}^{m} (y_i - \mu)^2 \right) \end{align*} $$


Step 2: Log-Likelihood

To simplify optimization, we take the natural logarithm of the likelihood:


$$ \ell(\mu, \sigma^2) = \log \mathcal{L} = -\frac{m}{2} \log 2\pi - m \log \sigma - \frac{1}{2\sigma^2} \sum_{i=1}^{m} (y_i - \mu)^2 $$


Step 3: Maximum Likelihood Estimation (MLE)

To find the best estimates of $\mu$ and $\sigma^2$, we maximize $\ell$:


$$ \frac{\partial \ell}{\partial \mu} = 0, \qquad \frac{\partial \ell}{\partial \sigma} = 0 $$


Compute each derivative:


$$ \begin{align*} \frac{\partial \ell}{\partial \mu} &= \frac{1}{\sigma^2} \sum_{i=1}^{m} (y_i - \mu) = 0 \quad \Rightarrow \quad \mu_{\text{MLE}} = \frac{1}{m} \sum_{i=1}^{m} y_i \quad \text{(sample mean)} \\\\ \frac{\partial \ell}{\partial \sigma} &= -\frac{m}{\sigma} + \frac{1}{\sigma^3} \sum_{i=1}^{m} (y_i - \mu)^2 = 0 \quad \Rightarrow \quad \sigma^2_{\text{MLE}} = \frac{1}{m} \sum_{i=1}^{m} (y_i - \mu)^2 \quad \text{(sample variance)} \end{align*} $$


Key Observations

  • The most likely estimate of $\mu$ is the sample mean
  • The most likely estimate of $\sigma^2$ is the sample variance
  • These results justify the intuitive approach of using mean and variance as descriptors of data

A surprising and powerful result:
Even if the data isn't Gaussian, the sample mean becomes approximately Gaussian by the central limit theorem, especially when $m$ is large.



1.2. Maximum Likelihood Estimation (MLE)

Maximum Likelihood Estimation is a general method used in statistics to estimate parameters of a model by maximizing the likelihood function.


Definition

The likelihood function is the probability (or density) of observing the data $D$ given a model with parameters $\theta$:


$$ \mathcal{L}(\theta) = P(D \mid \theta) = P(D \,;\, \theta) $$


The MLE is defined as:


$$ \theta_{\text{MLE}} = \arg \max_{\theta} \;\; P(D \,;\, \theta) $$


Interpretation

  • You assume the data is generated by a probabilistic model with unknown parameters
  • You find the parameters $\theta$ that make the observed data most probable
  • This approach is model-driven, consistent with generative modeling

Example (Gaussian)

If you assume the data is Gaussian:


$$ \mu_{\text{MLE}} = \frac{1}{m} \sum_{i=1}^{m} x_i $$


This is the result we derived earlier and serves as a canonical example of how MLE works in practice.


In [ ]:
import numpy as np
from scipy.stats import norm
import matplotlib.pyplot as plt
%matplotlib inline
In [ ]:
# MLE of Gaussian distribution
# mu

m = 20
mu = 0
sigma = 5

x = np.random.normal(mu,sigma,[m,1])
xp = np.linspace(-20, 20, 100)
y0 = np.zeros([m, 1])

muhat = [-5, 0, 5, np.mean(x)]

plt.figure(figsize = (6, 6))

for i in range(4):
    yp = norm.pdf(xp, muhat[i], sigma)
    y = norm.pdf(x, muhat[i], sigma)
    logL = np.sum(np.log(y))

    plt.subplot(4, 1, i+1)
    plt.plot(xp, yp, 'r')
    plt.plot(x, y, 'bo')
    plt.plot(np.hstack([x, x]).T, np.hstack([y, y0]).T, 'k--')

    plt.title(r'$\hat\mu$ = {0:.2f}'.format(muhat[i]))
    plt.text(-15,0.06,np.round(logL,4))
    plt.axis([-20, 20, 0, 0.11])

plt.tight_layout()
plt.show()
No description has been provided for this image

Compare to a result from formula

$$\mu_{MLE}=\frac{1}{m}\sum_{i = 1}^{m}x_i$$

In [ ]:
# mean is unknown in this example
# variance is known in this example

m = 10
mu = 0
sigma = 5

x = np.random.normal(mu,sigma,[m,1])

mus = np.arange(-10, 10.5, 0.5)
LOGL = []

for i in range(np.size(mus)):
    y = norm.pdf(x, mus[i], sigma)
    logL = np.sum(np.log(y))
    LOGL.append(logL)

muhat = np.mean(x)
print(muhat,'\n')

plt.figure(figsize = (6, 4))
plt.plot(mus, LOGL, '.')
plt.title('$log (\prod \mathcal{N}(x \mid \mu , \sigma^2))$')
plt.xlabel(r'$\hat \mu$')
plt.grid(alpha = 0.3)
plt.show()
1.018080164522729 

No description has been provided for this image

$$\sigma_{MLE}^2=\frac{1}{m}\sum\limits_{i=1}^m(y_i-\mu)^2$$

In [ ]:
# mean is known in this example
# variance is unknown in this example

m = 100
mu = 0
sigma = 3

x = np.random.normal(mu,sigma,[m,1])  # samples

sigmas = np.arange(1, 10, 0.1)
LOGL = []

for i in range(sigmas.shape[0]):
    y = norm.pdf(x, mu, sigmas[i])     # likelihood
    logL = np.sum(np.log(y))
    LOGL.append(logL)

sigmahat = np.sqrt(np.var(x))
print(sigmahat)

plt.figure(figsize = (6, 4))
plt.title(r'$\log (\prod \mathcal{N} (x|\mu,\sigma^2))$')
plt.plot(sigmas, LOGL, '.')
plt.xlabel(r'$\hat \sigma$')
plt.axis([0, np.max(sigmas), np.min(LOGL), -200])
plt.grid(alpha = 0.3)
plt.show()
3.0438236510057792
No description has been provided for this image




2. Data Fusion with Uncertainties

From Learning Theory (Reza Shadmehr, Johns Hopkins University)

  • Lecture 6

In many real-world systems, measurements are collected from multiple sensors, each of which may be noisy or imprecise. An important question is: How can we combine noisy measurements from multiple sensors to produce a more accurate estimate?


This question leads us to the concept of data fusion - combining observations while accounting for their associated uncertainties.


2.1. Two Noisy Sensors

Assume we are trying to estimate a true quantity $x$ using two sensors, each affected by additive Gaussian noise:



$$\begin{align*} y_a &= x + \varepsilon_a, \quad \varepsilon_a \sim \mathcal{N}\left(0,\sigma^2_a\right) \\ y_b &= x + \varepsilon_b, \quad \varepsilon_b \sim \mathcal{N}\left(0,\sigma^2_b\right) \end{align*}$$


Each sensor provides a noisy measurement of the same true value $x$, but with possibly different noise levels.


Matrix Formulation

We rewrite the two measurements in vector form:


$$ y = \begin{bmatrix} y_a \\ y_b \end{bmatrix} = Cx + \varepsilon = \begin{bmatrix} 1 \\ 1 \end{bmatrix} x + \begin{bmatrix} \varepsilon_a \\ \varepsilon_b \end{bmatrix}, \quad \varepsilon \sim \mathcal{N}(0, R) $$


The noise covariance matrix is:


$$ R = \begin{bmatrix} \sigma_a^2 & 0 \\ 0 & \sigma_b^2 \end{bmatrix} $$


Thus, the likelihood model for $y$ given $x$ is:


$$ P(y \mid x) \sim \mathcal{N}(Cx, R) $$


Or in full expression:


$$ P(y \mid x) = \frac{1}{\sqrt{(2\pi)^2 |R|}} \exp\left( -\frac{1}{2} (y - Cx)^T R^{-1} (y - Cx) \right) $$


Goal: Estimate $\hat{x}$ from $y$

We aim to find the maximum likelihood estimate (MLE) of $x$, given the observation $y$.


Step 1: Log-Likelihood

The negative log-likelihood is:


$$ \ell(x) = \text{const} + \frac{1}{2} (y - Cx)^T R^{-1} (y - Cx) $$


We minimize this with respect to $x$.


Step 2: Expand the Quadratic Term


$$ \begin{align*} (y - Cx)^T R^{-1} (y - Cx) &= y^T R^{-1} y - 2 x^T C^T R^{-1} y + x^T C^T R^{-1} C x \end{align*} $$


Step 3: Set Gradient to Zero

Differentiating with respect to $x$ and setting the derivative to zero:


$$ \frac{d\ell}{dx} = -2 C^T R^{-1} y + 2 C^T R^{-1} C x = 0 $$


Solving for $\hat{x}$:


$$ \hat{x} = \left(C^T R^{-1} C\right)^{-1} C^T R^{-1} y $$


This is the weighted least squares estimator, where weights are determined by the inverse noise variances.


Analytical Evaluation

We compute the matrix terms explicitly:


$$ \begin{align*} C &= \begin{bmatrix} 1 \\ 1 \end{bmatrix}, \quad R^{-1} = \begin{bmatrix} \frac{1}{\sigma_a^2} & 0 \\ 0 & \frac{1}{\sigma_b^2} \end{bmatrix} \\\\ C^T R^{-1} C &= \frac{1}{\sigma_a^2} + \frac{1}{\sigma_b^2} \\\\ C^T R^{-1} &= \begin{bmatrix} \frac{1}{\sigma_a^2} & \frac{1}{\sigma_b^2} \end{bmatrix} \end{align*} $$


Plug into the estimator:


$$ \hat{x} = \left( \frac{1}{\sigma_a^2} + \frac{1}{\sigma_b^2} \right)^{-1} \left( \frac{1}{\sigma_a^2} y_a + \frac{1}{\sigma_b^2} y_b \right) $$


This gives the weighted average of the two measurements, where the weights are inversely proportional to their variances.


Variance of the Estimate

We now calculate the variance of the fused estimate $\hat{x}$:


$$ \text{var}(\hat{x}) = \left(C^T R^{-1} C\right)^{-1} = \frac{1}{\frac{1}{\sigma_a^2} + \frac{1}{\sigma_b^2}} $$


Why?


$$\begin{align*} \text{var}\left(\hat{x}\right) &= \left(\left(C^TR^{-1}C \right)^{-1}C^TR^{-1}\right) \cdot \text{var}(y) \cdot \left( \left(C^TR^{-1}C)^{-1}C^TR^{-1} \right) \right)^T\\ & = \left( \left(C^TR^{-1}C \right)^{-1}C^TR^{-1}\right) \cdot R \cdot \left( \left(C^TR^{-1}C)^{-1}C^TR^{-1} \right) \right)^T\\ &= \left(C^TR^{-1}C \right)^{-1}C^T \cdot \left(R^{-1} \right)^TC \left(\left(C^TR^{-1}C \right)^{-1} \right)^T \\ &=\underbrace{\left(C^TR^{-1}C\right)^{-1}} \underbrace{C^TR^{-1}C}\left(\left(C^TR^{-1}C\right)^{-1}\right)^T = \left(C^TR^{-1}C\right)^{-1}\\\\ &=\frac{1}{\frac{1}{\sigma^2_a}+\frac{1}{\sigma^2_b}} \end{align*}$$


Note:


$$ \text{var}(\hat{x}) \leq \min(\sigma_a^2, \sigma_b^2) $$


This inequality reflects the benefit of combining information: the estimate is more precise than either of the individual sensors.


Summary


$$ \begin{align*} \hat{x} &= \frac{ \frac{1}{\sigma_a^2} y_a + \frac{1}{\sigma_b^2} y_b } { \frac{1}{\sigma_a^2} + \frac{1}{\sigma_b^2} } = \alpha_a y_a + \alpha_b y_b \quad \text{(weighted average)} \\\\ \text{var}(\hat{x}) &= \frac{1}{ \frac{1}{\sigma_a^2} + \frac{1}{\sigma_b^2} } \end{align*} $$


Where:


$$ \alpha_a = \frac{ \frac{1}{\sigma_a^2} }{ \frac{1}{\sigma_a^2} + \frac{1}{\sigma_b^2} }, \quad \alpha_b = \frac{ \frac{1}{\sigma_b^2} }{ \frac{1}{\sigma_a^2} + \frac{1}{\sigma_b^2} } $$




Big Lesson

  • Two sensors are better than one
  • Even if both sensors are noisy, fusing them reduces overall uncertainty
  • Sensor accuracy matters - better sensors get more weight
  • This principle is at the core of probabilistic robotics, Kalman filtering, sensor networks, and machine learning models with uncertainty


2.2. Example: Two Rulers (1D Sensor Fusion in the Brain)


Imagine you are estimating the length of an object using two different sources of sensory information:

  • Visual measurement: looking at the object
  • Haptic measurement: feeling the object with eyes closed


Each of these measurements is affected by Gaussian noise. Let the true length be $x$, and the measurements be:


$$ \begin{align*} y_{\text{vision}} = y_v &= x + \varepsilon_v, \quad \varepsilon_v \sim \mathcal{N}(0, \sigma_v^2) \\ y_{\text{haptic}} = y_h &= x + \varepsilon_h, \quad \varepsilon_h \sim \mathcal{N}(0, \sigma_h^2) \end{align*} $$


Where:

  • $x$: the true length of the object
  • $y_v$: measurement from vision (visual ruler)
  • $y_h$: measurement from touch (haptic ruler)
  • $\sigma_v^2$, $\sigma_h^2$: variances associated with visual and haptic uncertainty, respectively

Each sensor independently observes the same true value $x$ with its own uncertainty. Our goal is to estimate $x$ based on both observations.


Optimal Fusion

The optimal estimate $\hat{x}$ that fuses both noisy measurements is:


$$ \hat{x} = \frac{ \frac{1}{\sigma_v^2} y_v + \frac{1}{\sigma_h^2} y_h } { \frac{1}{\sigma_v^2} + \frac{1}{\sigma_h^2} } $$


More reliable measurements (i.e., smaller variance) receive greater weight.

  • If $\sigma_v^2 \ll \sigma_h^2$, visual information is more reliable, and $\hat{x} \approx y_{\text{vision}}$

  • If both variances are equal, then $\hat{x}$ becomes the average: $\hat{x} = \frac{y_{\text{vision}} + y_{\text{haptic}}}{2}$


Variance of the Fused Estimate

The uncertainty (variance) of the fused estimate is:


$$ \text{var}(\hat{x}) = \frac{1}{ \frac{1}{\sigma_v^2} + \frac{1}{\sigma_h^2} } < \min(\sigma_v^2, \sigma_h^2) $$


This is always less than either individual variance, showing that fusing two noisy measurements leads to a more accurate (less uncertain) estimate.


Neuroscience Perspective

This model reflects how the human brain integrates sensory input:

  • It doesn't rely solely on one modality

  • It combines information in a statistically optimal way (approximately)

  • Uncertainty matters: The brain gives more weight to more reliable sources


This example beautifully illustrates how probabilistic reasoning and Bayesian estimation naturally arise in both artificial systems (sensor fusion) and biological ones (perception).


1D Fusion Example

In [ ]:
x = 5       # true state (length in this example)
a = 1       # sigma of a
b = 2       # sigma of b

YA = []
YB = []
XML = []

for i in range(2000):
    ya = x + np.random.normal(0,a)
    yb = x + np.random.normal(0,b)
    xml = (1/a**2*ya + 1/b**2*yb)/(1/a**2+1/b**2)
    YA.append(ya)
    YB.append(yb)
    XML.append(xml)

plt.figure(figsize = (6, 5))
plt.subplot(3, 1, 1), plt.hist(YA, 41, density=True), plt.axis([0, 10, 0, 0.5]),
plt.title(r'$y_a$'), plt.grid(alpha = 0.3)
plt.subplot(3, 1, 2), plt.hist(YB, 41, density=True), plt.axis([0, 10, 0, 0.5]),
plt.title(r'$y_b$'), plt.grid(alpha = 0.3)
plt.subplot(3, 1, 3), plt.hist(XML, 41, density=True), plt.axis([0, 10, 0, 0.5]),
plt.title(r'$x_{ML}$'), plt.grid(alpha = 0.3)
plt.tight_layout()
plt.show()
No description has been provided for this image

2D Fusion Example

In [ ]:
x = np.array([5, 10]).reshape(-1, 1) # true position

mu = np.array([0, 0])
Ra = np.matrix([[9, 1],
               [1, 1]])
Rb = np.matrix([[1, 1],
                [1, 9]])

YA = []
YB = []
XML = []

for i in range(1000):
    ya = x + np.random.multivariate_normal(mu, Ra).reshape(-1, 1)
    yb = x + np.random.multivariate_normal(mu, Rb).reshape(-1, 1)
    xml = (Ra.I+Rb.I).I*(Ra.I*ya+Rb.I*yb)
    YA.append(ya.T)
    YB.append(yb.T)
    XML.append(xml.T)

YA = np.vstack(YA)
YB = np.vstack(YB)
XML = np.vstack(XML)
In [ ]:
plt.figure(figsize = (6, 4))
plt.title('Data Fusion')
plt.plot(YA[:,0], YA[:,1], 'b.', label='Observation 1')
plt.plot(YB[:,0], YB[:,1], 'r.', label='Observation 2')
plt.plot(XML[:,0], XML[:,1], 'k.', label='MLE')
plt.axis('equal')
plt.grid(alpha = 0.3)
plt.legend()
plt.show()
No description has been provided for this image




3. Maximum-a-Posteriori Estimation (MAP)

3.1. Think Differently

In many traditional estimation settings, we are given a dataset $D = \{d_1, d_2, \cdots, d_m\}$:



We aim to estimate the underlying probability distribution $P(D)$ that generated the data. Once $P(D)$ is estimated, a common prediction for the next observation $d_{m+1}$ is simply the expected value:


$$ d_{m+1} = E[D] $$


But What If the Data Comes from a Hidden Parameter?

Suppose the data is generated not directly, but through a hidden variable $\theta$. In this case:

  • $\theta$ is unknown
  • We assume $\theta$ itself is a random variable



The assumption is now that the data points are conditionally independent given $\theta$:


$$ d_i \sim P(d_i \mid \theta) $$


Our goal shifts from estimating $P(D)$ to inferring the hidden parameter $\theta$ that explains the data.


Key Perspective

We no longer treat $\theta$ as a fixed constant to be estimated (as in MLE), but rather as a random variable with its own prior distribution:


$$ \theta \sim P(\theta) $$


Upon observing data $D$, we apply Bayes' rule to update our belief about $\theta$:


$$ P(\theta \mid D) = \frac{P(D \mid \theta) \cdot P(\theta)}{P(D)} $$


This framework is the foundation of Bayesian inference, where we do not merely look for the best-fit model, but update our belief about parameters based on evidence.


In Maximum Likelihood Estimation (MLE), we choose the parameter $\theta$ that maximizes the likelihood of the observed data. In contrast, Maximum-a-Posteriori (MAP) estimation incorporates both the observed data and our prior belief about $\theta$.


MAP as Posterior Maximization

We aim to choose $\theta$ that maximizes the posterior probability given the data $D$:


$$ \theta_{\text{MAP}} = \mathrm{arg} \max_{\theta} \; P(\theta \mid D) $$


By Bayes' rule, the posterior is expressed as:


$$ P(\theta \mid D) = \frac{P(D \mid \theta) \cdot P(\theta)}{P(D)} $$


  • $P(\theta)$: Prior distribution over parameters (before observing data)
  • $P(D \mid \theta)$: Likelihood of the observed data under parameter $\theta$
  • $P(D)$: Marginal likelihood (a normalizing constant)

Since $P(D)$ does not depend on $\theta$, we can ignore it in the optimization:


$$ \theta_{\text{MAP}} = \mathrm{arg} \max_{\theta} \; P(D \mid \theta) \cdot P(\theta) $$


Logarithmic Form

Maximizing the log-posterior simplifies the computation:


$$ \theta_{\text{MAP}} = \mathrm{arg} \max_{\theta} \; \log P(D \mid \theta) + \log P(\theta) $$


This is just like MLE with an additional regularization term $\log P(\theta)$ that encodes prior belief.


For Independent Observations

If the data consists of $m$ i.i.d. observations $D = \{d_1, d_2, \dots, d_m\}$, the MAP estimator becomes:


$$ \theta_{\text{MAP}} = \mathrm{arg} \max_{\theta} \; \left\{ \sum_{i=1}^m \log P(d_i \mid \theta) + \log P(\theta) \right\} $$


Compare this to MLE:


$$ \theta_{\text{MLE}} = \mathrm{arg} \max_{\theta} \; \sum_{i=1}^m \log P(d_i \mid \theta) $$


Key Insight

  • MAP is similar to MLE but incorporates a prior belief about the parameter $\theta$
  • The prior acts like an additional observation or constraint
  • MAP estimation leads to regularized solutions

$$ \begin{align*} \text{MAP:} \quad & \theta_{\text{MAP}} = \mathrm{arg} \max_{\theta} \; \log P(D \mid \theta) + \log P(\theta) \\\\ \text{MLE:} \quad & \theta_{\text{MLE}} = \mathrm{arg} \max_{\theta} \; \log P(D \mid \theta) \end{align*} $$


Takeaway: MAP provides a principled way to incorporate prior knowledge into parameter estimation.



3.2. MAP for a Univariate Gaussian

Suppose we observe data points:

$$D = {x_1, x_2, \cdots, x_m}$$


and assume that each sample is drawn independently from a Gaussian distribution with unknown mean $\theta$ and known variance $\sigma^2$:


$$ x_i \sim \mathcal{N}(\theta,\sigma^2) $$


Now suppose that we also have a prior belief about the value of $\theta$:


$$ \theta \sim \mathcal{N}(\mu, 1^2) $$


That is, our prior belief is that $\theta$ is normally distributed with mean $\mu$ and variance $1$.



Step 1: Joint Probability

The joint probability of the data given $\theta$ (i.e., the likelihood) is:


$$ P(x_1,x_2,\cdots,x_m \mid \theta) = \prod\limits_{i=1}^m P(x_i \mid \theta) $$


Step 2: MAP Estimation

The MAP estimate maximizes the posterior probability of $\theta$:


$$ \theta_{MAP} = \mathrm{arg} \max_{\theta} \; P(\theta \mid D) = \mathrm{arg} \max_{\theta} \; P(D \mid \theta)P(\theta) $$


Taking the logarithm:


$$ \theta_{MAP} = \mathrm{arg} \max_{\theta} \; \left\{\log P(D \mid \theta) + \log P(\theta)\right\} $$


Step 3: Compute Gradients

We proceed by taking the derivative with respect to $\theta$:


  • For Likelihood term

$$ \frac{\partial}{\partial \theta} \left(\log P(D \mid \theta)\right) = \frac{1}{\sigma^2} \left( \sum_{i=1}^m x_i - m\theta \right) $$


  • For the prior term:

$$ \frac{\partial}{\partial \theta} \left(\log P(\theta)\right) = \mu - \theta $$


Step 4: Set Derivative to Zero

We combine both terms and set the derivative to zero:


$$ \frac{1}{\sigma^2} \sum_{i=1}^m x_i + \mu - \left(\frac{m}{\sigma^2} + 1\right)\theta^* = 0 $$


Solving for $\theta^*$:


$$ \theta^* = \frac{\frac{1}{\sigma^2} \sum_{i=1}^m x_i + \mu}{\frac{m}{\sigma^2} + 1} $$


Thus, the MAP estimator for $\theta$ is:


$$ \theta_{MAP} = \frac{\frac{m}{\sigma^2}}{\frac{m}{\sigma^2}+1}\bar{x} + \frac{1}{\frac{m}{\sigma^2}+1}\mu $$


This is a weighted average of the sample mean $\bar{x}$ and the prior mean $\mu$.



Interpretation

The MAP estimate blends data and prior belief based on their respective uncertainties:

  • $\bar{x} = \frac{1}{m} \sum x_i$: sample mean from observed data

  • $\mu$: prior guess of $\theta$

  • The weight on $\bar{x}$ increases as the number of samples $m$ increases or as noise variance $\sigma^2$ decreases

  • The prior acts like a virtual data point with unit variance


Analogy

  • $\mu$ = 1st observation (prior), drawn from $\mathcal{N}(\mu, 1)$
  • $\bar{x}$ = 2nd observation (data), drawn from $\mathcal{N}(\theta, \frac{\sigma^2}{m})$

MAP estimation is like combining two observations with known variances.


Big Lesson

A prior behaves like additional data. It stabilizes the estimate, especially when observed data is limited or noisy. As more data is collected, the influence of the prior diminishes.


Note:

Prior knowledge can come from various sources such as:

  • Education: e.g., learning that volume is not always proportional to weight
  • Experience: accumulated through interactions over time
  • Aging: as people grow older, they refine priors based on repeated sensory outcomes

Example: Perceiving Object Weight via Multiple Senses

You're presented with two objects (as in the image) and asked: "Which one is heavier?"

We consider four different sensory conditions:

  • Without Touching and with Eyes Closed
  • Only Visual Inspection
  • Only Haptic (Touch) Inspection
  • Both Visual and Haptic Inspection

(1) Without Touching and with Eyes Closed

  • No sensory input (neither visual nor haptic)
  • You have to guess purely based on prior knowledge or random chance

(2) Only Visual Inspection

  • You see the objects but do not touch them
  • You likely assume the larger object is heavier
  • This is guided by a prior based on daily experience (volume $\propto$ mass)
  • Outcome: You likely say "the left one is heavier"

(3) Only Haptic (Touch) Inspection

  • Eyes are closed, but you feel the objects
  • No visual bias is introduced
  • This engages the haptic sensory system, which may have higher precision for weight
  • Outcome: You may correctly identify which is heavier if your sense of touch is reliable

(4) Both Visual and Haptic Inspection

  • You integrate visual and haptic signals
  • The brain uses a Bayesian strategy to weight each signal inversely to its uncertainty


3.3. MAP in Python


Suppose $\theta$ is a random variable with prior distribution:


$$\theta \sim \mathcal{N}(\mu, 1^2)$$


We observe data drawn from a Gaussian distribution with unknown mean $\theta$ and known variance $\sigma^2$:


$$ x_i \sim \mathcal{N}(\theta, \sigma^2) $$


This setup corresponds to estimating the mean of a univariate Gaussian using MAP, where we incorporate prior knowledge about $\theta$ centered at $\mu$.


In [ ]:
# known
mu = 5
sigma = 2

# unknown theta
theta = np.random.normal(mu,1)
x = np.random.normal(theta, sigma)

print('theta = {:.4f}\n'.format(theta))
print('x = {:.4f}'.format(x))
theta = 5.4504

x = 4.7356

$$\theta_{MAP} = \frac{\frac{m}{\sigma^2}}{\frac{m}{\sigma^2}+1}\bar{x}+\frac{1}{\frac{m}{\sigma^2}+1}\mu $$

In [ ]:
# MAP

m = 4
X = np.random.normal(theta,sigma,[m,1])

xbar = np.mean(X)
theta_MAP = m/(m+sigma**2)*xbar + sigma**2/(m+sigma**2)*mu

print('mu = 5\n')
print('xbar = {:.4f}\n'.format(xbar))
print('theta_MAP = {:.4f}'.format(theta_MAP))
mu = 5

xbar = 4.3017

theta_MAP = 4.6508
In [ ]:
# theta
mu = 5
theta = np.random.normal(mu,1)

sigma = 2
m = 2000

X = np.random.normal(theta,sigma,[m,1])
X = np.asmatrix(X)

print('theta = {:.4f}\n'.format(theta))
plt.figure(figsize = (6, 4))
plt.hist(X, 31)
plt.xlim([-5, 15])
plt.show()
theta = 3.6052

No description has been provided for this image
In [ ]:
m = 50
XMLE = []
XMAP = []

for k in range(m):
    xmle = np.mean(X[0:k+1,0])
    xmap = k/(k+sigma**2)*xmle + sigma**2/(k + sigma**2)*mu
    XMLE.append(xmle)
    XMAP.append(xmap)

print('theta = {:.4f}\n'.format(theta))
plt.figure(figsize = (6,4))
plt.plot(XMLE)
plt.plot(XMAP)
plt.legend(['MLE','MAP'])
plt.show()
theta = 3.6052

No description has been provided for this image




4. Linear Measurement with Noise

In many estimation problems, we deal with measurements contaminated by noise. A common setup is:


$$y = Ax + \omega$$


Where:

  • $x$ is the true quantity we wish to estimate.

  • $y$ is the measurement.

  • $A$ is a matrix representing the measurement model or the sensor's characteristics.

  • $\omega$ represents sensor noise



We assume that both $x$ and $\omega$ are normally distributed:


$$x \sim \mathcal{N}(\mu_x, \Sigma_x) \quad \rightarrow \quad \text{for simplicity } x \sim \mathcal{N}(0, \Sigma_x)$$

$$\omega \sim \mathcal{N}(\mu_{\omega}, \Sigma_{\omega}) \quad \rightarrow \quad \text{for simplicity } \omega \sim \mathcal{N}(0, \Sigma_{\omega})$$


Further, we assume $x$ and $\omega$ are independent. The joint distribution of $x$ and $\omega$ is:


$$\begin{bmatrix} x \\ \omega \end{bmatrix} \sim \mathcal{N}\left( \begin{bmatrix} \mu_x \\ \mu_{\omega}\end{bmatrix}, \begin{bmatrix} \Sigma_{x}& 0\\ 0& \Sigma_{\omega} \end{bmatrix} \right)$$


Since $y = A x + \omega$, the joint distribution of $x$ and $y$ becomes:


$$\begin{bmatrix} x \\ y \end{bmatrix} = \begin{bmatrix} I & 0 \\A & I \end{bmatrix} \begin{bmatrix} x \\ \omega \end{bmatrix}$$


Therefore,


$$ \begin{align*} \begin{bmatrix} x \\ y \end{bmatrix} &\sim \mathcal{N}\left( \begin{bmatrix} I & 0 \\A & I \end{bmatrix} \begin{bmatrix} \mu_x \\ \mu_{\omega}\end{bmatrix}, \begin{bmatrix} I & 0 \\A & I \end{bmatrix} \begin{bmatrix} \Sigma_{x}& 0\\ 0& \Sigma_{\omega} \end{bmatrix} \begin{bmatrix} I & 0 \\A & I \end{bmatrix}^T \right) \\\\ &\sim \mathcal{N}\left( \begin{bmatrix} \mu_x \\ A \mu_x + \mu_{\omega}\end{bmatrix}, \begin{bmatrix} \Sigma_{x}& \Sigma_x A^T\\ A\Sigma_x& A\Sigma_xA^T + \Sigma_{\omega} \end{bmatrix} \right) = \mathcal{N} \left( \begin{bmatrix} \mu_x \\ \mu_y\end{bmatrix}, \begin{bmatrix} \Sigma_{x}& \Sigma_{xy}\\ \Sigma_{yx}& \Sigma_y \end{bmatrix} \right) \end{align*} $$



$$\Sigma_y = \underbrace{ A\Sigma_xA^T}_{\text{signal covariance}} + \underbrace{\Sigma_{\omega}}_{\text{noise covariance}}$$


Posterior Distribution: Estimate $x$ Given $y$

Given a measurement $y$, the posterior distribution of $x$ given $y$ is also Gaussian:


$$x \mid y \sim \mathcal{N} \left(\mu_x + \Sigma_{xy}\Sigma_{y}^{-1}(y-\mu_y), \; \Sigma_{x} - \Sigma_{xy}\Sigma_{y}^{-1}\Sigma_{yx} \right)$$

Where:

  • $\Sigma_{xy} = A \Sigma_x$

  • $\Sigma_y = A \Sigma_x A^T + \Sigma_{\omega}$


The optimal estimator for $x$ is:


$$ \begin{align*} \hat{x} = \phi(y) &= E[x \mid y]\\ & = \mu_x + \Sigma_{xy}\Sigma_{y}^{-1}(y-\mu_y)\\ & = \mu_x + \Sigma_x A^T(A\Sigma_xA^T + \Sigma_{\omega})^{-1}(y-\mu_y)\\ & = \mu_x + K(y-\mu_y)\\\\ \text{cov}(x \mid y) &= \Sigma_{x} - \Sigma_{xy}\Sigma_{y}^{-1}\Sigma_{yx}\\\\ &= \Sigma_{x} - \Sigma_x A^T (A\Sigma_xA^T + \Sigma_{\omega})^{-1} A \Sigma_x\leq \Sigma_x \end{align*} $$


Where the Kalman gain $K$ is:


$$ K = \Sigma_x A^T \left( A \Sigma_x A^T + \Sigma_\omega \right)^{-1} $$


Interpretation

  • $\mu_x$ is our best prior guess for $x$.

  • The term $(y - \mu_y)$ represents the discrepancy between the observed value and the expected value.

  • The estimator adjusts the prior estimate by $K$ times the discrepancy between the measured and expected values.

  • The covariance of the estimation error is always smaller than the prior covariance of $x$, reflecting the fact that measurements have reduced uncertainty about $x$.




5. Recursive Bayesian Estimation from Two Sensors

Bayesian reasoning provides a powerful framework for estimation and inference under uncertainty. It allows us to systematically update our beliefs as new data becomes available, making it a cornerstone of modern probabilistic modeling and filtering.


5.1. Re-visit Two Sensors Problem



In the two-sensor fusion problem, each sensor provides a noisy observation of the same true value $x$. We assume that the measurements are independent and corrupted by Gaussian noise.

Let:


$$ \text{var}(y_a) = \sigma_a^2, \qquad \text{var}(y_b) = \sigma_b^2 $$


Then, the optimal estimator of $x$ based on both measurements is a precision-weighted average:


$$ \begin{align*} E[\hat x] &= \frac{ \frac{1}{\sigma_a^2} y_a + \frac{1}{\sigma_b^2} y_b }{ \frac{1}{\sigma_a^2} + \frac{1}{\sigma_b^2} } = \frac{\frac{1}{\sigma^2_a}}{\frac{1}{\sigma^2_a} + \frac{1}{\sigma^2_a}}y_a + \frac{\frac{1}{\sigma^2_b}}{\frac{1}{\sigma^2_a} + \frac{1}{\sigma^2_b}}y_b \\\\ \text{var}[\hat x] &= \frac{1}{\frac{1}{\sigma_a^2} + \frac{1}{\sigma_b^2}} \quad < \min(\sigma_a^2, \sigma_b^2) \end{align*} $$


This result shows that fusing two noisy estimates reduces uncertainty: the variance of the combined estimate is strictly smaller than that of either individual measurement.



5.2. Recursive Bayesian Update

Now let's view this fusion process from a recursive Bayesian perspective: we start with one measurement and then update our belief after observing the second.


Step 1: Prior from Sensor A

Suppose we first receive a measurement from sensor A:


$$ \hat{x}_1 = y_a, \qquad \hat{\sigma}_1^2 = \sigma_a^2 $$


This forms our initial belief about $x$, with a corresponding uncertainty $\hat{\sigma}_1^2$.


Step 2: Update with Sensor B

We then incorporate sensor B's measurement $y_b$, which has its own uncertainty $\sigma_b^2$.

The updated estimate becomes:


$$ \hat{x}_2 = (1 - K) \hat{x}_1 + K y_b, \quad K = \frac{ \frac{1}{\sigma_b^2} }{ \frac{1}{\sigma_a^2} + \frac{1}{\sigma_b^2} } $$


This can be interpreted as:


$$ \hat{x}_2 = \hat{x}_1 + K (y_b - \hat{x}_1) $$


Interpretation

This recursive form illustrates the essence of Bayesian updating:

  • $\hat{x}_1$ is the prior estimate

  • $y_b - \hat{x}_1$ is the innovation or prediction error (difference between new observation and prior)

  • $K$ determines how much we trust the new observation


This equation has the form:

$$ \hat{x}_{\text{new}} = \text{prior} + \text{gain} \times \text{prediction error} $$


The more confident we are in sensor B (i.e., smaller $\sigma_b^2$), the larger the gain $K$ and the more we adjust the prior.


This recursive update structure forms the basis of the Kalman filter, a powerful tool for sequential state estimation in dynamic systems.



5.3. Bayesian Kalman Filter

The Kalman filter is a recursive Bayesian estimator for linear dynamical systems with Gaussian noise. It formalizes the intuition we saw in sensor fusion: we start with a belief, predict based on system dynamics, and update that belief when a new observation arrives.



State Dynamics

$$ x_{t+1} = A x_t + B u_t + \omega_t, \quad \omega_t \sim \mathcal{N}(0, Q) $$


Measurement Model

$$ y_t = C x_t + v_t, \quad v_t \sim \mathcal{N}(0, R) $$


All variables are assumed Gaussian, and noise terms are independent:

  • $x_t$: system state

  • $u_t$: control input

  • $y_t$: observed measurement

  • $w_t$: process noise

  • $v_t$: observation noise



Kalman Filtering as Recursive Bayesian Inference

At each time step, the Kalman filter performs two operations:


(1) Prediction Step (Prior Update)

We use the system model to project the prior belief forward in time:


Predicted mean:

$$ \hat{x}_{t \mid t-1} = A \hat{x}_{t-1} + B u_{t-1} $$


Predicted covariance:

$$ \Sigma_{t \mid t-1} = A \Sigma_{t-1} A^T + Q $$


This gives us the prior distribution before seeing the new measurement:


$$ p(x_t \mid y_{1:t-1}) = \mathcal{N}(\hat{x}_{t \mid t-1}, \Sigma_{t \mid t-1}) $$


(2) Correction Step (Posterior Update)

When a new measurement $y_t$ arrives, we update the prediction:


Kalman gain:

$$ K_t = \Sigma_{t \mid t-1} C^T (C \Sigma_{t \mid t-1} C^T + R)^{-1} $$


Updated mean:

$$ \hat{x}_t = \hat{x}_{t \mid t-1} + K_t \left( y_t - C \hat{x}_{t \mid t-1} \right) $$


Updated covariance:

$$ \Sigma_t = \Sigma_{t \mid t-1} - K_t C \Sigma_{t \mid t-1} $$


This gives the posterior distribution:

$$ p(x_t \mid y_{1:t}) = \mathcal{N}(\hat{x}_t, \Sigma_t) $$


Summary of Bayesian Kalman Filter

  • Maintains a Gaussian belief over the system state
  • Alternates between prediction (using dynamics) and correction (using measurements)
  • Each step is computationally efficient (closed-form update)
  • The Kalman gain $K_t$ balances trust between the model prediction and the new data
  • Covariance decreases over time as more information is accumulated


5.4. Object Tracking in Computer Vision

  • Lecture: Introduction to Computer Vision by Prof. Aaron Bobick at Georgia Tech
In [ ]:
from IPython.display import YouTubeVideo
YouTubeVideo('5yUjYCkm2jI?rel=0', width = "560", height = "315")
Out[ ]:
In [ ]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')