Parameter Estimation in Probabilistic Model

By Prof. Seungchul Lee
http://iai.postech.ac.kr/
Industrial AI Lab at POSTECH

Table of Contents

1. Probability Density Estimation¶

Reconstructing the probability density function from a set of given data samples $y_1, y_2, \cdots, y_m$
Want to recover the underlying probability density function generating our dataset

Assumption
- Gaussian distributed with unknown parameters 𝜇 and 𝜎
- Samples are independent

It becomes a parameter estimation problem

Think about how to compute the probability
- Gaussian distributed with unknown parameters 𝜇 and 𝜎
- Samples are independent

$$P \left(y_1,y_2,\cdots,y_m \,;\, \mu,\sigma^2 \right) = \prod_{i=1}^{m} P \left(y_i \,;\, \mu,\sigma^2 \right)$$

1.1. Given $m$ Data Points, Drawn from a Gaussian Distribution¶

You will often see the following derivation

$$ \begin{align*} P\left(y=y_i \,;\, \mu,\sigma^2\right) & =\frac{1}{\sqrt{2\pi}\sigma}\exp \left(-\frac{1}{2\sigma^2}(y_i-\mu)^2 \right)\;\text{: generative model}\\\\ \mathcal{L} = P \left(y_1,y_2,\cdots,y_m \,;\, \mu,\sigma^2 \right) & =\prod\limits_{i=1}^m\frac{1}{\sqrt{2\pi}\sigma}\exp\left(-\frac{1}{2\sigma^2}\left(y_i-\mu\right)^2\right)\\ & = \frac{1}{(2\pi)^\frac{m}{2}\sigma^m}\exp\left(-\frac{1}{2\sigma^2}\sum\limits_{i=1}^m(y_i-\mu)^2 \right)\\\\ \ell =\log \mathcal{L} &=-\frac{m}{2} \text{log}{2\pi}-m \text{log}{\sigma}-\frac{1}{2\sigma^2}\sum\limits_{i=1}^m(y_i-\mu)^2 \end{align*}$$

To maximize, $\frac{\partial \ell}{\partial \mu}=0, \frac{\partial \ell}{\partial \sigma}=0$

$$ \begin{align*} \frac{\partial \ell}{\partial \mu} &=\frac{1}{\sigma^2}\sum\limits_{i=1}^m(y_i-\mu)=0 \;\;\;\;\qquad\quad \implies \quad \mu_{MLE}=\frac{1} {m}\sum\limits_{i=1}^m y_i \quad \text{: sample mean}\\\\ \frac{\partial \ell}{\partial \sigma} &= -\frac{m}{\sigma}+\frac{1}{\sigma^3}\sum\limits_{i=1}^m(y_i-\mu)^2=0 \;\; \implies \quad \sigma_{MLE}^2=\frac{1}{m}\sum\limits_{i=1}^m(y_i-\mu)^2 \quad \text{: sample variance} \end{align*} $$

Big lesson
- We often compute a mean and variance to represent data statistics
- We kind of assume that a data set is Gaussian distributed
- Good news: sample mean is Gaussian distributed by the central limit theorem

1.2. Maximum Likelihood Estimation (MLE)¶

Method of estimating the parameters of a probability distribution by maximizing a likelihood function
Under the assumed statistical model the observed data is most probable

Likelihood function
- The probability density at a particular outcome $D$ when the true value of the parameter is $\theta$

$$\mathcal{L} = P(D \mid \theta) = P(D \,;\, \theta)$$

$$\theta_{MLE} = \underset{\theta}{\mathrm{argmax}}\;\; P(D \,;\, \theta)$$

A model exists first and it generates samples
- Generative model

import numpy as np
from scipy.stats import norm
import matplotlib.pyplot as plt
%matplotlib inline

# MLE of Gaussian distribution 
# mu

m = 20
mu = 0
sigma = 5

x = np.random.normal(mu,sigma,[m,1])
xp = np.linspace(-20, 20, 100)
y0 = np.zeros([m, 1])

muhat = [-5, 0, 5, np.mean(x)]

plt.figure(figsize=(8, 8))

for i in range(4):
    yp = norm.pdf(xp, muhat[i], sigma)
    y = norm.pdf(x, muhat[i], sigma)
    logL = np.sum(np.log(y))
    
    plt.subplot(4, 1, i+1)
    plt.plot(xp, yp, 'r')
    plt.plot(x, y, 'bo')
    plt.plot(np.hstack([x, x]).T, np.hstack([y, y0]).T, 'k--')
    
    plt.title(r'$\hat\mu$ = {0:.2f}'.format(muhat[i]), fontsize=15)
    plt.text(-15,0.06,np.round(logL,4),fontsize=15)
    plt.axis([-20, 20, 0, 0.11])

plt.tight_layout()
plt.show()

Compare to a result from formula

$$\mu_{MLE}=\frac{1}{m}\sum_{i = 1}^{m}x_i$$

# mean is unknown in this example
# variance is known in this example

m = 10
mu = 0
sigma = 5

x = np.random.normal(mu,sigma,[m,1])

mus = np.arange(-10, 10.5, 0.5)
LOGL = []

for i in range(np.size(mus)):
    y = norm.pdf(x, mus[i], sigma)
    logL = np.sum(np.log(y))
    LOGL.append(logL)

muhat = np.mean(x)
print(muhat)
    
plt.figure(figsize=(10, 6))
plt.plot(mus, LOGL, '.')
plt.title('$log (\prod \mathcal{N}(x \mid \mu , \sigma^2))$', fontsize=20)
plt.xlabel(r'$\hat \mu$', fontsize=15)
plt.grid(alpha=0.3)
plt.show()

-1.4011074292982022

$$\sigma_{MLE}^2=\frac{1}{m}\sum\limits_{i=1}^m(y_i-\mu)^2$$

# mean is known in this example
# variance is unknown in this example

m = 100
mu = 0
sigma = 3

x = np.random.normal(mu,sigma,[m,1])  # samples

sigmas = np.arange(1, 10, 0.1)
LOGL = []

for i in range(sigmas.shape[0]):
    y = norm.pdf(x, mu, sigmas[i])     # likelihood
    logL = np.sum(np.log(y))
    LOGL.append(logL)

sigmahat = np.sqrt(np.var(x))
print(sigmahat)

plt.figure(figsize=(10,6))
plt.title(r'$\log (\prod \mathcal{N} (x|\mu,\sigma^2))$',fontsize=20)
plt.plot(sigmas, LOGL, '.')
plt.xlabel(r'$\hat \sigma$', fontsize=15)
plt.axis([0, np.max(sigmas), np.min(LOGL), -200])
plt.grid(alpha=0.3)
plt.show()

2.726137849417368

2. Data Fusion with Uncertainties¶

From Learning Theory (Reza Shadmehr, Johns Hopkins University)
- Lecture 6

Suppose there are two sensors

$$\begin{align*} y_a &= x + \varepsilon_a, \; \varepsilon_a \sim \mathcal{N}\left(0,\sigma^2_a\right) \\ y_b &= x + \varepsilon_b, \; \varepsilon_b \sim \mathcal{N}\left(0,\sigma^2_b\right) \end{align*}$$

In a matrix form

$$ y = \begin{bmatrix} y_a \\ y_b\end{bmatrix} = Cx+\varepsilon = \begin{bmatrix}1\\1\end{bmatrix}x+\begin{bmatrix}\varepsilon_a\\ \varepsilon_b \end{bmatrix} \quad \quad \varepsilon \sim \mathcal{N}\left(0,R\right),\;\; R=\begin{bmatrix} \sigma^2_a & 0\\ 0 & \sigma^2_b\end{bmatrix}$$

$$\begin{align*}P\left(y \mid x\right) & \sim \mathcal{N}\left(Cx,R\right)\\ &= \frac{1}{\sqrt{\left(2\pi\right)^2\vert R \vert}}\exp\left(-\frac{1}{2}\left(y-Cx\right)^TR^{-1}\left(y-Cx\right)\right)\end{align*}$$

Find $\,\hat{x}$

$$\ell = -\log{2\pi}-\frac{1}{2}\log{\vert R \vert}-\frac{1}{2} \underbrace{\left(y-Cx\right)^TR^{-1}\left(y-Cx\right)}$$

$$ \begin{align*} \left(y-Cx\right)^TR^{-1}\left(y-Cx\right) &= y^TR^{-1}y-y^TR^{-1}Cx-x^TC^TR^{-1}y+x^TC^TR^{-1}Cx \\ \\ \implies \frac{d\ell}{dx} & =0=-2C^TR^{-1}y + 2C^TR^{-1}Cx\\ \therefore \;\; \hat{x} &=\left(C^TR^{-1}C\right)^{-1}C^TR^{-1}y \end{align*} $$

$ \left(C^TR^{-1}C\right)^{-1}C^TR^{-1} $

$$ \begin{align*} \left(C^TR^{-1}C\right) &= \begin{bmatrix}1 & 1\end{bmatrix}\begin{bmatrix}\frac{1}{\sigma^2_a} & 0 \\ 0 & \frac{1}{\sigma^2_b}\end{bmatrix}\begin{bmatrix}1\\ 1\end{bmatrix} = \frac{1}{\sigma^2_a}+\frac{1}{\sigma^2_b}\\ C^TR^{-1} &= \begin{bmatrix}1 & 1\end{bmatrix}\begin{bmatrix}\frac{1}{\sigma^2_a} & 0 \\ 0&\frac{1}{\sigma^2_b}\end{bmatrix} = \begin{bmatrix}\frac{1}{\sigma^2_a} &\frac{1}{\sigma^2_b} \end{bmatrix}\\ \\ \end{align*} $$

$$\begin{align*} \hat{x} &= \left(C^TR^{-1}C\right)^{-1}C^TR^{-1}y = \left(\frac{1}{\sigma^2_a}+\frac{1}{\sigma^2_b}\right)^{-1} \begin{bmatrix}\frac{1}{\sigma^2_a} &\frac{1}{\sigma^2_b} \end{bmatrix} \begin{bmatrix}y_a\\y_b\end{bmatrix} \\&= \frac{\frac{1}{\sigma^2_a}y_a+\frac{1}{\sigma^2_b}y_b}{\frac{1}{\sigma^2_a}+\frac{1}{\sigma^2_b}}\\ \\ \text{var}\left(\hat{x}\right) &= \left(\left(C^TR^{-1}C \right)^{-1}C^TR^{-1}\right) \cdot \text{var}(y) \cdot \left( \left(C^TR^{-1}C)^{-1}C^TR^{-1} \right) \right)^T\\ & = \left( \left(C^TR^{-1}C \right)^{-1}C^TR^{-1}\right) \cdot R \cdot \left( \left(C^TR^{-1}C)^{-1}C^TR^{-1} \right) \right)^T\\ &= \left(C^TR^{-1}C \right)^{-1}C^T \cdot \left(R^{-1} \right)^TC \left(\left(C^TR^{-1}C \right)^{-1} \right)^T \\ &=\underbrace{\left(C^TR^{-1}C\right)^{-1}} \underbrace{C^TR^{-1}C}\left(\left(C^TR^{-1}C\right)^{-1}\right)^T = \left(C^TR^{-1}C\right)^{-1}\\\\ &=\frac{1}{\frac{1}{\sigma^2_a}+\frac{1}{\sigma^2_b}} \leq \;\; \sigma^2_a,\;\sigma^2_b \end{align*}$$

Summary

$$ \begin{align*} \hat x &= \frac{\frac{1}{\sigma_a^2} y_a + \frac{1}{\sigma_b^2}y_b}{\frac{1}{\sigma_a^2} + \frac{1}{\sigma_b^2}} = \frac{\frac{1}{\sigma^2_a}}{\frac{1}{\sigma^2_a} + \frac{1}{\sigma^2_b}}y_a + \frac{\frac{1}{\sigma^2_b}}{\frac{1}{\sigma^2_a} + \frac{1}{\sigma^2_b}}y_b\\\\ \text{var}(\hat x) &= \frac{1}{\frac{1}{\sigma^2_a}+\frac{1}{\sigma^2_b}} \quad < \quad \sigma_a^2, \;\sigma_b^2 \qquad \implies \text{more accurate} \end{align*} $$

Big lesson:
- Two sensors are better than one sensor $\implies$ less uncertainties
- Accuracy or uncertainty information is also important in sensors

Example of two rulers

1D example
how brain works on human measurements from both haptic and visual channels

x = 5       # true state (length in this example)
a = 1       # sigma of a
b = 2       # sigma of b

YA = []
YB = []
XML = []

for i in range(2000):
    ya = x + np.random.normal(0,a)
    yb = x + np.random.normal(0,b)
    xml = (1/a**2*ya + 1/b**2*yb)/(1/a**2+1/b**2)
    YA.append(ya)
    YB.append(yb)
    XML.append(xml)

plt.figure(figsize=(8, 6))
plt.subplot(3, 1, 1), plt.hist(YA, 41, density=True), plt.axis([0, 10, 0, 0.5]), 
plt.title(r'$y_a$', fontsize=20), plt.grid(alpha=0.3)
plt.subplot(3, 1, 2), plt.hist(YB, 41, density=True), plt.axis([0, 10, 0, 0.5]), 
plt.title(r'$y_b$', fontsize=20), plt.grid(alpha=0.3)
plt.subplot(3, 1, 3), plt.hist(XML, 41, density=True), plt.axis([0, 10, 0, 0.5]), 
plt.title(r'$x_{ML}$', fontsize=20), plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

Example of two GPSs

2D example

x = np.array([5, 10]).reshape(-1, 1) # true position

mu = np.array([0, 0])
Ra = np.matrix([[9, 1],
               [1, 1]])
Rb = np.matrix([[1, 1],
                [1, 9]])

YA = []
YB = []
XML = []

for i in range(1000):
    ya = x + np.random.multivariate_normal(mu, Ra).reshape(-1, 1)
    yb = x + np.random.multivariate_normal(mu, Rb).reshape(-1, 1)
    xml = (Ra.I+Rb.I).I*(Ra.I*ya+Rb.I*yb)
    YA.append(ya.T)
    YB.append(yb.T)
    XML.append(xml.T)

YA = np.vstack(YA)
YB = np.vstack(YB)
XML = np.vstack(XML)

plt.figure(figsize=(10, 6))
plt.title('Data Fusion', fontsize=15)
plt.plot(YA[:,0], YA[:,1], 'b.', label='Observation 1')
plt.plot(YB[:,0], YB[:,1], 'r.', label='Observation 2')
plt.plot(XML[:,0], XML[:,1], 'k.', label='MLE')
plt.axis('equal')
plt.grid(alpha=0.3)
plt.legend(fontsize=15)
plt.show()

%%html
<center><iframe src="https://www.youtube.com/embed/52jlBrAcw9Q?rel=0" 
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>

3. Maximum-a-Posteriori Estimation (MAP)¶

3.1. Think Differently¶

Given $d_1, d_2, \cdots, d_m$
- We estimated a probability density function $P(D)$
- The most reasonable guess for the next sample: $d_{m+1} = E[D]$

What if there is a hidden variable $\theta$, and we know that samples are generated from it?
- Here, $\theta$ is a random variable

Choose $\theta$ that maximizes the posterior probability of $\theta$ (i.e. probability in the light of the observed data)

Posterior probability of $\theta$ is given by the Bayes Rule

$$P(\theta \mid D) = \frac{P(D \mid \theta)P(\theta)}{P(D)}$$
- $P(\theta)$: Prior probability of $\theta$ (without having seen any data)
- $P(D \mid \theta)$: Likelihood
- $P(D)$: Probability of the data (independent of $\theta$)

$$ P(D) = \int P(\theta)P(D \mid \theta) d\theta $$

The Bayes rule lets us update our belief about $\theta$ in the light of observed data

While doing MAP, we usually maximize the log of the posterior probability

$$\begin{align*} \theta_{MAP} = \underset{\theta}{\mathrm{argmax}}\;\;P(\theta \mid D) &= \underset{\theta}{\mathrm{argmax}}\;\;\frac{P(D \mid \theta)P(\theta)}{P(D)}\\ & =\underset{\theta}{\mathrm{argmax}} \;\; P(D \mid \theta)P(\theta)\\ & = \underset{\theta}{\mathrm{argmax}} \;\; \log P(D \mid \theta)P(\theta)\\ & = \underset{\theta}{\mathrm{argmax}} \;\; \left\{\log{P \left(D \mid \theta \right)} + \log{P(\theta) }\right\}\end{align*}$$

For multiple observations $D = \{d_1,d_2,\cdots,d_m\}$
- Assume independent

$$ \theta_{MAP} = \underset{\theta}{\mathrm{argmax}} \;\; \left\{\sum_{i=1}^{m}\log{P \left(d_i \mid \theta \right)} + \log{P(\theta) }\right\} $$

Same as MLE except the extra log-prior-distribution term

$$ \theta_{MLE} = \underset{\theta}{\mathrm{argmax}} \;\; \left\{\sum_{i=1}^{m}\log{P \left(d_i \mid \theta \right)}\right\} $$

MAP allows incorporating our prior knowledge about $\theta$ in its estimation

$$\theta_{MAP} = \underset{\theta}{\mathrm{argmax}}\;\;P(\theta \mid D)$$

$$ \theta_{MLE} = \underset{\theta}{\mathrm{argmax}}\;\;P(D \mid \theta) $$

3.2. MAP for a Univariate Gaussian¶

Observations $ D=\{x_1,x_2,\cdots,x_m\}: \text{conditionally independent given}\; \theta$

$$ x_i \sim \mathcal{N}(\theta,\sigma^2) $$

Suppose that $\theta$ is a random variable with $\theta \sim \mathcal{N}(\mu,1^2)$, but a prior knowledge
- Unknown $\theta$ and known $\mu, \; \sigma^2$

Joint Probability

$$ P(x_1,x_2,\cdots,x_m \mid \theta) = \prod\limits_{i=1}^m P(x_i \mid \theta) $$

MAP: choose $\theta_{MAP}$

$$\begin{align*} \theta_{MAP} &= \underset{\theta}{\mathrm{argmax}}\;\;P(\theta \mid D)=\frac{P(D \mid \theta)P(\theta)}{P(D)}\\ & = \underset{\theta}{\mathrm{argmax}} \;\; P(D \mid \theta)P(\theta)\\ & = \underset{\theta}{\mathrm{argmax}} \;\; \left\{\log{P \left(D \mid \theta \right)} + \log{P(\theta) }\right\}\end{align*}$$

$$\begin{align*} \frac{\partial}{\partial \theta} \left(\log{P \left(D \mid \theta \right)} \right) & =\; \cdots \; = \frac{1}{\sigma^2} \left(\sum\limits_{i=1}^m x_i-m\theta \right)\quad (\text{we did in MLE})\\ \\ \frac{\partial}{\partial \theta} \left(\log{P \left(\theta \right)} \right) &=\frac{\partial}{\partial\theta} \left(\log \left(\frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}(\theta-\mu)^2} \right) \right)\\ & \; \vdots \\ &=\frac{\partial}{\partial\theta} \left(-\frac{1}{2}\log{2\pi}-\frac{1}{2} \left(\theta - \mu \right)^2 \right)\\ & =\mu-\theta \end{align*}$$

$$\begin{align*} \implies & \frac{\partial}{\partial \theta} \left(\log{P \left(D \mid \theta \right)} \right) + \frac{\partial}{\partial \theta} \left(\log{P \left(\theta \right)} \right)\\ & = \frac{1}{\sigma^2}\left(\sum\limits_{i=1}^m x_i - m\theta^* \right) + \mu - \theta^* = 0 \\ & = \frac{1}{\sigma^2}\sum\limits_{i=1}^m x_i + \mu - \left(\frac{m}{\sigma^2}+1\right)\theta^* = 0 \\ \theta^* &= \frac{\frac{1}{\sigma^2}\sum\limits_{i=1}^m x_i + \mu}{\frac{m}{\sigma^2}+1} = \frac{\frac{m}{\sigma^2}\cdot\frac{1}{m}\sum\limits_{i=1}^m x_i+1\cdot\mu}{\frac{m}{\sigma^2}+1} \end{align*}$$

$$\therefore \;\theta_{MAP} = \frac{\frac{m}{\sigma^2}}{\frac{m}{\sigma^2}+1}\bar{x}+\frac{1}{\frac{m}{\sigma^2}+1}\mu \;\;\;\text{: look familiar ?}$$

MLE interpretation:

$$\begin{align*} &\begin{cases} \mu = \text{prior mean}\\ \bar{x} = \text{sample mean} \end{cases} \\ &\begin{cases} \mu = 1\text{st} \;\;\text{observation} \; \sim \mathcal{N}\left(0,1^2\right)\\ \bar{x} = 2\text{nd}\;\; \text{observation} \; \sim \mathcal{N}\left(0,\left(\frac{\sigma}{\sqrt{m}}\right)^2\right) \end{cases} \\ \end{align*}$$

Big lesson: a prior acts as a data

Note: prior knowledge

Education
Get older

Example) Experiment in class

Which one do you think is heavier?
- with eyes closed
- with visual inspection
- with haptic (touch) inspection

3.3. MAP in Python¶

Suppose that $\theta$ is a random variable with $\theta \sim \mathcal{N}(\mu,1^2)$, but a prior knowledge (unknown $\theta$ and known $\mu, \; \sigma^2$)

$$ x_i \sim \mathcal{N}(\theta,\sigma^2) $$

for mean of a univariate Gaussian

# known
mu = 5 
sigma = 2

# unknown theta 
theta = np.random.normal(mu,1)
x = np.random.normal(theta, sigma)

print('theta = {:.4f}'.format(theta))
print('x = {:.4f}'.format(x))

theta = 3.1817
x = 5.6953

$$\theta_{MAP} = \frac{\frac{m}{\sigma^2}}{\frac{m}{\sigma^2}+1}\bar{x}+\frac{1}{\frac{m}{\sigma^2}+1}\mu $$

# MAP

m = 4
X = np.random.normal(theta,sigma,[m,1])

xbar = np.mean(X)
theta_MAP = m/(m+sigma**2)*xbar + sigma**2/(m+sigma**2)*mu

print('mu = 5')
print('xbar = {:.4f}'.format(xbar))
print('theta_MAP = {:.4f}'.format(theta_MAP))

mu = 5
xbar = 4.6384
theta_MAP = 4.8192

# theta
mu = 5
theta = np.random.normal(mu,1)

sigma = 2
m = 2000

X = np.random.normal(theta,sigma,[m,1])
X = np.asmatrix(X)

print('theta = {:.4f}'.format(theta))
plt.figure(figsize=(10,6))
plt.hist(X,31)
plt.xlim([-5,15])
plt.show()

theta = 5.4629

m = 50
XMLE = []
XMAP = []

for k in range(m):
    xmle = np.mean(X[0:k+1,0])
    xmap = k/(k+sigma**2)*xmle + sigma**2/(k + sigma**2)*mu
    XMLE.append(xmle)
    XMAP.append(xmap)
    
print('theta = {:.4f}'.format(theta))
plt.figure(figsize=(10,6))
plt.plot(XMLE)
plt.plot(XMAP)
plt.legend(['MLE','MAP'])
plt.show()

theta = 5.4629

4. Linear Measurement with Noise¶

$$y = Ax + \omega$$

$x$ is what we want to estimate
$y$ is measured
$A$ characterizes sensors or measurements
$\omega$ is sensor noise

Common assumptions
- $x \sim \mathcal{N}(\mu_x, \Sigma_x)$ $\rightarrow$ for simplicity $x \sim \mathcal{N}(0, \Sigma_x)$
- $\omega \sim \mathcal{N}(\mu_{\omega}, \Sigma_{\omega})$ $\rightarrow$ $\omega \sim \mathcal{N}(0, \Sigma_{\omega})$ is usually
- $x$ and $\omega$ are independent

$$\begin{bmatrix} x \\ \omega \end{bmatrix} \sim \mathcal{N}\left( \begin{bmatrix} \mu_x \\ \mu_{\omega}\end{bmatrix}, \begin{bmatrix} \Sigma_{x}& 0\\ 0& \Sigma_{\omega} \end{bmatrix} \right)$$

$$\begin{bmatrix} x \\ y \end{bmatrix} = \begin{bmatrix} I & 0 \\A & I \end{bmatrix} \begin{bmatrix} x \\ \omega \end{bmatrix}$$

$$ \begin{align*} \begin{bmatrix} x \\ y \end{bmatrix} &\sim \mathcal{N}\left( \begin{bmatrix} I & 0 \\A & I \end{bmatrix} \begin{bmatrix} \mu_x \\ \mu_{\omega}\end{bmatrix}, \begin{bmatrix} I & 0 \\A & I \end{bmatrix} \begin{bmatrix} \Sigma_{x}& 0\\ 0& \Sigma_{\omega} \end{bmatrix} \begin{bmatrix} I & 0 \\A & I \end{bmatrix}^T \right) \\\\ &\sim \mathcal{N}\left( \begin{bmatrix} \mu_x \\ A \mu_x + \mu_{\omega}\end{bmatrix}, \begin{bmatrix} \Sigma_{x}& \Sigma_x A^T\\ A\Sigma_x& A\Sigma_xA^T + \Sigma_{\omega} \end{bmatrix} \right) = \mathcal{N} \left( \begin{bmatrix} \mu_x \\ \mu_y\end{bmatrix}, \begin{bmatrix} \Sigma_{x}& \Sigma_{xy}\\ \Sigma_{yx}& \Sigma_y \end{bmatrix} \right) \end{align*} $$

$$\Sigma_y = \underbrace{ A\Sigma_xA^T}_{\text{signal covariance}} + \underbrace{\Sigma_{\omega}}_{\text{noise covariance}}$$

Back to estimation problem: estimate $x$ given $y$

$$x \mid y \sim \mathcal{N} \left(\mu_x + \Sigma_{xy}\Sigma_{y}^{-1}(y-\mu_y), \; \Sigma_{x} - \Sigma_{xy}\Sigma_{y}^{-1}\Sigma_{yx} \right)$$

$$ \begin{align*} \hat{x} = \phi(y) &= E[x \mid y]\\ & = \mu_x + \Sigma_{xy}\Sigma_{y}^{-1}(y-\mu_y)\\ & = \mu_x + \Sigma_x A^T(A\Sigma_xA^T + \Sigma_{\omega})^{-1}(y-\mu_y)\\ & = \mu_x + K(y-\mu_y)\\\\ \text{cov}(x \mid y) &= \Sigma_{x} - \Sigma_{xy}\Sigma_{y}^{-1}\Sigma_{yx}\\\\ &= \Sigma_{x} - \Sigma_x A^T (A\Sigma_xA^T + \Sigma_{\omega})^{-1} A \Sigma_x\leq \Sigma_x \end{align*} $$

Interpretation
- $\mu_x$ is our best prior guess of $x$ (before measurement)
- $y - \mu_y$ is the discrepancy between what we actually measure ($y$) and the expected value of $y$
- Estimator modifies prior guess by $K$ times this discrepancy
- Estimator blends prior information with measurement
- Covariance of estimation error is always less than prior covariance of $x$

5. Bayesian Thinking¶

5.1. Re-visit Two Sensors Problem¶

Assumptions
- Follows Gaussian distribution
- Variances of two sensors

$$\text{var}(y_a) = \sigma_a^2\\ \text{var}(y_b) = \sigma_b^2$$

Optimal position estimation $\hat x$

$$ \begin{align*} E[\hat x] &= \frac{\frac{1}{\sigma_a^2} y_a + \frac{1}{\sigma_b^2} y_b} {\frac{1}{\sigma_a^2} + \frac{1}{\sigma_b^2}} = \frac{\frac{1}{\sigma^2_a}}{\frac{1}{\sigma^2_a} + \frac{1}{\sigma^2_a}}y_a + \frac{\frac{1}{\sigma^2_b}}{\frac{1}{\sigma^2_a} + \frac{1}{\sigma^2_b}}y_b\\\\ \text{var}[\hat x] &= \frac{1}{\frac{1}{\sigma^2_a}+\frac{1}{\sigma^2_b}} \quad < \quad \sigma_a^2, \;\sigma_b^2 \qquad \implies \text{more accurate} \end{align*} $$

5.2. Different Perspective¶

Observe sensor A first $\;y_a, \; \sigma_a^2$

$$ \begin{align*} \hat x_1 &= y_a\\ \hat{\sigma}_{1}^2 &= \sigma_a^2 \end{align*} $$

Combine sensor B's obervation $\;y_b, \; \sigma_b^2$

$$ \begin{align*} E[\hat x] &= \frac{\frac{1}{\sigma_a^2} y_a + \frac{1}{\sigma_b^2}y_b}{\frac{1}{\sigma_a^2} + \frac{1} {\sigma_b^2}} = \frac{\frac{1}{\sigma^2_a}}{\frac{1}{\sigma^2_a} + \frac{1}{\sigma^2_b}}y_a + \frac{\frac{1}{\sigma^2_b}}{\frac{1}{\sigma^2_a} + \frac{1}{\sigma^2_b}}y_b \quad \text{and set } K = \frac{\frac{1}{\sigma_b^2}}{\frac{1}{\sigma_a^2} + \frac{1}{\sigma_b^2}}\\\\ \hat{x}_2 = E[\hat x] &= (1-K) y_a + K y_b\\ &= (1-K) \hat x_1 + K y_b\\ &= \hat x_1 + K (y_b - \hat x_1 ) \end{align*} $$

Optimal estimation
- prediction first, then correct or update with prediction error

$$\underbrace{\hat x_2}_{\text{final estimation}} = \underbrace{\hat x_1}_{\text{prediction}} + K ( \overbrace{ \underbrace{y_b}_{\text{measured}} - \hat x_1}^{\text{prediction error}})$$

5.3. Bayesian Kalman Filter¶

Discrete linear dynamical system of motion

$$ \begin{align*} x_{t+1} &= A x_t + B u_t\\ y_t &= C x_t \end{align*} $$

Kalman filter has a very nice Bayesian interpretation.
- Model $x_t$ with a Gaussian
  
  $$p(x_t) = \mathcal{N}(x_t, \Sigma_t)$$
- Prediction using state dynamics model
  
  $$p(x_{t} \mid x_{t-1})$$
- Inference from noisy measurements
  
  $$p(y_t \mid x_t)$$

Summary
- Prediction
  - We do not have any more data, we just update our earlier posterior of $\hat x_{t-1}$ to give a new posterior of $\hat x_{t}$
  - We are just propagating the pmf under the linear dynamics
    
    $$ \begin{align*} x_{t+1} &= A x_t + B u_t\\ y_t &= C x_t \end{align*} $$
- Correction
  - We start with $\hat{x}_t$, and use it as the prior for our next measurement of $y_t$
    
    $$ \hat x_{t} \leftarrow \hat x_t + K (y_{t} - C\hat x_t ) $$
- (Recursive) Repeat over time

5.4. Object Tracking in Computer Vision¶

Lecture: Introduction to Computer Vision by Prof. Aaron Bobick at Georgia Tech

%%html
<center><iframe src="https://www.youtube.com/embed/rf3DKqWajWY?rel=0" 
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>

%%html
<center><iframe src="https://www.youtube.com/embed/5yUjYCkm2jI?rel=0" 
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>

5.5. Bayesian Inference¶

Start with prior beliefs, which can be thought of as a summary of opinions.
- might be subjective
Given our prior, we can update our opinion, and produce a new opinion.
- This new distribution is called the posterior
Iterate
- if more data is available

Coin-flip example

See how two different priors can result in two different posteriors

Gaussian distributions
- Explaining away

%%javascript
$.getScr`aipt('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')