Parameter Estimation in Probabilistic Model


By Prof. Seungchul Lee
http://iai.postech.ac.kr/
Industrial AI Lab at POSTECH

Table of Contents

1. Probability Density EstimationĀ¶

  • Reconstructing the probability density function from a set of given data samples $y_1, y_2, \cdots, y_m$

  • Want to recover the underlying probability density function generating our dataset



  • Assumption
    • Gaussian distributed with unknown parameters šœ‡ and šœŽ
    • Samples are independent
  • It becomes a parameter estimation problem
  • Think about how to compute the probability
    • Gaussian distributed with unknown parameters šœ‡ and šœŽ
    • Samples are independent


$$P \left(y_1,y_2,\cdots,y_m \,;\, \mu,\sigma^2 \right) = \prod_{i=1}^{m} P \left(y_i \,;\, \mu,\sigma^2 \right)$$



1.1. Given $m$ Data Points, Drawn from a Gaussian DistributionĀ¶

  • You will often see the following derivation



$$ \begin{align*} P\left(y=y_i \,;\, \mu,\sigma^2\right) & =\frac{1}{\sqrt{2\pi}\sigma}\exp \left(-\frac{1}{2\sigma^2}(y_i-\mu)^2 \right)\;\text{: generative model}\\\\ \mathcal{L} = P \left(y_1,y_2,\cdots,y_m \,;\, \mu,\sigma^2 \right) & =\prod\limits_{i=1}^m\frac{1}{\sqrt{2\pi}\sigma}\exp\left(-\frac{1}{2\sigma^2}\left(y_i-\mu\right)^2\right)\\ & = \frac{1}{(2\pi)^\frac{m}{2}\sigma^m}\exp\left(-\frac{1}{2\sigma^2}\sum\limits_{i=1}^m(y_i-\mu)^2 \right)\\\\ \ell =\log \mathcal{L} &=-\frac{m}{2} \text{log}{2\pi}-m \text{log}{\sigma}-\frac{1}{2\sigma^2}\sum\limits_{i=1}^m(y_i-\mu)^2 \end{align*}$$



  • To maximize, $\frac{\partial \ell}{\partial \mu}=0, \frac{\partial \ell}{\partial \sigma}=0$


$$ \begin{align*} \frac{\partial \ell}{\partial \mu} &=\frac{1}{\sigma^2}\sum\limits_{i=1}^m(y_i-\mu)=0 \;\;\;\;\qquad\quad \implies \quad \mu_{MLE}=\frac{1} {m}\sum\limits_{i=1}^m y_i \quad \text{: sample mean}\\\\ \frac{\partial \ell}{\partial \sigma} &= -\frac{m}{\sigma}+\frac{1}{\sigma^3}\sum\limits_{i=1}^m(y_i-\mu)^2=0 \;\; \implies \quad \sigma_{MLE}^2=\frac{1}{m}\sum\limits_{i=1}^m(y_i-\mu)^2 \quad \text{: sample variance} \end{align*} $$


  • Big lesson
    • We often compute a mean and variance to represent data statistics
    • We kind of assume that a data set is Gaussian distributed
    • Good news: sample mean is Gaussian distributed by the central limit theorem

1.2. Maximum Likelihood Estimation (MLE)Ā¶

  • Method of estimating the parameters of a probability distribution by maximizing a likelihood function

  • Under the assumed statistical model the observed data is most probable

  • Likelihood function
    • The probability density at a particular outcome $D$ when the true value of the parameter is $\theta$


$$\mathcal{L} = P(D \mid \theta) = P(D \,;\, \theta)$$


$$\theta_{MLE} = \underset{\theta}{\mathrm{argmax}}\;\; P(D \,;\, \theta)$$

  • A model exists first and it generates samples
    • Generative model
InĀ [1]:
import numpy as np
from scipy.stats import norm
import matplotlib.pyplot as plt
%matplotlib inline
InĀ [2]:
# MLE of Gaussian distribution 
# mu

m = 20
mu = 0
sigma = 5

x = np.random.normal(mu,sigma,[m,1])
xp = np.linspace(-20, 20, 100)
y0 = np.zeros([m, 1])

muhat = [-5, 0, 5, np.mean(x)]

plt.figure(figsize=(8, 8))

for i in range(4):
    yp = norm.pdf(xp, muhat[i], sigma)
    y = norm.pdf(x, muhat[i], sigma)
    logL = np.sum(np.log(y))
    
    plt.subplot(4, 1, i+1)
    plt.plot(xp, yp, 'r')
    plt.plot(x, y, 'bo')
    plt.plot(np.hstack([x, x]).T, np.hstack([y, y0]).T, 'k--')
    
    plt.title(r'$\hat\mu$ = {0:.2f}'.format(muhat[i]), fontsize=15)
    plt.text(-15,0.06,np.round(logL,4),fontsize=15)
    plt.axis([-20, 20, 0, 0.11])

plt.tight_layout()
plt.show()

Compare to a result from formula

$$\mu_{MLE}=\frac{1}{m}\sum_{i = 1}^{m}x_i$$
InĀ [3]:
# mean is unknown in this example
# variance is known in this example

m = 10
mu = 0
sigma = 5

x = np.random.normal(mu,sigma,[m,1])

mus = np.arange(-10, 10.5, 0.5)
LOGL = []

for i in range(np.size(mus)):
    y = norm.pdf(x, mus[i], sigma)
    logL = np.sum(np.log(y))
    LOGL.append(logL)

muhat = np.mean(x)
print(muhat)
    
plt.figure(figsize=(10, 6))
plt.plot(mus, LOGL, '.')
plt.title('$log (\prod \mathcal{N}(x \mid \mu , \sigma^2))$', fontsize=20)
plt.xlabel(r'$\hat \mu$', fontsize=15)
plt.grid(alpha=0.3)
plt.show()
-1.4011074292982022
$$\sigma_{MLE}^2=\frac{1}{m}\sum\limits_{i=1}^m(y_i-\mu)^2$$
InĀ [4]:
# mean is known in this example
# variance is unknown in this example

m = 100
mu = 0
sigma = 3

x = np.random.normal(mu,sigma,[m,1])  # samples

sigmas = np.arange(1, 10, 0.1)
LOGL = []

for i in range(sigmas.shape[0]):
    y = norm.pdf(x, mu, sigmas[i])     # likelihood
    logL = np.sum(np.log(y))
    LOGL.append(logL)

sigmahat = np.sqrt(np.var(x))
print(sigmahat)

plt.figure(figsize=(10,6))
plt.title(r'$\log (\prod \mathcal{N} (x|\mu,\sigma^2))$',fontsize=20)
plt.plot(sigmas, LOGL, '.')
plt.xlabel(r'$\hat \sigma$', fontsize=15)
plt.axis([0, np.max(sigmas), np.min(LOGL), -200])
plt.grid(alpha=0.3)
plt.show()
2.726137849417368

2. Data Fusion with UncertaintiesĀ¶

  • Suppose there are two sensors



$$\begin{align*} y_a &= x + \varepsilon_a, \; \varepsilon_a \sim \mathcal{N}\left(0,\sigma^2_a\right) \\ y_b &= x + \varepsilon_b, \; \varepsilon_b \sim \mathcal{N}\left(0,\sigma^2_b\right) \end{align*}$$
  • In a matrix form

$$ y = \begin{bmatrix} y_a \\ y_b\end{bmatrix} = Cx+\varepsilon = \begin{bmatrix}1\\1\end{bmatrix}x+\begin{bmatrix}\varepsilon_a\\ \varepsilon_b \end{bmatrix} \quad \quad \varepsilon \sim \mathcal{N}\left(0,R\right),\;\; R=\begin{bmatrix} \sigma^2_a & 0\\ 0 & \sigma^2_b\end{bmatrix}$$



$$\begin{align*}P\left(y \mid x\right) & \sim \mathcal{N}\left(Cx,R\right)\\ &= \frac{1}{\sqrt{\left(2\pi\right)^2\vert R \vert}}\exp\left(-\frac{1}{2}\left(y-Cx\right)^TR^{-1}\left(y-Cx\right)\right)\end{align*}$$

  • Find $\,\hat{x}$

$$\ell = -\log{2\pi}-\frac{1}{2}\log{\vert R \vert}-\frac{1}{2} \underbrace{\left(y-Cx\right)^TR^{-1}\left(y-Cx\right)}$$



$$ \begin{align*} \left(y-Cx\right)^TR^{-1}\left(y-Cx\right) &= y^TR^{-1}y-y^TR^{-1}Cx-x^TC^TR^{-1}y+x^TC^TR^{-1}Cx \\ \\ \implies \frac{d\ell}{dx} & =0=-2C^TR^{-1}y + 2C^TR^{-1}Cx\\ \therefore \;\; \hat{x} &=\left(C^TR^{-1}C\right)^{-1}C^TR^{-1}y \end{align*} $$

  • $ \left(C^TR^{-1}C\right)^{-1}C^TR^{-1} $
$$ \begin{align*} \left(C^TR^{-1}C\right) &= \begin{bmatrix}1 & 1\end{bmatrix}\begin{bmatrix}\frac{1}{\sigma^2_a} & 0 \\ 0 & \frac{1}{\sigma^2_b}\end{bmatrix}\begin{bmatrix}1\\ 1\end{bmatrix} = \frac{1}{\sigma^2_a}+\frac{1}{\sigma^2_b}\\ C^TR^{-1} &= \begin{bmatrix}1 & 1\end{bmatrix}\begin{bmatrix}\frac{1}{\sigma^2_a} & 0 \\ 0&\frac{1}{\sigma^2_b}\end{bmatrix} = \begin{bmatrix}\frac{1}{\sigma^2_a} &\frac{1}{\sigma^2_b} \end{bmatrix}\\ \\ \end{align*} $$
$$\begin{align*} \hat{x} &= \left(C^TR^{-1}C\right)^{-1}C^TR^{-1}y = \left(\frac{1}{\sigma^2_a}+\frac{1}{\sigma^2_b}\right)^{-1} \begin{bmatrix}\frac{1}{\sigma^2_a} &\frac{1}{\sigma^2_b} \end{bmatrix} \begin{bmatrix}y_a\\y_b\end{bmatrix} \\&= \frac{\frac{1}{\sigma^2_a}y_a+\frac{1}{\sigma^2_b}y_b}{\frac{1}{\sigma^2_a}+\frac{1}{\sigma^2_b}}\\ \\ \text{var}\left(\hat{x}\right) &= \left(\left(C^TR^{-1}C \right)^{-1}C^TR^{-1}\right) \cdot \text{var}(y) \cdot \left( \left(C^TR^{-1}C)^{-1}C^TR^{-1} \right) \right)^T\\ & = \left( \left(C^TR^{-1}C \right)^{-1}C^TR^{-1}\right) \cdot R \cdot \left( \left(C^TR^{-1}C)^{-1}C^TR^{-1} \right) \right)^T\\ &= \left(C^TR^{-1}C \right)^{-1}C^T \cdot \left(R^{-1} \right)^TC \left(\left(C^TR^{-1}C \right)^{-1} \right)^T \\ &=\underbrace{\left(C^TR^{-1}C\right)^{-1}} \underbrace{C^TR^{-1}C}\left(\left(C^TR^{-1}C\right)^{-1}\right)^T = \left(C^TR^{-1}C\right)^{-1}\\\\ &=\frac{1}{\frac{1}{\sigma^2_a}+\frac{1}{\sigma^2_b}} \leq \;\; \sigma^2_a,\;\sigma^2_b \end{align*}$$
  • Summary

$$ \begin{align*} \hat x &= \frac{\frac{1}{\sigma_a^2} y_a + \frac{1}{\sigma_b^2}y_b}{\frac{1}{\sigma_a^2} + \frac{1}{\sigma_b^2}} = \frac{\frac{1}{\sigma^2_a}}{\frac{1}{\sigma^2_a} + \frac{1}{\sigma^2_b}}y_a + \frac{\frac{1}{\sigma^2_b}}{\frac{1}{\sigma^2_a} + \frac{1}{\sigma^2_b}}y_b\\\\ \text{var}(\hat x) &= \frac{1}{\frac{1}{\sigma^2_a}+\frac{1}{\sigma^2_b}} \quad < \quad \sigma_a^2, \;\sigma_b^2 \qquad \implies \text{more accurate} \end{align*} $$


  • Big lesson:
    • Two sensors are better than one sensor $\implies$ less uncertainties
    • Accuracy or uncertainty information is also important in sensors



Example of two rulers

  • 1D example
  • how brain works on human measurements from both haptic and visual channels


InĀ [16]:
x = 5       # true state (length in this example)
a = 1       # sigma of a
b = 2       # sigma of b

YA = []
YB = []
XML = []

for i in range(2000):
    ya = x + np.random.normal(0,a)
    yb = x + np.random.normal(0,b)
    xml = (1/a**2*ya + 1/b**2*yb)/(1/a**2+1/b**2)
    YA.append(ya)
    YB.append(yb)
    XML.append(xml)

plt.figure(figsize=(8, 6))
plt.subplot(3, 1, 1), plt.hist(YA, 41, density=True), plt.axis([0, 10, 0, 0.5]), 
plt.title(r'$y_a$', fontsize=20), plt.grid(alpha=0.3)
plt.subplot(3, 1, 2), plt.hist(YB, 41, density=True), plt.axis([0, 10, 0, 0.5]), 
plt.title(r'$y_b$', fontsize=20), plt.grid(alpha=0.3)
plt.subplot(3, 1, 3), plt.hist(XML, 41, density=True), plt.axis([0, 10, 0, 0.5]), 
plt.title(r'$x_{ML}$', fontsize=20), plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

Example of two GPSs

  • 2D example
InĀ [6]:
x = np.array([5, 10]).reshape(-1, 1) # true position

mu = np.array([0, 0])
Ra = np.matrix([[9, 1],
               [1, 1]])
Rb = np.matrix([[1, 1],
                [1, 9]])

YA = []
YB = []
XML = []

for i in range(1000):
    ya = x + np.random.multivariate_normal(mu, Ra).reshape(-1, 1)
    yb = x + np.random.multivariate_normal(mu, Rb).reshape(-1, 1)
    xml = (Ra.I+Rb.I).I*(Ra.I*ya+Rb.I*yb)
    YA.append(ya.T)
    YB.append(yb.T)
    XML.append(xml.T)

YA = np.vstack(YA)
YB = np.vstack(YB)
XML = np.vstack(XML)
InĀ [7]:
plt.figure(figsize=(10, 6))
plt.title('Data Fusion', fontsize=15)
plt.plot(YA[:,0], YA[:,1], 'b.', label='Observation 1')
plt.plot(YB[:,0], YB[:,1], 'r.', label='Observation 2')
plt.plot(XML[:,0], XML[:,1], 'k.', label='MLE')
plt.axis('equal')
plt.grid(alpha=0.3)
plt.legend(fontsize=15)
plt.show()
InĀ [8]:
%%html
<center><iframe src="https://www.youtube.com/embed/52jlBrAcw9Q?rel=0" 
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>

3. Maximum-a-Posteriori Estimation (MAP)Ā¶

3.1. Think DifferentlyĀ¶

  • Given $d_1, d_2, \cdots, d_m$
    • We estimated a probability density function $P(D)$
    • The most reasonable guess for the next sample: $d_{m+1} = E[D]$



  • What if there is a hidden variable $\theta$, and we know that samples are generated from it?
    • Here, $\theta$ is a random variable



  • Choose $\theta$ that maximizes the posterior probability of $\theta$ (i.e. probability in the light of the observed data)
  • Posterior probability of $\theta$ is given by the Bayes Rule

    $$P(\theta \mid D) = \frac{P(D \mid \theta)P(\theta)}{P(D)}$$
    • $P(\theta)$: Prior probability of $\theta$ (without having seen any data)
    • $P(D \mid \theta)$: Likelihood
    • $P(D)$: Probability of the data (independent of $\theta$)


$$ P(D) = \int P(\theta)P(D \mid \theta) d\theta $$

  • The Bayes rule lets us update our belief about $\theta$ in the light of observed data
  • While doing MAP, we usually maximize the log of the posterior probability

$$\begin{align*} \theta_{MAP} = \underset{\theta}{\mathrm{argmax}}\;\;P(\theta \mid D) &= \underset{\theta}{\mathrm{argmax}}\;\;\frac{P(D \mid \theta)P(\theta)}{P(D)}\\ & =\underset{\theta}{\mathrm{argmax}} \;\; P(D \mid \theta)P(\theta)\\ & = \underset{\theta}{\mathrm{argmax}} \;\; \log P(D \mid \theta)P(\theta)\\ & = \underset{\theta}{\mathrm{argmax}} \;\; \left\{\log{P \left(D \mid \theta \right)} + \log{P(\theta) }\right\}\end{align*}$$
  • For multiple observations $D = \{d_1,d_2,\cdots,d_m\}$
    • Assume independent


$$ \theta_{MAP} = \underset{\theta}{\mathrm{argmax}} \;\; \left\{\sum_{i=1}^{m}\log{P \left(d_i \mid \theta \right)} + \log{P(\theta) }\right\} $$

  • Same as MLE except the extra log-prior-distribution term


$$ \theta_{MLE} = \underset{\theta}{\mathrm{argmax}} \;\; \left\{\sum_{i=1}^{m}\log{P \left(d_i \mid \theta \right)}\right\} $$

  • MAP allows incorporating our prior knowledge about $\theta$ in its estimation


$$\theta_{MAP} = \underset{\theta}{\mathrm{argmax}}\;\;P(\theta \mid D)$$


$$ \theta_{MLE} = \underset{\theta}{\mathrm{argmax}}\;\;P(D \mid \theta) $$

3.2. MAP for a Univariate GaussianĀ¶

  • Observations $ D=\{x_1,x_2,\cdots,x_m\}: \text{conditionally independent given}\; \theta$


$$ x_i \sim \mathcal{N}(\theta,\sigma^2) $$

  • Suppose that $\theta$ is a random variable with $\theta \sim \mathcal{N}(\mu,1^2)$, but a prior knowledge
    • Unknown $\theta$ and known $\mu, \; \sigma^2$



  • Joint Probability
$$ P(x_1,x_2,\cdots,x_m \mid \theta) = \prod\limits_{i=1}^m P(x_i \mid \theta) $$
  • MAP: choose $\theta_{MAP}$

$$\begin{align*} \theta_{MAP} &= \underset{\theta}{\mathrm{argmax}}\;\;P(\theta \mid D)=\frac{P(D \mid \theta)P(\theta)}{P(D)}\\ & = \underset{\theta}{\mathrm{argmax}} \;\; P(D \mid \theta)P(\theta)\\ & = \underset{\theta}{\mathrm{argmax}} \;\; \left\{\log{P \left(D \mid \theta \right)} + \log{P(\theta) }\right\}\end{align*}$$



$$\begin{align*} \frac{\partial}{\partial \theta} \left(\log{P \left(D \mid \theta \right)} \right) & =\; \cdots \; = \frac{1}{\sigma^2} \left(\sum\limits_{i=1}^m x_i-m\theta \right)\quad (\text{we did in MLE})\\ \\ \frac{\partial}{\partial \theta} \left(\log{P \left(\theta \right)} \right) &=\frac{\partial}{\partial\theta} \left(\log \left(\frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}(\theta-\mu)^2} \right) \right)\\ & \; \vdots \\ &=\frac{\partial}{\partial\theta} \left(-\frac{1}{2}\log{2\pi}-\frac{1}{2} \left(\theta - \mu \right)^2 \right)\\ & =\mu-\theta \end{align*}$$



$$\begin{align*} \implies & \frac{\partial}{\partial \theta} \left(\log{P \left(D \mid \theta \right)} \right) + \frac{\partial}{\partial \theta} \left(\log{P \left(\theta \right)} \right)\\ & = \frac{1}{\sigma^2}\left(\sum\limits_{i=1}^m x_i - m\theta^* \right) + \mu - \theta^* = 0 \\ & = \frac{1}{\sigma^2}\sum\limits_{i=1}^m x_i + \mu - \left(\frac{m}{\sigma^2}+1\right)\theta^* = 0 \\ \theta^* &= \frac{\frac{1}{\sigma^2}\sum\limits_{i=1}^m x_i + \mu}{\frac{m}{\sigma^2}+1} = \frac{\frac{m}{\sigma^2}\cdot\frac{1}{m}\sum\limits_{i=1}^m x_i+1\cdot\mu}{\frac{m}{\sigma^2}+1} \end{align*}$$


$$\therefore \;\theta_{MAP} = \frac{\frac{m}{\sigma^2}}{\frac{m}{\sigma^2}+1}\bar{x}+\frac{1}{\frac{m}{\sigma^2}+1}\mu \;\;\;\text{: look familiar ?}$$

  • MLE interpretation:

$$\begin{align*} &\begin{cases} \mu = \text{prior mean}\\ \bar{x} = \text{sample mean} \end{cases} \\ &\begin{cases} \mu = 1\text{st} \;\;\text{observation} \; \sim \mathcal{N}\left(0,1^2\right)\\ \bar{x} = 2\text{nd}\;\; \text{observation} \; \sim \mathcal{N}\left(0,\left(\frac{\sigma}{\sqrt{m}}\right)^2\right) \end{cases} \\ \end{align*}$$


  • Big lesson: a prior acts as a data



Note: prior knowledge

  • Education
  • Get older

Example) Experiment in class

  • Which one do you think is heavier?
    • with eyes closed
    • with visual inspection
    • with haptic (touch) inspection



3.3. MAP in PythonĀ¶

Suppose that $\theta$ is a random variable with $\theta \sim \mathcal{N}(\mu,1^2)$, but a prior knowledge (unknown $\theta$ and known $\mu, \; \sigma^2$)

$$ x_i \sim \mathcal{N}(\theta,\sigma^2) $$
  • for mean of a univariate Gaussian
InĀ [9]:
# known
mu = 5 
sigma = 2

# unknown theta 
theta = np.random.normal(mu,1)
x = np.random.normal(theta, sigma)

print('theta = {:.4f}'.format(theta))
print('x = {:.4f}'.format(x))
theta = 3.1817
x = 5.6953
$$\theta_{MAP} = \frac{\frac{m}{\sigma^2}}{\frac{m}{\sigma^2}+1}\bar{x}+\frac{1}{\frac{m}{\sigma^2}+1}\mu $$
InĀ [10]:
# MAP

m = 4
X = np.random.normal(theta,sigma,[m,1])

xbar = np.mean(X)
theta_MAP = m/(m+sigma**2)*xbar + sigma**2/(m+sigma**2)*mu

print('mu = 5')
print('xbar = {:.4f}'.format(xbar))
print('theta_MAP = {:.4f}'.format(theta_MAP))
mu = 5
xbar = 4.6384
theta_MAP = 4.8192
InĀ [11]:
# theta
mu = 5
theta = np.random.normal(mu,1)

sigma = 2
m = 2000

X = np.random.normal(theta,sigma,[m,1])
X = np.asmatrix(X)

print('theta = {:.4f}'.format(theta))
plt.figure(figsize=(10,6))
plt.hist(X,31)
plt.xlim([-5,15])
plt.show()
theta = 5.4629
InĀ [12]:
m = 50
XMLE = []
XMAP = []

for k in range(m):
    xmle = np.mean(X[0:k+1,0])
    xmap = k/(k+sigma**2)*xmle + sigma**2/(k + sigma**2)*mu
    XMLE.append(xmle)
    XMAP.append(xmap)
    
print('theta = {:.4f}'.format(theta))
plt.figure(figsize=(10,6))
plt.plot(XMLE)
plt.plot(XMAP)
plt.legend(['MLE','MAP'])
plt.show()
theta = 5.4629

4. Linear Measurement with NoiseĀ¶




$$y = Ax + \omega$$

  • $x$ is what we want to estimate

  • $y$ is measured

  • $A$ characterizes sensors or measurements

  • $\omega$ is sensor noise

  • Common assumptions

    • $x \sim \mathcal{N}(\mu_x, \Sigma_x)$ $\rightarrow$ for simplicity $x \sim \mathcal{N}(0, \Sigma_x)$

    • $\omega \sim \mathcal{N}(\mu_{\omega}, \Sigma_{\omega})$ $\rightarrow$ $\omega \sim \mathcal{N}(0, \Sigma_{\omega})$ is usually

    • $x$ and $\omega$ are independent


$$\begin{bmatrix} x \\ \omega \end{bmatrix} \sim \mathcal{N}\left( \begin{bmatrix} \mu_x \\ \mu_{\omega}\end{bmatrix}, \begin{bmatrix} \Sigma_{x}& 0\\ 0& \Sigma_{\omega} \end{bmatrix} \right)$$


$$\begin{bmatrix} x \\ y \end{bmatrix} = \begin{bmatrix} I & 0 \\A & I \end{bmatrix} \begin{bmatrix} x \\ \omega \end{bmatrix}$$

$$ \begin{align*} \begin{bmatrix} x \\ y \end{bmatrix} &\sim \mathcal{N}\left( \begin{bmatrix} I & 0 \\A & I \end{bmatrix} \begin{bmatrix} \mu_x \\ \mu_{\omega}\end{bmatrix}, \begin{bmatrix} I & 0 \\A & I \end{bmatrix} \begin{bmatrix} \Sigma_{x}& 0\\ 0& \Sigma_{\omega} \end{bmatrix} \begin{bmatrix} I & 0 \\A & I \end{bmatrix}^T \right) \\\\ &\sim \mathcal{N}\left( \begin{bmatrix} \mu_x \\ A \mu_x + \mu_{\omega}\end{bmatrix}, \begin{bmatrix} \Sigma_{x}& \Sigma_x A^T\\ A\Sigma_x& A\Sigma_xA^T + \Sigma_{\omega} \end{bmatrix} \right) = \mathcal{N} \left( \begin{bmatrix} \mu_x \\ \mu_y\end{bmatrix}, \begin{bmatrix} \Sigma_{x}& \Sigma_{xy}\\ \Sigma_{yx}& \Sigma_y \end{bmatrix} \right) \end{align*} $$



$$\Sigma_y = \underbrace{ A\Sigma_xA^T}_{\text{signal covariance}} + \underbrace{\Sigma_{\omega}}_{\text{noise covariance}}$$

  • Back to estimation problem: estimate $x$ given $y$


$$x \mid y \sim \mathcal{N} \left(\mu_x + \Sigma_{xy}\Sigma_{y}^{-1}(y-\mu_y), \; \Sigma_{x} - \Sigma_{xy}\Sigma_{y}^{-1}\Sigma_{yx} \right)$$


$$ \begin{align*} \hat{x} = \phi(y) &= E[x \mid y]\\ & = \mu_x + \Sigma_{xy}\Sigma_{y}^{-1}(y-\mu_y)\\ & = \mu_x + \Sigma_x A^T(A\Sigma_xA^T + \Sigma_{\omega})^{-1}(y-\mu_y)\\ & = \mu_x + K(y-\mu_y)\\\\ \text{cov}(x \mid y) &= \Sigma_{x} - \Sigma_{xy}\Sigma_{y}^{-1}\Sigma_{yx}\\\\ &= \Sigma_{x} - \Sigma_x A^T (A\Sigma_xA^T + \Sigma_{\omega})^{-1} A \Sigma_x\leq \Sigma_x \end{align*} $$

  • Interpretation

    • $\mu_x$ is our best prior guess of $x$ (before measurement)

    • $y - \mu_y$ is the discrepancy between what we actually measure ($y$) and the expected value of $y$

    • Estimator modifies prior guess by $K$ times this discrepancy

    • Estimator blends prior information with measurement

    • Covariance of estimation error is always less than prior covariance of $x$

5. Bayesian ThinkingĀ¶

5.1. Re-visit Two Sensors ProblemĀ¶




  • Assumptions
    • Follows Gaussian distribution
    • Variances of two sensors


$$\text{var}(y_a) = \sigma_a^2\\ \text{var}(y_b) = \sigma_b^2$$

  • Optimal position estimation $\hat x$
$$ \begin{align*} E[\hat x] &= \frac{\frac{1}{\sigma_a^2} y_a + \frac{1}{\sigma_b^2} y_b} {\frac{1}{\sigma_a^2} + \frac{1}{\sigma_b^2}} = \frac{\frac{1}{\sigma^2_a}}{\frac{1}{\sigma^2_a} + \frac{1}{\sigma^2_a}}y_a + \frac{\frac{1}{\sigma^2_b}}{\frac{1}{\sigma^2_a} + \frac{1}{\sigma^2_b}}y_b\\\\ \text{var}[\hat x] &= \frac{1}{\frac{1}{\sigma^2_a}+\frac{1}{\sigma^2_b}} \quad < \quad \sigma_a^2, \;\sigma_b^2 \qquad \implies \text{more accurate} \end{align*} $$

5.2. Different PerspectiveĀ¶

  • Observe sensor A first $\;y_a, \; \sigma_a^2$


$$ \begin{align*} \hat x_1 &= y_a\\ \hat{\sigma}_{1}^2 &= \sigma_a^2 \end{align*} $$

  • Combine sensor B's obervation $\;y_b, \; \sigma_b^2$


$$ \begin{align*} E[\hat x] &= \frac{\frac{1}{\sigma_a^2} y_a + \frac{1}{\sigma_b^2}y_b}{\frac{1}{\sigma_a^2} + \frac{1} {\sigma_b^2}} = \frac{\frac{1}{\sigma^2_a}}{\frac{1}{\sigma^2_a} + \frac{1}{\sigma^2_b}}y_a + \frac{\frac{1}{\sigma^2_b}}{\frac{1}{\sigma^2_a} + \frac{1}{\sigma^2_b}}y_b \quad \text{and set } K = \frac{\frac{1}{\sigma_b^2}}{\frac{1}{\sigma_a^2} + \frac{1}{\sigma_b^2}}\\\\ \hat{x}_2 = E[\hat x] &= (1-K) y_a + K y_b\\ &= (1-K) \hat x_1 + K y_b\\ &= \hat x_1 + K (y_b - \hat x_1 ) \end{align*} $$

  • Optimal estimation
    • prediction first, then correct or update with prediction error


$$\underbrace{\hat x_2}_{\text{final estimation}} = \underbrace{\hat x_1}_{\text{prediction}} + K ( \overbrace{ \underbrace{y_b}_{\text{measured}} - \hat x_1}^{\text{prediction error}})$$

5.3. Bayesian Kalman FilterĀ¶

  • Discrete linear dynamical system of motion


$$ \begin{align*} x_{t+1} &= A x_t + B u_t\\ y_t &= C x_t \end{align*} $$




  • Kalman filter has a very nice Bayesian interpretation.
    • Model $x_t$ with a Gaussian

      $$p(x_t) = \mathcal{N}(x_t, \Sigma_t)$$
    • Prediction using state dynamics model

      $$p(x_{t} \mid x_{t-1})$$
    • Inference from noisy measurements

      $$p(y_t \mid x_t)$$


  • Summary

    • Prediction
      • We do not have any more data, we just update our earlier posterior of $\hat x_{t-1}$ to give a new posterior of $\hat x_{t}$
      • We are just propagating the pmf under the linear dynamics

        $$ \begin{align*} x_{t+1} &= A x_t + B u_t\\ y_t &= C x_t \end{align*} $$
    • Correction
      • We start with $\hat{x}_t$, and use it as the prior for our next measurement of $y_t$

        $$ \hat x_{t} \leftarrow \hat x_t + K (y_{t} - C\hat x_t ) $$
    • (Recursive) Repeat over time

5.4. Object Tracking in Computer VisionĀ¶

  • Lecture: Introduction to Computer Vision by Prof. Aaron Bobick at Georgia Tech
InĀ [13]:
%%html
<center><iframe src="https://www.youtube.com/embed/rf3DKqWajWY?rel=0" 
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>
InĀ [14]:
%%html
<center><iframe src="https://www.youtube.com/embed/5yUjYCkm2jI?rel=0" 
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>

5.5. Bayesian InferenceĀ¶

  • Start with prior beliefs, which can be thought of as a summary of opinions.

    • might be subjective
  • Given our prior, we can update our opinion, and produce a new opinion.

    • This new distribution is called the posterior
  • Iterate

    • if more data is available

  • Coin-flip example

  • See how two different priors can result in two different posteriors

  • Gaussian distributions
    • Explaining away

InĀ [15]:
%%javascript
$.getScr`aipt('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')