Parameter Estimation in Probabilistic Model

By Prof. Seungchul Lee
http://iai.postech.ac.kr/
Industrial AI Lab at POSTECH

1. Probability Density Estimation¶

• Reconstructing the probability density function from a set of given data samples $y_1, y_2, \cdots, y_m$

• Want to recover the underlying probability density function generating our dataset

• Assumption
• Gaussian distributed with unknown parameters 𝜇 and 𝜎
• Samples are independent
• It becomes a parameter estimation problem
• Think about how to compute the probability
• Gaussian distributed with unknown parameters 𝜇 and 𝜎
• Samples are independent

$$P \left(y_1,y_2,\cdots,y_m \,;\, \mu,\sigma^2 \right) = \prod_{i=1}^{m} P \left(y_i \,;\, \mu,\sigma^2 \right)$$

1.1. Given $m$ Data Points, Drawn from a Gaussian Distribution¶

• You will often see the following derivation

\begin{align*} P\left(y=y_i \,;\, \mu,\sigma^2\right) & =\frac{1}{\sqrt{2\pi}\sigma}\exp \left(-\frac{1}{2\sigma^2}(y_i-\mu)^2 \right)\;\text{: generative model}\\\\ \mathcal{L} = P \left(y_1,y_2,\cdots,y_m \,;\, \mu,\sigma^2 \right) & =\prod\limits_{i=1}^m\frac{1}{\sqrt{2\pi}\sigma}\exp\left(-\frac{1}{2\sigma^2}\left(y_i-\mu\right)^2\right)\\ & = \frac{1}{(2\pi)^\frac{m}{2}\sigma^m}\exp\left(-\frac{1}{2\sigma^2}\sum\limits_{i=1}^m(y_i-\mu)^2 \right)\\\\ \ell =\log \mathcal{L} &=-\frac{m}{2} \text{log}{2\pi}-m \text{log}{\sigma}-\frac{1}{2\sigma^2}\sum\limits_{i=1}^m(y_i-\mu)^2 \end{align*}

• To maximize, $\frac{\partial \ell}{\partial \mu}=0, \frac{\partial \ell}{\partial \sigma}=0$

\begin{align*} \frac{\partial \ell}{\partial \mu} &=\frac{1}{\sigma^2}\sum\limits_{i=1}^m(y_i-\mu)=0 \;\;\;\;\qquad\quad \implies \quad \mu_{MLE}=\frac{1} {m}\sum\limits_{i=1}^m y_i \quad \text{: sample mean}\\\\ \frac{\partial \ell}{\partial \sigma} &= -\frac{m}{\sigma}+\frac{1}{\sigma^3}\sum\limits_{i=1}^m(y_i-\mu)^2=0 \;\; \implies \quad \sigma_{MLE}^2=\frac{1}{m}\sum\limits_{i=1}^m(y_i-\mu)^2 \quad \text{: sample variance} \end{align*}

• Big lesson
• We often compute a mean and variance to represent data statistics
• We kind of assume that a data set is Gaussian distributed
• Good news: sample mean is Gaussian distributed by the central limit theorem

1.2. Maximum Likelihood Estimation (MLE)¶

• Method of estimating the parameters of a probability distribution by maximizing a likelihood function

• Under the assumed statistical model the observed data is most probable

• Likelihood function
• The probability density at a particular outcome $D$ when the true value of the parameter is $\theta$

$$\mathcal{L} = P(D \mid \theta) = P(D \,;\, \theta)$$

$$\theta_{MLE} = \underset{\theta}{\mathrm{argmax}}\;\; P(D \,;\, \theta)$$

• A model exists first and it generates samples
• Generative model
In [1]:
import numpy as np
from scipy.stats import norm
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
# MLE of Gaussian distribution
# mu

m = 20
mu = 0
sigma = 5

x = np.random.normal(mu,sigma,[m,1])
xp = np.linspace(-20, 20, 100)
y0 = np.zeros([m, 1])

muhat = [-5, 0, 5, np.mean(x)]

plt.figure(figsize=(8, 8))

for i in range(4):
yp = norm.pdf(xp, muhat[i], sigma)
y = norm.pdf(x, muhat[i], sigma)
logL = np.sum(np.log(y))

plt.subplot(4, 1, i+1)
plt.plot(xp, yp, 'r')
plt.plot(x, y, 'bo')
plt.plot(np.hstack([x, x]).T, np.hstack([y, y0]).T, 'k--')

plt.title(r'$\hat\mu$ = {0:.2f}'.format(muhat[i]), fontsize=15)
plt.text(-15,0.06,np.round(logL,4),fontsize=15)
plt.axis([-20, 20, 0, 0.11])

plt.tight_layout()
plt.show()


Compare to a result from formula

$$\mu_{MLE}=\frac{1}{m}\sum_{i = 1}^{m}x_i$$
In [3]:
# mean is unknown in this example
# variance is known in this example

m = 10
mu = 0
sigma = 5

x = np.random.normal(mu,sigma,[m,1])

mus = np.arange(-10, 10.5, 0.5)
LOGL = []

for i in range(np.size(mus)):
y = norm.pdf(x, mus[i], sigma)
logL = np.sum(np.log(y))
LOGL.append(logL)

muhat = np.mean(x)
print(muhat)

plt.figure(figsize=(10, 6))
plt.plot(mus, LOGL, '.')
plt.title('$log (\prod \mathcal{N}(x \mid \mu , \sigma^2))$', fontsize=20)
plt.xlabel(r'$\hat \mu$', fontsize=15)
plt.grid(alpha=0.3)
plt.show()

-1.4011074292982022

$$\sigma_{MLE}^2=\frac{1}{m}\sum\limits_{i=1}^m(y_i-\mu)^2$$
In [4]:
# mean is known in this example
# variance is unknown in this example

m = 100
mu = 0
sigma = 3

x = np.random.normal(mu,sigma,[m,1])  # samples

sigmas = np.arange(1, 10, 0.1)
LOGL = []

for i in range(sigmas.shape[0]):
y = norm.pdf(x, mu, sigmas[i])     # likelihood
logL = np.sum(np.log(y))
LOGL.append(logL)

sigmahat = np.sqrt(np.var(x))
print(sigmahat)

plt.figure(figsize=(10,6))
plt.title(r'$\log (\prod \mathcal{N} (x|\mu,\sigma^2))$',fontsize=20)
plt.plot(sigmas, LOGL, '.')
plt.xlabel(r'$\hat \sigma$', fontsize=15)
plt.axis([0, np.max(sigmas), np.min(LOGL), -200])
plt.grid(alpha=0.3)
plt.show()

2.726137849417368


2. Data Fusion with Uncertainties¶

• Suppose there are two sensors

\begin{align*} y_a &= x + \varepsilon_a, \; \varepsilon_a \sim \mathcal{N}\left(0,\sigma^2_a\right) \\ y_b &= x + \varepsilon_b, \; \varepsilon_b \sim \mathcal{N}\left(0,\sigma^2_b\right) \end{align*}
• In a matrix form

$$y = \begin{bmatrix} y_a \\ y_b\end{bmatrix} = Cx+\varepsilon = \begin{bmatrix}1\\1\end{bmatrix}x+\begin{bmatrix}\varepsilon_a\\ \varepsilon_b \end{bmatrix} \quad \quad \varepsilon \sim \mathcal{N}\left(0,R\right),\;\; R=\begin{bmatrix} \sigma^2_a & 0\\ 0 & \sigma^2_b\end{bmatrix}$$

\begin{align*}P\left(y \mid x\right) & \sim \mathcal{N}\left(Cx,R\right)\\ &= \frac{1}{\sqrt{\left(2\pi\right)^2\vert R \vert}}\exp\left(-\frac{1}{2}\left(y-Cx\right)^TR^{-1}\left(y-Cx\right)\right)\end{align*}

• Find $\,\hat{x}$

$$\ell = -\log{2\pi}-\frac{1}{2}\log{\vert R \vert}-\frac{1}{2} \underbrace{\left(y-Cx\right)^TR^{-1}\left(y-Cx\right)}$$

\begin{align*} \left(y-Cx\right)^TR^{-1}\left(y-Cx\right) &= y^TR^{-1}y-y^TR^{-1}Cx-x^TC^TR^{-1}y+x^TC^TR^{-1}Cx \\ \\ \implies \frac{d\ell}{dx} & =0=-2C^TR^{-1}y + 2C^TR^{-1}Cx\\ \therefore \;\; \hat{x} &=\left(C^TR^{-1}C\right)^{-1}C^TR^{-1}y \end{align*}

• $\left(C^TR^{-1}C\right)^{-1}C^TR^{-1}$
\begin{align*} \left(C^TR^{-1}C\right) &= \begin{bmatrix}1 & 1\end{bmatrix}\begin{bmatrix}\frac{1}{\sigma^2_a} & 0 \\ 0 & \frac{1}{\sigma^2_b}\end{bmatrix}\begin{bmatrix}1\\ 1\end{bmatrix} = \frac{1}{\sigma^2_a}+\frac{1}{\sigma^2_b}\\ C^TR^{-1} &= \begin{bmatrix}1 & 1\end{bmatrix}\begin{bmatrix}\frac{1}{\sigma^2_a} & 0 \\ 0&\frac{1}{\sigma^2_b}\end{bmatrix} = \begin{bmatrix}\frac{1}{\sigma^2_a} &\frac{1}{\sigma^2_b} \end{bmatrix}\\ \\ \end{align*}
\begin{align*} \hat{x} &= \left(C^TR^{-1}C\right)^{-1}C^TR^{-1}y = \left(\frac{1}{\sigma^2_a}+\frac{1}{\sigma^2_b}\right)^{-1} \begin{bmatrix}\frac{1}{\sigma^2_a} &\frac{1}{\sigma^2_b} \end{bmatrix} \begin{bmatrix}y_a\\y_b\end{bmatrix} \\&= \frac{\frac{1}{\sigma^2_a}y_a+\frac{1}{\sigma^2_b}y_b}{\frac{1}{\sigma^2_a}+\frac{1}{\sigma^2_b}}\\ \\ \text{var}\left(\hat{x}\right) &= \left(\left(C^TR^{-1}C \right)^{-1}C^TR^{-1}\right) \cdot \text{var}(y) \cdot \left( \left(C^TR^{-1}C)^{-1}C^TR^{-1} \right) \right)^T\\ & = \left( \left(C^TR^{-1}C \right)^{-1}C^TR^{-1}\right) \cdot R \cdot \left( \left(C^TR^{-1}C)^{-1}C^TR^{-1} \right) \right)^T\\ &= \left(C^TR^{-1}C \right)^{-1}C^T \cdot \left(R^{-1} \right)^TC \left(\left(C^TR^{-1}C \right)^{-1} \right)^T \\ &=\underbrace{\left(C^TR^{-1}C\right)^{-1}} \underbrace{C^TR^{-1}C}\left(\left(C^TR^{-1}C\right)^{-1}\right)^T = \left(C^TR^{-1}C\right)^{-1}\\\\ &=\frac{1}{\frac{1}{\sigma^2_a}+\frac{1}{\sigma^2_b}} \leq \;\; \sigma^2_a,\;\sigma^2_b \end{align*}
• Summary

\begin{align*} \hat x &= \frac{\frac{1}{\sigma_a^2} y_a + \frac{1}{\sigma_b^2}y_b}{\frac{1}{\sigma_a^2} + \frac{1}{\sigma_b^2}} = \frac{\frac{1}{\sigma^2_a}}{\frac{1}{\sigma^2_a} + \frac{1}{\sigma^2_b}}y_a + \frac{\frac{1}{\sigma^2_b}}{\frac{1}{\sigma^2_a} + \frac{1}{\sigma^2_b}}y_b\\\\ \text{var}(\hat x) &= \frac{1}{\frac{1}{\sigma^2_a}+\frac{1}{\sigma^2_b}} \quad < \quad \sigma_a^2, \;\sigma_b^2 \qquad \implies \text{more accurate} \end{align*}

• Big lesson:
• Two sensors are better than one sensor $\implies$ less uncertainties
• Accuracy or uncertainty information is also important in sensors

Example of two rulers

• 1D example
• how brain works on human measurements from both haptic and visual channels

In [16]:
x = 5       # true state (length in this example)
a = 1       # sigma of a
b = 2       # sigma of b

YA = []
YB = []
XML = []

for i in range(2000):
ya = x + np.random.normal(0,a)
yb = x + np.random.normal(0,b)
xml = (1/a**2*ya + 1/b**2*yb)/(1/a**2+1/b**2)
YA.append(ya)
YB.append(yb)
XML.append(xml)

plt.figure(figsize=(8, 6))
plt.subplot(3, 1, 1), plt.hist(YA, 41, density=True), plt.axis([0, 10, 0, 0.5]),
plt.title(r'$y_a$', fontsize=20), plt.grid(alpha=0.3)
plt.subplot(3, 1, 2), plt.hist(YB, 41, density=True), plt.axis([0, 10, 0, 0.5]),
plt.title(r'$y_b$', fontsize=20), plt.grid(alpha=0.3)
plt.subplot(3, 1, 3), plt.hist(XML, 41, density=True), plt.axis([0, 10, 0, 0.5]),
plt.title(r'$x_{ML}$', fontsize=20), plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()


Example of two GPSs

• 2D example
In [6]:
x = np.array([5, 10]).reshape(-1, 1) # true position

mu = np.array([0, 0])
Ra = np.matrix([[9, 1],
[1, 1]])
Rb = np.matrix([[1, 1],
[1, 9]])

YA = []
YB = []
XML = []

for i in range(1000):
ya = x + np.random.multivariate_normal(mu, Ra).reshape(-1, 1)
yb = x + np.random.multivariate_normal(mu, Rb).reshape(-1, 1)
xml = (Ra.I+Rb.I).I*(Ra.I*ya+Rb.I*yb)
YA.append(ya.T)
YB.append(yb.T)
XML.append(xml.T)

YA = np.vstack(YA)
YB = np.vstack(YB)
XML = np.vstack(XML)

In [7]:
plt.figure(figsize=(10, 6))
plt.title('Data Fusion', fontsize=15)
plt.plot(YA[:,0], YA[:,1], 'b.', label='Observation 1')
plt.plot(YB[:,0], YB[:,1], 'r.', label='Observation 2')
plt.plot(XML[:,0], XML[:,1], 'k.', label='MLE')
plt.axis('equal')
plt.grid(alpha=0.3)
plt.legend(fontsize=15)
plt.show()